# Boyer-Moore string search algorithm (Python)

Other implementations: Java | Python

# Introduction

A basic (without any good-suffix-shift rule) implementation of the Boyer-Moore string matching algorithm, with right-to-left scan and a standard bad-character-shift rule. This algorithm has a sub-linear typical runtime (see Gusfield, p. 17). It can be extended using a refined version of the bad-character-shift rule which improves efficiency for small alphabets, e.g. for usage in bioinformatics (see Gusfield, p. 18) and by the strong good-suffix rule for provable worst-case linear runtime (see Gusfield, p. 20). As an alternative for even faster matching (dependent on the pattern length, not the text length, after linear-time preprocessing) consider suffix-tree based algorithms (see Gusfield, p. 89).

# Matching

We match a pattern of length n in a text of length m:

```<<lengths>>=
m = len(text)
n = len(pattern)
```

Preprocess the pattern for the right-to-left-scan and bad-character-shift rules by finding the right-most positions of all characters in the pattern:

```<<preprop_call>>=
```

We align p and t, starting on index 0 (meaning the beginning of the pattern is aligned with position 0, i.e. the beginning, of the text), and shift p to the left, until we reach the end of t:

```<<align_start>>=
alignedAt = 0
while alignedAt + (n - 1) < m:
```

On each aligned position, we scan the pattern from right to left, comparing the aligned characters at the current position in the text x and at the current position in the pattern y:

```<<loop>>=
for indexInPattern in xrange(n-1, -1, -1):
indexInText = alignedAt + indexInPattern
x = text[indexInText]
y = pattern[indexInPattern]
```

If the pattern is longer than the text, we have no match here:

```<<break>>=
if indexInText >= m:
break
```

In the case of a mismatch, we do the shifting:

```<<mismatch>>=
if x != y:
```

We first retrieve the right-most index of the mismatching text-character in the pattern:

```<<get_index>>=
r = rightMostIndexes.get(x)
```

If the mismatching character in the text is not in the pattern we can shift until we are aligned behind the mismatch-position, resulting in sub-linear runtime, as this will result in some characters never being inspected:

```<<big_skip>>=
if x not in rightMostIndexes:
alignedAt = indexInText + 1
```

Else we shift the pattern to the right until the right-most occurrence of x in the pattern is under the mismatch position in the text (if this shift is a forward shift, i.e. to the right), as this is the next possible place where an occurrence of the pattern can begin in the text:

```<<small_skip>>=
else:
shift = indexInText - (alignedAt + r)
alignedAt += (shift > 0 and shift or alignedAt + 1)
```

If the characters are equal and the pattern has been scanned completely from right to left, we have a match at the currently aligned position in the text. We store the match and shift the pattern one position to the right:

```<<match>>=
elif indexInPattern == 0:
matches.append(alignedAt)
alignedAt += 1
```

# Preprocessing

For each character in the string to preprocess, we store its right-most position by scanning the string from right to left, storing the character as a key and its position as a value in a hash-map, if it is not in the map already:

```<<preprop>>=
map = { }
for i in xrange(len(pattern)-1, -1, -1):
c = pattern[i]
if c not in map:
map[c] = i
```

# Usage

A bit of basic testing: match ana in bananas, print the matches found and simulate a simple unit test.

```<<usage>>=
matches = match("ana", "bananas")
for integer in matches:
print "Match at:", integer
print (matches == [1, 3] and "OK" or "Failed")
```

# Program

This results in the full program when we put the pieces together:

```<<boyer_moore.py>>=
def match(pattern, text):
matches = []
lengths
preprop_call
align_start
loop
break
mismatch
get_index
big_skip
small_skip
break
match
return matches