# Matching with allowed mismatch (Java)

## Overview

An implementation of string matching with a defined number of at most k mismatches based on longest common extension (lce) computation as described in Gusfield 1999:200. This is a form of inexact string matching that allows no insertions or deletions but only matches and mismatches. When used with lce computation with suffix trees it has a runtime complexity of O(km), where m is the length of the text. This implementation uses a simple computation of the lce.

## Implementation

The approach is very similar to matching with wildcards: For every position in the text, up to k lce queries are executed, and if the lce reaches the end of the pattern, then more than k mismatches are needed to match. As at most k lce computations are required the runtime complexity is O(km), where k is the length of the text (Gusfield 1999:200). The method takes three parameters: the text, the pattern and the number of allowed mismatches. It returns a collection of strings, the matching substrings. The algorithm, as described in Gusfield 1999:200, consists of four steps:

• Step 1: Set j to 1 and h to i and count to 0.
```<<init>>=
int j = 0;
int h = i;
int count = 0;
int n = p.length();
```
• Step 2: Compute the length L of the longest common extension starting at positions j of P and h of T:
```<<compute_lce>>=
int L = SimpleLongestCommonExtension.longestCommonExtension(p, j, t, h);
```
• Step 3: If j + L = n + 1, then a k-mismatch of P occurs in T starting at i (in fact, only count mismatches occur); stop.
```<<collect>>=
if (j + 1 + L == n + 1) {
break;
}
```
• Step 4: If count >= k, then increment count by one, set j to j + L + 1, set h to h + L + 1 and go to step 2. If count = k + 1, then a k-mismatch of P does not occur stating at i; stop.
```<<match>>=
else if (count < k) {
count++;
j = j + L + 1;
h = h + L + 1;
} else if (count == k) {
break;
}
```

## Usage

• A JUnit 4 unit test to demonstrate the usage:
```<<test>>=
Collection<String> results = getMatches("abentbananaend", "bend", 2);
assertEquals(Arrays.asList("bent", "bana", "aend"), results);
```
• The complete program:
```<<MatchingWithAllowedMismatch.java>>=
import static org.junit.Assert.assertEquals;
import org.junit.Test;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
public class MatchingWithAllowedMismatch {
public static Collection<String> getMatches(String t, String p, int k) {
Collection<String> result = new ArrayList<String>();
for (int i = 0; i < t.length(); i++) {
init
while (true) {
compute_lce
collect
match
}
}
return result;
}
@Test
public void testGetMismatches() {
test
}
}
```
• The required simple implementation of longest common extension computation:
```<<SimpleLongestCommonExtension.java>>=
Longest common extension (Java)#8162#SimpleLongestCommonExtension.java
```

## References

• Gusfield, Dan (1999), Algorithms on Strings, Sequences and Trees. Cambridge: University Press.