Matching with allowed mismatch (Java)
From LiteratePrograms
Contents |
Overview
An implementation of string matching with a defined number of at most k mismatches based on longest common extension (lce) computation as described in Gusfield 1999:200. This is a form of inexact string matching that allows no insertions or deletions but only matches and mismatches. When used with lce computation with suffix trees it has a runtime complexity of O(km), where m is the length of the text. This implementation uses a simple computation of the lce.
Implementation
The approach is very similar to matching with wildcards: For every position in the text, up to k lce queries are executed, and if the lce reaches the end of the pattern, then more than k mismatches are needed to match. As at most k lce computations are required the runtime complexity is O(km), where k is the length of the text (Gusfield 1999:200). The method takes three parameters: the text, the pattern and the number of allowed mismatches. It returns a collection of strings, the matching substrings. The algorithm, as described in Gusfield 1999:200, consists of four steps:
- Step 1: Set j to 1 and h to i and count to 0.
<<init>>= int j = 0; int h = i; int count = 0; int n = p.length();
- Step 2: Compute the length L of the longest common extension starting at positions j of P and h of T:
<<compute_lce>>= int L = SimpleLongestCommonExtension.longestCommonExtension(p, j, t, h);
- Step 3: If j + L = n + 1, then a k-mismatch of P occurs in T starting at i (in fact, only count mismatches occur); stop.
<<collect>>= if (j + 1 + L == n + 1) { result.add(t.substring(i, i + n)); break; }
- Step 4: If count >= k, then increment count by one, set j to j + L + 1, set h to h + L + 1 and go to step 2. If count = k + 1, then a k-mismatch of P does not occur stating at i; stop.
<<match>>= else if (count < k) { count++; j = j + L + 1; h = h + L + 1; } else if (count == k) { break; }
Usage
- A JUnit 4 unit test to demonstrate the usage:
<<test>>= Collection<String> results = getMatches("abentbananaend", "bend", 2); assertEquals(Arrays.asList("bent", "bana", "aend"), results);
- The complete program:
<<MatchingWithAllowedMismatch.java>>= import static org.junit.Assert.assertEquals; import org.junit.Test; import java.util.ArrayList; import java.util.Arrays; import java.util.Collection; public class MatchingWithAllowedMismatch { public static Collection<String> getMatches(String t, String p, int k) { Collection<String> result = new ArrayList<String>(); for (int i = 0; i < t.length(); i++) { init while (true) { compute_lce collect match } } return result; } @Test public void testGetMismatches() { test } }
- The required simple implementation of longest common extension computation:
<<SimpleLongestCommonExtension.java>>= Longest common extension (Java)#8162#SimpleLongestCommonExtension.java
References
- Gusfield, Dan (1999), Algorithms on Strings, Sequences and Trees. Cambridge: University Press.
Download code |