assume we have the text t = abcabaabcabac and we're searching for the pattern p = abaa.

in this text pattern p occurs only once when the shift s = 3.
like i always do, i'm implementing the algorithms as described in introduction to algorithms, second edition(p. 906).
the naive brute force algorithm
except the brute force algorithm, all algorithms such as rabin-karp, knuth-morris-pratt, boyer-moore, finite automation perform some preprocessing based on the pattern and then find the valid shifts which is called; "matching" phase.the idea of brute force algorithm is very simple. find all the valid shifts using a loop that checks the condition p[1 .. m] = t[s + 1 .. s + m] for each of the n - m + 1 possible values of s.
here it's in java.
int naive_string_search(String t, String p) {
int n = t.length();
int m = p.length();
int j = 0;
for(int s = 0; s <= n-m; s++) {
while((j < m) && (t.charAt(s+j) == p.charAt(j))) j++;
if(j == m) return s;
}
return -1;
}
running time of brute force algorithm is o((n-m+1) m). the running time of naive string search algorithm is equal to its matching time, since there's no preprocessing.as you can see, naive string search algorithm isn't an optimal solution for such problem. we can say naive string search is the bubble sort of the string search algorithms.
rabin-karp algorithm
the rabin-karp algorithm created by michael rabin and richard karp in 1987 that uses hashing. it uses θ(m) preprocessing time and its worst-case running time is θ((n-m+1)m).rabin-karp algorithm comes with the idea of if two strings are equal then their hash values are also equal.
here's the rabin-karp algorithm in java.
int calculate_hash(int h, int m, int d, int q) {
for(int i = 0; i < m-1; i++)
h = (h*d) % q;
return h;
}
boolean match(int s, int m, String text, String pattern) {
for(int i = 0; i < m; i++) {
if(text.charAt(s+i) != pattern.charAt(i))
return false;
}
return true;
}
int rabin_karp_matcher(String text, String pattern, int d, int q) {
int n = text.length();
int m = pattern.length();
int h = calculate_hash(1, m, d, q); // h = d^m-1 mod q
int p = 0;
int t = 0;
// this is the preprocessing part
for(int i = 0; i < m; i++) {
p = (d*p + pattern.charAt(i)) % q;
t = (d*t + text.charAt(i)) % q;
}
//eof preprocessing
//and the matching part
for(int s = 0; s <= n-m; s++) {
if(p == t) {
if(match(s, m, text, pattern)) return s;
}
if(s < n-m) {
t = (d*(t - ((text.charAt(s)*h) % q)) + text.charAt(s+m)) % q;
if(t < 0) t += q;
}
}
return -1;
}
it works as follows;
1) the inputs are text, and pattern we want to match.
2) d is the radix we use(which is |Σ|)
3) q is the prime.
all characters are interpreted as radix-d digits. h is the value of the high-order digit position of an m-digit window. matching part iterates through the all possible shifts s, the following variant;
ts = text[s + 1 .. s + m] mod q.
if p = ts, then we check to see if p[1 .. m] = t[s + 1 .. s + m] to rule out the spurious hit.
in the second part of the series i'll cover boyer-moore, knuth-morris-pratt and finite automation algorithms.
related link:
introduction to algorithms - string searching, knuth-morris-pratt