introduction to algorithms - string searching, part 1

03.03.2010

every text editor, and ide performs string searches. basically, there always a text and the users searches for a pattern in the text. if you use an efficient algorithm can aid the responsiveness of the text-editing program. in this post i'll cover the most known string search algorithms. string search algorithms are also used to search for particular patterns in dna sequences.

assume we have the text t = abcabaabcabac and we're searching for the pattern p = abaa.

in this text pattern p occurs only once when the shift s = 3.

like i always do, i'm implementing the algorithms as described in introduction to algorithms, second edition(p. 906).

the naive brute force algorithm

except the brute force algorithm, all algorithms such as rabin-karp, knuth-morris-pratt, boyer-moore, finite automation perform some preprocessing based on the pattern and then find the valid shifts which is called; "matching" phase.

the idea of brute force algorithm is very simple. find all the valid shifts using a loop that checks the condition p[1 .. m] = t[s + 1 .. s + m] for each of the n - m + 1 possible values of s.

here it's in java.
int naive_string_search(String t, String p) {
  int n = t.length();
  int m = p.length();
  int j = 0;
  for(int s = 0; s <= n-m; s++) {
    while((j < m) && (t.charAt(s+j) == p.charAt(j))) j++;
    if(j == m) return s;
  }
  return -1;
}
running time of brute force algorithm is o((n-m+1) m). the running time of naive string search algorithm is equal to its matching time, since there's no preprocessing.

as you can see, naive string search algorithm isn't an optimal solution for such problem. we can say naive string search is the bubble sort of the string search algorithms.

rabin-karp algorithm

the rabin-karp algorithm created by michael rabin and richard karp in 1987 that uses hashing. it uses θ(m) preprocessing time and its worst-case running time is θ((n-m+1)m).

rabin-karp algorithm comes with the idea of if two strings are equal then their hash values are also equal.

here's the rabin-karp algorithm in java.
int calculate_hash(int h, int m, int d, int q) {
  for(int i = 0; i < m-1; i++)
    h = (h*d) % q;
  return h;
}

boolean match(int s, int m, String text, String pattern) {
  for(int i = 0; i < m; i++) {
    if(text.charAt(s+i) != pattern.charAt(i))
      return false;
  }
  return true;
}

int rabin_karp_matcher(String text, String pattern, int d, int q) {
  int n = text.length();
  int m = pattern.length();
  int h = calculate_hash(1, m, d, q); // h = d^m-1 mod q
  
  int p = 0;
  int t = 0;
  
  // this is the preprocessing part
  for(int i = 0; i < m; i++) {
    p = (d*p + pattern.charAt(i)) % q;
    t = (d*t + text.charAt(i)) % q;
  }
  //eof preprocessing
  
  
  //and the matching part
  for(int s = 0; s <= n-m; s++) {
    if(p == t) {
      if(match(s, m, text, pattern)) return s;
    }
    if(s < n-m) {
      t = (d*(t - ((text.charAt(s)*h) % q)) + text.charAt(s+m)) % q;
      if(t < 0) t += q;
    }
  }
  return -1;
}

it works as follows;
1) the inputs are text, and pattern we want to match.
2) d is the radix we use(which is |Σ|)
3) q is the prime.

all characters are interpreted as radix-d digits. h is the value of the high-order digit position of an m-digit window. matching part iterates through the all possible shifts s, the following variant;

ts = text[s + 1 .. s + m] mod q.

if p = ts, then we check to see if p[1 .. m] = t[s + 1 .. s + m] to rule out the spurious hit.

in the second part of the series i'll cover boyer-moore, knuth-morris-pratt and finite automation algorithms.

related link:
introduction to algorithms - string searching, knuth-morris-pratt
blog comments powered by Disqus