|Special thanks to Jakeboxer.com and every other sources in the references for helping me to understand KMP Algorithm.|
What is the KMP Algorithm?
The KMP Algorithm (or Knuth, Morris, and Pratt string searching algorithm) cleverly uses the previous comparison data. It can search for a pattern in O(n) time as it never re-compares a text symbol that has matched a pattern symbol. However, it uses a partial match table to analyze the pattern structure. Construction of a partial match table takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n) which is a linear time complexity and where m is size of pattern string and n is size of main string.
|The Partial Match Table
The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.
Here’s the partial match table for the pattern “abababca”:
char: | a | b | a | b | a | b | c | a |
We need to know about proper prefixes and proper suffixes.
Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.
Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.
With this in mind, I can now give the one-sentence meaning of the values in the partial match table:
The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.
Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.
Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.