What is KMP Algorithm for pattern matching?
I see the link I have posted was already shared. Good to know others also found it useful!
KMP is an advanced pattern matching algorithm. I have seen people getting confused while learning it. Here is an article about KMP algorithm which explains the concept & time complexity in simple terms. I mean it is simpler than most other KMP resources :)
KMP is an advanced algorithm & lot of resources are present in the internet. I would recommend https://www.w3spot.com/2020/07/kmp-algorithm-explained-in-plain-english.html as a primer. It can be as simple as it gets if you follow through the example there.
KMP(Knuth Morris Pratt) Pattern SearchingThe Naive pattern searching algorithm doesn’t work well in cases where we see many matching characters followed by a mismatching character. Following are some examples.
txt[] = “AAAAAAAAAAAAAAAAAB” pat[] = “AAAAB”
txt[] = “ABABABCABABABCABABABC” pat[] = “ABABAC” (not a worst case, but a bad case for Naive)The KMP matching algorithm uses degenerating property (pattern having same sub-patterns appearing more than once in the pattern) of the pattern and improves the worst case complexity to O(n). The basic idea behind KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text of the next window. We take advantage of this information to avoid matching the characters that we know will anyway match. Let us consider below example to understand this.
Matching Overviewtxt = “AAAAABAAABA”pat = “AAAA”
We compare first window of txt with pattxt = “AAAAABAAABA”pat = “AAAA” [Initial position]We find a match. This is same as Naive String Matching.
In the next step, we compare next window of txt with pat.txt = “AAAAABAAABA”pat = “AAAA” [Pattern shifted one position]This is where KMP does optimization over Naive. In this second window, we only compare fourth A of pattern with fourth character of current window of text to decide whether current window matches or not. Since we know first three characters will anyway match, we skipped matching first three characters.
Need of Preprocessing?An important question arises from the above explanation, how to know how many characters to be skipped. To know this, we pre-process pattern and prepare an integer array lps[] that tells us the count of characters to be skipped.
//program can be convert in c#// C++ program for implementation of KMP pattern searching // algorithm void computeLPSArray(char* pat, int M, int* lps); // Prints occurrences of txt[] in pat[] void KMPSearch(char* pat, char* txt) { int M = strlen(pat); int N = strlen(txt); // create lps[] that will hold the longest prefix suffix // values for pattern int lps[M]; // Preprocess the pattern (calculate lps[] array) computeLPSArray(pat, M, lps); int i = 0; // index for txt[] int j = 0; // index for pat[] while (i <; N) { if (pat[j] == txt[i]) { j++; i++; } if (j == M) { printf("Found pattern at index %d ", i - j); j = lps[j - 1]; } // mismatch after j matches else if (i <; N && pat[j] != txt[i]) { // Do not match lps[0..lps[j-1]] characters, // they will match anyway if (j != 0) j = lps[j - 1]; else i = i + 1; } } } // Fills lps[] for given patttern pat[0..M-1] void computeLPSArray(char* pat, int M, int* lps) { // length of the previous longest prefix suffix int len = 0; lps[0] = 0; // lps[0] is always 0 // the loop calculates lps[i] for i = 1 to M-1 int i = 1; while (i <; M) { if (pat[i] == pat[len]) { len++; lps[i] = len; i++; } else // (pat[i] != pat[len]) { // This is tricky. Consider the example. // AAACAAAA and i = 7. The idea is similar // to search step. if (len != 0) { len = lps[len - 1]; // Also, note that we do not increment // i here } else // if (len == 0) { lps[i] = 0; i++; } } } } // Driver program to test above function int main() { char txt[] = "ABABDABACDABABCABAB"; char pat[] = "ABABCABAB"; KMPSearch(pat, txt); return 0; }
//program can be convert in c#
// C++ program for implementation of KMP pattern searching
// algorithm
void computeLPSArray(char* pat, int M, int* lps);
// Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int lps[M];
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
int j = 0; // index for pat[]
while (i <; N) {
if (pat[j] == txt[i]) {
j++;
i++;
}
if (j == M) {
printf("Found pattern at index %d ", i - j);
j = lps[j - 1];
// mismatch after j matches
else if (i <; N && pat[j] != txt[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
else
i = i + 1;
// Fills lps[] for given patttern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
// length of the previous longest prefix suffix
int len = 0;
lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
int i = 1;
while (i <; M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
else // (pat[i] != pat[len])
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = lps[len - 1];
// Also, note that we do not increment
// i here
else // if (len == 0)
lps[i] = 0;
// Driver program to test above function
int main()
char txt[] = "ABABDABACDABABCABAB";
char pat[] = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
Output:Found pattern at index 10
//Whats your opinion ??
Knuth Morris Pratt (KMP) is an algorithm, which checks the characters from left to right. When a pattern has a sub-pattern appears more than one in the sub-pattern, it uses that property to improve the time complexity, also for in the worst case.