Efficient Pattern Matching: The Knuth-Morris-Pratt Algorithm
Pattern matching is a fundamental problem in computer science, with application ranging from text processing to bioinformatics. Given a text string and a pattern string, the goal is to determine whether the pattern appears somewhere within the text, and if so, locate its position. We can solve this problem using two ways.
Before looking into the KMP algorithm, let's first understand the na?ve approach to pattern matching. The na?ve algorithm compares the pattern to all sub-strings of the text, starting from each position, to check if there is a match. While this algorithm is conceptually simple, in worst case, this algorithm has time complexity of O(n2). The reason for this is, when the pattern doesn't appear in the text at all or appears only at the very end then the algorithm will have to compare the entire pattern against the text. Let's look into it visually to understand its limitations.
As we can see above, at every iteration the algorithm compares whole string which is not very efficient.
Now let's shift our focus to the KMP algorithm, in order to understand KMP algorithm, we first need to understand the concept of prefix function.
What is a Prefix Function?
Given a string s, the prefix function pi[i] for position i (1-based indexing) is defined as the length of the longest proper prefix of the substring s[0:i] that is also a suffix of this substring. Here, a "proper prefix" is a prefix that is not equal to the string itself.
Let's simplify this with an example. Consider the string "abcabcd." For each position i in this string, the prefix function pi[i] would be as follows:
Why is the Prefix Function Important?
The prefix function is crucial in the KMP algorithm because it helps determine how far we can shift the pattern when a mismatch occurs. This allows us to avoid unnecessary comparisons, making the algorithm more efficient than the na?ve approach.
领英推荐
How is the Prefix Function Computed?
The prefix function is computed using a simple algorithm that iterates through the string. Here's a high-level overview of how it works:
This algorithm efficiently computes the prefix function for each position in the string, providing us with valuable information that helps optimize the pattern matching process in the KMP algorithm.
function compute_prefix(pattern):
m = length of pattern
pi = array of length m
k = 0
for q = 1 to m - 1:
while k > 0 and pattern[k] != pattern[q]:
k = pi[k - 1]
if pattern[k] == pattern[q]:
k = k + 1
pi[q] = k
return pi
Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm uses the prefix function to avoid unnecessary comparisons while searching for a pattern within a text. It achieves this by precomputing the prefix function for the pattern and then using it to determine the next position to compare when a mismatch occurs.
Let's look into some visualize to better understand the KMP algorithm.
By using the prefix function to determine how far to shift the pattern pointer when a mismatch occurs, the KMP algorithm achieves a time complexity of O(m + n), where m is the length of the pattern and n is the length of the text, making it much more efficient than the na?ve approach.
function kmp_search(text, pattern):
n = length of text
m = length of pattern
pi = compute_prefix_function(pattern)
q = 0
for i = 0 to n - 1:
while q > 0 and pattern[q] != text[i]:
q = pi[q - 1]
if pattern[q] == text[i]:
q = q + 1
if q == m:
print("Pattern found at index", i - m + 1)
q = pi[q - 1]
function compute_prefix_function(pattern):
m = length of pattern
pi = array of length m
k = 0
for q = 1 to m - 1:
while k > 0 and pattern[k] != pattern[q]:
k = pi[k - 1]
if pattern[k] == pattern[q]:
k = k + 1
pi[q] = k
return pi
In conclusion, the KMP algorithm provides an efficient solution to the pattern matching problem by leveraging the prefix function. By avoiding unnecessary comparisons, the KMP algorithm achieves a linear time complexity, making it suitable for large texts and patterns.