Efficient Pattern Matching: The Knuth-Morris-Pratt Algorithm

Efficient Pattern Matching: The Knuth-Morris-Pratt Algorithm

Pattern matching is a fundamental problem in computer science, with application ranging from text processing to bioinformatics. Given a text string and a pattern string, the goal is to determine whether the pattern appears somewhere within the text, and if so, locate its position. We can solve this problem using two ways.

  1. Na?ve string matching
  2. Knuth-Morris-Pratt algorithm

Before looking into the KMP algorithm, let's first understand the na?ve approach to pattern matching. The na?ve algorithm compares the pattern to all sub-strings of the text, starting from each position, to check if there is a match. While this algorithm is conceptually simple, in worst case, this algorithm has time complexity of O(n2). The reason for this is, when the pattern doesn't appear in the text at all or appears only at the very end then the algorithm will have to compare the entire pattern against the text. Let's look into it visually to understand its limitations.

Na?ve string matching algorithm (animation taken from lobb.nz)

As we can see above, at every iteration the algorithm compares whole string which is not very efficient.

Now let's shift our focus to the KMP algorithm, in order to understand KMP algorithm, we first need to understand the concept of prefix function.

What is a Prefix Function?

Given a string s, the prefix function pi[i] for position i (1-based indexing) is defined as the length of the longest proper prefix of the substring s[0:i] that is also a suffix of this substring. Here, a "proper prefix" is a prefix that is not equal to the string itself.

Let's simplify this with an example. Consider the string "abcabcd." For each position i in this string, the prefix function pi[i] would be as follows:

  • pi[1] = 0 (no proper prefix exists)
  • pi[2] = 0 (no proper prefix exists)
  • pi[3] = 0 (no proper prefix exists)
  • pi[4] = 1 (the proper prefix "a" is also a suffix)
  • pi[5] = 0 (no proper prefix exists)
  • pi[6] = 1 (the proper prefix "ab" is also a suffix)
  • pi[7] = 2 (the proper prefix "abc" is also a suffix)
  • pi[8] = 3 (the proper prefix "abca" is also a suffix)

Why is the Prefix Function Important?

The prefix function is crucial in the KMP algorithm because it helps determine how far we can shift the pattern when a mismatch occurs. This allows us to avoid unnecessary comparisons, making the algorithm more efficient than the na?ve approach.

How is the Prefix Function Computed?

The prefix function is computed using a simple algorithm that iterates through the string. Here's a high-level overview of how it works:

  • Start with k = 0 and iterate through each position q in the string (starting from q = 1).
  • If the character at position k is equal to the character at position q, increment k and set pi[q] = k.
  • If the characters don't match and k > 0, update k to pi[k-1] and repeat the comparison.
  • If the characters don't match and k == 0, set pi[q] = 0.

This algorithm efficiently computes the prefix function for each position in the string, providing us with valuable information that helps optimize the pattern matching process in the KMP algorithm.

function compute_prefix(pattern):
    m = length of pattern
    pi = array of length m
    k = 0
    for q = 1 to m - 1:
        while k > 0 and pattern[k] != pattern[q]:
            k = pi[k - 1]
        if pattern[k] == pattern[q]:
            k = k + 1
        pi[q] = k
    return pi        

Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm uses the prefix function to avoid unnecessary comparisons while searching for a pattern within a text. It achieves this by precomputing the prefix function for the pattern and then using it to determine the next position to compare when a mismatch occurs.

Let's look into some visualize to better understand the KMP algorithm.

KMP algorithm visualization with prefix table (Animation taken from cmps-people.ok.ubc.ca

By using the prefix function to determine how far to shift the pattern pointer when a mismatch occurs, the KMP algorithm achieves a time complexity of O(m + n), where m is the length of the pattern and n is the length of the text, making it much more efficient than the na?ve approach.

function kmp_search(text, pattern):
    n = length of text
    m = length of pattern
    pi = compute_prefix_function(pattern)
    q = 0
    for i = 0 to n - 1:
        while q > 0 and pattern[q] != text[i]:
            q = pi[q - 1]
        if pattern[q] == text[i]:
            q = q + 1
        if q == m:
            print("Pattern found at index", i - m + 1)
            q = pi[q - 1]

function compute_prefix_function(pattern):
    m = length of pattern
    pi = array of length m
    k = 0
    for q = 1 to m - 1:
        while k > 0 and pattern[k] != pattern[q]:
            k = pi[k - 1]
        if pattern[k] == pattern[q]:
            k = k + 1
        pi[q] = k
    return pi        

In conclusion, the KMP algorithm provides an efficient solution to the pattern matching problem by leveraging the prefix function. By avoiding unnecessary comparisons, the KMP algorithm achieves a linear time complexity, making it suitable for large texts and patterns.

要查看或添加评论,请登录

Ali Raza的更多文章

社区洞察

其他会员也浏览了