登录查看更多内容

Efficient Pattern Matching: The Knuth-Morris-Pratt Algorithm

Ali Raza

Software Engineer | Seeking MS or PhD Opportunities

发布日期: 2024年4月25日

Pattern matching is a fundamental problem in computer science, with application ranging from text processing to bioinformatics. Given a text string and a pattern string, the goal is to determine whether the pattern appears somewhere within the text, and if so, locate its position. We can solve this problem using two ways.

Na?ve string matching
Knuth-Morris-Pratt algorithm

Before looking into the KMP algorithm, let's first understand the na?ve approach to pattern matching. The na?ve algorithm compares the pattern to all sub-strings of the text, starting from each position, to check if there is a match. While this algorithm is conceptually simple, in worst case, this algorithm has time complexity of O(n2). The reason for this is, when the pattern doesn't appear in the text at all or appears only at the very end then the algorithm will have to compare the entire pattern against the text. Let's look into it visually to understand its limitations.

Na?ve string matching algorithm (animation taken from lobb.nz)

As we can see above, at every iteration the algorithm compares whole string which is not very efficient.

Now let's shift our focus to the KMP algorithm, in order to understand KMP algorithm, we first need to understand the concept of prefix function.

What is a Prefix Function?

Given a string s, the prefix function pi[i] for position i (1-based indexing) is defined as the length of the longest proper prefix of the substring s[0:i] that is also a suffix of this substring. Here, a "proper prefix" is a prefix that is not equal to the string itself.

Let's simplify this with an example. Consider the string "abcabcd." For each position i in this string, the prefix function pi[i] would be as follows:

pi[1] = 0 (no proper prefix exists)
pi[2] = 0 (no proper prefix exists)
pi[3] = 0 (no proper prefix exists)
pi[4] = 1 (the proper prefix "a" is also a suffix)
pi[5] = 0 (no proper prefix exists)
pi[6] = 1 (the proper prefix "ab" is also a suffix)
pi[7] = 2 (the proper prefix "abc" is also a suffix)
pi[8] = 3 (the proper prefix "abca" is also a suffix)

Why is the Prefix Function Important?

The prefix function is crucial in the KMP algorithm because it helps determine how far we can shift the pattern when a mismatch occurs. This allows us to avoid unnecessary comparisons, making the algorithm more efficient than the na?ve approach.

领英推荐

Unraveling the Threads of Graph Theory: From Bridges…

Data & Analytics 2 个月前

Digital Data Storage in DNA: Use of Modified…

TriLink BioTechnologies, part of Maravai LifeSciences 1 年前

BOLD Perspectives: Data Storage in DNA -- with Dina…

BOLD Community 1 年前

How is the Prefix Function Computed?

The prefix function is computed using a simple algorithm that iterates through the string. Here's a high-level overview of how it works:

Start with k = 0 and iterate through each position q in the string (starting from q = 1).
If the character at position k is equal to the character at position q, increment k and set pi[q] = k.
If the characters don't match and k > 0, update k to pi[k-1] and repeat the comparison.
If the characters don't match and k == 0, set pi[q] = 0.

This algorithm efficiently computes the prefix function for each position in the string, providing us with valuable information that helps optimize the pattern matching process in the KMP algorithm.

function compute_prefix(pattern):
    m = length of pattern
    pi = array of length m
    k = 0
    for q = 1 to m - 1:
        while k > 0 and pattern[k] != pattern[q]:
            k = pi[k - 1]
        if pattern[k] == pattern[q]:
            k = k + 1
        pi[q] = k
    return pi

Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm uses the prefix function to avoid unnecessary comparisons while searching for a pattern within a text. It achieves this by precomputing the prefix function for the pattern and then using it to determine the next position to compare when a mismatch occurs.

Let's look into some visualize to better understand the KMP algorithm.

KMP algorithm visualization with prefix table (Animation taken from cmps-people.ok.ubc.ca

By using the prefix function to determine how far to shift the pattern pointer when a mismatch occurs, the KMP algorithm achieves a time complexity of O(m + n), where m is the length of the pattern and n is the length of the text, making it much more efficient than the na?ve approach.

function kmp_search(text, pattern):
    n = length of text
    m = length of pattern
    pi = compute_prefix_function(pattern)
    q = 0
    for i = 0 to n - 1:
        while q > 0 and pattern[q] != text[i]:
            q = pi[q - 1]
        if pattern[q] == text[i]:
            q = q + 1
        if q == m:
            print("Pattern found at index", i - m + 1)
            q = pi[q - 1]

function compute_prefix_function(pattern):
    m = length of pattern
    pi = array of length m
    k = 0
    for q = 1 to m - 1:
        while k > 0 and pattern[k] != pattern[q]:
            k = pi[k - 1]
        if pattern[k] == pattern[q]:
            k = k + 1
        pi[q] = k
    return pi

In conclusion, the KMP algorithm provides an efficient solution to the pattern matching problem by leveraging the prefix function. By avoiding unnecessary comparisons, the KMP algorithm achieves a linear time complexity, making it suitable for large texts and patterns.

要查看或添加评论，请登录

Ali Raza的更多文章

TF-IDF (Term Frequency - Inverse Document Frequency) in NLP.

2024年7月20日

TF-IDF (Term Frequency - Inverse Document Frequency) in NLP.

TF-IDF is a statistical measure used in information retrieval and text mining to evaluate the importance of a word in a…
Natural Language Processing Basics with spaCy (Part 2)

2024年7月13日

Natural Language Processing Basics with spaCy (Part 2)

In the first part we went through the concepts of Tokenization Case Folding Stop Word Removal Stemming Lemmatization…
Natural Language Processing Basics with spaCy (Part 1)

2024年7月7日

Natural Language Processing Basics with spaCy (Part 1)

1. What is NLP? Natural Language Processing or NLP is an area of computer science and artificial intelligence concerned…

2 条评论
Fundamentals of Regular Expressions in Python

2024年6月29日

Fundamentals of Regular Expressions in Python

Regular expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule…

Efficient Pattern Matching: The Knuth-Morris-Pratt Algorithm

Ali Raza

Software Engineer | Seeking MS or PhD Opportunities

What is a Prefix Function?

Why is the Prefix Function Important?

领英推荐

How is the Prefix Function Computed?

Knuth-Morris-Pratt (KMP) Algorithm

Ali Raza的更多文章

社区洞察

其他会员也浏览了

Computational Biology Software Market Geographical Expansion & Analysis Growth Development, Status, Recorded during 2024 to 2033

Dimensionality Reduction: A Single-Cell Example

The Legacy of Algorithms: The Rise of Computer Science and AI

AI Axioms - A New Manifesto for Modular Mathematics

Life Science resources/database. How many?

Mathematics and Cybernetics - applied aspects

The Legacy of Algorithms: The Industrial Revolution and the Birth of Modern Computing

Linear Regression: Bridging Celestial Mechanics and Predictive Analytics

What is the current thinking in complexity science, applied mathematics, and computational social science on analysis of social media

The Mathematics of Design: What Typing Monkeys Reveal About Life's Origins

What is a Prefix Function?

Why is the Prefix Function Important?

领英推荐

How is the Prefix Function Computed?

Knuth-Morris-Pratt (KMP) Algorithm

Ali Raza的更多文章

TF-IDF (Term Frequency - Inverse Document Frequency) in NLP.

Natural Language Processing Basics with spaCy (Part 2)

Natural Language Processing Basics with spaCy (Part 1)

Fundamentals of Regular Expressions in Python

社区洞察

其他会员也浏览了

Computational Biology Software Market Geographical Expansion & Analysis Growth Development, Status, Recorded during 2024 to 2033

Dimensionality Reduction: A Single-Cell Example

The Legacy of Algorithms: The Rise of Computer Science and AI

AI Axioms - A New Manifesto for Modular Mathematics

Life Science resources/database. How many?

Mathematics and Cybernetics - applied aspects

The Legacy of Algorithms: The Industrial Revolution and the Birth of Modern Computing

Linear Regression: Bridging Celestial Mechanics and Predictive Analytics

What is the current thinking in complexity science, applied mathematics, and computational social science on analysis of social media

The Mathematics of Design: What Typing Monkeys Reveal About Life's Origins