Skip to content
Feb 25

String Matching: Brute Force and KMP Algorithm

MT
Mindli Team

AI-Generated Content

String Matching: Brute Force and KMP Algorithm

Finding a specific pattern within a larger body of text is a fundamental operation in computing, critical for everything from search engines and text editors to DNA sequence analysis and network intrusion detection. At its core, string matching is the problem of finding all occurrences of a short string, called the pattern, within a longer string, called the text. Mastering the efficiency of this task is a cornerstone of algorithmic thinking. We will explore two key approaches: the intuitive but slow brute-force method and the sophisticated, linear-time Knuth-Morris-Pratt (KMP) algorithm, which cleverly reuses information to avoid wasted work.

The Brute-Force (Naive) Pattern Matching Method

The most straightforward way to search for a pattern is the brute-force algorithm. Imagine sliding the pattern over the text, one character at a time, and at each starting position, checking if every character in the pattern matches the corresponding character in the text. This method makes no assumptions and requires no preprocessing.

The algorithm works as follows:

  1. Align the pattern with the beginning of the text.
  2. Compare characters from left to right.
  3. If all characters match, record the starting index as a match.
  4. Regardless of the outcome (full match or mismatch), shift the pattern exactly one position to the right.
  5. Repeat steps 2-4 until the pattern slides past the end of the text.

In pseudocode, for text T[0..n-1] and pattern P[0..m-1]:

for i from 0 to n-m:
    for j from 0 to m-1:
        if T[i+j] != P[j]:
            break
    if j == m-1:
        print("Pattern found at index", i)

The primary drawback is its time complexity. In the worst case—for example, searching for "AAAAB" in "AAAAAAAA"—the inner loop may run nearly its full length for each outer loop iteration. This results in a time complexity of , where is the text length and is the pattern length. While simple to implement, this quadratic performance is inefficient for long texts or patterns.

Introducing the Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm solves the same problem but achieves a dramatically better worst-case time complexity of . It does this by using information from previous comparisons to skip positions that cannot possibly yield a match. The key insight is that when a mismatch occurs, the pattern itself often contains information that allows us to slide it forward by more than one character without missing a potential match.

This intelligence comes from preprocessing the pattern to construct a failure function (also called the longest proper prefix which is also a suffix or LPS array). This auxiliary array, of length , tells us, for each position j in the pattern, the length of the longest proper prefix of P[0..j] that is also a suffix of P[0..j]. A proper prefix is a prefix that is not the whole string itself.

Instead of always resetting the pattern's index j to 0 after a mismatch, the failure function F[j-1] tells us the next index in the pattern to compare. This means we never move the index i in the text backward; we only move forward. This property is what enables the linear time scan of the text.

Constructing the KMP Failure Function

The failure function is the engine of the KMP algorithm. Building it efficiently is crucial. It is constructed using a similar "self-matching" process. We initialize F[0] = 0 because a single character has no proper prefix. Then, we use two pointers: i (for the end of the current prefix we are considering) and j (for the candidate prefix length). The logic mirrors the main KMP search but is applied to the pattern itself.

Let's construct the failure function for the pattern P = "ABABAC".

  1. F[0] = 0 (For "A").
  2. For i=1, P[1]='B'. Compare to P[0]='A'. Mismatch. j is 0, so F[1] = 0.
  3. For i=2, P[2]='A'. Compare to P[0]='A'. Match! So F[2] = j + 1 = 1.
  4. For i=3, P[3]='B'. Since P[3] matches P[1] (because j=1), we set F[3] = j + 1 = 2.
  5. For i=4, P[4]='A'. Matches P[2] (j=2). Set F[4] = 3.
  6. For i=5, P[5]='C'. Mismatch with P[3] (j=3). We don't reset j to 0. Instead, we use the failure function recursively: j = F[j-1] = F[2] = 1. Now compare P[5] with P[1] ('B'). Mismatch. j = F[0] = 0. Compare P[5] with P[0] ('A'). Mismatch. j is 0, so F[5] = 0.

The final failure function (LPS array) is [0, 0, 1, 2, 3, 0].

The KMP Search Process

With the failure function F precomputed, the main search becomes elegant and efficient. We traverse the text with index i and the pattern with index j. When characters T[i] and P[j] match, we increment both pointers. When they mismatch:

  • If j > 0, we use the failure function to update j = F[j-1] and re-compare T[i] with the new P[j] without moving i.
  • If j == 0, we simply increment i to move forward in the text.

Searching for "ABABAC" in "ABABABAC":

  1. Match A, B, A, B, A. (i and j advance to 5 and 5).
  2. Mismatch: T[5]='B' vs P[5]='C'.
  3. Since j=5, we set j = F[4] = 3. i stays at 5.
  4. Compare T[5]='B' with P[3]='B'. Match.
  5. Continue... (i=6, j=4), (i=7, j=5), final match found.

The text pointer i only moves forward. The total number of operations is bounded by (each character in the text is involved in at most one successful match and one mismatch transition), leading to the complexity when including the preprocessing.

Common Pitfalls

  1. Misunderstanding the Failure Function's Purpose: A common error is to think the failure function stores where to restart the search in the text. It does not. It stores where to restart matching within the pattern after a mismatch, allowing the text pointer to remain fixed. Remember, the text index i never decreases.
  1. Incorrect Failure Function Construction: Implementing the LPS construction incorrectly is the most frequent bug. Specifically, forgetting to handle the recursive fallback using j = F[j-1] when a mismatch occurs during the construction phase will produce a wrong table. Always trace through an example like "AABAABAAA" to verify your logic.
  1. Off-by-One Errors in Indices: Whether using 0-based or 1-based indexing, consistency is key. A classic mistake is accessing F[j] when you need F[j-1] after a mismatch. In our standard 0-based implementation, when a mismatch happens at pattern index j, the next comparison should be with pattern index F[j-1].
  1. Assuming KMP is Always Faster in Practice: While KMP has a superior worst-case guarantee, the brute-force method can be faster for very short patterns or texts due to its extreme simplicity and low constant factors. KMP's strength is its predictable, linear performance, especially with repetitive patterns in long texts.

Summary

  • Brute-force matching slides the pattern one position at a time, checking all characters, resulting in worst-case time complexity. It is simple but inefficient for large inputs.
  • The KMP algorithm preprocesses the pattern in time to build a failure function (LPS array) that identifies the longest prefix of the pattern that is also a suffix for each substring.
  • Using this function, KMP performs the main search in time by avoiding redundant comparisons; the text pointer never moves backward, leading to a combined complexity.
  • The core of KMP is understanding that after a mismatch, knowledge of the pattern's structure allows you to resume matching from an advanced position within the pattern itself, not from its beginning.
  • Implementation requires careful attention to the construction of the failure function and the management of indices during the search to avoid off-by-one errors.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.