Skip to content
Feb 25

Algo: Z-Algorithm for String Matching

MT
Mindli Team

AI-Generated Content

Algo: Z-Algorithm for String Matching

String matching is a cornerstone of computer science, essential for tasks from search engines to genomics. While brute-force checking works, it's inefficient. The Z-Algorithm offers an elegant, linear-time solution by preprocessing a string into a powerful data structure called the Z-array. Mastering this algorithm provides you with a versatile tool for pattern matching and deeper string analysis, all with a surprisingly intuitive implementation.

Understanding the Z-Array

Before tackling the algorithm, you must understand its output. Given a string of length , the Z-array is an array where each element (for ) stores the length of the longest substring starting at position that is also a prefix of .

Formally, means the substring matches exactly the prefix . By definition, is often set to or the entire string's length, as the comparison starts with itself. Let's solidify this with an example. For the string S = "aabcaabxaaz":

  • because S[1]="a" matches S[0]="a", but S[2]="b" does not match S[1]="a".
  • because the substring starting at index 4, "aab", matches the three-character prefix "aab".
  • (just "a").
  • , , , and are all 0, as those characters do not match the starting 'a'.

This array becomes the key to unlocking fast comparisons. It efficiently encodes where prefixes re-occur within the string itself.

The Mechanics of the Z-Algorithm

The Z-Algorithm computes this array in time using a clever, efficient windowing strategy. The core idea is to maintain an interval representing the Z-box—the rightmost prefix-matching substring we have encountered so far. and are the left and right boundaries of this matching segment.

We iterate from index to , and for each , we calculate using previously computed information:

  1. Case 1: (Outside the current Z-box)

There is no prior information to use. We must compare characters from with from scratch until a mismatch, counting the length. This establishes a new Z-box: , .

  1. Case 2: (Inside the current Z-box)

We can use the already computed value from the prefix. Let . This corresponds to the mirrored position in the prefix.

  • Sub-case A:

The full mirrored substring fits inside the current Z-box. We can confidently set without new character comparisons. The Z-box boundaries remain unchanged.

  • Sub-case B:

The mirrored match reaches or exceeds the boundary of our known Z-box. We know characters match up to , but we must check beyond by comparing with , extending the match. We then update the Z-box: , new matching boundary.

This intelligent reuse of past computations is what achieves linear time. The algorithm never compares a character that lies inside a Z-box more than once.

Applying the Z-Algorithm for Pattern Matching

The direct application of the Z-array is in linear-time pattern matching. Given a pattern of length and a text of length , the goal is to find all occurrences of in .

The technique is to form a new string: S = P + "__MATH_INLINE_52__" is a character not present in either or , preventing matches that straddle the boundary between the pattern and text. We then compute the Z-array for this concatenated string .

The result is simple: any index in the T portion of where equals indicates that a match for pattern begins at position i - m - 1 in the original text . For example, if P="ab" and T="cabxab", we form `S="abZ[6] = 2m=26 - 2 - 1 = 3O(m + n)$ time.

Z-Algorithm vs. KMP and Other Matchers

You will often see the Z-Algorithm compared to the Knuth-Morris-Pratt (KMP) algorithm. Both preprocess strings in linear time to enable pattern matching. However, their approaches and outputs differ.

KMP builds a Longest Prefix Suffix (LPS) array (or failure function) that indicates, for a mismatch, where to resume matching in the pattern. The Z-Algorithm builds the more general Z-array, which directly describes self-similarity within any string. For pure pattern matching, both are equally efficient. Many practitioners find the Z-Algorithm conceptually simpler to implement than KMP, as it avoids the more complex state transitions of the failure function.

The Z-array itself is also more versatile. Beyond pattern matching, it can be used to find the shortest repeating substring, count distinct substrings, and solve various string palindrome problems. KMP's LPS array is more specialized for the matching task. In competitive programming and interview settings, the Z-Algorithm is often favored for its straightforward logic and multi-purpose output.

Common Pitfalls

  1. Off-by-One Errors in the Z-Box Logic: The most frequent implementation errors involve indexing within the i <= R case. Confusing R - i with R - i + 1 (the length of the box from i) or mis-setting L and R can break the algorithm. Always trace through a simple example like "aaaaa" to verify your indices. Remember, is the last index of the Z-box, not the length.
  1. Forgetting the Delimiter in Pattern Matching: When concatenating strings for pattern search (`P + "Z[i]$ value could result from a match that starts in the pattern and ends in the text, which is invalid. The delimiter creates a guaranteed mismatch, cleanly separating the two domains.
  1. Misinterpreting the Z-Array Values: It's easy to confuse with a "match starting at i." It is specifically the length of the match with the prefix (). Do not use a raw Z-array computed on a text to find arbitrary substrings; it only finds prefixes. For general substring search, you must use the concatenation method described above.

Summary

  • The Z-Algorithm preprocesses a string in time to produce a Z-array, where is the length of the longest substring starting at that matches the string's prefix.
  • Its efficiency stems from maintaining a Z-box , which allows the reuse of previously computed matches to avoid redundant character comparisons.
  • For pattern matching, concatenate the pattern, a unique delimiter, and the text (`P+"Z[i]m$.
  • Compared to KMP, the Z-Algorithm is often seen as simpler to implement and its Z-array output is more versatile for a wider range of string analysis problems beyond simple matching.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.