String Matching: Rabin-Karp Algorithm

Searching for patterns within text is a fundamental operation in computing, from finding words in a document to identifying sequences in genetic data. The naive approach of checking every possible position is simple but slow, costing $O (nm)$ time for a text of length $n$ and a pattern of length $m$ . The Rabin-Karp algorithm provides a clever solution by using hashing to quickly filter out most mismatch positions, achieving an expected average-case time of $O (n + m)$ . Its power lies in a rolling hash function that can be updated in constant time as the search window slides, making it exceptionally efficient for multi-pattern search and large-scale text analysis.

The Core Idea: Fingerprinting with Rolling Hashes

At its heart, Rabin-Karp treats a string as a number in a large base (e.g., base 256 for ASCII characters). It computes a numerical "fingerprint"—a hash value—for both the pattern and for each contiguous substring of the text that is the same length as the pattern. Instead of comparing characters directly, it first compares these hash values. If the hash values differ, the substrings cannot be equal. If they match, a potential match is found, but a full character-by-character comparison is still required to confirm it, due to the possibility of a hash collision (two different strings producing the same hash).

The algorithm's efficiency stems from how it computes these fingerprints. A rolling hash function allows the hash for the next text window to be computed from the hash of the current window in constant time, $O (1)$ , rather than recalculating it from scratch in $O (m)$ time. This is the engine that drives the algorithm's performance. A common rolling hash is a polynomial rolling hash using a large prime number. For a string $S$ of length $m$ , its hash $h (S)$ can be defined as: $h (S) = (S [0] \cdot b^{m - 1} + S [1] \cdot b^{m - 2} + ... + S [m - 1] \cdot b^{0}) mod p$ where $b$ is the base (e.g., 256) and $p$ is a chosen prime number.

The Algorithm Step-by-Step

Let's walk through the Rabin-Karp procedure for a single pattern $P$ of length $m$ in a text $T$ of length $n$ .

Preprocessing: Calculate Initial Hashes

Choose a base $b$ (typically the size of the alphabet) and a large prime number $p$ to keep the hash value within a manageable integer range.
Compute the hash value for the pattern, $ha s h_{P}$ .
Compute the hash value for the first $m$ characters of the text, $ha s h_{T}$ .

Searching: Slide and Compare

Iterate $i$ from $0$ to $n - m$ :
If $ha s h_{P} == ha s h_{T}$ , perform a character-by-character comparison of $P$ and $T [i ... i + m - 1]$ . If they match exactly, record position $i$ as a match.
Before moving to the next iteration, compute the hash for the next text window. If $i < n - m$ , calculate $ha s h_{T}$ for $T [i + 1... i + m]$ using the rolling update formula:

$ha s h_{T_{n e w}} = ((ha s h_{T_{o l d}} - T [i] \cdot b^{m - 1}) \cdot b + T [i + m]) mod p$ This formula removes the contribution of the leaving character $T [i]$ and adds the contribution of the entering character $T [i + m]$ .

The mathematical update is the critical step. Subtracting $T [i] \cdot b^{m - 1}$ removes the highest-order term from the old hash. Multiplying the entire result by $b$ shifts all remaining terms up by one power (like shifting a number left in base $b$ ). Finally, adding $T [i + m]$ introduces the new lowest-order term.

Handling Hash Collisions and Choosing Parameters

A hash collision occurs when two distinct strings produce an identical hash value. In Rabin-Karp, this leads to a spurious hit—an unnecessary and costly full string comparison. The algorithm remains correct because every hash match is verified, but too many collisions destroy its efficiency, degrading it back to $O (nm)$ time in the worst case.

The choice of the prime modulus $p$ is paramount for controlling collisions. A larger prime $p$ makes collisions less likely but increases the risk of integer overflow during computation. In practice, we use modular arithmetic to keep numbers small. We must also handle negative results after the subtraction step in the rolling hash by adding $p$ before taking the final modulus.

For example, if the calculated $ha s h_{T_{n e w}}$ becomes negative, we simply add $p$ to make it positive before proceeding. This ensures the hash value remains a valid non-negative integer modulo $p$ .

Extension to Multi-Pattern Search

One of Rabin-Karp's most significant advantages is its elegant extension to searching for multiple patterns simultaneously. The naive approach would require running a single-pattern algorithm $k$ times, costing $O (k (n + m_{a vg}))$ . Rabin-Karp can do this in an expected $O (n + \sum m_{i})$ time.

The procedure is straightforward:

Precompute the hash value for each pattern in the set.
Store these pattern hashes in a hash set or hash table for $O (1)$ lookup.
As you slide the window across the text, compute the rolling hash $ha s h_{T}$ for each text window.
For each $ha s h_{T}$ , check if it exists in the set of pattern hashes.
If a hash match is found, perform a full string comparison against all patterns that share that hash value to confirm the match and identify which pattern was found.

This makes Rabin-Karp highly effective for tasks like checking a document against a database of prohibited phrases or searching for numerous genetic markers in a DNA sequence.

Practical Applications

The Rabin-Karp algorithm shines in real-world applications where text streams are large or patterns are numerous.

Plagiarism Detection: Software can break a source document into overlapping $n$ -grams (phrases of $n$ words), compute their hashes, and efficiently check them against a vast database of known works using the multi-pattern approach.
DNA Sequence Matching: Genomic sequences (strings over the alphabet {A, C, G, T}) are enormous. Rabin-Karp is used to find specific gene sequences or markers within long DNA strands. Its ability to handle a small, fixed alphabet (base 4 or 5) works perfectly with the rolling hash model.
File and Data Deduplication: Systems can identify duplicate blocks of data by computing rolling hashes of file chunks and comparing them, a process known as fingerprinting.

Common Pitfalls

Ignoring Hash Collisions: Treating a hash match as a definite string match is a critical error. Correction: Always follow a hash match with a definitive character-by-character comparison. The hash is only a fast filter.
Poor Choice of Prime Modulus: Using a small prime (like 101) or a non-prime number leads to frequent collisions. Correction: Use a large prime number relative to your base and expected input size (e.g., 10^9+7 or 10^9+9 are common choices in competitive programming) to distribute hash values more uniformly.
Incorrect Rolling Hash Update: Mis-calculating the contribution of the exiting character, especially its power of $b$ , is a common implementation bug. Correction: Precompute $b^{m - 1} mod p$ at the start and use it in the update formula. Always ensure the subtraction step is handled correctly with modular arithmetic to avoid negative values.
Integer Overflow: Even with modulus, intermediate calculations like (hash * base) + new_char can overflow standard 32-bit integers. Correction: Use a data type with a larger bit width (e.g., 64-bit integers like long long in C++) for intermediate calculations before applying the final modulus operation.

Summary

The Rabin-Karp algorithm uses hashing to accelerate string matching, achieving an expected $O (n + m)$ time complexity by filtering comparisons with a rolling hash function.
Its core innovation is the constant-time $O (1)$ hash update when sliding the search window, derived from a polynomial hash function computed modulo a large prime.
A hash collision necessitates a full string verification to ensure algorithm correctness, making the choice of a good hash function and large prime modulus critical for performance.
The algorithm extends naturally to multi-pattern search by using a hash set of pattern hashes, making it vastly more efficient than running multiple single-pattern searches.
Key applications include plagiarism detection, DNA sequence analysis, and data deduplication, where its ability to process streaming windows of text efficiently is a major advantage.

String Matching: Rabin-Karp Algorithm

String Matching: Rabin-Karp Algorithm

The Core Idea: Fingerprinting with Rolling Hashes

The Algorithm Step-by-Step

Handling Hash Collisions and Choosing Parameters

Extension to Multi-Pattern Search

Practical Applications

Common Pitfalls

Summary

Write better notes with AI