Loading content...
The Rabin-Karp algorithm has a fascinating complexity profile: it achieves O(n + m) expected time in typical cases, but can degrade to O(n × m) worst case under specific conditions. This dual nature distinguishes it from algorithms like KMP that guarantee O(n + m) always.
Understanding when and why Rabin-Karp performs well—and when it might struggle—is crucial for making informed algorithm choices. This page provides a rigorous complexity analysis that equips you to predict Rabin-Karp's behavior in any scenario.
By the end of this page, you will: (1) derive the preprocessing time complexity, (2) rigorously analyze expected running time, (3) understand worst-case scenarios and their probability, (4) compare with other string matching algorithms, and (5) know when Rabin-Karp is the optimal choice.
Before pattern matching begins, Rabin-Karp performs several preprocessing steps. Let's analyze each one.
Step 1: Compute pattern hash
for each character c in pattern: // m iterations
hash = (hash * base + ord(c)) % mod // O(1) each
Total: O(m) time
Step 2: Compute base^(m-1) mod q
// Method 1: Iterative (naïve)
power = 1
for i = 0 to m-2: // m-1 iterations
power = (power * base) % mod // O(1) each
// Method 2: Fast exponentiation
power = modpow(base, m-1, mod) // O(log m)
Iterative: O(m) time Fast exponentiation: O(log m) time
Step 3: Compute first window hash
for i = 0 to m-1: // m iterations
window_hash = (window_hash * base + ord(text[i])) % mod // O(1)
Total: O(m) time
| Step | Operation | Time Complexity |
|---|---|---|
| Pattern hash | Hash m characters | O(m) |
| Base power | Compute b^(m-1) mod q | O(m) or O(log m) |
| First window hash | Hash first m characters of text | O(m) |
| Total preprocessing | — | O(m) |
Total preprocessing: O(m)
The preprocessing time is linear in pattern length, independent of text length. This is important: we do O(m) work once, then O(1) work per position during matching.
Space complexity of preprocessing:
Rabin-Karp has excellent space efficiency—it uses only constant extra memory beyond the input strings.
The matching phase is where Rabin-Karp's complexity depends on the input. Let's analyze it carefully.
Main loop structure:
for i = 0 to n - m: // (n - m + 1) iterations
if window_hash == pattern_hash: // O(1) comparison
if verify(text, pattern, i): // O(m) when called
report match
roll_hash() // O(1)
Per-iteration breakdown:
Key question: How often is verification performed?
Verification is called whenever window_hash == pattern_hash. This happens when:
Let k = number of actual matches + number of spurious hits
Matching phase time complexity:
Total time = Loop iterations × O(1) + Verifications × O(m)
= (n - m + 1) × O(1) + k × O(m)
= O(n) + O(km)
= O(n + km)
Where k = (actual matches) + (spurious hits)
The critical insight:
The complexity depends entirely on k:
For typical inputs with few matches and rare collisions, k is very small. For pathological inputs, k can be large.
KMP guarantees O(n + m) regardless of input. Rabin-Karp only guarantees this in the expected case. However, Rabin-Karp's simplicity and ability to handle multiple patterns can make it preferable despite the weaker guarantee.
Let's derive the expected running time rigorously, assuming our hash function distributes values uniformly.
Probability model:
Expected number of spurious hits:
For positions where text doesn't match pattern (at most N positions):
E[spurious hits] = N × P(collision)
= N × (1/q)
= (n - m + 1) / q
With q = 10^9 and n = 10^6:
E[spurious hits] = 10^6 / 10^9 = 0.001
We expect essentially zero spurious hits for typical inputs!
Expected verification count:
Let A = number of actual matches (depends on pattern and text).
E[verifications] = A + E[spurious hits]
= A + (n - m + 1) / q
≈ A (when q is large)
Expected total time:
E[total time] = Preprocessing + Matching
= O(m) + O(n) + E[verifications] × O(m)
= O(m) + O(n) + O(Am) + O((n-m+1)/q × m)
= O(m + n + Am + nm/q)
When q ≫ nm (which is typical):
E[total time] = O(n + m + Am)
If A is O(1) (few matches), this is O(n + m).
| Scenario | Value of A | Expected Time | Simplified |
|---|---|---|---|
| Pattern not found | A = 0 | O(n + m + 0) | O(n + m) |
| Single match | A = 1 | O(n + m + m) | O(n + m) |
| Few matches | A = O(1) | O(n + m + m) | O(n + m) |
| Many matches | A = Θ(n/m) | O(n + m + n) | O(n) |
| Pattern matches everywhere | A = Θ(n) | O(n + m + nm) | O(nm) |
For most practical applications—searching for unique identifiers, keywords, or specific patterns—the number of matches A is very small, often 0 or 1. In these cases, Rabin-Karp runs in O(n + m) time, which is optimal for single-pattern matching.
Understanding the worst case helps identify when Rabin-Karp might underperform and when alternatives should be considered.
Worst case scenario:
The algorithm achieves O(nm) when verification is required at every position. This happens when:
Example 1: Pattern matches everywhere
Text: AAAAAAAAA (n = 9)
Pattern: AA (m = 2)
Matches at: 0, 1, 2, 3, 4, 5, 6, 7 (8 matches!)
Work = 8 positions × O(2) verification each = O(16) = O(nm)
This is unavoidable—if the pattern truly appears everywhere, we must verify everywhere.
Example 2: Adversarial hash collisions
An attacker who knows the hash parameters can construct input where every substring hashes to the pattern's hash value, forcing O(m) verification at each of n-m+1 positions.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import time def worst_case_demo(): """Demonstrate Rabin-Karp worst case behavior.""" # Worst case: pattern matches at almost every position sizes = [1000, 2000, 4000, 8000] print("Pattern matching everywhere (worst case):") print("-" * 50) for n in sizes: text = "A" * n pattern = "AA" # Matches at positions 0, 1, ..., n-2 start = time.time() # Simplified Rabin-Karp counting verifications base, mod = 31, 10**9 + 7 ph = (ord('A') * base + ord('A')) % mod wh = ph # First window also "AA" bp = base match_count = 0 verify_count = 0 for i in range(n - 1): if wh == ph: verify_count += 1 if text[i:i+2] == pattern: match_count += 1 if i < n - 2: wh = ((wh - ord(text[i]) * bp) * base + ord(text[i + 2])) % mod if wh < 0: wh += mod elapsed = time.time() - start print(f"n={n}: {match_count} matches, {verify_count} verifications, {elapsed*1000:.2f}ms") print() print("Notice: Doubling n roughly doubles time (linear in n)") print("But work per position is O(m)=O(2), so total is O(nm)") worst_case_demo()Probability of worst case via collisions:
For the worst case to occur through collisions alone (not actual matches), every single position must have a spurious hit:
P(all positions collide) = (1/q)^N
= (1/10^9)^(n-m+1)
≈ 0 (exponentially small)
This is essentially impossible with random input. Worst case through collisions requires adversarial construction.
Practical implications:
Rabin-Karp has excellent space efficiency. Let's analyze it carefully.
Variables used:
Total auxiliary space: O(1)
Rabin-Karp uses constant extra memory regardless of input size. This is optimal for a pattern matching algorithm.
| Algorithm | Auxiliary Space | Total Space (including input) |
|---|---|---|
| Naive | O(1) | O(n + m) |
| Rabin-Karp | O(1) | O(n + m) |
| KMP | O(m) | O(n + m) |
| Boyer-Moore | O(m + σ) | O(n + m) |
| Suffix Array | O(n) | O(n) |
Memory access patterns:
Rabin-Karp has favorable memory access patterns:
For very large texts that don't fit in memory, Rabin-Karp can be implemented as a streaming algorithm, processing the text one character at a time without buffering.
Streaming implementation:
while has_more_input():
char = read_next_char()
update_hash(char, outgoing_char)
if hash_matches():
verify_or_report_possible_match()
outgoing_char = char_at(current_pos - m + 1)
This enables pattern matching on texts larger than available memory.
For streaming, you need to buffer the current window (m characters) to verify matches of enable outgoing character access. This adds O(m) space. Pure streaming without verification (Monte Carlo mode) truly uses O(1) space.
Let's compare Rabin-Karp's complexity with other major string matching algorithms to understand when each is preferable.
| Algorithm | Preprocessing | Expected | Worst Case | Space |
|---|---|---|---|---|
| Naive | O(1) | O(nm) | O(nm) | O(1) |
| Rabin-Karp | O(m) | O(n + m) | O(nm) | O(1) |
| KMP | O(m) | O(n + m) | O(n + m) | O(m) |
| Boyer-Moore | O(m + σ) | O(n/m) | O(nm) | O(m + σ) |
| Rabin-Karp (k patterns) | O(km) | O(n + km) | O(nm) | O(k) |
Key observations:
Rabin-Karp vs Naive: Same worst case, but vastly better expected case. The hash filter eliminates most comparisons.
Rabin-Karp vs KMP: KMP guarantees O(n+m) always. Rabin-Karp only achieves this in expectation. However:
Rabin-Karp vs Boyer-Moore: Boyer-Moore can be sublinear O(n/m) when pattern is long and mismatches occur early. Rabin-Karp is always at least O(n). Boyer-Moore is preferred for single long patterns.
Multiple patterns: Rabin-Karp shines when searching for k patterns simultaneously—preprocessing is O(km) but matching remains O(n + km) expected.
Use Rabin-Karp when: (1) searching for multiple patterns, (2) simplicity is valued, (3) input is not adversarial. Use KMP when: worst-case guarantees are required. Use Boyer-Moore when: pattern is long and alphabet is large.
Asymptotic complexity tells part of the story, but practical performance depends on additional factors. Let's examine them.
Factor 1: Constant factors in hash computation
Each hash update involves:
The modulo operation (%) is often the slowest—it involves integer division. Using a power-of-two modulus enables bit masking instead:
hash % (2^k) = hash & ((1 << k) - 1) // Much faster!
However, power-of-two moduli have worse collision properties than primes.
Factor 2: Branch prediction
The main loop has a conditional check:
if window_hash == pattern_hash: // Usually false
verify()
Since matches are rare, the branch predictor learns to assume 'not equal' and proceeds efficiently. Spurious hits cause branch mispredictions, adding ~10-20 cycles each.
Factor 3: Cache behavior
Rabin-Karp reads text sequentially, which is cache-friendly:
Factor 4: Integer overflow handling
In languages without arbitrary-precision integers (C++, Java):
123456789101112131415161718192021222324252627282930
import time def benchmark_mod_operations(): """Compare different modulo approaches.""" n = 10_000_000 # Prime modulus (standard) mod_prime = 10**9 + 7 # Power of two modulus (fast but more collisions) mod_pow2 = 2**30 mask = mod_pow2 - 1 values = list(range(n)) # Prime modulo start = time.time() result1 = [v % mod_prime for v in values] time_prime = time.time() - start # Power of 2 modulo with masking start = time.time() result2 = [v & mask for v in values] time_pow2 = time.time() - start print(f"Prime modulo ({mod_prime}): {time_prime*1000:.2f}ms") print(f"Power-of-2 mask ({mod_pow2}): {time_pow2*1000:.2f}ms") print(f"Speedup: {time_prime/time_pow2:.2f}x") benchmark_mod_operations()These micro-optimizations rarely matter in practice. Focus first on correct implementation, then measure if performance is insufficient. The standard prime modulus approach is robust and typically fast enough.
Another way to understand Rabin-Karp's efficiency is through amortized analysis—analyzing total work over many operations.
Work accounting by character:
Consider accounting for work based on each character in the text:
Each text character contributes to hash computation:
Each text character participates in at most one verification:
Total work over all characters:
Hashing work = n characters × O(1) = O(n)
Verification work = (matches + collisions) × O(m)
The hashing work is perfectly amortized to O(1) per character. The verification work depends on how many times we verify.
Banker's method analysis:
Imagine each text character comes with 2 'credits':
Each hash update costs exactly 2 credits (one incoming char, one outgoing char). Since we have n characters with 2 credits each, we have 2n credits total.
Hash updates = n - m + 1
Credits needed = 2(n - m + 1) < 2n
Credits available = 2n
✓ All hashing is paid for
Verification credits:
Verification is 'extra' work not covered by character credits. It's paid for by the pattern:
This explains why complexity is O(n) + O(km).
Amortized analysis shows average cost per operation over a sequence. It doesn't guarantee individual operation costs. A single verification costs O(m), even though amortized hash cost is O(1). For real-time systems requiring bounded per-operation cost, this distinction matters.
We've thoroughly analyzed Rabin-Karp's complexity from multiple angles. Let's consolidate the key insights:
| Scenario | Recommendation | Complexity |
|---|---|---|
| Single pattern, trusted input | ✓ Use Rabin-Karp | O(n + m) expected |
| Single pattern, adversarial input | Use KMP instead | O(n + m) guaranteed |
| Multiple patterns | ✓✓ Strong choice | O(n + km) expected |
| Very long pattern | Consider Boyer-Moore | O(n/m) possible |
| Memory-constrained | ✓ Use Rabin-Karp | O(1) space |
What's next:
The final page explores Rabin-Karp's unique strength: efficiently searching for multiple patterns simultaneously. This is where Rabin-Karp truly outshines other algorithms, enabling powerful applications like plagiarism detection, DNA sequence analysis, and multi-keyword search.
You now have a rigorous understanding of Rabin-Karp's complexity—when it achieves O(n + m), when it degrades to O(nm), and why the expected case is usually excellent. This analysis enables informed algorithm selection for your specific use cases.