Rabin Karp Algorithm - Learning Module

Loading content...

0/279

Time Complexity: O(n + m) Average, O(n × m) Worst

Understanding Rabin-Karp's Dual Nature

The Rabin-Karp algorithm has a fascinating complexity profile: it achieves O(n + m) expected time in typical cases, but can degrade to O(n × m) worst case under specific conditions. This dual nature distinguishes it from algorithms like KMP that guarantee O(n + m) always.

Understanding when and why Rabin-Karp performs well—and when it might struggle—is crucial for making informed algorithm choices. This page provides a rigorous complexity analysis that equips you to predict Rabin-Karp's behavior in any scenario.

What You Will Learn

By the end of this page, you will: (1) derive the preprocessing time complexity, (2) rigorously analyze expected running time, (3) understand worst-case scenarios and their probability, (4) compare with other string matching algorithms, and (5) know when Rabin-Karp is the optimal choice.

Preprocessing Phase Analysis

Before pattern matching begins, Rabin-Karp performs several preprocessing steps. Let's analyze each one.

Step 1: Compute pattern hash

for each character c in pattern:  // m iterations
    hash = (hash * base + ord(c)) % mod  // O(1) each

Total: O(m) time

Step 2: Compute base^(m-1) mod q

// Method 1: Iterative (naïve)
power = 1
for i = 0 to m-2:  // m-1 iterations
    power = (power * base) % mod  // O(1) each
    
// Method 2: Fast exponentiation
power = modpow(base, m-1, mod)  // O(log m)

Iterative: O(m) time Fast exponentiation: O(log m) time

Step 3: Compute first window hash

for i = 0 to m-1:  // m iterations
    window_hash = (window_hash * base + ord(text[i])) % mod  // O(1)

Total: O(m) time

Preprocessing Time Breakdown
Step	Operation	Time Complexity
Pattern hash	Hash m characters	O(m)
Base power	Compute b^(m-1) mod q	O(m) or O(log m)
First window hash	Hash first m characters of text	O(m)
Total preprocessing	—	O(m)

Total preprocessing: O(m)

The preprocessing time is linear in pattern length, independent of text length. This is important: we do O(m) work once, then O(1) work per position during matching.

Space complexity of preprocessing:

Pattern hash: O(1)
Base power: O(1)
Window hash: O(1)
Total auxiliary space: O(1)

Rabin-Karp has excellent space efficiency—it uses only constant extra memory beyond the input strings.

Matching Phase Analysis

The matching phase is where Rabin-Karp's complexity depends on the input. Let's analyze it carefully.

Main loop structure:

for i = 0 to n - m:  // (n - m + 1) iterations
    if window_hash == pattern_hash:  // O(1) comparison
        if verify(text, pattern, i):  // O(m) when called
            report match
    roll_hash()  // O(1)

Per-iteration breakdown:

Hash comparison: O(1)
Hash rolling: O(1)
Verification (conditional): O(m)

Key question: How often is verification performed?

Verification is called whenever window_hash == pattern_hash. This happens when:

There's an actual match (we want this)
There's a hash collision (spurious hit)

Let k = number of actual matches + number of spurious hits

Matching phase time complexity:

Total time = Loop iterations × O(1) + Verifications × O(m)
           = (n - m + 1) × O(1) + k × O(m)
           = O(n) + O(km)
           = O(n + km)

Where k = (actual matches) + (spurious hits)

The critical insight:

The complexity depends entirely on k:

If k is small (constant or logarithmic): O(n + m) = O(n)
If k is large (proportional to n): O(nm)

For typical inputs with few matches and rare collisions, k is very small. For pathological inputs, k can be large.

Comparison with KMP

KMP guarantees O(n + m) regardless of input. Rabin-Karp only guarantees this in the expected case. However, Rabin-Karp's simplicity and ability to handle multiple patterns can make it preferable despite the weaker guarantee.

Expected Case Analysis

Let's derive the expected running time rigorously, assuming our hash function distributes values uniformly.

Probability model:

Modulus: q (a large prime)
Position count: N = n - m + 1
At each position, hash equals pattern hash with probability:
- 1 if strings match (actual match)
- 1/q if strings differ (spurious hit via collision)

Expected number of spurious hits:

For positions where text doesn't match pattern (at most N positions):

E[spurious hits] = N × P(collision)
                 = N × (1/q)
                 = (n - m + 1) / q

With q = 10^9 and n = 10^6:

E[spurious hits] = 10^6 / 10^9 = 0.001

We expect essentially zero spurious hits for typical inputs!

Expected verification count:

Let A = number of actual matches (depends on pattern and text).

E[verifications] = A + E[spurious hits]
                 = A + (n - m + 1) / q
                 ≈ A  (when q is large)

Expected total time:

E[total time] = Preprocessing + Matching
             = O(m) + O(n) + E[verifications] × O(m)
             = O(m) + O(n) + O(Am) + O((n-m+1)/q × m)
             = O(m + n + Am + nm/q)

When q ≫ nm (which is typical):

E[total time] = O(n + m + Am)

If A is O(1) (few matches), this is O(n + m).

Expected Time for Various Scenarios
Scenario	Value of A	Expected Time	Simplified
Pattern not found	A = 0	O(n + m + 0)	O(n + m)
Single match	A = 1	O(n + m + m)	O(n + m)
Few matches	A = O(1)	O(n + m + m)	O(n + m)
Many matches	A = Θ(n/m)	O(n + m + n)	O(n)
Pattern matches everywhere	A = Θ(n)	O(n + m + nm)	O(nm)

The Typical Case

For most practical applications—searching for unique identifiers, keywords, or specific patterns—the number of matches A is very small, often 0 or 1. In these cases, Rabin-Karp runs in O(n + m) time, which is optimal for single-pattern matching.

Worst Case Analysis

Understanding the worst case helps identify when Rabin-Karp might underperform and when alternatives should be considered.

Worst case scenario:

The algorithm achieves O(nm) when verification is required at every position. This happens when:

Pattern matches at every position, OR
Every position is a spurious hit

Example 1: Pattern matches everywhere

Text:    AAAAAAAAA (n = 9)
Pattern: AA        (m = 2)
Matches at: 0, 1, 2, 3, 4, 5, 6, 7 (8 matches!)

Work = 8 positions × O(2) verification each = O(16) = O(nm)

This is unavoidable—if the pattern truly appears everywhere, we must verify everywhere.

Example 2: Adversarial hash collisions

An attacker who knows the hash parameters can construct input where every substring hashes to the pattern's hash value, forcing O(m) verification at each of n-m+1 positions.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import time
 
def worst_case_demo():
    """Demonstrate Rabin-Karp worst case behavior."""
    
    # Worst case: pattern matches at almost every position
    sizes = [1000, 2000, 4000, 8000]
    
    print("Pattern matching everywhere (worst case):")
    print("-" * 50)
    
    for n in sizes:
        text = "A" * n
        pattern = "AA"  # Matches at positions 0, 1, ..., n-2
        
        start = time.time()
        
        # Simplified Rabin-Karp counting verifications
        base, mod = 31, 10**9 + 7
        
        ph = (ord('A') * base + ord('A')) % mod
        wh = ph  # First window also "AA"
        bp = base
        
        match_count = 0
        verify_count = 0
        
        for i in range(n - 1):
            if wh == ph:
                verify_count += 1
                if text[i:i+2] == pattern:
                    match_count += 1
            if i < n - 2:
                wh = ((wh - ord(text[i]) * bp) * base + ord(text[i + 2])) % mod
                if wh < 0:
                    wh += mod
        
        elapsed = time.time() - start
        
        print(f"n={n}: {match_count} matches, {verify_count} verifications, {elapsed*1000:.2f}ms")
    
    print()
    print("Notice: Doubling n roughly doubles time (linear in n)")
    print("But work per position is O(m)=O(2), so total is O(nm)")
 
worst_case_demo()

Probability of worst case via collisions:

For the worst case to occur through collisions alone (not actual matches), every single position must have a spurious hit:

P(all positions collide) = (1/q)^N
                         = (1/10^9)^(n-m+1)
                         ≈ 0  (exponentially small)

This is essentially impossible with random input. Worst case through collisions requires adversarial construction.

Practical implications:

For random/typical text: O(n + m) expected, worst case extremely unlikely
For user-provided input in security context: Consider worst case possible
For highly repetitive text with short patterns: Expect near-worst-case

Space Complexity Analysis

Rabin-Karp has excellent space efficiency. Let's analyze it carefully.

Variables used:

Pattern hash — Single integer: O(1)
Window hash — Single integer: O(1)
Base power (b^{m-1}) — Single integer: O(1)
Position counters — Constants: O(1)
Hash parameters (base, mod) — Constants: O(1)

Total auxiliary space: O(1)

Rabin-Karp uses constant extra memory regardless of input size. This is optimal for a pattern matching algorithm.

Space Complexity Comparison
Algorithm	Auxiliary Space	Total Space (including input)
Naive	O(1)	O(n + m)
Rabin-Karp	O(1)	O(n + m)
KMP	O(m)	O(n + m)
Boyer-Moore	O(m + σ)	O(n + m)
Suffix Array	O(n)	O(n)

Memory access patterns:

Rabin-Karp has favorable memory access patterns:

Sequential text access — We read text characters in order, maximizing cache efficiency
Pattern accessed only during verification — And then sequentially
No auxiliary data structures — No arrays to allocate or traverse

For very large texts that don't fit in memory, Rabin-Karp can be implemented as a streaming algorithm, processing the text one character at a time without buffering.

Streaming implementation:

while has_more_input():
    char = read_next_char()
    update_hash(char, outgoing_char)
    if hash_matches():
        verify_or_report_possible_match()
    outgoing_char = char_at(current_pos - m + 1)

This enables pattern matching on texts larger than available memory.

Streaming Limitations

For streaming, you need to buffer the current window (m characters) to verify matches of enable outgoing character access. This adds O(m) space. Pure streaming without verification (Monte Carlo mode) truly uses O(1) space.

Comparative Algorithm Analysis

Let's compare Rabin-Karp's complexity with other major string matching algorithms to understand when each is preferable.

String Matching Algorithm Comparison
Algorithm	Preprocessing	Expected	Worst Case	Space
Naive	O(1)	O(nm)	O(nm)	O(1)
Rabin-Karp	O(m)	O(n + m)	O(nm)	O(1)
KMP	O(m)	O(n + m)	O(n + m)	O(m)
Boyer-Moore	O(m + σ)	O(n/m)	O(nm)	O(m + σ)
Rabin-Karp (k patterns)	O(km)	O(n + km)	O(nm)	O(k)

Key observations:

Rabin-Karp vs Naive: Same worst case, but vastly better expected case. The hash filter eliminates most comparisons.
Rabin-Karp vs KMP: KMP guarantees O(n+m) always. Rabin-Karp only achieves this in expectation. However:
- Rabin-Karp is simpler to implement
- Rabin-Karp handles multiple patterns elegantly
- KMP is preferred when worst-case guarantees matter
Rabin-Karp vs Boyer-Moore: Boyer-Moore can be sublinear O(n/m) when pattern is long and mismatches occur early. Rabin-Karp is always at least O(n). Boyer-Moore is preferred for single long patterns.
Multiple patterns: Rabin-Karp shines when searching for k patterns simultaneously—preprocessing is O(km) but matching remains O(n + km) expected.

Algorithm Selection Guide

Use Rabin-Karp when: (1) searching for multiple patterns, (2) simplicity is valued, (3) input is not adversarial. Use KMP when: worst-case guarantees are required. Use Boyer-Moore when: pattern is long and alphabet is large.

Practical Performance Factors

Asymptotic complexity tells part of the story, but practical performance depends on additional factors. Let's examine them.

Factor 1: Constant factors in hash computation

Each hash update involves:

1 multiplication
1 subtraction
1 multiplication
1 addition
1 modulo operation

The modulo operation (%) is often the slowest—it involves integer division. Using a power-of-two modulus enables bit masking instead:

hash % (2^k) = hash & ((1 << k) - 1)  // Much faster!

However, power-of-two moduli have worse collision properties than primes.

Factor 2: Branch prediction

The main loop has a conditional check:

if window_hash == pattern_hash:  // Usually false
    verify()

Since matches are rare, the branch predictor learns to assume 'not equal' and proceeds efficiently. Spurious hits cause branch mispredictions, adding ~10-20 cycles each.

Factor 3: Cache behavior

Rabin-Karp reads text sequentially, which is cache-friendly:

Each character is read once for hash update
Second read only for verification (rare)
Pattern fits in L1 cache for typical sizes

Factor 4: Integer overflow handling

In languages without arbitrary-precision integers (C++, Java):

Must use 64-bit integers for intermediate values
Consider using Barrett reduction for fast modulo
Or use 128-bit integers for very large moduli

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import time
 
def benchmark_mod_operations():
    """Compare different modulo approaches."""
    n = 10_000_000
    
    # Prime modulus (standard)
    mod_prime = 10**9 + 7
    
    # Power of two modulus (fast but more collisions)  
    mod_pow2 = 2**30
    mask = mod_pow2 - 1
    
    values = list(range(n))
    
    # Prime modulo
    start = time.time()
    result1 = [v % mod_prime for v in values]
    time_prime = time.time() - start
    
    # Power of 2 modulo with masking
    start = time.time()
    result2 = [v & mask for v in values]
    time_pow2 = time.time() - start
    
    print(f"Prime modulo ({mod_prime}): {time_prime*1000:.2f}ms")
    print(f"Power-of-2 mask ({mod_pow2}): {time_pow2*1000:.2f}ms")
    print(f"Speedup: {time_prime/time_pow2:.2f}x")
 
benchmark_mod_operations()

Premature Optimization Warning

These micro-optimizations rarely matter in practice. Focus first on correct implementation, then measure if performance is insufficient. The standard prime modulus approach is robust and typically fast enough.

Amortized Analysis Perspective

Another way to understand Rabin-Karp's efficiency is through amortized analysis—analyzing total work over many operations.

Work accounting by character:

Consider accounting for work based on each character in the text:

Each text character contributes to hash computation:
- Enters the window once: O(1) work
- Exits the window once: O(1) work
- Total per character: O(1)
Each text character participates in at most one verification:
- If no match at that position: O(0)
- If match or collision: O(1) comparison work

Total work over all characters:

Hashing work = n characters × O(1) = O(n)
Verification work = (matches + collisions) × O(m)

The hashing work is perfectly amortized to O(1) per character. The verification work depends on how many times we verify.

Banker's method analysis:

Imagine each text character comes with 2 'credits':

1 credit for entering a window
1 credit for leaving a window

Each hash update costs exactly 2 credits (one incoming char, one outgoing char). Since we have n characters with 2 credits each, we have 2n credits total.

Hash updates = n - m + 1
Credits needed = 2(n - m + 1) < 2n
Credits available = 2n
✓ All hashing is paid for

Verification credits:

Verification is 'extra' work not covered by character credits. It's paid for by the pattern:

Pattern contributes m credits
Each verification costs m credits
If k verifications occur, we need km credits beyond what the text provides

This explains why complexity is O(n) + O(km).

Amortization vs Worst Case

Amortized analysis shows average cost per operation over a sequence. It doesn't guarantee individual operation costs. A single verification costs O(m), even though amortized hash cost is O(1). For real-time systems requiring bounded per-operation cost, this distinction matters.

Summary: Complexity in Practice

We've thoroughly analyzed Rabin-Karp's complexity from multiple angles. Let's consolidate the key insights:

Key Complexity Insights

•Preprocessing: O(m) — Pattern hash, base power, and first window hash.
•Expected matching: O(n + m) — When collisions are rare and matches are few.
•Worst case: O(nm) — When verification occurs at every position.
•Space: O(1) auxiliary — Among the most space-efficient algorithms.
•Collision probability: ~1/q per position — Negligible with large modulus.
•Worst case is rare with random input — Requires adversarial construction.
•Practical performance is excellent — Cache-friendly, branch-predictor-friendly.

Quick Reference: When to Use Rabin-Karp
Scenario	Recommendation	Complexity
Single pattern, trusted input	✓ Use Rabin-Karp	O(n + m) expected
Single pattern, adversarial input	Use KMP instead	O(n + m) guaranteed
Multiple patterns	✓✓ Strong choice	O(n + km) expected
Very long pattern	Consider Boyer-Moore	O(n/m) possible
Memory-constrained	✓ Use Rabin-Karp	O(1) space

What's next:

The final page explores Rabin-Karp's unique strength: efficiently searching for multiple patterns simultaneously. This is where Rabin-Karp truly outshines other algorithms, enabling powerful applications like plagiarism detection, DNA sequence analysis, and multi-keyword search.

Page Complete

You now have a rigorous understanding of Rabin-Karp's complexity—when it achieves O(n + m), when it degrades to O(nm), and why the expected case is usually excellent. This analysis enables informed algorithm selection for your specific use cases.

Time Complexity: O(n + m) Average, O(n × m) Worst

Understanding Rabin-Karp's Dual Nature

What You Will Learn

Preprocessing Phase Analysis

Before pattern matching begins, Rabin-Karp performs several preprocessing steps. Let's analyze each one.

Step 1: Compute pattern hash

for each character c in pattern:  // m iterations
    hash = (hash * base + ord(c)) % mod  // O(1) each

Total: O(m) time

Step 2: Compute base^(m-1) mod q

// Method 1: Iterative (naïve)
power = 1
for i = 0 to m-2:  // m-1 iterations
    power = (power * base) % mod  // O(1) each
    
// Method 2: Fast exponentiation
power = modpow(base, m-1, mod)  // O(log m)

Iterative: O(m) time Fast exponentiation: O(log m) time

Step 3: Compute first window hash

for i = 0 to m-1:  // m iterations
    window_hash = (window_hash * base + ord(text[i])) % mod  // O(1)

Total: O(m) time

Preprocessing Time Breakdown
Step	Operation	Time Complexity
Pattern hash	Hash m characters	O(m)
Base power	Compute b^(m-1) mod q	O(m) or O(log m)
First window hash	Hash first m characters of text	O(m)
Total preprocessing	—	O(m)

Total preprocessing: O(m)

The preprocessing time is linear in pattern length, independent of text length. This is important: we do O(m) work once, then O(1) work per position during matching.

Space complexity of preprocessing:

Pattern hash: O(1)
Base power: O(1)
Window hash: O(1)
Total auxiliary space: O(1)

Rabin-Karp has excellent space efficiency—it uses only constant extra memory beyond the input strings.

Matching Phase Analysis

The matching phase is where Rabin-Karp's complexity depends on the input. Let's analyze it carefully.

Main loop structure:

for i = 0 to n - m:  // (n - m + 1) iterations
    if window_hash == pattern_hash:  // O(1) comparison
        if verify(text, pattern, i):  // O(m) when called
            report match
    roll_hash()  // O(1)

Per-iteration breakdown:

Hash comparison: O(1)
Hash rolling: O(1)
Verification (conditional): O(m)

Key question: How often is verification performed?

Verification is called whenever window_hash == pattern_hash. This happens when:

There's an actual match (we want this)
There's a hash collision (spurious hit)

Let k = number of actual matches + number of spurious hits

Matching phase time complexity:

Total time = Loop iterations × O(1) + Verifications × O(m)
           = (n - m + 1) × O(1) + k × O(m)
           = O(n) + O(km)
           = O(n + km)

Where k = (actual matches) + (spurious hits)

The critical insight:

The complexity depends entirely on k:

If k is small (constant or logarithmic): O(n + m) = O(n)
If k is large (proportional to n): O(nm)

For typical inputs with few matches and rare collisions, k is very small. For pathological inputs, k can be large.

Comparison with KMP

Expected Case Analysis

Let's derive the expected running time rigorously, assuming our hash function distributes values uniformly.

Probability model:

Modulus: q (a large prime)
Position count: N = n - m + 1
At each position, hash equals pattern hash with probability:
- 1 if strings match (actual match)
- 1/q if strings differ (spurious hit via collision)

Expected number of spurious hits:

For positions where text doesn't match pattern (at most N positions):

E[spurious hits] = N × P(collision)
                 = N × (1/q)
                 = (n - m + 1) / q

With q = 10^9 and n = 10^6:

E[spurious hits] = 10^6 / 10^9 = 0.001

We expect essentially zero spurious hits for typical inputs!

Expected verification count:

Let A = number of actual matches (depends on pattern and text).

E[verifications] = A + E[spurious hits]
                 = A + (n - m + 1) / q
                 ≈ A  (when q is large)

Expected total time:

E[total time] = Preprocessing + Matching
             = O(m) + O(n) + E[verifications] × O(m)
             = O(m) + O(n) + O(Am) + O((n-m+1)/q × m)
             = O(m + n + Am + nm/q)

When q ≫ nm (which is typical):

E[total time] = O(n + m + Am)

If A is O(1) (few matches), this is O(n + m).

Expected Time for Various Scenarios
Scenario	Value of A	Expected Time	Simplified
Pattern not found	A = 0	O(n + m + 0)	O(n + m)
Single match	A = 1	O(n + m + m)	O(n + m)
Few matches	A = O(1)	O(n + m + m)	O(n + m)
Many matches	A = Θ(n/m)	O(n + m + n)	O(n)
Pattern matches everywhere	A = Θ(n)	O(n + m + nm)	O(nm)

The Typical Case

Worst Case Analysis

Understanding the worst case helps identify when Rabin-Karp might underperform and when alternatives should be considered.

Worst case scenario:

The algorithm achieves O(nm) when verification is required at every position. This happens when:

Pattern matches at every position, OR
Every position is a spurious hit

Example 1: Pattern matches everywhere

Text:    AAAAAAAAA (n = 9)
Pattern: AA        (m = 2)
Matches at: 0, 1, 2, 3, 4, 5, 6, 7 (8 matches!)

Work = 8 positions × O(2) verification each = O(16) = O(nm)

This is unavoidable—if the pattern truly appears everywhere, we must verify everywhere.

Example 2: Adversarial hash collisions

An attacker who knows the hash parameters can construct input where every substring hashes to the pattern's hash value, forcing O(m) verification at each of n-m+1 positions.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import time
 
def worst_case_demo():
    """Demonstrate Rabin-Karp worst case behavior."""
    
    # Worst case: pattern matches at almost every position
    sizes = [1000, 2000, 4000, 8000]
    
    print("Pattern matching everywhere (worst case):")
    print("-" * 50)
    
    for n in sizes:
        text = "A" * n
        pattern = "AA"  # Matches at positions 0, 1, ..., n-2
        
        start = time.time()
        
        # Simplified Rabin-Karp counting verifications
        base, mod = 31, 10**9 + 7
        
        ph = (ord('A') * base + ord('A')) % mod
        wh = ph  # First window also "AA"
        bp = base
        
        match_count = 0
        verify_count = 0
        
        for i in range(n - 1):
            if wh == ph:
                verify_count += 1
                if text[i:i+2] == pattern:
                    match_count += 1
            if i < n - 2:
                wh = ((wh - ord(text[i]) * bp) * base + ord(text[i + 2])) % mod
                if wh < 0:
                    wh += mod
        
        elapsed = time.time() - start
        
        print(f"n={n}: {match_count} matches, {verify_count} verifications, {elapsed*1000:.2f}ms")
    
    print()
    print("Notice: Doubling n roughly doubles time (linear in n)")
    print("But work per position is O(m)=O(2), so total is O(nm)")
 
worst_case_demo()

Probability of worst case via collisions:

For the worst case to occur through collisions alone (not actual matches), every single position must have a spurious hit:

P(all positions collide) = (1/q)^N
                         = (1/10^9)^(n-m+1)
                         ≈ 0  (exponentially small)

This is essentially impossible with random input. Worst case through collisions requires adversarial construction.

Practical implications:

For random/typical text: O(n + m) expected, worst case extremely unlikely
For user-provided input in security context: Consider worst case possible
For highly repetitive text with short patterns: Expect near-worst-case

Space Complexity Analysis

Rabin-Karp has excellent space efficiency. Let's analyze it carefully.

Variables used:

Pattern hash — Single integer: O(1)
Window hash — Single integer: O(1)
Base power (b^{m-1}) — Single integer: O(1)
Position counters — Constants: O(1)
Hash parameters (base, mod) — Constants: O(1)

Total auxiliary space: O(1)

Rabin-Karp uses constant extra memory regardless of input size. This is optimal for a pattern matching algorithm.

Space Complexity Comparison
Algorithm	Auxiliary Space	Total Space (including input)
Naive	O(1)	O(n + m)
Rabin-Karp	O(1)	O(n + m)
KMP	O(m)	O(n + m)
Boyer-Moore	O(m + σ)	O(n + m)
Suffix Array	O(n)	O(n)

Memory access patterns:

Rabin-Karp has favorable memory access patterns:

Sequential text access — We read text characters in order, maximizing cache efficiency
Pattern accessed only during verification — And then sequentially
No auxiliary data structures — No arrays to allocate or traverse

For very large texts that don't fit in memory, Rabin-Karp can be implemented as a streaming algorithm, processing the text one character at a time without buffering.

Streaming implementation:

while has_more_input():
    char = read_next_char()
    update_hash(char, outgoing_char)
    if hash_matches():
        verify_or_report_possible_match()
    outgoing_char = char_at(current_pos - m + 1)

This enables pattern matching on texts larger than available memory.

Streaming Limitations

Comparative Algorithm Analysis

Let's compare Rabin-Karp's complexity with other major string matching algorithms to understand when each is preferable.

String Matching Algorithm Comparison
Algorithm	Preprocessing	Expected	Worst Case	Space
Naive	O(1)	O(nm)	O(nm)	O(1)
Rabin-Karp	O(m)	O(n + m)	O(nm)	O(1)
KMP	O(m)	O(n + m)	O(n + m)	O(m)
Boyer-Moore	O(m + σ)	O(n/m)	O(nm)	O(m + σ)
Rabin-Karp (k patterns)	O(km)	O(n + km)	O(nm)	O(k)

Key observations:

Rabin-Karp vs Naive: Same worst case, but vastly better expected case. The hash filter eliminates most comparisons.
Rabin-Karp vs KMP: KMP guarantees O(n+m) always. Rabin-Karp only achieves this in expectation. However:
- Rabin-Karp is simpler to implement
- Rabin-Karp handles multiple patterns elegantly
- KMP is preferred when worst-case guarantees matter
Rabin-Karp vs Boyer-Moore: Boyer-Moore can be sublinear O(n/m) when pattern is long and mismatches occur early. Rabin-Karp is always at least O(n). Boyer-Moore is preferred for single long patterns.
Multiple patterns: Rabin-Karp shines when searching for k patterns simultaneously—preprocessing is O(km) but matching remains O(n + km) expected.

Algorithm Selection Guide

Practical Performance Factors

Asymptotic complexity tells part of the story, but practical performance depends on additional factors. Let's examine them.

Factor 1: Constant factors in hash computation

Each hash update involves:

1 multiplication
1 subtraction
1 multiplication
1 addition
1 modulo operation

The modulo operation (%) is often the slowest—it involves integer division. Using a power-of-two modulus enables bit masking instead:

hash % (2^k) = hash & ((1 << k) - 1)  // Much faster!

However, power-of-two moduli have worse collision properties than primes.

Factor 2: Branch prediction

The main loop has a conditional check:

if window_hash == pattern_hash:  // Usually false
    verify()

Since matches are rare, the branch predictor learns to assume 'not equal' and proceeds efficiently. Spurious hits cause branch mispredictions, adding ~10-20 cycles each.

Factor 3: Cache behavior

Rabin-Karp reads text sequentially, which is cache-friendly:

Each character is read once for hash update
Second read only for verification (rare)
Pattern fits in L1 cache for typical sizes

Factor 4: Integer overflow handling

In languages without arbitrary-precision integers (C++, Java):

Must use 64-bit integers for intermediate values
Consider using Barrett reduction for fast modulo
Or use 128-bit integers for very large moduli

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import time
 
def benchmark_mod_operations():
    """Compare different modulo approaches."""
    n = 10_000_000
    
    # Prime modulus (standard)
    mod_prime = 10**9 + 7
    
    # Power of two modulus (fast but more collisions)  
    mod_pow2 = 2**30
    mask = mod_pow2 - 1
    
    values = list(range(n))
    
    # Prime modulo
    start = time.time()
    result1 = [v % mod_prime for v in values]
    time_prime = time.time() - start
    
    # Power of 2 modulo with masking
    start = time.time()
    result2 = [v & mask for v in values]
    time_pow2 = time.time() - start
    
    print(f"Prime modulo ({mod_prime}): {time_prime*1000:.2f}ms")
    print(f"Power-of-2 mask ({mod_pow2}): {time_pow2*1000:.2f}ms")
    print(f"Speedup: {time_prime/time_pow2:.2f}x")
 
benchmark_mod_operations()

Premature Optimization Warning

Amortized Analysis Perspective

Another way to understand Rabin-Karp's efficiency is through amortized analysis—analyzing total work over many operations.

Work accounting by character:

Consider accounting for work based on each character in the text:

Each text character contributes to hash computation:
- Enters the window once: O(1) work
- Exits the window once: O(1) work
- Total per character: O(1)
Each text character participates in at most one verification:
- If no match at that position: O(0)
- If match or collision: O(1) comparison work

Total work over all characters:

Hashing work = n characters × O(1) = O(n)
Verification work = (matches + collisions) × O(m)

The hashing work is perfectly amortized to O(1) per character. The verification work depends on how many times we verify.

Banker's method analysis:

Imagine each text character comes with 2 'credits':

1 credit for entering a window
1 credit for leaving a window

Each hash update costs exactly 2 credits (one incoming char, one outgoing char). Since we have n characters with 2 credits each, we have 2n credits total.

Hash updates = n - m + 1
Credits needed = 2(n - m + 1) < 2n
Credits available = 2n
✓ All hashing is paid for

Verification credits:

Verification is 'extra' work not covered by character credits. It's paid for by the pattern:

Pattern contributes m credits
Each verification costs m credits
If k verifications occur, we need km credits beyond what the text provides

This explains why complexity is O(n) + O(km).

Amortization vs Worst Case

Summary: Complexity in Practice

We've thoroughly analyzed Rabin-Karp's complexity from multiple angles. Let's consolidate the key insights:

Key Complexity Insights

•Preprocessing: O(m) — Pattern hash, base power, and first window hash.
•Expected matching: O(n + m) — When collisions are rare and matches are few.
•Worst case: O(nm) — When verification occurs at every position.
•Space: O(1) auxiliary — Among the most space-efficient algorithms.
•Collision probability: ~1/q per position — Negligible with large modulus.
•Worst case is rare with random input — Requires adversarial construction.
•Practical performance is excellent — Cache-friendly, branch-predictor-friendly.

Quick Reference: When to Use Rabin-Karp
Scenario	Recommendation	Complexity
Single pattern, trusted input	✓ Use Rabin-Karp	O(n + m) expected
Single pattern, adversarial input	Use KMP instead	O(n + m) guaranteed
Multiple patterns	✓✓ Strong choice	O(n + km) expected
Very long pattern	Consider Boyer-Moore	O(n/m) possible
Memory-constrained	✓ Use Rabin-Karp	O(1) space

What's next:

Page Complete