Data Structures & AlgorithmsRabin-Karp Algorithm

Rabin-Karp Algorithm — Rolling Hash

LevelAdvanced

Duration75 mins

TopicRabin-Karp Algorithm

5 / 5

Multiple Pattern Searching Advantage

The Killer Feature of Rabin-Karp

Throughout this module, we've seen how Rabin-Karp achieves O(n + m) expected time for single-pattern matching—comparable to KMP, though without the same worst-case guarantee. So why choose Rabin-Karp?

The answer lies in its killer feature: the ability to efficiently search for multiple patterns simultaneously. While KMP would require separate passes for each pattern, Rabin-Karp can check k patterns at each position with minimal overhead, achieving O(n + km) expected time instead of O(kn + km).

This capability makes Rabin-Karp the algorithm of choice for many real-world applications: plagiarism detection, malware signature scanning, multi-keyword search, and bioinformatics sequence analysis.

What You Will Learn

By the end of this page, you will: (1) understand why Rabin-Karp excels at multi-pattern matching, (2) implement efficient multi-pattern search, (3) analyze the complexity and tradeoffs, (4) explore real-world applications, and (5) compare with specialized algorithms like Aho-Corasick.

The Multi-Pattern Problem

Let's formalize the multi-pattern matching problem and understand why naive approaches are inefficient.

Problem definition:

Given:

Text T of length n
Set of k patterns P₁, P₂, ..., Pₖ with total length M = Σ|Pᵢ|

Find: All occurrences of any pattern in the text.

Naive approach: Run single-pattern algorithm k times

for each pattern P in patterns:
    find_all_occurrences(text, P)  // O(n + m) per pattern

Total time: O(k × n + M) = O(kn + M)

For 1000 patterns and a 1MB text, this means ~1 billion operations. Completely impractical.

The fundamental inefficiency:

Running separate searches means reading the text k times. If the text is large (e.g., a genome), this is extraordinarily wasteful.

Pattern 1: Scan entire text → O(n)
Pattern 2: Scan entire text → O(n)
...
Pattern k: Scan entire text → O(n)
─────────────────────────────────
Total: k × O(n) = O(kn)

The goal:

Can we scan the text just once and check all patterns at each position? This would give O(n) text scanning plus O(k) work per position for pattern checking, totaling O(n × something-less-than-k) if we're clever.

Rabin-Karp achieves exactly this through hash-based filtering.

Multi-Pattern Matching Complexity Comparison
Approach	Time Complexity	Text Scans	Practical for Large k?
Naive (k separate runs)	O(kn + M)	k	No
Rabin-Karp multi-pattern	O(n + M) expected	1	Yes
Aho-Corasick	O(n + M + z)	1	Yes

The Power of Single-Pass Scanning

The text is often much larger than the combined pattern length. Reading the text once instead of k times can mean the difference between seconds and hours. Rabin-Karp's ability to check multiple patterns with a single hash lookup per position is what makes this possible.

The Hash Set Approach

Rabin-Karp's multi-pattern matching uses a simple but powerful insight: store all pattern hashes in a set, then check if each window's hash is in the set.

Algorithm structure:

Preprocessing:
- Compute hash of each pattern Pᵢ
- Store all hashes in a hash set (for O(1) lookup)
- Note: assumes all patterns have the same length m
Matching:
- Compute initial window hash
- For each position:
  - Check if window hash is in the pattern hash set (O(1))
  - If yes: verify which pattern(s) match
  - Roll the hash to next position

Why this works:

Pattern hashes: {H(P₁), H(P₂), ..., H(Pₖ)}  // Stored in hash set

At each position i:
  window_hash = H(text[i:i+m])
  if window_hash in pattern_hashes:  // O(1) lookup!
      verify and report matches

The key is that checking if a hash exists in a set takes O(1) average time, regardless of how many patterns we have!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def rabin_karp_multi_pattern(text: str, patterns: list[str]) -> dict[str, list[int]]:
    """
    Find all occurrences of multiple patterns in text.
    
    Assumes all patterns have the same length.
    Returns dict mapping each pattern to list of match positions.
    
    Time: O(n + km) expected where k = number of patterns
    Space: O(k) for pattern hash storage
    """
    if not patterns:
        return {}
    
    n = len(text)
    m = len(patterns[0])  # Assumes equal-length patterns
    
    # Validate all patterns have same length
    if not all(len(p) == m for p in patterns):
        raise ValueError("All patterns must have the same length")
    
    if m > n:
        return {p: [] for p in patterns}
    
    base = 31
    mod = 10**9 + 7
    
    # Preprocessing: compute all pattern hashes
    pattern_hashes = {}  # hash -> list of patterns with that hash
    for pattern in patterns:
        h = 0
        for c in pattern:
            h = (h * base + ord(c)) % mod
        
        # Multiple patterns might have same hash (collision)
        if h not in pattern_hashes:
            pattern_hashes[h] = []
        pattern_hashes[h].append(pattern)
    
    # Initialize result
    results = {p: [] for p in patterns}
    
    # Compute first window hash
    window_hash = 0
    for i in range(m):
        window_hash = (window_hash * base + ord(text[i])) % mod
    
    # Precompute base^(m-1)
    base_power = pow(base, m - 1, mod)
    
    # Scan text once
    for i in range(n - m + 1):
        # O(1) hash set lookup
        if window_hash in pattern_hashes:
            # Verify against all patterns with this hash
            window_text = text[i:i+m]
            for pattern in pattern_hashes[window_hash]:
                if window_text == pattern:
                    results[pattern].append(i)
        
        # Roll hash
        if i < n - m:
            outgoing = ord(text[i])
            incoming = ord(text[i + m])
            window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod
            if window_hash < 0:
                window_hash += mod
    
    return results
 
 
# Example usage
text = "ABABCABABCABABCAB"
patterns = ["ABABC", "BABCA", "ABCAB", "CABAB"]
 
results = rabin_karp_multi_pattern(text, patterns)
for pattern, positions in results.items():
    if positions:
        print(f"'{pattern}' found at positions: {positions}")
    else:
        print(f"'{pattern}' not found")

Handling Variable-Length Patterns

The basic multi-pattern Rabin-Karp assumes all patterns have the same length. Real-world applications often require searching for patterns of varying lengths.

Challenge:

With different pattern lengths, we need different window sizes. Rolling hashes for different window sizes don't share computation.

Solutions:

Approach 1: Group by length

Group patterns by length and run separate searches for each group:

patterns_by_length = group_patterns(patterns)
for length, patterns_of_this_length in patterns_by_length:
    run_rabin_karp(text, patterns_of_this_length, window_size=length)

Complexity: O(L × n) where L = number of distinct lengths

This is efficient when there are few distinct lengths.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from collections import defaultdict
 
def rabin_karp_variable_length(text: str, patterns: list[str]) -> dict[str, list[int]]:
    """
    Multi-pattern Rabin-Karp supporting variable-length patterns.
    
    Groups patterns by length and runs separate searches per group.
    Efficient when number of distinct lengths is small.
    """
    if not patterns:
        return {}
    
    n = len(text)
    base = 31
    mod = 10**9 + 7
    
    # Group patterns by length
    by_length = defaultdict(list)
    for pattern in patterns:
        by_length[len(pattern)].append(pattern)
    
    results = {p: [] for p in patterns}
    
    for m, patterns_of_length in by_length.items():
        if m > n:
            continue
        
        # Build hash set for this length
        pattern_hashes = defaultdict(list)
        for pattern in patterns_of_length:
            h = 0
            for c in pattern:
                h = (h * base + ord(c)) % mod
            pattern_hashes[h].append(pattern)
        
        # Compute first window hash
        window_hash = 0
        for i in range(m):
            window_hash = (window_hash * base + ord(text[i])) % mod
        
        base_power = pow(base, m - 1, mod)
        
        # Scan text for this length
        for i in range(n - m + 1):
            if window_hash in pattern_hashes:
                window_text = text[i:i+m]
                for pattern in pattern_hashes[window_hash]:
                    if window_text == pattern:
                        results[pattern].append(i)
            
            if i < n - m:
                outgoing = ord(text[i])
                incoming = ord(text[i + m])
                window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod
                if window_hash < 0:
                    window_hash += mod
    
    return results
 
 
# Example with variable-length patterns
text = "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG"
patterns = ["THE", "QUICK", "BROWN FOX", "LAZY"]
 
results = rabin_karp_variable_length(text, patterns)
for pattern, positions in results.items():
    if positions:
        print(f"'{pattern}' found at positions: {positions}")

Approach 2: Minimum-length filtering

Compute hash for the minimum pattern length. Use this as a preliminary filter, then verify longer patterns:

min_length = min(len(p) for p in patterns)
preliminary_hash = rolling_hash(text, min_length)

for each position where preliminary hash matches any pattern prefix:
    check all patterns whose prefix matched

This is more complex but reduces text scans to one.

Approach 3: Use Aho-Corasick

For truly optimal variable-length multi-pattern matching, consider Aho-Corasick (briefly discussed later). It achieves O(n + M + z) where z = total match count.

Practical Pattern Length Distribution

In many applications, patterns cluster around a few lengths. For example, malware signatures might mostly be 20-30 bytes, keyword lists might be 5-15 characters. The 'group by length' approach handles these cases efficiently.

Multi-Pattern Complexity Analysis

Let's rigorously analyze the complexity of multi-pattern Rabin-Karp.

Parameters:

n = text length
k = number of patterns
m = pattern length (assuming equal lengths)
M = total pattern length = k × m

Preprocessing:

for each of k patterns:          // k iterations
    compute hash:                 // O(m) each
    add to hash set:              // O(1) expected

Total preprocessing: O(km) = O(M)

Matching:

for each of n-m+1 positions:     // O(n) iterations
    hash lookup in set:           // O(1) expected
    if match: verify              // O(?) - see below
    roll hash:                    // O(1)

Base matching: O(n) (without considering verification)

Verification analysis:

Verification occurs when window hash is found in pattern hash set. This happens for:

Actual matches (we want these)
Hash collisions (spurious hits)

Expected spurious hits:

With k patterns and modulus q:

P(window hash matches any pattern hash) ≤ k/q

For n-m+1 windows:

E[spurious hits] ≤ (n-m+1) × k/q ≈ nk/q

With q = 10^9, n = 10^6, k = 100:

E[spurious hits] ≈ 10^6 × 100 / 10^9 = 0.1

Negligible spurious hits expected!

Total expected time:

Preprocessing + Matching + Verification
= O(M) + O(n) + O(A × m)
= O(n + M + Am)

Where A = actual match count. For typical scenarios with few matches: O(n + M)

Multi-Pattern Complexity Summary
Phase	Operation	Complexity
Preprocessing	Hash k patterns	O(M) = O(km)
Matching	Scan n positions, O(1) lookup each	O(n)
Verification	Verify A matches + spurious hits	O(Am) expected
Total	—	O(n + M) expected

Comparison with Naive Approach

Naive approach: O(kn + M) — scans text k times. Rabin-Karp: O(n + M) — scans text once. For k = 1000 patterns and 1MB text, this is the difference between ~1 billion operations and ~1 million operations. A 1000× improvement!

Real-World Applications

Multi-pattern Rabin-Karp powers numerous real-world systems. Let's explore the key applications in depth.

Application 1: Plagiarism Detection

Problem: Given a document and a corpus of existing documents, find copied passages.

Approach:

Extract n-grams (e.g., 5-word phrases) from the target document
Search for each n-gram in corpus documents
High density of matches indicates plagiarism

# Pseudo-code for plagiarism detection
grams = extract_ngrams(target_document, n=5)
for corpus_doc in corpus:
    matches = rabin_karp_multi(corpus_doc, grams)
    if len(matches) / len(corpus_doc) > threshold:
        flag_as_plagiarism(target_document, corpus_doc, matches)

Rabin-Karp makes this feasible for large corpora.

Application 2: Malware Signature Scanning

Problem: Scan files for known malware byte sequences (signatures).

Challenges:

Thousands of signatures to check
Files can be gigabytes in size
Must be fast for real-time scanning

Solution:

Store all signature hashes in a set
Stream file through rolling hash
O(file_size) regardless of signature count

Application 3: Intrusion Detection Systems (IDS)

Network traffic is scanned for attack patterns:

SQL injection patterns
Cross-site scripting (XSS) markers
Known exploit payloads

Rabin-Karp enables line-rate scanning of network packets against thousands of attack signatures.

Application 4: DNA Sequence Analysis

Bioinformatics applications:

Find all occurrences of known gene sequences
Identify transcription factor binding sites
Search for primer sequences in genomes

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def plagiarism_detection(target: str, corpus: list[str], n: int = 5) -> dict:
    """
    Detect potential plagiarism by finding shared n-grams.
    
    Args:
        target: Document to check for plagiarism
        corpus: List of reference documents
        n: Size of word n-grams to use
    
    Returns:
        Dict with plagiarism analysis for each corpus document
    """
    # Extract word n-grams from target
    target_words = target.split()
    if len(target_words) < n:
        return {}
    
    # Create n-gram patterns
    ngrams = set()
    for i in range(len(target_words) - n + 1):
        ngram = ' '.join(target_words[i:i+n])
        ngrams.add(ngram)
    
    # For equal-length search, group by length
    # Here we simplify by assuming space-separated words
    results = {}
    
    for doc_id, doc in enumerate(corpus):
        doc_name = f"Document {doc_id}"
        
        # Find matches using Rabin-Karp
        matches = rabin_karp_variable_length(doc, list(ngrams))
        
        match_count = sum(len(positions) for positions in matches.values())
        match_ratio = match_count / max(1, len(doc.split()) - n + 1)
        
        results[doc_name] = {
            'match_count': match_count,
            'match_ratio': match_ratio,
            'plagiarism_flag': match_ratio > 0.1,  # 10% threshold
            'matched_ngrams': [ng for ng, pos in matches.items() if pos]
        }
    
    return results
 
 
# Example usage
target = "The quick brown fox jumps over the lazy dog"
corpus = [
    "The quick brown fox runs through the forest",
    "This is a completely different document about cats",
    "The quick brown fox jumps over the lazy sleeping dog"
]
 
analysis = plagiarism_detection(target, corpus, n=3)
for doc, info in analysis.items():
    print(f"{doc}: {info['match_count']} matches, " +
          f"ratio={info['match_ratio']:.2%}, " +
          f"flagged={info['plagiarism_flag']}")

Industry-Strength Implementations

Production plagiarism detectors (like Turnitin) and antivirus scanners use highly optimized variants of these algorithms, often combined with other techniques like locality-sensitive hashing for approximate matching and SIMD instructions for speed.

Optimization Techniques

For production systems handling millions of patterns or gigabytes of text, additional optimizations become important.

Optimization 1: Bloom filter pre-screening

Instead of a hash set, use a Bloom filter for pattern hashes:

Space: O(k) bits instead of O(k) hash entries
Lookup: Still O(1), often faster due to cache
Trade-off: Small false positive rate (acceptable with verification)

bloom = BloomFilter(size=1000000, hash_count=7)
for pattern in patterns:
    bloom.add(hash(pattern))

# During matching
if bloom.might_contain(window_hash):  # Fast!
    if window_hash in exact_hash_set:  # Rare
        verify()

Optimization 2: SIMD-accelerated hashing

Modern CPUs can compute multiple hashes in parallel using SIMD instructions:

Process 4-8 positions simultaneously
Vectorized modular arithmetic
4-8× speedup on compatible hardware

Optimization 3: Pattern hash precomputation

For static pattern sets (e.g., malware signatures), precompute and cache:

Pattern hashes
Sorted hash array for binary search (if hash set overhead is too high)
Perfect hash function (zero collisions within pattern set)

Optimization 4: Chunked processing for large texts

for chunk in text.chunks(CHUNK_SIZE):
    matches_in_chunk = rabin_karp(chunk, patterns)
    adjust_positions(matches_in_chunk, chunk.offset)
    yield matches_in_chunk

Benefits:

Better cache utilization
Enables parallel processing of chunks
Reduces memory pressure for huge texts

Optimization 5: Early termination

For some applications, we don't need all matches:

'Find first' returns after single match
'Find up to N' stops after N matches
'Existence check' returns boolean on first match

Profile Before Optimizing

These optimizations add complexity. Always profile your specific workload first. For many applications, the basic implementation is fast enough. Only add optimizations when measurements show they're needed.

Comparison with Aho-Corasick

Rabin-Karp isn't the only multi-pattern algorithm. The Aho-Corasick algorithm is a specialized automaton-based approach that guarantees optimal time complexity.

Aho-Corasick overview:

Builds a finite automaton (trie + failure links) from all patterns
Processes text in a single pass, following automaton transitions
Reports matches as automaton reaches accepting states

Complexity comparison:

Rabin-Karp vs Aho-Corasick
Aspect	Rabin-Karp Multi	Aho-Corasick
Preprocessing	O(M)	O(M × alphabet_size)
Matching	O(n + M) expected	O(n + z) guaranteed
Total	O(n + M) expected	O(n + M + z)
Space	O(k) pattern hashes	O(M) automaton states
Variable lengths	Requires grouping	Native support
Worst case	O(nm)	O(n + M + z)
Implementation	Simple	Complex

Where z = number of matches reported

When to choose Rabin-Karp:

Implementation simplicity is valued — Rabin-Karp is straightforward to implement correctly
Patterns change frequently — No complex automaton to rebuild
Patterns are equal length — Eliminates main limitation
Expected case is sufficient — Worst case is astronomically rare

When to choose Aho-Corasick:

Guaranteed linear time is required — Security-critical, real-time systems
Variable-length patterns naturally — No grouping needed
Patterns are static — Automaton built once, used many times
Many overlapping matches — Aho-Corasick handles naturally

Practical guidance:

For most applications, either algorithm works well. Rabin-Karp's simplicity makes it a good default choice. Consider Aho-Corasick for specialized use cases like intrusion detection or antivirus where worst-case guarantees matter.

Hybrid Approaches

Some systems use Rabin-Karp as a fast filter and Aho-Corasick for detailed matching. The hash filter quickly eliminates most positions, then the automaton processes the remaining candidates precisely.

Production-Ready Implementation

Let's build a complete, production-ready multi-pattern Rabin-Karp implementation that handles edge cases and provides useful features.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
from collections import defaultdict
from dataclasses import dataclass
from typing import Iterator
 
@dataclass
class Match:
    """Represents a pattern match in text."""
    pattern: str
    position: int
    length: int
    
    def __repr__(self):
        return f"Match('{self.pattern}' at {self.position})"
 
 
class MultiPatternMatcher:
    """
    Production-ready multi-pattern matching using Rabin-Karp.
    
    Features:
    - Supports patterns of varying lengths
    - Double hashing for collision resistance
    - Streaming-capable design
    - Match count and position tracking
    """
    
    def __init__(self, patterns: list[str], 
                 base1: int = 31, mod1: int = 10**9 + 7,
                 base2: int = 37, mod2: int = 10**9 + 9):
        if not patterns:
            raise ValueError("At least one pattern required")
        
        self.patterns = list(set(patterns))  # Deduplicate
        self.base1, self.mod1 = base1, mod1
        self.base2, self.mod2 = base2, mod2
        
        # Group patterns by length
        self.by_length: dict[int, dict[tuple, list[str]]] = defaultdict(dict)
        
        for pattern in self.patterns:
            m = len(pattern)
            h1 = self._hash(pattern, base1, mod1)
            h2 = self._hash(pattern, base2, mod2)
            key = (h1, h2)
            
            if key not in self.by_length[m]:
                self.by_length[m][key] = []
            self.by_length[m][key].append(pattern)
        
        self.lengths = sorted(self.by_length.keys())
    
    def _hash(self, s: str, base: int, mod: int) -> int:
        """Compute polynomial hash of string."""
        h = 0
        for c in s:
            h = (h * base + ord(c)) % mod
        return h
    
    def search(self, text: str) -> list[Match]:
        """Find all pattern occurrences in text."""
        return list(self.search_iter(text))
    
    def search_iter(self, text: str) -> Iterator[Match]:
        """Iterate over pattern occurrences (memory-efficient)."""
        n = len(text)
        
        for m in self.lengths:
            if m > n:
                continue
            
            patterns_at_length = self.by_length[m]
            
            # Compute initial window hashes
            h1 = self._hash(text[:m], self.base1, self.mod1)
            h2 = self._hash(text[:m], self.base2, self.mod2)
            
            # Base powers for rolling
            bp1 = pow(self.base1, m - 1, self.mod1)
            bp2 = pow(self.base2, m - 1, self.mod2)
            
            for i in range(n - m + 1):
                key = (h1, h2)
                
                if key in patterns_at_length:
                    window = text[i:i+m]
                    for pattern in patterns_at_length[key]:
                        if window == pattern:
                            yield Match(pattern, i, m)
                
                # Roll hashes
                if i < n - m:
                    out = ord(text[i])
                    inp = ord(text[i + m])
                    
                    h1 = ((h1 - out * bp1) * self.base1 + inp) % self.mod1
                    h2 = ((h2 - out * bp2) * self.base2 + inp) % self.mod2
                    
                    if h1 < 0: h1 += self.mod1
                    if h2 < 0: h2 += self.mod2
    
    def contains_any(self, text: str) -> bool:
        """Check if text contains any pattern (early termination)."""
        for _ in self.search_iter(text):
            return True
        return False
    
    def count_matches(self, text: str) -> dict[str, int]:
        """Count occurrences of each pattern."""
        counts = {p: 0 for p in self.patterns}
        for match in self.search_iter(text):
            counts[match.pattern] += 1
        return counts
 
 
# Example usage
matcher = MultiPatternMatcher([
    "hello",
    "world",
    "hell",
    "algorithm"
])
 
text = """
Hello, World! This is a test of the hello algorithm.
The world of algorithms is vast. Hello again!
"""
 
print("All matches:")
for match in matcher.search(text.lower()):
    print(f"  {match}")
 
print(f"
Contains any: {matcher.contains_any(text.lower())}")
print(f"
Match counts: {matcher.count_matches(text.lower())}")

Summary: Rabin-Karp's Unique Strength

We've explored Rabin-Karp's killer feature in depth. Let's consolidate the key insights:

Key Takeaways

•Multi-pattern matching in single pass — Check k patterns at each position with O(1) hash lookup.
•Complexity: O(n + M) expected — Independent of pattern count k for the scanning phase.
•Hash set enables instant lookup — Convert O(k) pattern checks to O(1) set membership.
•Variable lengths require grouping — Or use specialized algorithms like Aho-Corasick.
•Real-world applications are vast — Plagiarism detection, malware scanning, IDS, bioinformatics.
•Double hashing eliminates collisions — Makes spurious hits exponentially rare.
•Simpler than Aho-Corasick — Often the pragmatic choice despite weaker guarantees.

Module complete:

You've now mastered the Rabin-Karp algorithm comprehensively:

Hashing for pattern matching — The conceptual foundation
Rolling hash technique — O(1) hash updates
Collision handling — Verification and mitigation strategies
Complexity analysis — Expected O(n+m), worst O(nm)
Multi-pattern advantage — The killer feature

Rabin-Karp is a beautiful example of how a simple idea (fingerprinting) combined with clever mathematics (rolling hash) produces a practical, powerful algorithm. Its multi-pattern capability makes it irreplaceable in many real-world systems.

Module Complete

Congratulations! You've completed the Rabin-Karp Algorithm module. You now understand the rolling hash technique, collision handling, complexity analysis, and—crucially—the multi-pattern searching advantage that makes Rabin-Karp the algorithm of choice for many real-world applications.

5 / 5

Loading learning content...

Data Structures & AlgorithmsRabin-Karp Algorithm

Rabin-Karp Algorithm — Rolling Hash

LevelAdvanced

Duration75 mins

TopicRabin-Karp Algorithm

5 / 5

Multiple Pattern Searching Advantage

The Killer Feature of Rabin-Karp

What You Will Learn

The Multi-Pattern Problem

Let's formalize the multi-pattern matching problem and understand why naive approaches are inefficient.

Problem definition:

Given:

Text T of length n
Set of k patterns P₁, P₂, ..., Pₖ with total length M = Σ|Pᵢ|

Find: All occurrences of any pattern in the text.

Naive approach: Run single-pattern algorithm k times

for each pattern P in patterns:
    find_all_occurrences(text, P)  // O(n + m) per pattern

Total time: O(k × n + M) = O(kn + M)

For 1000 patterns and a 1MB text, this means ~1 billion operations. Completely impractical.

The fundamental inefficiency:

Running separate searches means reading the text k times. If the text is large (e.g., a genome), this is extraordinarily wasteful.

Pattern 1: Scan entire text → O(n)
Pattern 2: Scan entire text → O(n)
...
Pattern k: Scan entire text → O(n)
─────────────────────────────────
Total: k × O(n) = O(kn)

The goal:

Rabin-Karp achieves exactly this through hash-based filtering.

Multi-Pattern Matching Complexity Comparison
Approach	Time Complexity	Text Scans	Practical for Large k?
Naive (k separate runs)	O(kn + M)	k	No
Rabin-Karp multi-pattern	O(n + M) expected	1	Yes
Aho-Corasick	O(n + M + z)	1	Yes

The Power of Single-Pass Scanning

The Hash Set Approach

Rabin-Karp's multi-pattern matching uses a simple but powerful insight: store all pattern hashes in a set, then check if each window's hash is in the set.

Algorithm structure:

Preprocessing:
- Compute hash of each pattern Pᵢ
- Store all hashes in a hash set (for O(1) lookup)
- Note: assumes all patterns have the same length m
Matching:
- Compute initial window hash
- For each position:
  - Check if window hash is in the pattern hash set (O(1))
  - If yes: verify which pattern(s) match
  - Roll the hash to next position

Why this works:

Pattern hashes: {H(P₁), H(P₂), ..., H(Pₖ)}  // Stored in hash set

At each position i:
  window_hash = H(text[i:i+m])
  if window_hash in pattern_hashes:  // O(1) lookup!
      verify and report matches

The key is that checking if a hash exists in a set takes O(1) average time, regardless of how many patterns we have!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def rabin_karp_multi_pattern(text: str, patterns: list[str]) -> dict[str, list[int]]:
    """
    Find all occurrences of multiple patterns in text.
    
    Assumes all patterns have the same length.
    Returns dict mapping each pattern to list of match positions.
    
    Time: O(n + km) expected where k = number of patterns
    Space: O(k) for pattern hash storage
    """
    if not patterns:
        return {}
    
    n = len(text)
    m = len(patterns[0])  # Assumes equal-length patterns
    
    # Validate all patterns have same length
    if not all(len(p) == m for p in patterns):
        raise ValueError("All patterns must have the same length")
    
    if m > n:
        return {p: [] for p in patterns}
    
    base = 31
    mod = 10**9 + 7
    
    # Preprocessing: compute all pattern hashes
    pattern_hashes = {}  # hash -> list of patterns with that hash
    for pattern in patterns:
        h = 0
        for c in pattern:
            h = (h * base + ord(c)) % mod
        
        # Multiple patterns might have same hash (collision)
        if h not in pattern_hashes:
            pattern_hashes[h] = []
        pattern_hashes[h].append(pattern)
    
    # Initialize result
    results = {p: [] for p in patterns}
    
    # Compute first window hash
    window_hash = 0
    for i in range(m):
        window_hash = (window_hash * base + ord(text[i])) % mod
    
    # Precompute base^(m-1)
    base_power = pow(base, m - 1, mod)
    
    # Scan text once
    for i in range(n - m + 1):
        # O(1) hash set lookup
        if window_hash in pattern_hashes:
            # Verify against all patterns with this hash
            window_text = text[i:i+m]
            for pattern in pattern_hashes[window_hash]:
                if window_text == pattern:
                    results[pattern].append(i)
        
        # Roll hash
        if i < n - m:
            outgoing = ord(text[i])
            incoming = ord(text[i + m])
            window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod
            if window_hash < 0:
                window_hash += mod
    
    return results
 
 
# Example usage
text = "ABABCABABCABABCAB"
patterns = ["ABABC", "BABCA", "ABCAB", "CABAB"]
 
results = rabin_karp_multi_pattern(text, patterns)
for pattern, positions in results.items():
    if positions:
        print(f"'{pattern}' found at positions: {positions}")
    else:
        print(f"'{pattern}' not found")

Handling Variable-Length Patterns

The basic multi-pattern Rabin-Karp assumes all patterns have the same length. Real-world applications often require searching for patterns of varying lengths.

Challenge:

With different pattern lengths, we need different window sizes. Rolling hashes for different window sizes don't share computation.

Solutions:

Approach 1: Group by length

Group patterns by length and run separate searches for each group:

patterns_by_length = group_patterns(patterns)
for length, patterns_of_this_length in patterns_by_length:
    run_rabin_karp(text, patterns_of_this_length, window_size=length)

Complexity: O(L × n) where L = number of distinct lengths

This is efficient when there are few distinct lengths.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from collections import defaultdict
 
def rabin_karp_variable_length(text: str, patterns: list[str]) -> dict[str, list[int]]:
    """
    Multi-pattern Rabin-Karp supporting variable-length patterns.
    
    Groups patterns by length and runs separate searches per group.
    Efficient when number of distinct lengths is small.
    """
    if not patterns:
        return {}
    
    n = len(text)
    base = 31
    mod = 10**9 + 7
    
    # Group patterns by length
    by_length = defaultdict(list)
    for pattern in patterns:
        by_length[len(pattern)].append(pattern)
    
    results = {p: [] for p in patterns}
    
    for m, patterns_of_length in by_length.items():
        if m > n:
            continue
        
        # Build hash set for this length
        pattern_hashes = defaultdict(list)
        for pattern in patterns_of_length:
            h = 0
            for c in pattern:
                h = (h * base + ord(c)) % mod
            pattern_hashes[h].append(pattern)
        
        # Compute first window hash
        window_hash = 0
        for i in range(m):
            window_hash = (window_hash * base + ord(text[i])) % mod
        
        base_power = pow(base, m - 1, mod)
        
        # Scan text for this length
        for i in range(n - m + 1):
            if window_hash in pattern_hashes:
                window_text = text[i:i+m]
                for pattern in pattern_hashes[window_hash]:
                    if window_text == pattern:
                        results[pattern].append(i)
            
            if i < n - m:
                outgoing = ord(text[i])
                incoming = ord(text[i + m])
                window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod
                if window_hash < 0:
                    window_hash += mod
    
    return results
 
 
# Example with variable-length patterns
text = "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG"
patterns = ["THE", "QUICK", "BROWN FOX", "LAZY"]
 
results = rabin_karp_variable_length(text, patterns)
for pattern, positions in results.items():
    if positions:
        print(f"'{pattern}' found at positions: {positions}")

Approach 2: Minimum-length filtering

Compute hash for the minimum pattern length. Use this as a preliminary filter, then verify longer patterns:

min_length = min(len(p) for p in patterns)
preliminary_hash = rolling_hash(text, min_length)

for each position where preliminary hash matches any pattern prefix:
    check all patterns whose prefix matched

This is more complex but reduces text scans to one.

Approach 3: Use Aho-Corasick

For truly optimal variable-length multi-pattern matching, consider Aho-Corasick (briefly discussed later). It achieves O(n + M + z) where z = total match count.

Practical Pattern Length Distribution

Multi-Pattern Complexity Analysis

Let's rigorously analyze the complexity of multi-pattern Rabin-Karp.

Parameters:

n = text length
k = number of patterns
m = pattern length (assuming equal lengths)
M = total pattern length = k × m

Preprocessing:

for each of k patterns:          // k iterations
    compute hash:                 // O(m) each
    add to hash set:              // O(1) expected

Total preprocessing: O(km) = O(M)

Matching:

for each of n-m+1 positions:     // O(n) iterations
    hash lookup in set:           // O(1) expected
    if match: verify              // O(?) - see below
    roll hash:                    // O(1)

Base matching: O(n) (without considering verification)

Verification analysis:

Verification occurs when window hash is found in pattern hash set. This happens for:

Actual matches (we want these)
Hash collisions (spurious hits)

Expected spurious hits:

With k patterns and modulus q:

P(window hash matches any pattern hash) ≤ k/q

For n-m+1 windows:

E[spurious hits] ≤ (n-m+1) × k/q ≈ nk/q

With q = 10^9, n = 10^6, k = 100:

E[spurious hits] ≈ 10^6 × 100 / 10^9 = 0.1

Negligible spurious hits expected!

Total expected time:

Preprocessing + Matching + Verification
= O(M) + O(n) + O(A × m)
= O(n + M + Am)

Where A = actual match count. For typical scenarios with few matches: O(n + M)

Multi-Pattern Complexity Summary
Phase	Operation	Complexity
Preprocessing	Hash k patterns	O(M) = O(km)
Matching	Scan n positions, O(1) lookup each	O(n)
Verification	Verify A matches + spurious hits	O(Am) expected
Total	—	O(n + M) expected

Comparison with Naive Approach

Real-World Applications

Multi-pattern Rabin-Karp powers numerous real-world systems. Let's explore the key applications in depth.

Application 1: Plagiarism Detection

Problem: Given a document and a corpus of existing documents, find copied passages.

Approach:

Extract n-grams (e.g., 5-word phrases) from the target document
Search for each n-gram in corpus documents
High density of matches indicates plagiarism

# Pseudo-code for plagiarism detection
grams = extract_ngrams(target_document, n=5)
for corpus_doc in corpus:
    matches = rabin_karp_multi(corpus_doc, grams)
    if len(matches) / len(corpus_doc) > threshold:
        flag_as_plagiarism(target_document, corpus_doc, matches)

Rabin-Karp makes this feasible for large corpora.

Application 2: Malware Signature Scanning

Problem: Scan files for known malware byte sequences (signatures).

Challenges:

Thousands of signatures to check
Files can be gigabytes in size
Must be fast for real-time scanning

Solution:

Store all signature hashes in a set
Stream file through rolling hash
O(file_size) regardless of signature count

Application 3: Intrusion Detection Systems (IDS)

Network traffic is scanned for attack patterns:

SQL injection patterns
Cross-site scripting (XSS) markers
Known exploit payloads

Rabin-Karp enables line-rate scanning of network packets against thousands of attack signatures.

Application 4: DNA Sequence Analysis

Bioinformatics applications:

Find all occurrences of known gene sequences
Identify transcription factor binding sites
Search for primer sequences in genomes

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def plagiarism_detection(target: str, corpus: list[str], n: int = 5) -> dict:
    """
    Detect potential plagiarism by finding shared n-grams.
    
    Args:
        target: Document to check for plagiarism
        corpus: List of reference documents
        n: Size of word n-grams to use
    
    Returns:
        Dict with plagiarism analysis for each corpus document
    """
    # Extract word n-grams from target
    target_words = target.split()
    if len(target_words) < n:
        return {}
    
    # Create n-gram patterns
    ngrams = set()
    for i in range(len(target_words) - n + 1):
        ngram = ' '.join(target_words[i:i+n])
        ngrams.add(ngram)
    
    # For equal-length search, group by length
    # Here we simplify by assuming space-separated words
    results = {}
    
    for doc_id, doc in enumerate(corpus):
        doc_name = f"Document {doc_id}"
        
        # Find matches using Rabin-Karp
        matches = rabin_karp_variable_length(doc, list(ngrams))
        
        match_count = sum(len(positions) for positions in matches.values())
        match_ratio = match_count / max(1, len(doc.split()) - n + 1)
        
        results[doc_name] = {
            'match_count': match_count,
            'match_ratio': match_ratio,
            'plagiarism_flag': match_ratio > 0.1,  # 10% threshold
            'matched_ngrams': [ng for ng, pos in matches.items() if pos]
        }
    
    return results
 
 
# Example usage
target = "The quick brown fox jumps over the lazy dog"
corpus = [
    "The quick brown fox runs through the forest",
    "This is a completely different document about cats",
    "The quick brown fox jumps over the lazy sleeping dog"
]
 
analysis = plagiarism_detection(target, corpus, n=3)
for doc, info in analysis.items():
    print(f"{doc}: {info['match_count']} matches, " +
          f"ratio={info['match_ratio']:.2%}, " +
          f"flagged={info['plagiarism_flag']}")

Industry-Strength Implementations

Optimization Techniques

For production systems handling millions of patterns or gigabytes of text, additional optimizations become important.

Optimization 1: Bloom filter pre-screening

Instead of a hash set, use a Bloom filter for pattern hashes:

Space: O(k) bits instead of O(k) hash entries
Lookup: Still O(1), often faster due to cache
Trade-off: Small false positive rate (acceptable with verification)

bloom = BloomFilter(size=1000000, hash_count=7)
for pattern in patterns:
    bloom.add(hash(pattern))

# During matching
if bloom.might_contain(window_hash):  # Fast!
    if window_hash in exact_hash_set:  # Rare
        verify()

Optimization 2: SIMD-accelerated hashing

Modern CPUs can compute multiple hashes in parallel using SIMD instructions:

Process 4-8 positions simultaneously
Vectorized modular arithmetic
4-8× speedup on compatible hardware

Optimization 3: Pattern hash precomputation

For static pattern sets (e.g., malware signatures), precompute and cache:

Pattern hashes
Sorted hash array for binary search (if hash set overhead is too high)
Perfect hash function (zero collisions within pattern set)

Optimization 4: Chunked processing for large texts

for chunk in text.chunks(CHUNK_SIZE):
    matches_in_chunk = rabin_karp(chunk, patterns)
    adjust_positions(matches_in_chunk, chunk.offset)
    yield matches_in_chunk

Benefits:

Better cache utilization
Enables parallel processing of chunks
Reduces memory pressure for huge texts

Optimization 5: Early termination

For some applications, we don't need all matches:

'Find first' returns after single match
'Find up to N' stops after N matches
'Existence check' returns boolean on first match

Profile Before Optimizing

Comparison with Aho-Corasick

Rabin-Karp isn't the only multi-pattern algorithm. The Aho-Corasick algorithm is a specialized automaton-based approach that guarantees optimal time complexity.

Aho-Corasick overview:

Builds a finite automaton (trie + failure links) from all patterns
Processes text in a single pass, following automaton transitions
Reports matches as automaton reaches accepting states

Complexity comparison:

Rabin-Karp vs Aho-Corasick
Aspect	Rabin-Karp Multi	Aho-Corasick
Preprocessing	O(M)	O(M × alphabet_size)
Matching	O(n + M) expected	O(n + z) guaranteed
Total	O(n + M) expected	O(n + M + z)
Space	O(k) pattern hashes	O(M) automaton states
Variable lengths	Requires grouping	Native support
Worst case	O(nm)	O(n + M + z)
Implementation	Simple	Complex

Where z = number of matches reported

When to choose Rabin-Karp:

Implementation simplicity is valued — Rabin-Karp is straightforward to implement correctly
Patterns change frequently — No complex automaton to rebuild
Patterns are equal length — Eliminates main limitation
Expected case is sufficient — Worst case is astronomically rare

When to choose Aho-Corasick:

Guaranteed linear time is required — Security-critical, real-time systems
Variable-length patterns naturally — No grouping needed
Patterns are static — Automaton built once, used many times
Many overlapping matches — Aho-Corasick handles naturally

Practical guidance:

Hybrid Approaches

Production-Ready Implementation

Let's build a complete, production-ready multi-pattern Rabin-Karp implementation that handles edge cases and provides useful features.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
from collections import defaultdict
from dataclasses import dataclass
from typing import Iterator
 
@dataclass
class Match:
    """Represents a pattern match in text."""
    pattern: str
    position: int
    length: int
    
    def __repr__(self):
        return f"Match('{self.pattern}' at {self.position})"
 
 
class MultiPatternMatcher:
    """
    Production-ready multi-pattern matching using Rabin-Karp.
    
    Features:
    - Supports patterns of varying lengths
    - Double hashing for collision resistance
    - Streaming-capable design
    - Match count and position tracking
    """
    
    def __init__(self, patterns: list[str], 
                 base1: int = 31, mod1: int = 10**9 + 7,
                 base2: int = 37, mod2: int = 10**9 + 9):
        if not patterns:
            raise ValueError("At least one pattern required")
        
        self.patterns = list(set(patterns))  # Deduplicate
        self.base1, self.mod1 = base1, mod1
        self.base2, self.mod2 = base2, mod2
        
        # Group patterns by length
        self.by_length: dict[int, dict[tuple, list[str]]] = defaultdict(dict)
        
        for pattern in self.patterns:
            m = len(pattern)
            h1 = self._hash(pattern, base1, mod1)
            h2 = self._hash(pattern, base2, mod2)
            key = (h1, h2)
            
            if key not in self.by_length[m]:
                self.by_length[m][key] = []
            self.by_length[m][key].append(pattern)
        
        self.lengths = sorted(self.by_length.keys())
    
    def _hash(self, s: str, base: int, mod: int) -> int:
        """Compute polynomial hash of string."""
        h = 0
        for c in s:
            h = (h * base + ord(c)) % mod
        return h
    
    def search(self, text: str) -> list[Match]:
        """Find all pattern occurrences in text."""
        return list(self.search_iter(text))
    
    def search_iter(self, text: str) -> Iterator[Match]:
        """Iterate over pattern occurrences (memory-efficient)."""
        n = len(text)
        
        for m in self.lengths:
            if m > n:
                continue
            
            patterns_at_length = self.by_length[m]
            
            # Compute initial window hashes
            h1 = self._hash(text[:m], self.base1, self.mod1)
            h2 = self._hash(text[:m], self.base2, self.mod2)
            
            # Base powers for rolling
            bp1 = pow(self.base1, m - 1, self.mod1)
            bp2 = pow(self.base2, m - 1, self.mod2)
            
            for i in range(n - m + 1):
                key = (h1, h2)
                
                if key in patterns_at_length:
                    window = text[i:i+m]
                    for pattern in patterns_at_length[key]:
                        if window == pattern:
                            yield Match(pattern, i, m)
                
                # Roll hashes
                if i < n - m:
                    out = ord(text[i])
                    inp = ord(text[i + m])
                    
                    h1 = ((h1 - out * bp1) * self.base1 + inp) % self.mod1
                    h2 = ((h2 - out * bp2) * self.base2 + inp) % self.mod2
                    
                    if h1 < 0: h1 += self.mod1
                    if h2 < 0: h2 += self.mod2
    
    def contains_any(self, text: str) -> bool:
        """Check if text contains any pattern (early termination)."""
        for _ in self.search_iter(text):
            return True
        return False
    
    def count_matches(self, text: str) -> dict[str, int]:
        """Count occurrences of each pattern."""
        counts = {p: 0 for p in self.patterns}
        for match in self.search_iter(text):
            counts[match.pattern] += 1
        return counts
 
 
# Example usage
matcher = MultiPatternMatcher([
    "hello",
    "world",
    "hell",
    "algorithm"
])
 
text = """
Hello, World! This is a test of the hello algorithm.
The world of algorithms is vast. Hello again!
"""
 
print("All matches:")
for match in matcher.search(text.lower()):
    print(f"  {match}")
 
print(f"
Contains any: {matcher.contains_any(text.lower())}")
print(f"
Match counts: {matcher.count_matches(text.lower())}")

Summary: Rabin-Karp's Unique Strength

We've explored Rabin-Karp's killer feature in depth. Let's consolidate the key insights:

Key Takeaways

•Multi-pattern matching in single pass — Check k patterns at each position with O(1) hash lookup.
•Complexity: O(n + M) expected — Independent of pattern count k for the scanning phase.
•Hash set enables instant lookup — Convert O(k) pattern checks to O(1) set membership.
•Variable lengths require grouping — Or use specialized algorithms like Aho-Corasick.
•Real-world applications are vast — Plagiarism detection, malware scanning, IDS, bioinformatics.
•Double hashing eliminates collisions — Makes spurious hits exponentially rare.
•Simpler than Aho-Corasick — Often the pragmatic choice despite weaker guarantees.

Module complete:

You've now mastered the Rabin-Karp algorithm comprehensively:

Hashing for pattern matching — The conceptual foundation
Rolling hash technique — O(1) hash updates
Collision handling — Verification and mitigation strategies
Complexity analysis — Expected O(n+m), worst O(nm)
Multi-pattern advantage — The killer feature

Module Complete

5 / 5