Loading learning content...
Throughout this module, we've seen how Rabin-Karp achieves O(n + m) expected time for single-pattern matching—comparable to KMP, though without the same worst-case guarantee. So why choose Rabin-Karp?
The answer lies in its killer feature: the ability to efficiently search for multiple patterns simultaneously. While KMP would require separate passes for each pattern, Rabin-Karp can check k patterns at each position with minimal overhead, achieving O(n + km) expected time instead of O(kn + km).
This capability makes Rabin-Karp the algorithm of choice for many real-world applications: plagiarism detection, malware signature scanning, multi-keyword search, and bioinformatics sequence analysis.
By the end of this page, you will: (1) understand why Rabin-Karp excels at multi-pattern matching, (2) implement efficient multi-pattern search, (3) analyze the complexity and tradeoffs, (4) explore real-world applications, and (5) compare with specialized algorithms like Aho-Corasick.
Let's formalize the multi-pattern matching problem and understand why naive approaches are inefficient.
Problem definition:
Given:
Find: All occurrences of any pattern in the text.
Naive approach: Run single-pattern algorithm k times
for each pattern P in patterns:
find_all_occurrences(text, P) // O(n + m) per pattern
Total time: O(k × n + M) = O(kn + M)
For 1000 patterns and a 1MB text, this means ~1 billion operations. Completely impractical.
The fundamental inefficiency:
Running separate searches means reading the text k times. If the text is large (e.g., a genome), this is extraordinarily wasteful.
Pattern 1: Scan entire text → O(n)
Pattern 2: Scan entire text → O(n)
...
Pattern k: Scan entire text → O(n)
─────────────────────────────────
Total: k × O(n) = O(kn)
The goal:
Can we scan the text just once and check all patterns at each position? This would give O(n) text scanning plus O(k) work per position for pattern checking, totaling O(n × something-less-than-k) if we're clever.
Rabin-Karp achieves exactly this through hash-based filtering.
| Approach | Time Complexity | Text Scans | Practical for Large k? |
|---|---|---|---|
| Naive (k separate runs) | O(kn + M) | k | No |
| Rabin-Karp multi-pattern | O(n + M) expected | 1 | Yes |
| Aho-Corasick | O(n + M + z) | 1 | Yes |
The text is often much larger than the combined pattern length. Reading the text once instead of k times can mean the difference between seconds and hours. Rabin-Karp's ability to check multiple patterns with a single hash lookup per position is what makes this possible.
Rabin-Karp's multi-pattern matching uses a simple but powerful insight: store all pattern hashes in a set, then check if each window's hash is in the set.
Algorithm structure:
Preprocessing:
Matching:
Why this works:
Pattern hashes: {H(P₁), H(P₂), ..., H(Pₖ)} // Stored in hash set
At each position i:
window_hash = H(text[i:i+m])
if window_hash in pattern_hashes: // O(1) lookup!
verify and report matches
The key is that checking if a hash exists in a set takes O(1) average time, regardless of how many patterns we have!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
def rabin_karp_multi_pattern(text: str, patterns: list[str]) -> dict[str, list[int]]: """ Find all occurrences of multiple patterns in text. Assumes all patterns have the same length. Returns dict mapping each pattern to list of match positions. Time: O(n + km) expected where k = number of patterns Space: O(k) for pattern hash storage """ if not patterns: return {} n = len(text) m = len(patterns[0]) # Assumes equal-length patterns # Validate all patterns have same length if not all(len(p) == m for p in patterns): raise ValueError("All patterns must have the same length") if m > n: return {p: [] for p in patterns} base = 31 mod = 10**9 + 7 # Preprocessing: compute all pattern hashes pattern_hashes = {} # hash -> list of patterns with that hash for pattern in patterns: h = 0 for c in pattern: h = (h * base + ord(c)) % mod # Multiple patterns might have same hash (collision) if h not in pattern_hashes: pattern_hashes[h] = [] pattern_hashes[h].append(pattern) # Initialize result results = {p: [] for p in patterns} # Compute first window hash window_hash = 0 for i in range(m): window_hash = (window_hash * base + ord(text[i])) % mod # Precompute base^(m-1) base_power = pow(base, m - 1, mod) # Scan text once for i in range(n - m + 1): # O(1) hash set lookup if window_hash in pattern_hashes: # Verify against all patterns with this hash window_text = text[i:i+m] for pattern in pattern_hashes[window_hash]: if window_text == pattern: results[pattern].append(i) # Roll hash if i < n - m: outgoing = ord(text[i]) incoming = ord(text[i + m]) window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod if window_hash < 0: window_hash += mod return results # Example usagetext = "ABABCABABCABABCAB"patterns = ["ABABC", "BABCA", "ABCAB", "CABAB"] results = rabin_karp_multi_pattern(text, patterns)for pattern, positions in results.items(): if positions: print(f"'{pattern}' found at positions: {positions}") else: print(f"'{pattern}' not found")The basic multi-pattern Rabin-Karp assumes all patterns have the same length. Real-world applications often require searching for patterns of varying lengths.
Challenge:
With different pattern lengths, we need different window sizes. Rolling hashes for different window sizes don't share computation.
Solutions:
Approach 1: Group by length
Group patterns by length and run separate searches for each group:
patterns_by_length = group_patterns(patterns)
for length, patterns_of_this_length in patterns_by_length:
run_rabin_karp(text, patterns_of_this_length, window_size=length)
Complexity: O(L × n) where L = number of distinct lengths
This is efficient when there are few distinct lengths.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
from collections import defaultdict def rabin_karp_variable_length(text: str, patterns: list[str]) -> dict[str, list[int]]: """ Multi-pattern Rabin-Karp supporting variable-length patterns. Groups patterns by length and runs separate searches per group. Efficient when number of distinct lengths is small. """ if not patterns: return {} n = len(text) base = 31 mod = 10**9 + 7 # Group patterns by length by_length = defaultdict(list) for pattern in patterns: by_length[len(pattern)].append(pattern) results = {p: [] for p in patterns} for m, patterns_of_length in by_length.items(): if m > n: continue # Build hash set for this length pattern_hashes = defaultdict(list) for pattern in patterns_of_length: h = 0 for c in pattern: h = (h * base + ord(c)) % mod pattern_hashes[h].append(pattern) # Compute first window hash window_hash = 0 for i in range(m): window_hash = (window_hash * base + ord(text[i])) % mod base_power = pow(base, m - 1, mod) # Scan text for this length for i in range(n - m + 1): if window_hash in pattern_hashes: window_text = text[i:i+m] for pattern in pattern_hashes[window_hash]: if window_text == pattern: results[pattern].append(i) if i < n - m: outgoing = ord(text[i]) incoming = ord(text[i + m]) window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod if window_hash < 0: window_hash += mod return results # Example with variable-length patternstext = "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG"patterns = ["THE", "QUICK", "BROWN FOX", "LAZY"] results = rabin_karp_variable_length(text, patterns)for pattern, positions in results.items(): if positions: print(f"'{pattern}' found at positions: {positions}")Approach 2: Minimum-length filtering
Compute hash for the minimum pattern length. Use this as a preliminary filter, then verify longer patterns:
min_length = min(len(p) for p in patterns)
preliminary_hash = rolling_hash(text, min_length)
for each position where preliminary hash matches any pattern prefix:
check all patterns whose prefix matched
This is more complex but reduces text scans to one.
Approach 3: Use Aho-Corasick
For truly optimal variable-length multi-pattern matching, consider Aho-Corasick (briefly discussed later). It achieves O(n + M + z) where z = total match count.
In many applications, patterns cluster around a few lengths. For example, malware signatures might mostly be 20-30 bytes, keyword lists might be 5-15 characters. The 'group by length' approach handles these cases efficiently.
Let's rigorously analyze the complexity of multi-pattern Rabin-Karp.
Parameters:
Preprocessing:
for each of k patterns: // k iterations
compute hash: // O(m) each
add to hash set: // O(1) expected
Total preprocessing: O(km) = O(M)
Matching:
for each of n-m+1 positions: // O(n) iterations
hash lookup in set: // O(1) expected
if match: verify // O(?) - see below
roll hash: // O(1)
Base matching: O(n) (without considering verification)
Verification analysis:
Verification occurs when window hash is found in pattern hash set. This happens for:
Expected spurious hits:
With k patterns and modulus q:
P(window hash matches any pattern hash) ≤ k/q
For n-m+1 windows:
E[spurious hits] ≤ (n-m+1) × k/q ≈ nk/q
With q = 10^9, n = 10^6, k = 100:
E[spurious hits] ≈ 10^6 × 100 / 10^9 = 0.1
Negligible spurious hits expected!
Total expected time:
Preprocessing + Matching + Verification
= O(M) + O(n) + O(A × m)
= O(n + M + Am)
Where A = actual match count. For typical scenarios with few matches: O(n + M)
| Phase | Operation | Complexity |
|---|---|---|
| Preprocessing | Hash k patterns | O(M) = O(km) |
| Matching | Scan n positions, O(1) lookup each | O(n) |
| Verification | Verify A matches + spurious hits | O(Am) expected |
| Total | — | O(n + M) expected |
Naive approach: O(kn + M) — scans text k times. Rabin-Karp: O(n + M) — scans text once. For k = 1000 patterns and 1MB text, this is the difference between ~1 billion operations and ~1 million operations. A 1000× improvement!
Multi-pattern Rabin-Karp powers numerous real-world systems. Let's explore the key applications in depth.
Application 1: Plagiarism Detection
Problem: Given a document and a corpus of existing documents, find copied passages.
Approach:
# Pseudo-code for plagiarism detection
grams = extract_ngrams(target_document, n=5)
for corpus_doc in corpus:
matches = rabin_karp_multi(corpus_doc, grams)
if len(matches) / len(corpus_doc) > threshold:
flag_as_plagiarism(target_document, corpus_doc, matches)
Rabin-Karp makes this feasible for large corpora.
Application 2: Malware Signature Scanning
Problem: Scan files for known malware byte sequences (signatures).
Challenges:
Solution:
Application 3: Intrusion Detection Systems (IDS)
Network traffic is scanned for attack patterns:
Rabin-Karp enables line-rate scanning of network packets against thousands of attack signatures.
Application 4: DNA Sequence Analysis
Bioinformatics applications:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
def plagiarism_detection(target: str, corpus: list[str], n: int = 5) -> dict: """ Detect potential plagiarism by finding shared n-grams. Args: target: Document to check for plagiarism corpus: List of reference documents n: Size of word n-grams to use Returns: Dict with plagiarism analysis for each corpus document """ # Extract word n-grams from target target_words = target.split() if len(target_words) < n: return {} # Create n-gram patterns ngrams = set() for i in range(len(target_words) - n + 1): ngram = ' '.join(target_words[i:i+n]) ngrams.add(ngram) # For equal-length search, group by length # Here we simplify by assuming space-separated words results = {} for doc_id, doc in enumerate(corpus): doc_name = f"Document {doc_id}" # Find matches using Rabin-Karp matches = rabin_karp_variable_length(doc, list(ngrams)) match_count = sum(len(positions) for positions in matches.values()) match_ratio = match_count / max(1, len(doc.split()) - n + 1) results[doc_name] = { 'match_count': match_count, 'match_ratio': match_ratio, 'plagiarism_flag': match_ratio > 0.1, # 10% threshold 'matched_ngrams': [ng for ng, pos in matches.items() if pos] } return results # Example usagetarget = "The quick brown fox jumps over the lazy dog"corpus = [ "The quick brown fox runs through the forest", "This is a completely different document about cats", "The quick brown fox jumps over the lazy sleeping dog"] analysis = plagiarism_detection(target, corpus, n=3)for doc, info in analysis.items(): print(f"{doc}: {info['match_count']} matches, " + f"ratio={info['match_ratio']:.2%}, " + f"flagged={info['plagiarism_flag']}")Production plagiarism detectors (like Turnitin) and antivirus scanners use highly optimized variants of these algorithms, often combined with other techniques like locality-sensitive hashing for approximate matching and SIMD instructions for speed.
For production systems handling millions of patterns or gigabytes of text, additional optimizations become important.
Optimization 1: Bloom filter pre-screening
Instead of a hash set, use a Bloom filter for pattern hashes:
bloom = BloomFilter(size=1000000, hash_count=7)
for pattern in patterns:
bloom.add(hash(pattern))
# During matching
if bloom.might_contain(window_hash): # Fast!
if window_hash in exact_hash_set: # Rare
verify()
Optimization 2: SIMD-accelerated hashing
Modern CPUs can compute multiple hashes in parallel using SIMD instructions:
Optimization 3: Pattern hash precomputation
For static pattern sets (e.g., malware signatures), precompute and cache:
Optimization 4: Chunked processing for large texts
for chunk in text.chunks(CHUNK_SIZE):
matches_in_chunk = rabin_karp(chunk, patterns)
adjust_positions(matches_in_chunk, chunk.offset)
yield matches_in_chunk
Benefits:
Optimization 5: Early termination
For some applications, we don't need all matches:
These optimizations add complexity. Always profile your specific workload first. For many applications, the basic implementation is fast enough. Only add optimizations when measurements show they're needed.
Rabin-Karp isn't the only multi-pattern algorithm. The Aho-Corasick algorithm is a specialized automaton-based approach that guarantees optimal time complexity.
Aho-Corasick overview:
Complexity comparison:
| Aspect | Rabin-Karp Multi | Aho-Corasick |
|---|---|---|
| Preprocessing | O(M) | O(M × alphabet_size) |
| Matching | O(n + M) expected | O(n + z) guaranteed |
| Total | O(n + M) expected | O(n + M + z) |
| Space | O(k) pattern hashes | O(M) automaton states |
| Variable lengths | Requires grouping | Native support |
| Worst case | O(nm) | O(n + M + z) |
| Implementation | Simple | Complex |
Where z = number of matches reported
When to choose Rabin-Karp:
When to choose Aho-Corasick:
Practical guidance:
For most applications, either algorithm works well. Rabin-Karp's simplicity makes it a good default choice. Consider Aho-Corasick for specialized use cases like intrusion detection or antivirus where worst-case guarantees matter.
Some systems use Rabin-Karp as a fast filter and Aho-Corasick for detailed matching. The hash filter quickly eliminates most positions, then the automaton processes the remaining candidates precisely.
Let's build a complete, production-ready multi-pattern Rabin-Karp implementation that handles edge cases and provides useful features.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
from collections import defaultdictfrom dataclasses import dataclassfrom typing import Iterator @dataclassclass Match: """Represents a pattern match in text.""" pattern: str position: int length: int def __repr__(self): return f"Match('{self.pattern}' at {self.position})" class MultiPatternMatcher: """ Production-ready multi-pattern matching using Rabin-Karp. Features: - Supports patterns of varying lengths - Double hashing for collision resistance - Streaming-capable design - Match count and position tracking """ def __init__(self, patterns: list[str], base1: int = 31, mod1: int = 10**9 + 7, base2: int = 37, mod2: int = 10**9 + 9): if not patterns: raise ValueError("At least one pattern required") self.patterns = list(set(patterns)) # Deduplicate self.base1, self.mod1 = base1, mod1 self.base2, self.mod2 = base2, mod2 # Group patterns by length self.by_length: dict[int, dict[tuple, list[str]]] = defaultdict(dict) for pattern in self.patterns: m = len(pattern) h1 = self._hash(pattern, base1, mod1) h2 = self._hash(pattern, base2, mod2) key = (h1, h2) if key not in self.by_length[m]: self.by_length[m][key] = [] self.by_length[m][key].append(pattern) self.lengths = sorted(self.by_length.keys()) def _hash(self, s: str, base: int, mod: int) -> int: """Compute polynomial hash of string.""" h = 0 for c in s: h = (h * base + ord(c)) % mod return h def search(self, text: str) -> list[Match]: """Find all pattern occurrences in text.""" return list(self.search_iter(text)) def search_iter(self, text: str) -> Iterator[Match]: """Iterate over pattern occurrences (memory-efficient).""" n = len(text) for m in self.lengths: if m > n: continue patterns_at_length = self.by_length[m] # Compute initial window hashes h1 = self._hash(text[:m], self.base1, self.mod1) h2 = self._hash(text[:m], self.base2, self.mod2) # Base powers for rolling bp1 = pow(self.base1, m - 1, self.mod1) bp2 = pow(self.base2, m - 1, self.mod2) for i in range(n - m + 1): key = (h1, h2) if key in patterns_at_length: window = text[i:i+m] for pattern in patterns_at_length[key]: if window == pattern: yield Match(pattern, i, m) # Roll hashes if i < n - m: out = ord(text[i]) inp = ord(text[i + m]) h1 = ((h1 - out * bp1) * self.base1 + inp) % self.mod1 h2 = ((h2 - out * bp2) * self.base2 + inp) % self.mod2 if h1 < 0: h1 += self.mod1 if h2 < 0: h2 += self.mod2 def contains_any(self, text: str) -> bool: """Check if text contains any pattern (early termination).""" for _ in self.search_iter(text): return True return False def count_matches(self, text: str) -> dict[str, int]: """Count occurrences of each pattern.""" counts = {p: 0 for p in self.patterns} for match in self.search_iter(text): counts[match.pattern] += 1 return counts # Example usagematcher = MultiPatternMatcher([ "hello", "world", "hell", "algorithm"]) text = """Hello, World! This is a test of the hello algorithm.The world of algorithms is vast. Hello again!""" print("All matches:")for match in matcher.search(text.lower()): print(f" {match}") print(f"Contains any: {matcher.contains_any(text.lower())}")print(f"Match counts: {matcher.count_matches(text.lower())}")We've explored Rabin-Karp's killer feature in depth. Let's consolidate the key insights:
Module complete:
You've now mastered the Rabin-Karp algorithm comprehensively:
Rabin-Karp is a beautiful example of how a simple idea (fingerprinting) combined with clever mathematics (rolling hash) produces a practical, powerful algorithm. Its multi-pattern capability makes it irreplaceable in many real-world systems.
Congratulations! You've completed the Rabin-Karp Algorithm module. You now understand the rolling hash technique, collision handling, complexity analysis, and—crucially—the multi-pattern searching advantage that makes Rabin-Karp the algorithm of choice for many real-world applications.