Loading content...
Throughout this module, we've seen how Rabin-Karp achieves O(n + m) expected time for single-pattern matching—comparable to KMP, though without the same worst-case guarantee. So why choose Rabin-Karp?\n\nThe answer lies in its killer feature: the ability to efficiently search for multiple patterns simultaneously. While KMP would require separate passes for each pattern, Rabin-Karp can check k patterns at each position with minimal overhead, achieving O(n + km) expected time instead of O(kn + km).\n\nThis capability makes Rabin-Karp the algorithm of choice for many real-world applications: plagiarism detection, malware signature scanning, multi-keyword search, and bioinformatics sequence analysis.
By the end of this page, you will: (1) understand why Rabin-Karp excels at multi-pattern matching, (2) implement efficient multi-pattern search, (3) analyze the complexity and tradeoffs, (4) explore real-world applications, and (5) compare with specialized algorithms like Aho-Corasick.
Let's formalize the multi-pattern matching problem and understand why naive approaches are inefficient.\n\nProblem definition:\n\nGiven:\n- Text T of length n\n- Set of k patterns P₁, P₂, ..., Pₖ with total length M = Σ|Pᵢ|\n\nFind: All occurrences of any pattern in the text.\n\nNaive approach: Run single-pattern algorithm k times\n\n\nfor each pattern P in patterns:\n find_all_occurrences(text, P) // O(n + m) per pattern\n\n\nTotal time: O(k × n + M) = O(kn + M)\n\nFor 1000 patterns and a 1MB text, this means ~1 billion operations. Completely impractical.
The fundamental inefficiency:\n\nRunning separate searches means reading the text k times. If the text is large (e.g., a genome), this is extraordinarily wasteful.\n\n\nPattern 1: Scan entire text → O(n)\nPattern 2: Scan entire text → O(n)\n...\nPattern k: Scan entire text → O(n)\n─────────────────────────────────\nTotal: k × O(n) = O(kn)\n\n\nThe goal:\n\nCan we scan the text just once and check all patterns at each position? This would give O(n) text scanning plus O(k) work per position for pattern checking, totaling O(n × something-less-than-k) if we're clever.\n\nRabin-Karp achieves exactly this through hash-based filtering.
| Approach | Time Complexity | Text Scans | Practical for Large k? |
|---|---|---|---|
| Naive (k separate runs) | O(kn + M) | k | No |
| Rabin-Karp multi-pattern | O(n + M) expected | 1 | Yes |
| Aho-Corasick | O(n + M + z) | 1 | Yes |
The text is often much larger than the combined pattern length. Reading the text once instead of k times can mean the difference between seconds and hours. Rabin-Karp's ability to check multiple patterns with a single hash lookup per position is what makes this possible.
Rabin-Karp's multi-pattern matching uses a simple but powerful insight: store all pattern hashes in a set, then check if each window's hash is in the set.\n\nAlgorithm structure:\n\n1. Preprocessing:\n - Compute hash of each pattern Pᵢ\n - Store all hashes in a hash set (for O(1) lookup)\n - Note: assumes all patterns have the same length m\n\n2. Matching:\n - Compute initial window hash\n - For each position:\n - Check if window hash is in the pattern hash set (O(1))\n - If yes: verify which pattern(s) match\n - Roll the hash to next position\n\nWhy this works:\n\n\nPattern hashes: {H(P₁), H(P₂), ..., H(Pₖ)} // Stored in hash set\n\nAt each position i:\n window_hash = H(text[i:i+m])\n if window_hash in pattern_hashes: // O(1) lookup!\n verify and report matches\n\n\nThe key is that checking if a hash exists in a set takes O(1) average time, regardless of how many patterns we have!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
def rabin_karp_multi_pattern(text: str, patterns: list[str]) -> dict[str, list[int]]: """ Find all occurrences of multiple patterns in text. Assumes all patterns have the same length. Returns dict mapping each pattern to list of match positions. Time: O(n + km) expected where k = number of patterns Space: O(k) for pattern hash storage """ if not patterns: return {} n = len(text) m = len(patterns[0]) # Assumes equal-length patterns # Validate all patterns have same length if not all(len(p) == m for p in patterns): raise ValueError("All patterns must have the same length") if m > n: return {p: [] for p in patterns} base = 31 mod = 10**9 + 7 # Preprocessing: compute all pattern hashes pattern_hashes = {} # hash -> list of patterns with that hash for pattern in patterns: h = 0 for c in pattern: h = (h * base + ord(c)) % mod # Multiple patterns might have same hash (collision) if h not in pattern_hashes: pattern_hashes[h] = [] pattern_hashes[h].append(pattern) # Initialize result results = {p: [] for p in patterns} # Compute first window hash window_hash = 0 for i in range(m): window_hash = (window_hash * base + ord(text[i])) % mod # Precompute base^(m-1) base_power = pow(base, m - 1, mod) # Scan text once for i in range(n - m + 1): # O(1) hash set lookup if window_hash in pattern_hashes: # Verify against all patterns with this hash window_text = text[i:i+m] for pattern in pattern_hashes[window_hash]: if window_text == pattern: results[pattern].append(i) # Roll hash if i < n - m: outgoing = ord(text[i]) incoming = ord(text[i + m]) window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod if window_hash < 0: window_hash += mod return results # Example usagetext = "ABABCABABCABABCAB"patterns = ["ABABC", "BABCA", "ABCAB", "CABAB"] results = rabin_karp_multi_pattern(text, patterns)for pattern, positions in results.items(): if positions: print(f"'{pattern}' found at positions: {positions}") else: print(f"'{pattern}' not found")The basic multi-pattern Rabin-Karp assumes all patterns have the same length. Real-world applications often require searching for patterns of varying lengths.\n\nChallenge:\n\nWith different pattern lengths, we need different window sizes. Rolling hashes for different window sizes don't share computation.\n\nSolutions:\n\nApproach 1: Group by length\n\nGroup patterns by length and run separate searches for each group:\n\n\npatterns_by_length = group_patterns(patterns)\nfor length, patterns_of_this_length in patterns_by_length:\n run_rabin_karp(text, patterns_of_this_length, window_size=length)\n\n\nComplexity: O(L × n) where L = number of distinct lengths\n\nThis is efficient when there are few distinct lengths.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
from collections import defaultdict def rabin_karp_variable_length(text: str, patterns: list[str]) -> dict[str, list[int]]: """ Multi-pattern Rabin-Karp supporting variable-length patterns. Groups patterns by length and runs separate searches per group. Efficient when number of distinct lengths is small. """ if not patterns: return {} n = len(text) base = 31 mod = 10**9 + 7 # Group patterns by length by_length = defaultdict(list) for pattern in patterns: by_length[len(pattern)].append(pattern) results = {p: [] for p in patterns} for m, patterns_of_length in by_length.items(): if m > n: continue # Build hash set for this length pattern_hashes = defaultdict(list) for pattern in patterns_of_length: h = 0 for c in pattern: h = (h * base + ord(c)) % mod pattern_hashes[h].append(pattern) # Compute first window hash window_hash = 0 for i in range(m): window_hash = (window_hash * base + ord(text[i])) % mod base_power = pow(base, m - 1, mod) # Scan text for this length for i in range(n - m + 1): if window_hash in pattern_hashes: window_text = text[i:i+m] for pattern in pattern_hashes[window_hash]: if window_text == pattern: results[pattern].append(i) if i < n - m: outgoing = ord(text[i]) incoming = ord(text[i + m]) window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod if window_hash < 0: window_hash += mod return results # Example with variable-length patternstext = "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG"patterns = ["THE", "QUICK", "BROWN FOX", "LAZY"] results = rabin_karp_variable_length(text, patterns)for pattern, positions in results.items(): if positions: print(f"'{pattern}' found at positions: {positions}")Approach 2: Minimum-length filtering\n\nCompute hash for the minimum pattern length. Use this as a preliminary filter, then verify longer patterns:\n\n\nmin_length = min(len(p) for p in patterns)\npreliminary_hash = rolling_hash(text, min_length)\n\nfor each position where preliminary hash matches any pattern prefix:\n check all patterns whose prefix matched\n\n\nThis is more complex but reduces text scans to one.\n\nApproach 3: Use Aho-Corasick\n\nFor truly optimal variable-length multi-pattern matching, consider Aho-Corasick (briefly discussed later). It achieves O(n + M + z) where z = total match count.
In many applications, patterns cluster around a few lengths. For example, malware signatures might mostly be 20-30 bytes, keyword lists might be 5-15 characters. The 'group by length' approach handles these cases efficiently.
Let's rigorously analyze the complexity of multi-pattern Rabin-Karp.\n\nParameters:\n- n = text length\n- k = number of patterns\n- m = pattern length (assuming equal lengths)\n- M = total pattern length = k × m\n\nPreprocessing:\n\n\nfor each of k patterns: // k iterations\n compute hash: // O(m) each\n add to hash set: // O(1) expected\n\n\nTotal preprocessing: O(km) = O(M)\n\nMatching:\n\n\nfor each of n-m+1 positions: // O(n) iterations\n hash lookup in set: // O(1) expected\n if match: verify // O(?) - see below\n roll hash: // O(1)\n\n\nBase matching: O(n) (without considering verification)
Verification analysis:\n\nVerification occurs when window hash is found in pattern hash set. This happens for:\n1. Actual matches (we want these)\n2. Hash collisions (spurious hits)\n\nExpected spurious hits:\n\nWith k patterns and modulus q:\n\nP(window hash matches any pattern hash) ≤ k/q\n\n\nFor n-m+1 windows:\n\nE[spurious hits] ≤ (n-m+1) × k/q ≈ nk/q\n\n\nWith q = 10^9, n = 10^6, k = 100:\n\nE[spurious hits] ≈ 10^6 × 100 / 10^9 = 0.1\n\n\nNegligible spurious hits expected!\n\nTotal expected time:\n\n\nPreprocessing + Matching + Verification\n= O(M) + O(n) + O(A × m)\n= O(n + M + Am)\n\n\nWhere A = actual match count. For typical scenarios with few matches: O(n + M)
| Phase | Operation | Complexity |
|---|---|---|
| Preprocessing | Hash k patterns | O(M) = O(km) |
| Matching | Scan n positions, O(1) lookup each | O(n) |
| Verification | Verify A matches + spurious hits | O(Am) expected |
| Total | — | O(n + M) expected |
Naive approach: O(kn + M) — scans text k times. Rabin-Karp: O(n + M) — scans text once. For k = 1000 patterns and 1MB text, this is the difference between ~1 billion operations and ~1 million operations. A 1000× improvement!
Multi-pattern Rabin-Karp powers numerous real-world systems. Let's explore the key applications in depth.\n\nApplication 1: Plagiarism Detection\n\nProblem: Given a document and a corpus of existing documents, find copied passages.\n\nApproach:\n- Extract n-grams (e.g., 5-word phrases) from the target document\n- Search for each n-gram in corpus documents\n- High density of matches indicates plagiarism\n\npython\n# Pseudo-code for plagiarism detection\ngrams = extract_ngrams(target_document, n=5)\nfor corpus_doc in corpus:\n matches = rabin_karp_multi(corpus_doc, grams)\n if len(matches) / len(corpus_doc) > threshold:\n flag_as_plagiarism(target_document, corpus_doc, matches)\n\n\nRabin-Karp makes this feasible for large corpora.
Application 2: Malware Signature Scanning\n\nProblem: Scan files for known malware byte sequences (signatures).\n\nChallenges:\n- Thousands of signatures to check\n- Files can be gigabytes in size\n- Must be fast for real-time scanning\n\nSolution:\n- Store all signature hashes in a set\n- Stream file through rolling hash\n- O(file_size) regardless of signature count\n\nApplication 3: Intrusion Detection Systems (IDS)\n\nNetwork traffic is scanned for attack patterns:\n- SQL injection patterns\n- Cross-site scripting (XSS) markers\n- Known exploit payloads\n\nRabin-Karp enables line-rate scanning of network packets against thousands of attack signatures.\n\nApplication 4: DNA Sequence Analysis\n\nBioinformatics applications:\n- Find all occurrences of known gene sequences\n- Identify transcription factor binding sites\n- Search for primer sequences in genomes
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
def plagiarism_detection(target: str, corpus: list[str], n: int = 5) -> dict: """ Detect potential plagiarism by finding shared n-grams. Args: target: Document to check for plagiarism corpus: List of reference documents n: Size of word n-grams to use Returns: Dict with plagiarism analysis for each corpus document """ # Extract word n-grams from target target_words = target.split() if len(target_words) < n: return {} # Create n-gram patterns ngrams = set() for i in range(len(target_words) - n + 1): ngram = ' '.join(target_words[i:i+n]) ngrams.add(ngram) # For equal-length search, group by length # Here we simplify by assuming space-separated words results = {} for doc_id, doc in enumerate(corpus): doc_name = f"Document {doc_id}" # Find matches using Rabin-Karp matches = rabin_karp_variable_length(doc, list(ngrams)) match_count = sum(len(positions) for positions in matches.values()) match_ratio = match_count / max(1, len(doc.split()) - n + 1) results[doc_name] = { 'match_count': match_count, 'match_ratio': match_ratio, 'plagiarism_flag': match_ratio > 0.1, # 10% threshold 'matched_ngrams': [ng for ng, pos in matches.items() if pos] } return results # Example usagetarget = "The quick brown fox jumps over the lazy dog"corpus = [ "The quick brown fox runs through the forest", "This is a completely different document about cats", "The quick brown fox jumps over the lazy sleeping dog"] analysis = plagiarism_detection(target, corpus, n=3)for doc, info in analysis.items(): print(f"{doc}: {info['match_count']} matches, " + f"ratio={info['match_ratio']:.2%}, " + f"flagged={info['plagiarism_flag']}")Production plagiarism detectors (like Turnitin) and antivirus scanners use highly optimized variants of these algorithms, often combined with other techniques like locality-sensitive hashing for approximate matching and SIMD instructions for speed.
For production systems handling millions of patterns or gigabytes of text, additional optimizations become important.\n\nOptimization 1: Bloom filter pre-screening\n\nInstead of a hash set, use a Bloom filter for pattern hashes:\n- Space: O(k) bits instead of O(k) hash entries\n- Lookup: Still O(1), often faster due to cache\n- Trade-off: Small false positive rate (acceptable with verification)\n\n\nbloom = BloomFilter(size=1000000, hash_count=7)\nfor pattern in patterns:\n bloom.add(hash(pattern))\n\n# During matching\nif bloom.might_contain(window_hash): # Fast!\n if window_hash in exact_hash_set: # Rare\n verify()\n\n\nOptimization 2: SIMD-accelerated hashing\n\nModern CPUs can compute multiple hashes in parallel using SIMD instructions:\n- Process 4-8 positions simultaneously\n- Vectorized modular arithmetic\n- 4-8× speedup on compatible hardware
Optimization 3: Pattern hash precomputation\n\nFor static pattern sets (e.g., malware signatures), precompute and cache:\n- Pattern hashes\n- Sorted hash array for binary search (if hash set overhead is too high)\n- Perfect hash function (zero collisions within pattern set)\n\nOptimization 4: Chunked processing for large texts\n\n\nfor chunk in text.chunks(CHUNK_SIZE):\n matches_in_chunk = rabin_karp(chunk, patterns)\n adjust_positions(matches_in_chunk, chunk.offset)\n yield matches_in_chunk\n\n\nBenefits:\n- Better cache utilization\n- Enables parallel processing of chunks\n- Reduces memory pressure for huge texts\n\nOptimization 5: Early termination\n\nFor some applications, we don't need all matches:\n- 'Find first' returns after single match\n- 'Find up to N' stops after N matches\n- 'Existence check' returns boolean on first match
These optimizations add complexity. Always profile your specific workload first. For many applications, the basic implementation is fast enough. Only add optimizations when measurements show they're needed.
Rabin-Karp isn't the only multi-pattern algorithm. The Aho-Corasick algorithm is a specialized automaton-based approach that guarantees optimal time complexity.\n\nAho-Corasick overview:\n\n- Builds a finite automaton (trie + failure links) from all patterns\n- Processes text in a single pass, following automaton transitions\n- Reports matches as automaton reaches accepting states\n\nComplexity comparison:
| Aspect | Rabin-Karp Multi | Aho-Corasick |
|---|---|---|
| Preprocessing | O(M) | O(M × alphabet_size) |
| Matching | O(n + M) expected | O(n + z) guaranteed |
| Total | O(n + M) expected | O(n + M + z) |
| Space | O(k) pattern hashes | O(M) automaton states |
| Variable lengths | Requires grouping | Native support |
| Worst case | O(nm) | O(n + M + z) |
| Implementation | Simple | Complex |
Where z = number of matches reported\n\nWhen to choose Rabin-Karp:\n\n1. Implementation simplicity is valued — Rabin-Karp is straightforward to implement correctly\n2. Patterns change frequently — No complex automaton to rebuild\n3. Patterns are equal length — Eliminates main limitation\n4. Expected case is sufficient — Worst case is astronomically rare\n\nWhen to choose Aho-Corasick:\n\n1. Guaranteed linear time is required — Security-critical, real-time systems\n2. Variable-length patterns naturally — No grouping needed\n3. Patterns are static — Automaton built once, used many times\n4. Many overlapping matches — Aho-Corasick handles naturally\n\nPractical guidance:\n\nFor most applications, either algorithm works well. Rabin-Karp's simplicity makes it a good default choice. Consider Aho-Corasick for specialized use cases like intrusion detection or antivirus where worst-case guarantees matter.
Some systems use Rabin-Karp as a fast filter and Aho-Corasick for detailed matching. The hash filter quickly eliminates most positions, then the automaton processes the remaining candidates precisely.
Let's build a complete, production-ready multi-pattern Rabin-Karp implementation that handles edge cases and provides useful features.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
from collections import defaultdictfrom dataclasses import dataclassfrom typing import Iterator @dataclassclass Match: """Represents a pattern match in text.""" pattern: str position: int length: int def __repr__(self): return f"Match('{self.pattern}' at {self.position})" class MultiPatternMatcher: """ Production-ready multi-pattern matching using Rabin-Karp. Features: - Supports patterns of varying lengths - Double hashing for collision resistance - Streaming-capable design - Match count and position tracking """ def __init__(self, patterns: list[str], base1: int = 31, mod1: int = 10**9 + 7, base2: int = 37, mod2: int = 10**9 + 9): if not patterns: raise ValueError("At least one pattern required") self.patterns = list(set(patterns)) # Deduplicate self.base1, self.mod1 = base1, mod1 self.base2, self.mod2 = base2, mod2 # Group patterns by length self.by_length: dict[int, dict[tuple, list[str]]] = defaultdict(dict) for pattern in self.patterns: m = len(pattern) h1 = self._hash(pattern, base1, mod1) h2 = self._hash(pattern, base2, mod2) key = (h1, h2) if key not in self.by_length[m]: self.by_length[m][key] = [] self.by_length[m][key].append(pattern) self.lengths = sorted(self.by_length.keys()) def _hash(self, s: str, base: int, mod: int) -> int: """Compute polynomial hash of string.""" h = 0 for c in s: h = (h * base + ord(c)) % mod return h def search(self, text: str) -> list[Match]: """Find all pattern occurrences in text.""" return list(self.search_iter(text)) def search_iter(self, text: str) -> Iterator[Match]: """Iterate over pattern occurrences (memory-efficient).""" n = len(text) for m in self.lengths: if m > n: continue patterns_at_length = self.by_length[m] # Compute initial window hashes h1 = self._hash(text[:m], self.base1, self.mod1) h2 = self._hash(text[:m], self.base2, self.mod2) # Base powers for rolling bp1 = pow(self.base1, m - 1, self.mod1) bp2 = pow(self.base2, m - 1, self.mod2) for i in range(n - m + 1): key = (h1, h2) if key in patterns_at_length: window = text[i:i+m] for pattern in patterns_at_length[key]: if window == pattern: yield Match(pattern, i, m) # Roll hashes if i < n - m: out = ord(text[i]) inp = ord(text[i + m]) h1 = ((h1 - out * bp1) * self.base1 + inp) % self.mod1 h2 = ((h2 - out * bp2) * self.base2 + inp) % self.mod2 if h1 < 0: h1 += self.mod1 if h2 < 0: h2 += self.mod2 def contains_any(self, text: str) -> bool: """Check if text contains any pattern (early termination).""" for _ in self.search_iter(text): return True return False def count_matches(self, text: str) -> dict[str, int]: """Count occurrences of each pattern.""" counts = {p: 0 for p in self.patterns} for match in self.search_iter(text): counts[match.pattern] += 1 return counts # Example usagematcher = MultiPatternMatcher([ "hello", "world", "hell", "algorithm"]) text = """Hello, World! This is a test of the hello algorithm.The world of algorithms is vast. Hello again!""" print("All matches:")for match in matcher.search(text.lower()): print(f" {match}") print(f"\nContains any: {matcher.contains_any(text.lower())}")print(f"\nMatch counts: {matcher.count_matches(text.lower())}")We've explored Rabin-Karp's killer feature in depth. Let's consolidate the key insights:
Module complete:\n\nYou've now mastered the Rabin-Karp algorithm comprehensively:\n\n1. Hashing for pattern matching — The conceptual foundation\n2. Rolling hash technique — O(1) hash updates\n3. Collision handling — Verification and mitigation strategies\n4. Complexity analysis — Expected O(n+m), worst O(nm)\n5. Multi-pattern advantage — The killer feature\n\nRabin-Karp is a beautiful example of how a simple idea (fingerprinting) combined with clever mathematics (rolling hash) produces a practical, powerful algorithm. Its multi-pattern capability makes it irreplaceable in many real-world systems.
Congratulations! You've completed the Rabin-Karp Algorithm module. You now understand the rolling hash technique, collision handling, complexity analysis, and—crucially—the multi-pattern searching advantage that makes Rabin-Karp the algorithm of choice for many real-world applications.