Loading content...
So far, we've focused on finding a single pattern in text. But many real-world applications require searching for multiple patterns simultaneously:
The distinction between single pattern matching and multiple pattern matching is fundamental because it changes the optimal algorithmic approach entirely.
By the end of this page, you will understand: (1) The formal definition of single vs. multiple pattern matching, (2) Complexity analysis for naive approaches to each, (3) Why single-pattern algorithms don't scale for multiple patterns, (4) Introduction to specialized multi-pattern algorithms, (5) Decision framework for choosing the right approach.
Single pattern matching is the classical formulation we've studied:
Given:
Find:
Complexity landscape:
| Algorithm | Preprocessing | Matching | Total | Space |
|---|---|---|---|---|
| Naive | O(1) | O(nm) | O(nm) | O(1) |
| KMP | O(m) | O(n) | O(n + m) | O(m) |
| Rabin-Karp | O(m) | O(n)* | O(n + m)* | O(1) |
| Boyer-Moore | O(m + | Σ | ) | O(n/m)† best |
*Average case; worst case O(nm) due to hash collisions
†Best case when pattern doesn't occur and alphabet is large
Key insight:
For single pattern matching, we can achieve O(n + m) time—linear in the total input size. This is optimal because we must at least read the inputs.
The efficiency principle:
Efficient single-pattern algorithms work by ensuring each text character is examined at most a constant number of times, regardless of the pattern. They achieve this through pattern preprocessing that enables "smart skipping" after mismatches.
When is single pattern matching sufficient?
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
/** * Demonstration of single pattern matching complexity * Comparing naive vs KMP approaches */ interface MatchingStats { matches: number[]; comparisons: number; timeMs: number;} // Naive O(nm) approachfunction naiveMatch(T: string, P: string): MatchingStats { const start = performance.now(); const n = T.length, m = P.length; const matches: number[] = []; let comparisons = 0; for (let i = 0; i <= n - m; i++) { let j = 0; while (j < m) { comparisons++; if (T[i + j] !== P[j]) break; j++; } if (j === m) matches.push(i); } return { matches, comparisons, timeMs: performance.now() - start };} // KMP O(n + m) approachfunction kmpMatch(T: string, P: string): MatchingStats { const start = performance.now(); const n = T.length, m = P.length; const matches: number[] = []; let comparisons = 0; // Build failure function const lps = new Array(m).fill(0); let len = 0, i = 1; while (i < m) { comparisons++; if (P[i] === P[len]) { lps[i++] = ++len; } else if (len > 0) { len = lps[len - 1]; } else { lps[i++] = 0; } } // Match i = 0; let j = 0; while (i < n) { comparisons++; if (T[i] === P[j]) { i++; j++; if (j === m) { matches.push(i - m); j = lps[j - 1]; } } else if (j > 0) { j = lps[j - 1]; } else { i++; } } return { matches, comparisons, timeMs: performance.now() - start };} // Worst-case demonstrationfunction demonstrateComplexity(): void { console.log("=== Single Pattern Matching Complexity Demo ===\n"); // Worst case: repetitive text and pattern const sizes = [100, 500, 1000, 2000]; for (const n of sizes) { const T = "a".repeat(n); const P = "a".repeat(Math.floor(n / 10)) + "b"; // Almost matches everywhere const naive = naiveMatch(T, P); const kmp = kmpMatch(T, P); console.log(`n = ${n}:`); console.log(` Naive: ${naive.comparisons} comparisons (${naive.timeMs.toFixed(2)}ms)`); console.log(` KMP: ${kmp.comparisons} comparisons (${kmp.timeMs.toFixed(2)}ms)`); console.log(` Ratio: ${(naive.comparisons / kmp.comparisons).toFixed(1)}x\n`); }} demonstrateComplexity();Multiple pattern matching extends the problem to searching for several patterns at once:
Given:
Find:
Alternatively:
The naive approach and its cost:
The simplest solution is to run a single-pattern algorithm k times:
for each pattern Pᵢ in P:
find all occurrences of Pᵢ in T
Using KMP for each pattern:
This becomes prohibitively expensive as k grows.
When k = 10,000 patterns (common in virus scanning), searching a 1MB file takes ~10 billion operations with naive multi-pattern matching. A virus scanner checking every downloaded file would be unusably slow. This scaling problem drove the development of specialized multi-pattern algorithms.
Better approaches exist:
Specialized multi-pattern algorithms achieve much better complexity:
| Algorithm | Preprocessing | Matching | Notes |
|---|---|---|---|
| Naive (k × single) | O(M) | O(kn) | Runs k separate searches |
| Aho-Corasick | O(M) | O(n + z) | z = total occurrences |
| Commentz-Walter | O(M + k | Σ | ) |
| Rabin-Karp Multi | O(M) | O(n + z)* | Hash-based, probabilistic |
The key insight:
Aho-Corasick and similar algorithms achieve O(n + z) matching time—independent of k! They process the text only once, regardless of how many patterns we're searching for.
This is transformative: searching for 1 pattern or 1 million patterns takes essentially the same time (just more preprocessing for the patterns).
| Scenario | k (patterns) | n (text) | Naive k×KMP | Aho-Corasick |
|---|---|---|---|---|
| Simple search | 1 | 1,000 | ~1,000 ops | ~1,000 ops |
| Multi-word search | 10 | 10,000 | ~100,000 ops | ~10,000 ops |
| Virus scan | 10,000 | 1,000,000 | ~10,000,000,000 ops | ~1,000,000 ops |
| Large dictionary | 100,000 | 1,000,000 | ~100,000,000,000 ops | ~1,000,000 ops |
The Aho-Corasick algorithm (1975) is the definitive solution for multiple pattern matching. It extends KMP's failure function concept to multiple patterns using a trie-based automaton.
Core idea:
The magic:
Each character of text is processed exactly once. Failure links allow jumping between patterns when a mismatch occurs, without re-reading text.
Complexity:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
/** * Conceptual illustration of Aho-Corasick * This is a simplified educational implementation */ interface TrieNode { children: Map<string, TrieNode>; fail: TrieNode | null; // Failure link output: string[]; // Patterns ending here} function createNode(): TrieNode { return { children: new Map(), fail: null, output: [], };} /** * Build Aho-Corasick automaton from patterns */function buildAhoCorasick(patterns: string[]): TrieNode { const root = createNode(); root.fail = root; // Root's failure points to itself // Phase 1: Build trie for (const pattern of patterns) { let node = root; for (const char of pattern) { if (!node.children.has(char)) { node.children.set(char, createNode()); } node = node.children.get(char)!; } node.output.push(pattern); // Mark pattern end } // Phase 2: Build failure links using BFS const queue: TrieNode[] = []; // Depth-1 nodes fail to root for (const child of root.children.values()) { child.fail = root; queue.push(child); } while (queue.length > 0) { const current = queue.shift()!; for (const [char, child] of current.children) { // Find failure for child let fail = current.fail!; while (fail !== root && !fail.children.has(char)) { fail = fail.fail!; } child.fail = fail.children.has(char) && fail.children.get(char) !== child ? fail.children.get(char)! : root; // Merge outputs from failure chain child.output = [...child.output, ...child.fail.output]; queue.push(child); } } return root;} /** * Search text using Aho-Corasick automaton */function ahoCorasickSearch( text: string, root: TrieNode): Array<{ position: number; pattern: string }> { const results: Array<{ position: number; pattern: string }> = []; let node = root; for (let i = 0; i < text.length; i++) { const char = text[i]; // Follow failure links until match or root while (node !== root && !node.children.has(char)) { node = node.fail!; } node = node.children.get(char) || root; // Report all patterns ending here for (const pattern of node.output) { results.push({ position: i - pattern.length + 1, pattern, }); } } return results;} // Demonstrationconst patterns = ["he", "she", "his", "hers"];const text = "ushers"; console.log("Patterns:", patterns);console.log(`Text: "${text}"\n`); const automaton = buildAhoCorasick(patterns);const matches = ahoCorasickSearch(text, automaton); console.log("Matches found:");for (const match of matches) { console.log(` "${match.pattern}" at position ${match.position}`);} // Output:// "she" at position 1// "he" at position 2// "hers" at position 2Aho-Corasick is used in real virus scanners, intrusion detection systems, and plagiarism detectors. The ClamAV virus scanner and Snort IDS both use variants of Aho-Corasick. It's one of the most practically important algorithms in computer science.
Let's formalize the distinction between these two problem classes:
Single Pattern Matching (SPM):
Input: Text T ∈ Σ*, Pattern P ∈ Σ*
|T| = n, |P| = m
Output: Set S = {i : 0 ≤ i ≤ n-m and T[i..i+m-1] = P}
Optimal: O(n + m) time, O(m) space
Multiple Pattern Matching (MPM):
Input: Text T ∈ Σ*, Pattern set P = {P₁, P₂, ..., Pₖ} ⊂ Σ*
|T| = n, |Pᵢ| = mᵢ, M = Σmᵢ
Output: Set S = {(i, j) : 0 ≤ i ≤ n-mⱼ and T[i..i+mⱼ-1] = Pⱼ}
Optimal: O(n + M + z) time, O(M) space
where z = |S| (number of matches)
Key differences:
| Aspect | SPM | MPM |
|---|---|---|
| Output | Positions only | (Position, Pattern) pairs |
| Lower bound | Ω(n + m) | Ω(n + M + z) |
| Text passes | ≥1 | Exactly 1 (optimal) |
| Pattern dependency | Preprocessing O(m) | Preprocessing O(M)* |
| Streaming | Possible with O(m) memory | Possible with O(M) memory |
*Some algorithms achieve O(M / word_size) with bit-parallelism
Relationship between SPM and MPM:
SPM is a special case of MPM where k = 1. Any MPM algorithm can solve SPM by using a single-element pattern set.
However, SPM algorithms can be simpler and have better constant factors for k = 1. The overhead of Aho-Corasick's trie structure is unnecessary when searching for a single pattern.
Hybrid approaches:
In practice, systems often use hybrid strategies:
These hybrids can outperform pure approaches in specific scenarios.
In practice, running k separate KMP searches is competitive with Aho-Corasick for small k (typically k < 5-10). Beyond that threshold, Aho-Corasick's single-pass advantage dominates. The exact crossover depends on pattern lengths, text length, and implementation details.
While Aho-Corasick is the classic solution, other approaches exist with different trade-offs:
1. Rabin-Karp for Multiple Patterns
Extend Rabin-Karp to multiple patterns by storing all pattern hashes in a hash set:
Patterns: {P₁, P₂, ..., Pₖ} (all same length m)
Preprocess: Store hash(Pᵢ) for all i in hash set
Match: For each window of size m:
if hash(window) in hash_set:
verify each matching pattern
Limitation: Works best when all patterns have same length. For variable lengths, run separately for each length group.
2. Commentz-Walter Algorithm
Extends Boyer-Moore to multiple patterns:
Better than Aho-Corasick when patterns are long and sparse in text.
3. Wu-Manber Algorithm
Optimized for many patterns with similar short prefixes:
| Algorithm | Best For | Avoid When |
|---|---|---|
| k × KMP | k < 5 patterns | Large pattern sets |
| Aho-Corasick | Many patterns, streaming | Very long patterns with few matches |
| Rabin-Karp Multi | Same-length patterns | Many different lengths |
| Commentz-Walter | Long patterns, sparse matches | Short patterns, dense matches |
| Wu-Manber | 10,000+ patterns | Very short patterns (< 3 chars) |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
/** * Multi-pattern Rabin-Karp for same-length patterns * Demonstrates hash-based multi-pattern matching */ const BASE = 256;const MOD = 101; // Prime modulus function computeHash(s: string): number { let h = 0; for (let i = 0; i < s.length; i++) { h = (h * BASE + s.charCodeAt(i)) % MOD; } return h;} function multiPatternRabinKarp( text: string, patterns: string[]): Array<{ position: number; pattern: string }> { if (patterns.length === 0) return []; // All patterns must be same length for this version const m = patterns[0].length; if (patterns.some(p => p.length !== m)) { throw new Error("All patterns must have same length"); } const results: Array<{ position: number; pattern: string }> = []; // Build hash table of patterns const patternHashes = new Map<number, string[]>(); for (const pattern of patterns) { const h = computeHash(pattern); if (!patternHashes.has(h)) { patternHashes.set(h, []); } patternHashes.get(h)!.push(pattern); // Handle collisions } if (m > text.length) return results; // Precompute BASE^(m-1) % MOD let highOrder = 1; for (let i = 0; i < m - 1; i++) { highOrder = (highOrder * BASE) % MOD; } // Compute initial window hash let textHash = computeHash(text.slice(0, m)); // Check each window for (let i = 0; i <= text.length - m; i++) { // Check if hash matches any pattern if (patternHashes.has(textHash)) { const window = text.slice(i, i + m); for (const pattern of patternHashes.get(textHash)!) { if (window === pattern) { // Verify match results.push({ position: i, pattern }); } } } // Roll the hash to next window if (i < text.length - m) { textHash = ( (textHash - text.charCodeAt(i) * highOrder) * BASE + text.charCodeAt(i + m) ) % MOD; if (textHash < 0) textHash += MOD; } } return results;} // Democonst patterns = ["abc", "bcd", "xyz", "cde"];const text = "abcdefxyzabcop"; console.log("Patterns:", patterns);console.log(`Text: "${text}"\n`); const matches = multiPatternRabinKarp(text, patterns);console.log("Matches found:");for (const match of matches) { console.log(` "${match.pattern}" at position ${match.position}`);}Given a pattern matching task, how do you choose between single-pattern and multi-pattern approaches?
Decision tree:
How many patterns are you searching for?
│
├─► Single pattern (k = 1)
│ │
│ └─► Use: KMP, Boyer-Moore, or Rabin-Karp
│ Factors: Pattern length, alphabet size, expected matches
│
└─► Multiple patterns (k > 1)
│
├─► k < 5 and patterns are long?
│ └─► Consider: k × Boyer-Moore (may beat Aho-Corasick)
│
├─► Patterns mostly same length?
│ └─► Consider: Multi-pattern Rabin-Karp
│
├─► Need streaming (process text once)?
│ └─► Use: Aho-Corasick
│
├─► Pattern set is very large (10,000+)?
│ └─► Use: Wu-Manber or Aho-Corasick with optimizations
│
└─► Default choice:
└─► Use: Aho-Corasick (robust, predictable)
When uncertain, start with the simplest solution that meets your requirements. For a single pattern, your language's built-in string search (indexOf, includes, find) often uses optimized C implementations that outperform hand-written algorithms. For multiple patterns, Aho-Corasick is the robust choice.
Beyond algorithmic complexity, real-world implementations must consider several practical factors:
Memory access patterns:
Aho-Corasick's trie can have poor cache locality when the automaton is large. Array-based representations with computed indices often outperform pointer-based tries in practice.
SIMD and parallelization:
Modern CPUs support SIMD instructions that can compare multiple characters simultaneously. Libraries like Hyperscan (Intel) use these for massive speedups.
Alphabet handling:
For Unicode text, the naive approach of 143,000+ children per trie node is impractical. Solutions include:
Dynamic pattern sets:
Classic Aho-Corasick assumes a fixed pattern set. If patterns are added/removed frequently:
| Library | Language | Algorithm | Notable Features |
|---|---|---|---|
| Intel Hyperscan | C/C++ | Hybrid (Aho-Corasick + SIMD) | SIMD optimization, regex support, streaming |
| Aho-Corasick (Python) | Python | Aho-Corasick | Pure Python, easy to use, reasonable speed |
| Flashtext | Python | Aho-Corasick variant | Optimized for word replacement |
| rust-aho-corasick | Rust | Aho-Corasick | High performance, streaming, match kinds |
| pyahocorasick | Python (C ext) | Aho-Corasick | C-based for speed, Python interface |
For production use, prefer well-tested libraries over custom implementations. Aho-Corasick has many edge cases and optimization opportunities that mature libraries handle correctly. Write your own only for learning or when you have very specific requirements not met by existing libraries.
You now understand the fundamental distinction between single and multiple pattern matching, why it matters, and how to choose between them. This knowledge is essential for designing efficient text processing systems. Next, we'll explore the incredible range of real-world applications that depend on these pattern matching techniques.