String Matching Problem - Learning Module

Loading content...

0/276

Single vs Multiple Pattern Matching

One Needle or Many?

So far, we've focused on finding a single pattern in text. But many real-world applications require searching for multiple patterns simultaneously:

Virus scanners search for thousands of malware signatures
Intrusion detection systems monitor network traffic for attack patterns
DNA analysis searches for multiple gene sequences
Content filtering checks for lists of prohibited words
Syntax highlighters identify language keywords and constructs

The distinction between single pattern matching and multiple pattern matching is fundamental because it changes the optimal algorithmic approach entirely.

Learning Objectives

By the end of this page, you will understand: (1) The formal definition of single vs. multiple pattern matching, (2) Complexity analysis for naive approaches to each, (3) Why single-pattern algorithms don't scale for multiple patterns, (4) Introduction to specialized multi-pattern algorithms, (5) Decision framework for choosing the right approach.

Single Pattern Matching: The Classical Problem

Single pattern matching is the classical formulation we've studied:

Given:

Text T of length n
Pattern P of length m

Find:

All positions where P occurs in T

Complexity landscape:

Algorithm	Preprocessing	Matching	Total	Space
Naive	O(1)	O(nm)	O(nm)	O(1)
KMP	O(m)	O(n)	O(n + m)	O(m)
Rabin-Karp	O(m)	O(n)*	O(n + m)*	O(1)
Boyer-Moore	O(m +	Σ	)	O(n/m)† best

*Average case; worst case O(nm) due to hash collisions

†Best case when pattern doesn't occur and alphabet is large

Key insight:

For single pattern matching, we can achieve O(n + m) time—linear in the total input size. This is optimal because we must at least read the inputs.

The efficiency principle:

Efficient single-pattern algorithms work by ensuring each text character is examined at most a constant number of times, regardless of the pattern. They achieve this through pattern preprocessing that enables "smart skipping" after mismatches.

When is single pattern matching sufficient?

Searching for one word in a document (Ctrl+F)
Finding a specific substring in a string
Log file analysis for a specific error message
Any scenario with exactly one needle in the haystack

single-pattern-complexity
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
/**
 * Demonstration of single pattern matching complexity
 * Comparing naive vs KMP approaches
 */
 
interface MatchingStats {
    matches: number[];
    comparisons: number;
    timeMs: number;
}
 
// Naive O(nm) approach
function naiveMatch(T: string, P: string): MatchingStats {
    const start = performance.now();
    const n = T.length, m = P.length;
    const matches: number[] = [];
    let comparisons = 0;
    
    for (let i = 0; i <= n - m; i++) {
        let j = 0;
        while (j < m) {
            comparisons++;
            if (T[i + j] !== P[j]) break;
            j++;
        }
        if (j === m) matches.push(i);
    }
    
    return { matches, comparisons, timeMs: performance.now() - start };
}
 
// KMP O(n + m) approach
function kmpMatch(T: string, P: string): MatchingStats {
    const start = performance.now();
    const n = T.length, m = P.length;
    const matches: number[] = [];
    let comparisons = 0;
    
    // Build failure function
    const lps = new Array(m).fill(0);
    let len = 0, i = 1;
    while (i < m) {
        comparisons++;
        if (P[i] === P[len]) {
            lps[i++] = ++len;
        } else if (len > 0) {
            len = lps[len - 1];
        } else {
            lps[i++] = 0;
        }
    }
    
    // Match
    i = 0;
    let j = 0;
    while (i < n) {
        comparisons++;
        if (T[i] === P[j]) {
            i++; j++;
            if (j === m) {
                matches.push(i - m);
                j = lps[j - 1];
            }
        } else if (j > 0) {
            j = lps[j - 1];
        } else {
            i++;
        }
    }
    
    return { matches, comparisons, timeMs: performance.now() - start };
}
 
// Worst-case demonstration
function demonstrateComplexity(): void {
    console.log("=== Single Pattern Matching Complexity Demo ===\n");
    
    // Worst case: repetitive text and pattern
    const sizes = [100, 500, 1000, 2000];
    
    for (const n of sizes) {
        const T = "a".repeat(n);
        const P = "a".repeat(Math.floor(n / 10)) + "b"; // Almost matches everywhere
        
        const naive = naiveMatch(T, P);
        const kmp = kmpMatch(T, P);
        
        console.log(`n = ${n}:`);
        console.log(`  Naive:  ${naive.comparisons} comparisons (${naive.timeMs.toFixed(2)}ms)`);
        console.log(`  KMP:    ${kmp.comparisons} comparisons (${kmp.timeMs.toFixed(2)}ms)`);
        console.log(`  Ratio:  ${(naive.comparisons / kmp.comparisons).toFixed(1)}x\n`);
    }
}
 
demonstrateComplexity();

Multiple Pattern Matching: The Extended Problem

Multiple pattern matching extends the problem to searching for several patterns at once:

Given:

Text T of length n
A set of k patterns P = {P₁, P₂, ..., Pₖ}
Let mᵢ = |Pᵢ| and M = Σmᵢ (total pattern length)

Find:

For each pattern Pᵢ, all positions where Pᵢ occurs in T

Alternatively:

Find all (position, pattern) pairs where pattern occurs at position

The naive approach and its cost:

The simplest solution is to run a single-pattern algorithm k times:

for each pattern Pᵢ in P:
    find all occurrences of Pᵢ in T

Using KMP for each pattern:

Time: O(k × n + M) = O(kn + M)
For 1000 patterns in a 1MB text: O(10⁹) operations

This becomes prohibitively expensive as k grows.

The Scalability Wall

When k = 10,000 patterns (common in virus scanning), searching a 1MB file takes ~10 billion operations with naive multi-pattern matching. A virus scanner checking every downloaded file would be unusably slow. This scaling problem drove the development of specialized multi-pattern algorithms.

Better approaches exist:

Specialized multi-pattern algorithms achieve much better complexity:

Algorithm	Preprocessing	Matching	Notes
Naive (k × single)	O(M)	O(kn)	Runs k separate searches
Aho-Corasick	O(M)	O(n + z)	z = total occurrences
Commentz-Walter	O(M + k	Σ	)
Rabin-Karp Multi	O(M)	O(n + z)*	Hash-based, probabilistic

The key insight:

Aho-Corasick and similar algorithms achieve O(n + z) matching time—independent of k! They process the text only once, regardless of how many patterns we're searching for.

This is transformative: searching for 1 pattern or 1 million patterns takes essentially the same time (just more preprocessing for the patterns).

Concrete Comparison: Single vs Multiple Pattern Algorithms
Scenario	k (patterns)	n (text)	Naive k×KMP	Aho-Corasick
Simple search	1	1,000	~1,000 ops	~1,000 ops
Multi-word search	10	10,000	~100,000 ops	~10,000 ops
Virus scan	10,000	1,000,000	~10,000,000,000 ops	~1,000,000 ops
Large dictionary	100,000	1,000,000	~100,000,000,000 ops	~1,000,000 ops

Aho-Corasick: The Multi-Pattern Champion

The Aho-Corasick algorithm (1975) is the definitive solution for multiple pattern matching. It extends KMP's failure function concept to multiple patterns using a trie-based automaton.

Core idea:

Build a trie (prefix tree) containing all patterns
Add failure links that connect nodes—similar to KMP's failure function, but across patterns
Add output links to handle patterns that are substrings of other patterns
Process text character-by-character, following trie transitions

The magic:

Each character of text is processed exactly once. Failure links allow jumping between patterns when a mismatch occurs, without re-reading text.

Complexity:

Preprocessing: O(M) to build trie and compute failure links
Matching: O(n + z) where z = total number of matches
Space: O(M) for the automaton

aho-corasick-concept
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
/**
 * Conceptual illustration of Aho-Corasick
 * This is a simplified educational implementation
 */
 
interface TrieNode {
    children: Map<string, TrieNode>;
    fail: TrieNode | null;        // Failure link
    output: string[];             // Patterns ending here
}
 
function createNode(): TrieNode {
    return {
        children: new Map(),
        fail: null,
        output: [],
    };
}
 
/**
 * Build Aho-Corasick automaton from patterns
 */
function buildAhoCorasick(patterns: string[]): TrieNode {
    const root = createNode();
    root.fail = root;  // Root's failure points to itself
    
    // Phase 1: Build trie
    for (const pattern of patterns) {
        let node = root;
        for (const char of pattern) {
            if (!node.children.has(char)) {
                node.children.set(char, createNode());
            }
            node = node.children.get(char)!;
        }
        node.output.push(pattern);  // Mark pattern end
    }
    
    // Phase 2: Build failure links using BFS
    const queue: TrieNode[] = [];
    
    // Depth-1 nodes fail to root
    for (const child of root.children.values()) {
        child.fail = root;
        queue.push(child);
    }
    
    while (queue.length > 0) {
        const current = queue.shift()!;
        
        for (const [char, child] of current.children) {
            // Find failure for child
            let fail = current.fail!;
            while (fail !== root && !fail.children.has(char)) {
                fail = fail.fail!;
            }
            child.fail = fail.children.has(char) && fail.children.get(char) !== child
                ? fail.children.get(char)!
                : root;
            
            // Merge outputs from failure chain
            child.output = [...child.output, ...child.fail.output];
            
            queue.push(child);
        }
    }
    
    return root;
}
 
/**
 * Search text using Aho-Corasick automaton
 */
function ahoCorasickSearch(
    text: string, 
    root: TrieNode
): Array<{ position: number; pattern: string }> {
    const results: Array<{ position: number; pattern: string }> = [];
    let node = root;
    
    for (let i = 0; i < text.length; i++) {
        const char = text[i];
        
        // Follow failure links until match or root
        while (node !== root && !node.children.has(char)) {
            node = node.fail!;
        }
        
        node = node.children.get(char) || root;
        
        // Report all patterns ending here
        for (const pattern of node.output) {
            results.push({
                position: i - pattern.length + 1,
                pattern,
            });
        }
    }
    
    return results;
}
 
// Demonstration
const patterns = ["he", "she", "his", "hers"];
const text = "ushers";
 
console.log("Patterns:", patterns);
console.log(`Text: "${text}"\n`);
 
const automaton = buildAhoCorasick(patterns);
const matches = ahoCorasickSearch(text, automaton);
 
console.log("Matches found:");
for (const match of matches) {
    console.log(`  "${match.pattern}" at position ${match.position}`);
}
 
// Output:
// "she" at position 1
// "he" at position 2
// "hers" at position 2

Why Aho-Corasick Matters

Aho-Corasick is used in real virus scanners, intrusion detection systems, and plagiarism detectors. The ClamAV virus scanner and Snort IDS both use variants of Aho-Corasick. It's one of the most practically important algorithms in computer science.

Formal Comparison: Single vs Multiple

Let's formalize the distinction between these two problem classes:

Single Pattern Matching (SPM):

Input:  Text T ∈ Σ*, Pattern P ∈ Σ*
        |T| = n, |P| = m
        
Output: Set S = {i : 0 ≤ i ≤ n-m and T[i..i+m-1] = P}

Optimal: O(n + m) time, O(m) space

Multiple Pattern Matching (MPM):

Input:  Text T ∈ Σ*, Pattern set P = {P₁, P₂, ..., Pₖ} ⊂ Σ*
        |T| = n, |Pᵢ| = mᵢ, M = Σmᵢ
        
Output: Set S = {(i, j) : 0 ≤ i ≤ n-mⱼ and T[i..i+mⱼ-1] = Pⱼ}

Optimal: O(n + M + z) time, O(M) space
        where z = |S| (number of matches)

Key differences:

Aspect	SPM	MPM
Output	Positions only	(Position, Pattern) pairs
Lower bound	Ω(n + m)	Ω(n + M + z)
Text passes	≥1	Exactly 1 (optimal)
Pattern dependency	Preprocessing O(m)	Preprocessing O(M)*
Streaming	Possible with O(m) memory	Possible with O(M) memory

*Some algorithms achieve O(M / word_size) with bit-parallelism

Relationship between SPM and MPM:

SPM is a special case of MPM where k = 1. Any MPM algorithm can solve SPM by using a single-element pattern set.

However, SPM algorithms can be simpler and have better constant factors for k = 1. The overhead of Aho-Corasick's trie structure is unnecessary when searching for a single pattern.

Hybrid approaches:

In practice, systems often use hybrid strategies:

Filter + Verify: Use hashing or SIMD to quickly filter positions, then verify with exact matching
Adaptive: Use SPM for few patterns, switch to MPM for many
Hierarchical: Group similar patterns, search groups, then narrow down

These hybrids can outperform pure approaches in specific scenarios.

The Crossover Point

In practice, running k separate KMP searches is competitive with Aho-Corasick for small k (typically k < 5-10). Beyond that threshold, Aho-Corasick's single-pass advantage dominates. The exact crossover depends on pattern lengths, text length, and implementation details.

Alternative Multi-Pattern Approaches

While Aho-Corasick is the classic solution, other approaches exist with different trade-offs:

1. Rabin-Karp for Multiple Patterns

Extend Rabin-Karp to multiple patterns by storing all pattern hashes in a hash set:

Compute rolling hash of text window
Check if hash exists in pattern hash set
Verify matches to handle collisions

Patterns: {P₁, P₂, ..., Pₖ} (all same length m)
Preprocess: Store hash(Pᵢ) for all i in hash set
Match: For each window of size m:
         if hash(window) in hash_set:
             verify each matching pattern

Limitation: Works best when all patterns have same length. For variable lengths, run separately for each length group.

2. Commentz-Walter Algorithm

Extends Boyer-Moore to multiple patterns:

Uses a reverse trie for right-to-left matching
Combines bad character and good suffix ideas
Can skip multiple characters like Boyer-Moore

Better than Aho-Corasick when patterns are long and sparse in text.

3. Wu-Manber Algorithm

Optimized for many patterns with similar short prefixes:

Uses hashing of pattern suffixes
Skip table based on text blocks
Very fast in practice for large pattern sets

Multi-Pattern Algorithm Selection Guide
Algorithm	Best For	Avoid When
k × KMP	k < 5 patterns	Large pattern sets
Aho-Corasick	Many patterns, streaming	Very long patterns with few matches
Rabin-Karp Multi	Same-length patterns	Many different lengths
Commentz-Walter	Long patterns, sparse matches	Short patterns, dense matches
Wu-Manber	10,000+ patterns	Very short patterns (< 3 chars)

multi-pattern-rabin-karp
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/**
 * Multi-pattern Rabin-Karp for same-length patterns
 * Demonstrates hash-based multi-pattern matching
 */
 
const BASE = 256;
const MOD = 101;  // Prime modulus
 
function computeHash(s: string): number {
    let h = 0;
    for (let i = 0; i < s.length; i++) {
        h = (h * BASE + s.charCodeAt(i)) % MOD;
    }
    return h;
}
 
function multiPatternRabinKarp(
    text: string,
    patterns: string[]
): Array<{ position: number; pattern: string }> {
    if (patterns.length === 0) return [];
    
    // All patterns must be same length for this version
    const m = patterns[0].length;
    if (patterns.some(p => p.length !== m)) {
        throw new Error("All patterns must have same length");
    }
    
    const results: Array<{ position: number; pattern: string }> = [];
    
    // Build hash table of patterns
    const patternHashes = new Map<number, string[]>();
    for (const pattern of patterns) {
        const h = computeHash(pattern);
        if (!patternHashes.has(h)) {
            patternHashes.set(h, []);
        }
        patternHashes.get(h)!.push(pattern);  // Handle collisions
    }
    
    if (m > text.length) return results;
    
    // Precompute BASE^(m-1) % MOD
    let highOrder = 1;
    for (let i = 0; i < m - 1; i++) {
        highOrder = (highOrder * BASE) % MOD;
    }
    
    // Compute initial window hash
    let textHash = computeHash(text.slice(0, m));
    
    // Check each window
    for (let i = 0; i <= text.length - m; i++) {
        // Check if hash matches any pattern
        if (patternHashes.has(textHash)) {
            const window = text.slice(i, i + m);
            for (const pattern of patternHashes.get(textHash)!) {
                if (window === pattern) {  // Verify match
                    results.push({ position: i, pattern });
                }
            }
        }
        
        // Roll the hash to next window
        if (i < text.length - m) {
            textHash = (
                (textHash - text.charCodeAt(i) * highOrder) * BASE + 
                text.charCodeAt(i + m)
            ) % MOD;
            if (textHash < 0) textHash += MOD;
        }
    }
    
    return results;
}
 
// Demo
const patterns = ["abc", "bcd", "xyz", "cde"];
const text = "abcdefxyzabcop";
 
console.log("Patterns:", patterns);
console.log(`Text: "${text}"\n`);
 
const matches = multiPatternRabinKarp(text, patterns);
console.log("Matches found:");
for (const match of matches) {
    console.log(`  "${match.pattern}" at position ${match.position}`);
}

Decision Framework: Choosing the Right Approach

Given a pattern matching task, how do you choose between single-pattern and multi-pattern approaches?

Decision tree:

How many patterns are you searching for?
│
├─► Single pattern (k = 1)
│   │
│   └─► Use: KMP, Boyer-Moore, or Rabin-Karp
│        Factors: Pattern length, alphabet size, expected matches
│
└─► Multiple patterns (k > 1)
    │
    ├─► k < 5 and patterns are long?
    │   └─► Consider: k × Boyer-Moore (may beat Aho-Corasick)
    │
    ├─► Patterns mostly same length?
    │   └─► Consider: Multi-pattern Rabin-Karp
    │
    ├─► Need streaming (process text once)?
    │   └─► Use: Aho-Corasick
    │
    ├─► Pattern set is very large (10,000+)?
    │   └─► Use: Wu-Manber or Aho-Corasick with optimizations
    │
    └─► Default choice:
        └─► Use: Aho-Corasick (robust, predictable)

Choose Single-Pattern When

•Searching for exactly one pattern
•Text changes frequently (can't preprocess)
•Pattern changes frequently (can preprocess cheaply)
•Simplicity is valued over marginal performance
•Memory is severely constrained

Choose Multi-Pattern When

•Searching for multiple patterns
•Pattern set is relatively stable
•Preprocessing cost is amortized over many searches
•Single-pass text processing is required
•Scalability for future pattern growth is needed

The Practical Default

When uncertain, start with the simplest solution that meets your requirements. For a single pattern, your language's built-in string search (indexOf, includes, find) often uses optimized C implementations that outperform hand-written algorithms. For multiple patterns, Aho-Corasick is the robust choice.

Real-World Implementation Considerations

Beyond algorithmic complexity, real-world implementations must consider several practical factors:

Memory access patterns:

Aho-Corasick's trie can have poor cache locality when the automaton is large. Array-based representations with computed indices often outperform pointer-based tries in practice.

SIMD and parallelization:

Modern CPUs support SIMD instructions that can compare multiple characters simultaneously. Libraries like Hyperscan (Intel) use these for massive speedups.

Alphabet handling:

For Unicode text, the naive approach of 143,000+ children per trie node is impractical. Solutions include:

Hash maps for children
Compressed tries (path compression)
Alphabet reduction (bytes instead of code points)

Dynamic pattern sets:

Classic Aho-Corasick assumes a fixed pattern set. If patterns are added/removed frequently:

Consider incremental construction algorithms
Or periodic full reconstruction with double-buffering

Production-Grade Multi-Pattern Matching Libraries
Library	Language	Algorithm	Notable Features
Intel Hyperscan	C/C++	Hybrid (Aho-Corasick + SIMD)	SIMD optimization, regex support, streaming
Aho-Corasick (Python)	Python	Aho-Corasick	Pure Python, easy to use, reasonable speed
Flashtext	Python	Aho-Corasick variant	Optimized for word replacement
rust-aho-corasick	Rust	Aho-Corasick	High performance, streaming, match kinds
pyahocorasick	Python (C ext)	Aho-Corasick	C-based for speed, Python interface

Don't Reinvent the Wheel

For production use, prefer well-tested libraries over custom implementations. Aho-Corasick has many edge cases and optimization opportunities that mature libraries handle correctly. Write your own only for learning or when you have very specific requirements not met by existing libraries.

Summary: Single vs Multiple Pattern Matching

Key Takeaways

•Problem formulation matters: SPM and MPM have different optimal solutions
•Naive multi-pattern is O(kn): Running k single-pattern searches doesn't scale
•Aho-Corasick achieves O(n + z): Independent of pattern count k
•The right choice depends on k: Few patterns favor SPM algorithms; many favor MPM
•Preprocessing enables efficiency: O(M) preprocessing unlocks O(n + z) matching
•Practical implementations have nuances: Memory, parallelism, and dynamic updates matter

Problem Category Mastered

You now understand the fundamental distinction between single and multiple pattern matching, why it matters, and how to choose between them. This knowledge is essential for designing efficient text processing systems. Next, we'll explore the incredible range of real-world applications that depend on these pattern matching techniques.

Single vs Multiple Pattern Matching

One Needle or Many?

So far, we've focused on finding a single pattern in text. But many real-world applications require searching for multiple patterns simultaneously:

Virus scanners search for thousands of malware signatures
Intrusion detection systems monitor network traffic for attack patterns
DNA analysis searches for multiple gene sequences
Content filtering checks for lists of prohibited words
Syntax highlighters identify language keywords and constructs

The distinction between single pattern matching and multiple pattern matching is fundamental because it changes the optimal algorithmic approach entirely.

Learning Objectives

Single Pattern Matching: The Classical Problem

Single pattern matching is the classical formulation we've studied:

Given:

Text T of length n
Pattern P of length m

Find:

All positions where P occurs in T

Complexity landscape:

Algorithm	Preprocessing	Matching	Total	Space
Naive	O(1)	O(nm)	O(nm)	O(1)
KMP	O(m)	O(n)	O(n + m)	O(m)
Rabin-Karp	O(m)	O(n)*	O(n + m)*	O(1)
Boyer-Moore	O(m +	Σ	)	O(n/m)† best

*Average case; worst case O(nm) due to hash collisions

†Best case when pattern doesn't occur and alphabet is large

Key insight:

For single pattern matching, we can achieve O(n + m) time—linear in the total input size. This is optimal because we must at least read the inputs.

The efficiency principle:

When is single pattern matching sufficient?

Searching for one word in a document (Ctrl+F)
Finding a specific substring in a string
Log file analysis for a specific error message
Any scenario with exactly one needle in the haystack

single-pattern-complexity
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
/**
 * Demonstration of single pattern matching complexity
 * Comparing naive vs KMP approaches
 */
 
interface MatchingStats {
    matches: number[];
    comparisons: number;
    timeMs: number;
}
 
// Naive O(nm) approach
function naiveMatch(T: string, P: string): MatchingStats {
    const start = performance.now();
    const n = T.length, m = P.length;
    const matches: number[] = [];
    let comparisons = 0;
    
    for (let i = 0; i <= n - m; i++) {
        let j = 0;
        while (j < m) {
            comparisons++;
            if (T[i + j] !== P[j]) break;
            j++;
        }
        if (j === m) matches.push(i);
    }
    
    return { matches, comparisons, timeMs: performance.now() - start };
}
 
// KMP O(n + m) approach
function kmpMatch(T: string, P: string): MatchingStats {
    const start = performance.now();
    const n = T.length, m = P.length;
    const matches: number[] = [];
    let comparisons = 0;
    
    // Build failure function
    const lps = new Array(m).fill(0);
    let len = 0, i = 1;
    while (i < m) {
        comparisons++;
        if (P[i] === P[len]) {
            lps[i++] = ++len;
        } else if (len > 0) {
            len = lps[len - 1];
        } else {
            lps[i++] = 0;
        }
    }
    
    // Match
    i = 0;
    let j = 0;
    while (i < n) {
        comparisons++;
        if (T[i] === P[j]) {
            i++; j++;
            if (j === m) {
                matches.push(i - m);
                j = lps[j - 1];
            }
        } else if (j > 0) {
            j = lps[j - 1];
        } else {
            i++;
        }
    }
    
    return { matches, comparisons, timeMs: performance.now() - start };
}
 
// Worst-case demonstration
function demonstrateComplexity(): void {
    console.log("=== Single Pattern Matching Complexity Demo ===\n");
    
    // Worst case: repetitive text and pattern
    const sizes = [100, 500, 1000, 2000];
    
    for (const n of sizes) {
        const T = "a".repeat(n);
        const P = "a".repeat(Math.floor(n / 10)) + "b"; // Almost matches everywhere
        
        const naive = naiveMatch(T, P);
        const kmp = kmpMatch(T, P);
        
        console.log(`n = ${n}:`);
        console.log(`  Naive:  ${naive.comparisons} comparisons (${naive.timeMs.toFixed(2)}ms)`);
        console.log(`  KMP:    ${kmp.comparisons} comparisons (${kmp.timeMs.toFixed(2)}ms)`);
        console.log(`  Ratio:  ${(naive.comparisons / kmp.comparisons).toFixed(1)}x\n`);
    }
}
 
demonstrateComplexity();

Multiple Pattern Matching: The Extended Problem

Multiple pattern matching extends the problem to searching for several patterns at once:

Given:

Text T of length n
A set of k patterns P = {P₁, P₂, ..., Pₖ}
Let mᵢ = |Pᵢ| and M = Σmᵢ (total pattern length)

Find:

For each pattern Pᵢ, all positions where Pᵢ occurs in T

Alternatively:

Find all (position, pattern) pairs where pattern occurs at position

The naive approach and its cost:

The simplest solution is to run a single-pattern algorithm k times:

for each pattern Pᵢ in P:
    find all occurrences of Pᵢ in T

Using KMP for each pattern:

Time: O(k × n + M) = O(kn + M)
For 1000 patterns in a 1MB text: O(10⁹) operations

This becomes prohibitively expensive as k grows.

The Scalability Wall

Better approaches exist:

Specialized multi-pattern algorithms achieve much better complexity:

Algorithm	Preprocessing	Matching	Notes
Naive (k × single)	O(M)	O(kn)	Runs k separate searches
Aho-Corasick	O(M)	O(n + z)	z = total occurrences
Commentz-Walter	O(M + k	Σ	)
Rabin-Karp Multi	O(M)	O(n + z)*	Hash-based, probabilistic

The key insight:

Aho-Corasick and similar algorithms achieve O(n + z) matching time—independent of k! They process the text only once, regardless of how many patterns we're searching for.

This is transformative: searching for 1 pattern or 1 million patterns takes essentially the same time (just more preprocessing for the patterns).

Concrete Comparison: Single vs Multiple Pattern Algorithms
Scenario	k (patterns)	n (text)	Naive k×KMP	Aho-Corasick
Simple search	1	1,000	~1,000 ops	~1,000 ops
Multi-word search	10	10,000	~100,000 ops	~10,000 ops
Virus scan	10,000	1,000,000	~10,000,000,000 ops	~1,000,000 ops
Large dictionary	100,000	1,000,000	~100,000,000,000 ops	~1,000,000 ops

Aho-Corasick: The Multi-Pattern Champion

The Aho-Corasick algorithm (1975) is the definitive solution for multiple pattern matching. It extends KMP's failure function concept to multiple patterns using a trie-based automaton.

Core idea:

Build a trie (prefix tree) containing all patterns
Add failure links that connect nodes—similar to KMP's failure function, but across patterns
Add output links to handle patterns that are substrings of other patterns
Process text character-by-character, following trie transitions

The magic:

Each character of text is processed exactly once. Failure links allow jumping between patterns when a mismatch occurs, without re-reading text.

Complexity:

Preprocessing: O(M) to build trie and compute failure links
Matching: O(n + z) where z = total number of matches
Space: O(M) for the automaton

aho-corasick-concept
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
/**
 * Conceptual illustration of Aho-Corasick
 * This is a simplified educational implementation
 */
 
interface TrieNode {
    children: Map<string, TrieNode>;
    fail: TrieNode | null;        // Failure link
    output: string[];             // Patterns ending here
}
 
function createNode(): TrieNode {
    return {
        children: new Map(),
        fail: null,
        output: [],
    };
}
 
/**
 * Build Aho-Corasick automaton from patterns
 */
function buildAhoCorasick(patterns: string[]): TrieNode {
    const root = createNode();
    root.fail = root;  // Root's failure points to itself
    
    // Phase 1: Build trie
    for (const pattern of patterns) {
        let node = root;
        for (const char of pattern) {
            if (!node.children.has(char)) {
                node.children.set(char, createNode());
            }
            node = node.children.get(char)!;
        }
        node.output.push(pattern);  // Mark pattern end
    }
    
    // Phase 2: Build failure links using BFS
    const queue: TrieNode[] = [];
    
    // Depth-1 nodes fail to root
    for (const child of root.children.values()) {
        child.fail = root;
        queue.push(child);
    }
    
    while (queue.length > 0) {
        const current = queue.shift()!;
        
        for (const [char, child] of current.children) {
            // Find failure for child
            let fail = current.fail!;
            while (fail !== root && !fail.children.has(char)) {
                fail = fail.fail!;
            }
            child.fail = fail.children.has(char) && fail.children.get(char) !== child
                ? fail.children.get(char)!
                : root;
            
            // Merge outputs from failure chain
            child.output = [...child.output, ...child.fail.output];
            
            queue.push(child);
        }
    }
    
    return root;
}
 
/**
 * Search text using Aho-Corasick automaton
 */
function ahoCorasickSearch(
    text: string, 
    root: TrieNode
): Array<{ position: number; pattern: string }> {
    const results: Array<{ position: number; pattern: string }> = [];
    let node = root;
    
    for (let i = 0; i < text.length; i++) {
        const char = text[i];
        
        // Follow failure links until match or root
        while (node !== root && !node.children.has(char)) {
            node = node.fail!;
        }
        
        node = node.children.get(char) || root;
        
        // Report all patterns ending here
        for (const pattern of node.output) {
            results.push({
                position: i - pattern.length + 1,
                pattern,
            });
        }
    }
    
    return results;
}
 
// Demonstration
const patterns = ["he", "she", "his", "hers"];
const text = "ushers";
 
console.log("Patterns:", patterns);
console.log(`Text: "${text}"\n`);
 
const automaton = buildAhoCorasick(patterns);
const matches = ahoCorasickSearch(text, automaton);
 
console.log("Matches found:");
for (const match of matches) {
    console.log(`  "${match.pattern}" at position ${match.position}`);
}
 
// Output:
// "she" at position 1
// "he" at position 2
// "hers" at position 2

Why Aho-Corasick Matters

Formal Comparison: Single vs Multiple

Let's formalize the distinction between these two problem classes:

Single Pattern Matching (SPM):

Input:  Text T ∈ Σ*, Pattern P ∈ Σ*
        |T| = n, |P| = m
        
Output: Set S = {i : 0 ≤ i ≤ n-m and T[i..i+m-1] = P}

Optimal: O(n + m) time, O(m) space

Multiple Pattern Matching (MPM):

Input:  Text T ∈ Σ*, Pattern set P = {P₁, P₂, ..., Pₖ} ⊂ Σ*
        |T| = n, |Pᵢ| = mᵢ, M = Σmᵢ
        
Output: Set S = {(i, j) : 0 ≤ i ≤ n-mⱼ and T[i..i+mⱼ-1] = Pⱼ}

Optimal: O(n + M + z) time, O(M) space
        where z = |S| (number of matches)

Key differences:

Aspect	SPM	MPM
Output	Positions only	(Position, Pattern) pairs
Lower bound	Ω(n + m)	Ω(n + M + z)
Text passes	≥1	Exactly 1 (optimal)
Pattern dependency	Preprocessing O(m)	Preprocessing O(M)*
Streaming	Possible with O(m) memory	Possible with O(M) memory

*Some algorithms achieve O(M / word_size) with bit-parallelism

Relationship between SPM and MPM:

SPM is a special case of MPM where k = 1. Any MPM algorithm can solve SPM by using a single-element pattern set.

However, SPM algorithms can be simpler and have better constant factors for k = 1. The overhead of Aho-Corasick's trie structure is unnecessary when searching for a single pattern.

Hybrid approaches:

In practice, systems often use hybrid strategies:

Filter + Verify: Use hashing or SIMD to quickly filter positions, then verify with exact matching
Adaptive: Use SPM for few patterns, switch to MPM for many
Hierarchical: Group similar patterns, search groups, then narrow down

These hybrids can outperform pure approaches in specific scenarios.

The Crossover Point

Alternative Multi-Pattern Approaches

While Aho-Corasick is the classic solution, other approaches exist with different trade-offs:

1. Rabin-Karp for Multiple Patterns

Extend Rabin-Karp to multiple patterns by storing all pattern hashes in a hash set:

Compute rolling hash of text window
Check if hash exists in pattern hash set
Verify matches to handle collisions

Patterns: {P₁, P₂, ..., Pₖ} (all same length m)
Preprocess: Store hash(Pᵢ) for all i in hash set
Match: For each window of size m:
         if hash(window) in hash_set:
             verify each matching pattern

Limitation: Works best when all patterns have same length. For variable lengths, run separately for each length group.

2. Commentz-Walter Algorithm

Extends Boyer-Moore to multiple patterns:

Uses a reverse trie for right-to-left matching
Combines bad character and good suffix ideas
Can skip multiple characters like Boyer-Moore

Better than Aho-Corasick when patterns are long and sparse in text.

3. Wu-Manber Algorithm

Optimized for many patterns with similar short prefixes:

Uses hashing of pattern suffixes
Skip table based on text blocks
Very fast in practice for large pattern sets

Multi-Pattern Algorithm Selection Guide
Algorithm	Best For	Avoid When
k × KMP	k < 5 patterns	Large pattern sets
Aho-Corasick	Many patterns, streaming	Very long patterns with few matches
Rabin-Karp Multi	Same-length patterns	Many different lengths
Commentz-Walter	Long patterns, sparse matches	Short patterns, dense matches
Wu-Manber	10,000+ patterns	Very short patterns (< 3 chars)

multi-pattern-rabin-karp
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/**
 * Multi-pattern Rabin-Karp for same-length patterns
 * Demonstrates hash-based multi-pattern matching
 */
 
const BASE = 256;
const MOD = 101;  // Prime modulus
 
function computeHash(s: string): number {
    let h = 0;
    for (let i = 0; i < s.length; i++) {
        h = (h * BASE + s.charCodeAt(i)) % MOD;
    }
    return h;
}
 
function multiPatternRabinKarp(
    text: string,
    patterns: string[]
): Array<{ position: number; pattern: string }> {
    if (patterns.length === 0) return [];
    
    // All patterns must be same length for this version
    const m = patterns[0].length;
    if (patterns.some(p => p.length !== m)) {
        throw new Error("All patterns must have same length");
    }
    
    const results: Array<{ position: number; pattern: string }> = [];
    
    // Build hash table of patterns
    const patternHashes = new Map<number, string[]>();
    for (const pattern of patterns) {
        const h = computeHash(pattern);
        if (!patternHashes.has(h)) {
            patternHashes.set(h, []);
        }
        patternHashes.get(h)!.push(pattern);  // Handle collisions
    }
    
    if (m > text.length) return results;
    
    // Precompute BASE^(m-1) % MOD
    let highOrder = 1;
    for (let i = 0; i < m - 1; i++) {
        highOrder = (highOrder * BASE) % MOD;
    }
    
    // Compute initial window hash
    let textHash = computeHash(text.slice(0, m));
    
    // Check each window
    for (let i = 0; i <= text.length - m; i++) {
        // Check if hash matches any pattern
        if (patternHashes.has(textHash)) {
            const window = text.slice(i, i + m);
            for (const pattern of patternHashes.get(textHash)!) {
                if (window === pattern) {  // Verify match
                    results.push({ position: i, pattern });
                }
            }
        }
        
        // Roll the hash to next window
        if (i < text.length - m) {
            textHash = (
                (textHash - text.charCodeAt(i) * highOrder) * BASE + 
                text.charCodeAt(i + m)
            ) % MOD;
            if (textHash < 0) textHash += MOD;
        }
    }
    
    return results;
}
 
// Demo
const patterns = ["abc", "bcd", "xyz", "cde"];
const text = "abcdefxyzabcop";
 
console.log("Patterns:", patterns);
console.log(`Text: "${text}"\n`);
 
const matches = multiPatternRabinKarp(text, patterns);
console.log("Matches found:");
for (const match of matches) {
    console.log(`  "${match.pattern}" at position ${match.position}`);
}

Decision Framework: Choosing the Right Approach

Given a pattern matching task, how do you choose between single-pattern and multi-pattern approaches?

Decision tree:

How many patterns are you searching for?
│
├─► Single pattern (k = 1)
│   │
│   └─► Use: KMP, Boyer-Moore, or Rabin-Karp
│        Factors: Pattern length, alphabet size, expected matches
│
└─► Multiple patterns (k > 1)
    │
    ├─► k < 5 and patterns are long?
    │   └─► Consider: k × Boyer-Moore (may beat Aho-Corasick)
    │
    ├─► Patterns mostly same length?
    │   └─► Consider: Multi-pattern Rabin-Karp
    │
    ├─► Need streaming (process text once)?
    │   └─► Use: Aho-Corasick
    │
    ├─► Pattern set is very large (10,000+)?
    │   └─► Use: Wu-Manber or Aho-Corasick with optimizations
    │
    └─► Default choice:
        └─► Use: Aho-Corasick (robust, predictable)

Choose Single-Pattern When

•Searching for exactly one pattern
•Text changes frequently (can't preprocess)
•Pattern changes frequently (can preprocess cheaply)
•Simplicity is valued over marginal performance
•Memory is severely constrained

Choose Multi-Pattern When

•Searching for multiple patterns
•Pattern set is relatively stable
•Preprocessing cost is amortized over many searches
•Single-pass text processing is required
•Scalability for future pattern growth is needed

The Practical Default

Real-World Implementation Considerations

Beyond algorithmic complexity, real-world implementations must consider several practical factors:

Memory access patterns:

Aho-Corasick's trie can have poor cache locality when the automaton is large. Array-based representations with computed indices often outperform pointer-based tries in practice.

SIMD and parallelization:

Modern CPUs support SIMD instructions that can compare multiple characters simultaneously. Libraries like Hyperscan (Intel) use these for massive speedups.

Alphabet handling:

For Unicode text, the naive approach of 143,000+ children per trie node is impractical. Solutions include:

Hash maps for children
Compressed tries (path compression)
Alphabet reduction (bytes instead of code points)

Dynamic pattern sets:

Classic Aho-Corasick assumes a fixed pattern set. If patterns are added/removed frequently:

Consider incremental construction algorithms
Or periodic full reconstruction with double-buffering

Production-Grade Multi-Pattern Matching Libraries
Library	Language	Algorithm	Notable Features
Intel Hyperscan	C/C++	Hybrid (Aho-Corasick + SIMD)	SIMD optimization, regex support, streaming
Aho-Corasick (Python)	Python	Aho-Corasick	Pure Python, easy to use, reasonable speed
Flashtext	Python	Aho-Corasick variant	Optimized for word replacement
rust-aho-corasick	Rust	Aho-Corasick	High performance, streaming, match kinds
pyahocorasick	Python (C ext)	Aho-Corasick	C-based for speed, Python interface

Don't Reinvent the Wheel

Summary: Single vs Multiple Pattern Matching

Key Takeaways

•Problem formulation matters: SPM and MPM have different optimal solutions
•Naive multi-pattern is O(kn): Running k single-pattern searches doesn't scale
•Aho-Corasick achieves O(n + z): Independent of pattern count k
•The right choice depends on k: Few patterns favor SPM algorithms; many favor MPM
•Preprocessing enables efficiency: O(M) preprocessing unlocks O(n + z) matching
•Practical implementations have nuances: Memory, parallelism, and dynamic updates matter

Problem Category Mastered