String Matching Problem - Learning Module

Loading content...

0/276

Applications: Search Engines, Bioinformatics, Text Editors

Pattern Matching: The Invisible Engine

The algorithms we've formally defined aren't academic curiosities—they're the invisible engines powering systems you use daily. From the moment you type a search query to the instant your editor highlights syntax errors, pattern matching algorithms are at work.

This page explores four major application domains in depth:

Search Engines — Finding needles in petabyte-scale haystacks
Bioinformatics — Decoding the language of life
Text Editors and IDEs — Real-time pattern matching for developers
Security and Filtering — Defending against threats and unwanted content

Each domain presents unique challenges that have driven innovation in pattern matching algorithms.

Learning Objectives

By the end of this page, you will understand: (1) How search engines use inverted indexes and pattern matching at scale, (2) Why bioinformatics pushed pattern matching to its limits, (3) How editors achieve real-time responsiveness, (4) The role of pattern matching in security systems. You'll see how theoretical concepts become production reality.

Search Engines: Finding Anything in Everything

Modern search engines perform pattern matching at a scale that would have seemed magical decades ago. Google processes over 8.5 billion searches per day across hundreds of billions of web pages. How is this possible?

The challenge:

Corpus size: 100+ billion web pages
Page size: Average ~50KB
Total data: Multiple petabytes
Query latency: < 200ms expected
Query volume: 100,000+ queries per second

Naive pattern matching is impossible at this scale. Searching 100 billion pages of 50KB each would require examining 5 × 10^18 bytes per query.

The solution: Inverted Indexes

Instead of searching text directly, search engines preprocess the corpus into inverted indexes:

Word → List of (document_id, position) pairs

Example inverted index:
"algorithm"  → [(doc1, 45), (doc1, 892), (doc7, 12), ...]
"efficient"  → [(doc1, 47), (doc3, 156), (doc7, 14), ...]
"search"     → [(doc2, 0), (doc7, 16), (doc9, 234), ...]

Querying becomes set intersection rather than text scanning:

Query: "efficient algorithm"
Find docs containing both words
Rank by relevance (PageRank, TF-IDF, etc.)

Search Engine Pattern Matching Techniques
Component	Pattern Matching Role	Algorithms Used
Index Building	Tokenize documents, extract words	Lexical analysis, word boundary detection
Query Processing	Parse and normalize query terms	Tokenization, stemming, spell correction
Autocomplete	Prefix matching for suggestions	Tries, prefix trees, Aho-Corasick
Phrase Search	Exact sequence matching in positions	Position list intersection
Wildcard Queries	Pattern with * and ? wildcards	N-gram indexes, permuterm indexes
Fuzzy Search	Approximate string matching	Edit distance, BK-trees, Levenshtein automata

The Preprocessing Trade-off

Search engines exemplify the preprocessing paradigm: invest massively upfront (building indexes takes days) to enable O(log n) or O(1) query time. This trade-off makes sense when the corpus changes slowly but queries arrive constantly.

Pattern matching in search ranking:

Beyond finding matches, modern search engines use pattern matching for relevance:

Snippet generation: Find query term occurrences to generate result previews
Highlighting: Mark query terms in displayed results
Proximity scoring: Favor documents where query terms appear close together
Phrase detection: Identify when query terms form a phrase in documents

Technical deep dive: Autocomplete

When you type "algo", Google suggests "algorithm", "algorithmic trading", etc. This is prefix matching at massive scale:

All popular queries are stored in a trie or prefix tree
Your input "algo" leads to a subtree of completions
Suggestions are ranked by popularity, personalization, and freshness
Response time: < 50ms while you're still typing

This is single-pattern matching (your input) against millions of stored patterns (popular queries), but optimized with specialized data structures.

autocomplete-concept
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/**
 * Simplified autocomplete using Trie
 * Demonstrates prefix matching for search suggestions
 */
 
interface TrieNode {
    children: Map<string, TrieNode>;
    isEndOfWord: boolean;
    popularity: number;  // For ranking
}
 
class AutocompleteTrie {
    private root: TrieNode;
    
    constructor() {
        this.root = this.createNode();
    }
    
    private createNode(): TrieNode {
        return {
            children: new Map(),
            isEndOfWord: false,
            popularity: 0,
        };
    }
    
    /**
     * Insert a query with its popularity score
     */
    insert(query: string, popularity: number = 1): void {
        let node = this.root;
        for (const char of query.toLowerCase()) {
            if (!node.children.has(char)) {
                node.children.set(char, this.createNode());
            }
            node = node.children.get(char)!;
        }
        node.isEndOfWord = true;
        node.popularity += popularity;
    }
    
    /**
     * Find all completions for a prefix, ranked by popularity
     */
    autocomplete(prefix: string, limit: number = 10): string[] {
        // Navigate to prefix node
        let node = this.root;
        for (const char of prefix.toLowerCase()) {
            if (!node.children.has(char)) {
                return [];  // Prefix not found
            }
            node = node.children.get(char)!;
        }
        
        // Collect all completions from this node
        const completions: Array<{ word: string; popularity: number }> = [];
        this.collectWords(node, prefix.toLowerCase(), completions);
        
        // Sort by popularity and return top results
        return completions
            .sort((a, b) => b.popularity - a.popularity)
            .slice(0, limit)
            .map(c => c.word);
    }
    
    private collectWords(
        node: TrieNode,
        currentWord: string,
        results: Array<{ word: string; popularity: number }>
    ): void {
        if (node.isEndOfWord) {
            results.push({ word: currentWord, popularity: node.popularity });
        }
        
        for (const [char, child] of node.children) {
            this.collectWords(child, currentWord + char, results);
        }
    }
}
 
// Demo: Build autocomplete from sample queries
const autocomplete = new AutocompleteTrie();
 
// Simulating popular searches
const searches = [
    { query: "algorithm", count: 10000 },
    { query: "algorithmic trading", count: 5000 },
    { query: "algorithms book", count: 3000 },
    { query: "algorithm design", count: 4500 },
    { query: "alpha", count: 2000 },
    { query: "alphabet", count: 8000 },
    { query: "alpine", count: 1500 },
];
 
for (const { query, count } of searches) {
    autocomplete.insert(query, count);
}
 
// User types "algo"
console.log("Suggestions for 'algo':");
console.log(autocomplete.autocomplete("algo", 5));
// Output: ["algorithm", "algorithmic trading", "algorithm design", "algorithms book"]
 
console.log("\nSuggestions for 'alp':");
console.log(autocomplete.autocomplete("alp", 5));
// Output: ["alphabet", "alpha", "alpine"]

Bioinformatics: Reading the Code of Life

Bioinformatics represents one of the most demanding applications of pattern matching. The human genome contains 3 billion base pairs (characters), and researchers regularly search for patterns ranging from short gene sequences to entire chromosome regions.

The biological context:

DNA is a string over a simple alphabet: Σ = {A, C, G, T} (adenine, cytosine, guanine, thymine). Despite this simplicity, the challenges are immense:

Genome size: 3 × 10⁹ bases for humans; bacterial genomes are smaller (10⁶-10⁷)
Pattern types: Exact matches, approximate matches, regular expression-like patterns
Query volume: Millions of short reads must be aligned to a reference genome
Error tolerance: Biological sequences have mutations, insertions, deletions

Key pattern matching problems in bioinformatics:

Bioinformatics Pattern Matching Problems
Problem	Description	Scale
Sequence Alignment	Find where a DNA/RNA/protein sequence matches in a database	Query: 100-10,000 bp; DB: 10⁹+ bp
Short Read Alignment	Map millions of 100-300 bp reads to a genome	10⁸ queries × 3×10⁹ genome
Motif Finding	Identify common patterns in gene promoter regions	Approximate patterns with wildcards
Genome Assembly	Reconstruct genome from overlapping fragments	Pattern overlap detection
Variant Detection	Find mutations, insertions, deletions compared to reference	Approximate matching
Phylogenetic Analysis	Compare sequences across species	Multiple sequence alignment

Why Exact Matching Isn't Enough

Biological sequences mutate. A gene from one person might differ from another by 1-2%. Exact pattern matching would miss these variations. Bioinformatics drove development of approximate matching algorithms that tolerate mismatches, insertions, and deletions—leading to techniques like edit distance and seed-and-extend.

Seed-and-Extend: The Practical Approach

Modern sequence aligners like BLAST, BWA, and Bowtie use a two-phase approach:

Phase 1: Seed (Fast approximate filter)

Find exact matches of short substrings (seeds, typically 10-30 bp)
Use hash tables, suffix arrays, or FM-index for O(1) or O(m) seed lookup
A single seed hit doesn't guarantee a match, but absence rules out match

Phase 2: Extend (Accurate verification)

For each seed hit, extend alignment in both directions
Use dynamic programming (Smith-Waterman) to find optimal local alignment
Score matches, mismatches, gaps according to biological scoring matrices

Reference: ...ACGTACGTACGATCGATCG...
                    ||||  | ||||
Query:           TACGTTCGAT
              
Seed match: "TACG" found at position X
Extend: Verify full alignment with DP, tolerating mismatches

The FM-Index Revolution:

The FM-index (based on Burrows-Wheeler Transform) enables searching a 3 billion character genome for an exact pattern in O(m) time with O(n) space—after O(n) preprocessing. This made whole-genome alignment practical.

seed-extend-concept
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
/**
 * Simplified seed-and-extend demonstration
 * Illustrates how modern aligners work (conceptually)
 */
 
interface Alignment {
    position: number;
    score: number;
    matches: number;
    mismatches: number;
    cigar: string;  // Compact alignment representation
}
 
/**
 * Build index of all k-mers in reference
 */
function buildSeedIndex(reference: string, k: number): Map<string, number[]> {
    const index = new Map<string, number[]>();
    
    for (let i = 0; i <= reference.length - k; i++) {
        const kmer = reference.slice(i, i + k);
        if (!index.has(kmer)) {
            index.set(kmer, []);
        }
        index.get(kmer)!.push(i);
    }
    
    return index;
}
 
/**
 * Simple score for extending seed
 * Real aligners use complex scoring matrices (BLOSUM, PAM)
 */
function extendAlignment(
    reference: string,
    query: string,
    seedPos: number
): Alignment | null {
    const startRef = Math.max(0, seedPos - 10);
    const endRef = Math.min(reference.length, seedPos + query.length + 10);
    const refRegion = reference.slice(startRef, endRef);
    
    // Simplified extension: score alignment at seed position
    let matches = 0;
    let mismatches = 0;
    let cigar = "";
    
    for (let i = 0; i < query.length; i++) {
        const refChar = reference[seedPos + i];
        const queryChar = query[i];
        
        if (refChar === undefined) break;
        
        if (refChar === queryChar) {
            matches++;
            cigar += "M";
        } else {
            mismatches++;
            cigar += "X";
        }
    }
    
    // Accept if error rate < 20%
    const errorRate = mismatches / (matches + mismatches);
    if (errorRate < 0.2) {
        return {
            position: seedPos,
            score: matches - mismatches,
            matches,
            mismatches,
            cigar: compressCigar(cigar),
        };
    }
    
    return null;
}
 
function compressCigar(cigar: string): string {
    // Convert "MMMMXMM" to "4M1X2M"
    let result = "";
    let count = 1;
    
    for (let i = 1; i <= cigar.length; i++) {
        if (i < cigar.length && cigar[i] === cigar[i-1]) {
            count++;
        } else {
            result += count + cigar[i-1];
            count = 1;
        }
    }
    
    return result;
}
 
/**
 * Seed-and-extend search
 */
function seedAndExtend(
    reference: string,
    query: string,
    seedLength: number = 8
): Alignment[] {
    // Phase 1: Build seed index
    const seedIndex = buildSeedIndex(reference, seedLength);
    
    // Phase 2: Find seed matches
    const alignments: Alignment[] = [];
    const checkedPositions = new Set<number>();
    
    // Check each seed in query
    for (let i = 0; i <= query.length - seedLength; i++) {
        const seed = query.slice(i, i + seedLength);
        const hits = seedIndex.get(seed) || [];
        
        for (const hitPos of hits) {
            // Adjust position to query start
            const alignStart = hitPos - i;
            
            if (alignStart >= 0 && !checkedPositions.has(alignStart)) {
                checkedPositions.add(alignStart);
                
                // Phase 3: Extend and verify
                const alignment = extendAlignment(reference, query, alignStart);
                if (alignment) {
                    alignments.push(alignment);
                }
            }
        }
    }
    
    return alignments.sort((a, b) => b.score - a.score);
}
 
// Demo: Align a read to a reference (with one mismatch)
const reference = "ACGTACGTACGATCGATCGATCGATCGACGT";
const query = "GATCGATCGAT";  // Perfect match
const queryMismatch = "GATCGATACAT";  // Two mismatches
 
console.log("Reference:", reference);
console.log("\nQuery 1 (exact):", query);
console.log("Alignments:", seedAndExtend(reference, query));
 
console.log("\nQuery 2 (with mismatches):", queryMismatch);
console.log("Alignments:", seedAndExtend(reference, queryMismatch));

Text Editors and IDEs: Real-Time Responsiveness

Every keystroke in a modern text editor triggers pattern matching operations: syntax highlighting, auto-completion, error detection, and search. The challenge is maintaining sub-millisecond response times on files with hundreds of thousands of lines.

Pattern matching in editors:

Find/Replace: Classic pattern matching, often with regex support
Syntax Highlighting: Match language keywords, operators, strings, comments
Code Navigation: Find symbol definitions and references
Linting: Pattern-based detection of code smells and errors
Autocomplete: Prefix matching against available symbols

The responsiveness requirement:

Users perceive delays > 100ms as "lag". For features triggered by every keystroke, response must be < 10ms to feel instantaneous. On a 100,000-line file, this means processing ~5MB of text in 10ms—requiring throughput of 500MB/second.

Editor Features and Their Pattern Matching Needs
Feature	Trigger	Pattern Type	Latency Budget
Syntax Highlighting	Every keystroke	Language keywords (multi-pattern)	< 10ms
Find (Ctrl+F)	User-initiated	User query (single pattern)	< 50ms (first results)
Find All	User-initiated	User query (all occurrences)	< 500ms
Replace All	User-initiated	User query + replacement	< 1s
Regex Search	User-initiated	Regular expression	< 500ms
Autocomplete	Typing trigger (e.g., '.')	Prefix matching	< 50ms
Go to Definition	Ctrl+Click / F12	Exact symbol lookup	< 100ms
Find References	Shift+F12	Exact symbol across files	< 500ms

Incremental vs Full Processing:

Key insight: most edits are small (a few characters), but reprocessing the entire document is expensive. Editors use incremental approaches:

Incremental Lexing: Only re-tokenize changed regions
Parse Tree Caching: Tree-sitter and similar parsers maintain an AST that's efficiently updated on edits
Lazy Highlighting: Highlight only visible lines; process off-screen content asynchronously
Index Updates: LSP servers maintain indexes that update incrementally

Data structures for text:

Storing document text in a simple string makes edits O(n) (shifting all characters after insertion point). Editors use specialized structures:

Gap Buffer: O(1) edits at cursor position; used by Emacs
Piece Table: O(log n) edits; used by VS Code
Rope: O(log n) edits and concatenation; used by Xi Editor

These structures also enable efficient substring extraction for search operations.

VS Code's Architecture

VS Code uses a piece table for text storage, integrated with ripgrep (a Rust-based grep) for file searching. Ripgrep uses SIMD-accelerated pattern matching and can search entire codebases in seconds. Language features come from Language Server Protocol (LSP) servers that maintain their own symbol indexes.

syntax-highlighting-concept
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
/**
 * Simplified syntax highlighter using regex-based tokenization
 * Real highlighters use more sophisticated lexer generators
 */
 
interface Token {
    type: 'keyword' | 'string' | 'comment' | 'number' | 'identifier' | 'operator' | 'default';
    value: string;
    start: number;
    end: number;
}
 
// Keywords for a simple language (like JavaScript subset)
const KEYWORDS = new Set([
    'function', 'return', 'if', 'else', 'for', 'while',
    'const', 'let', 'var', 'true', 'false', 'null'
]);
 
/**
 * Tokenize a line of code
 * Real implementations use Aho-Corasick for keywords, DFAs for complex tokens
 */
function tokenizeLine(line: string): Token[] {
    const tokens: Token[] = [];
    let pos = 0;
    
    while (pos < line.length) {
        // Skip whitespace
        if (/\s/.test(line[pos])) {
            pos++;
            continue;
        }
        
        // Comment
        if (line.slice(pos, pos + 2) === '//') {
            tokens.push({
                type: 'comment',
                value: line.slice(pos),
                start: pos,
                end: line.length
            });
            break;
        }
        
        // String
        if (line[pos] === '"' || line[pos] === "'") {
            const quote = line[pos];
            let end = pos + 1;
            while (end < line.length && line[end] !== quote) {
                if (line[end] === '\\') end++;  // Skip escaped char
                end++;
            }
            end++;  // Include closing quote
            tokens.push({
                type: 'string',
                value: line.slice(pos, end),
                start: pos,
                end
            });
            pos = end;
            continue;
        }
        
        // Number
        const numMatch = line.slice(pos).match(/^[0-9]+(\.[0-9]+)?/);
        if (numMatch) {
            tokens.push({
                type: 'number',
                value: numMatch[0],
                start: pos,
                end: pos + numMatch[0].length
            });
            pos += numMatch[0].length;
            continue;
        }
        
        // Identifier or keyword
        const idMatch = line.slice(pos).match(/^[a-zA-Z_][a-zA-Z0-9_]*/);
        if (idMatch) {
            const value = idMatch[0];
            const type = KEYWORDS.has(value) ? 'keyword' : 'identifier';
            tokens.push({
                type,
                value,
                start: pos,
                end: pos + value.length
            });
            pos += value.length;
            continue;
        }
        
        // Operator
        const opMatch = line.slice(pos).match(/^[+\-*/%=<>!&|]+/);
        if (opMatch) {
            tokens.push({
                type: 'operator',
                value: opMatch[0],
                start: pos,
                end: pos + opMatch[0].length
            });
            pos += opMatch[0].length;
            continue;
        }
        
        // Default: single character
        tokens.push({
            type: 'default',
            value: line[pos],
            start: pos,
            end: pos + 1
        });
        pos++;
    }
    
    return tokens;
}
 
/**
 * Apply syntax highlighting to a line
 */
function highlightLine(line: string): string {
    const tokens = tokenizeLine(line);
    const colors: Record<Token['type'], string> = {
        keyword: '\x1b[35m',    // Magenta
        string: '\x1b[32m',     // Green
        comment: '\x1b[90m',    // Gray
        number: '\x1b[33m',     // Yellow
        identifier: '\x1b[0m',  // Default
        operator: '\x1b[36m',   // Cyan
        default: '\x1b[0m',     // Default
    };
    const reset = '\x1b[0m';
    
    let result = '';
    let lastEnd = 0;
    
    for (const token of tokens) {
        result += line.slice(lastEnd, token.start);
        result += colors[token.type] + token.value + reset;
        lastEnd = token.end;
    }
    result += line.slice(lastEnd);
    
    return result;
}
 
// Demo
const code = `function fibonacci(n) {
    if (n <= 1) return n;  // Base case
    let result = 0;
    const message = "computing...";
    return fibonacci(n - 1) + fibonacci(n - 2);
}`;
 
console.log("Highlighted code:\n");
for (const line of code.split('\n')) {
    console.log(highlightLine(line));
}

Security Systems and Content Filtering

Pattern matching is fundamental to computer security. From detecting malware to filtering spam, security systems must rapidly match content against large databases of known threats.

Security applications:

1. Antivirus/Anti-Malware

Match file contents against signatures of known malware
Signature database: 10,000 to 10,000,000+ patterns
Challenge: Scan speed (users won't wait), low false positives
Algorithms: Aho-Corasick, Commentz-Walter, SIMD-accelerated matchers

2. Intrusion Detection Systems (IDS)

Monitor network traffic for attack patterns
Speed requirement: Gigabits per second throughput
Patterns: Known attack signatures, protocol anomalies
Example: Snort, Suricata use multi-pattern matching heavily

3. Web Application Firewalls (WAF)

Detect SQL injection, XSS, and other attacks in HTTP requests
Must process every request without adding latency
Patterns: Attack signatures (e.g., SQL keywords in unexpected places)

4. Spam and Content Filtering

Match email/content against prohibited patterns
Keywords, URLs, phrases associated with spam/abuse
Often combined with machine learning for better accuracy

Security System Pattern Matching Requirements
System	Input Volume	Pattern Count	Speed Requirement
Desktop Antivirus	GB/hour (file scans)	~10,000,000	50 MB/s
Network IDS	Gbps (traffic)	~10,000	Line rate (1-100 Gbps)
WAF	1000s req/sec	~1,000	< 1ms per request
Email Filter	Millions emails/day	~100,000	< 100ms per email
Real-time Chat Filter	1000s msg/sec	~10,000	< 10ms per message

The Arms Race

Attackers know security systems use pattern matching. They develop evasion techniques: encoding payloads, splitting across packets, using polymorphic code. Security systems respond with normalization (decoding before matching), reassembly (reconstructing split data), and behavior-based detection. This arms race drives continuous algorithm innovation.

security-pattern-matching
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
/**
 * Simplified content filter demonstration
 * Real systems use compiled automata and SIMD acceleration
 */
 
interface ThreatMatch {
    pattern: string;
    position: number;
    category: string;
    severity: 'low' | 'medium' | 'high' | 'critical';
}
 
interface ThreatSignature {
    pattern: string;
    category: string;
    severity: ThreatMatch['severity'];
    caseSensitive: boolean;
}
 
// Sample threat signatures
const THREAT_SIGNATURES: ThreatSignature[] = [
    // SQL Injection patterns
    { pattern: "' OR '1'='1", category: "sql_injection", severity: "critical", caseSensitive: false },
    { pattern: "'; DROP TABLE", category: "sql_injection", severity: "critical", caseSensitive: false },
    { pattern: "UNION SELECT", category: "sql_injection", severity: "high", caseSensitive: false },
    
    // XSS patterns
    { pattern: "<script>", category: "xss", severity: "high", caseSensitive: false },
    { pattern: "javascript:", category: "xss", severity: "high", caseSensitive: false },
    { pattern: "onerror=", category: "xss", severity: "medium", caseSensitive: false },
    
    // Command injection
    { pattern: "; rm -rf", category: "cmd_injection", severity: "critical", caseSensitive: true },
    { pattern: "| cat /etc/passwd", category: "cmd_injection", severity: "critical", caseSensitive: true },
];
 
/**
 * Simple multi-pattern matcher for security scanning
 * Production systems would use Aho-Corasick or Hyperscan
 */
class SecurityScanner {
    private signatures: ThreatSignature[];
    
    constructor(signatures: ThreatSignature[]) {
        this.signatures = signatures;
    }
    
    scan(input: string): ThreatMatch[] {
        const matches: ThreatMatch[] = [];
        
        for (const sig of this.signatures) {
            const searchPattern = sig.caseSensitive ? sig.pattern : sig.pattern.toLowerCase();
            const searchInput = sig.caseSensitive ? input : input.toLowerCase();
            
            let pos = 0;
            while (true) {
                const foundAt = searchInput.indexOf(searchPattern, pos);
                if (foundAt === -1) break;
                
                matches.push({
                    pattern: sig.pattern,
                    position: foundAt,
                    category: sig.category,
                    severity: sig.severity,
                });
                pos = foundAt + 1;
            }
        }
        
        return matches.sort((a, b) => 
            a.position - b.position || 
            (a.severity === 'critical' ? -1 : b.severity === 'critical' ? 1 : 0)
        );
    }
    
    isThreatenig(input: string): boolean {
        return this.scan(input).some(m => 
            m.severity === 'critical' || m.severity === 'high'
        );
    }
}
 
// Demo
const scanner = new SecurityScanner(THREAT_SIGNATURES);
 
const testInputs = [
    "SELECT name FROM users WHERE id = 1",  // Safe
    "SELECT * FROM users WHERE name = '' OR '1'='1'",  // SQL Injection
    "Hello <script>alert('XSS')</script> World",  // XSS
    "Normal comment with no threats",  // Safe
    "Enter name: Alice; rm -rf /",  // Command injection
];
 
console.log("=== Security Scan Results ===\n");
 
for (const input of testInputs) {
    const matches = scanner.scan(input);
    const status = matches.length > 0 
        ? `⚠️  ${matches.length} threat(s) detected`
        : "✓ Clean";
    
    console.log(`Input: "${input.slice(0, 50)}${input.length > 50 ? '...' : ''}"`);
    console.log(`Status: ${status}`);
    
    for (const match of matches) {
        console.log(`  - [${match.severity.toUpperCase()}] ${match.category}: "${match.pattern}" at pos ${match.position}`);
    }
    console.log("");
}

Beyond the Big Four: Other Applications

Pattern matching appears in countless other domains:

Data Compression:

Compression algorithms like LZ77 (used in gzip, zip, PNG) work by finding repeated patterns:

Search for occurrences of current position's suffix in previous window
Replace repeated strings with (distance, length) references
Pattern matching determines compression ratio

Plagiarism Detection:

Academic integrity systems compare submitted work against:

Previous submissions
Internet content
Published papers

Uses: Fingerprinting (hash-based), n-gram matching, suffix arrays for overlap detection

Natural Language Processing:

Named Entity Recognition: Match patterns for names, dates, amounts
Information Extraction: Find structured data in unstructured text
Machine Translation: Phrase-based matching between languages

Log Analysis:

Match log entries against patterns indicating errors, attacks, anomalies
Tools like grep, awk, LogStash use pattern matching extensively
Time-series pattern matching for anomaly detection

Database Systems:

LIKE queries: "SELECT * FROM users WHERE name LIKE 'John%'"
Full-text search indexes
String column indexing

Pattern Matching in Daily Life

•Spell Checkers: Suggest corrections by finding close matches (edit distance)
•Music Recognition: Shazam matches audio fingerprints against song database
•OCR: Match character images against font patterns
•Network Routing: Match IP addresses/prefixes against routing tables
•Speech Recognition: Match phoneme patterns to words
•Version Control: Git uses pattern matching for diffing and merging
•Build Systems: Match files against pattern rules (*.c → compile)

Summary: Pattern Matching Powers Our Digital World

We've journeyed through four major application domains, seeing how the formal concepts from earlier pages become production systems serving billions of users.

Application Domains Recap
Domain	Key Challenge	Primary Approach
Search Engines	Petabyte-scale search in milliseconds	Inverted indexes + prefix matching
Bioinformatics	Billions of approximate matches	Seed-and-extend, FM-indexes
Text Editors	Real-time response on every keystroke	Incremental processing, specialized storage
Security	Line-rate threat detection	Multi-pattern automata (Aho-Corasick)

Key Takeaways

•Scale transforms problems: At internet scale, naive algorithms are impossible; preprocessing is essential
•Domain constraints shape solutions: Biology needs approximate matching; security needs speed over accuracy
•Real-time demands incremental approaches: Editors can't reprocess entire documents on every keystroke
•Pattern matching evolves with technology: SIMD, GPUs, and FPGAs enable new performance frontiers
•The fundamentals remain constant: Despite technological change, the formal problem remains what we defined on Page 1

Module Complete

Congratulations! You've completed Module 1: String Matching Problem Formalization. You now understand the formal foundations of pattern matching and have seen how these concepts power real-world systems. The following modules will dive deep into the specific algorithms—naive matching, KMP, Rabin-Karp, and more—building on this foundation.