Loading content...
Every time you press Ctrl+F in a document, every time a search engine processes your query, every time a biologist searches for a gene sequence in DNA data—the same fundamental problem is being solved: pattern matching in strings.
This problem is so ubiquitous that it might seem trivial. After all, how hard can it be to find a word in text? The answer reveals one of the most elegant areas of algorithm design, where clever insights transform a seemingly simple problem from requiring hours of computation to mere milliseconds.
Understanding pattern matching formally—not just intuitively—is the gateway to mastering string algorithms. Every advanced technique, from the KMP algorithm to suffix trees, builds upon the precise definitions we will establish on this page.
By the end of this page, you will understand the formal mathematical definition of the pattern matching problem, the precise terminology used throughout string algorithm literature, why naive approaches fail at scale, and the framework that enables efficient solutions. This formalization is essential for understanding every pattern matching algorithm that follows.
Before we formalize the pattern matching problem, let's establish our intuitive understanding and then refine it.
The everyday concept:
When you search for the word "algorithm" in a textbook, you're looking for every place where that exact sequence of characters appears. You might find it on page 5, page 23, and page 147. Each of these is an occurrence or match.
But this informal description leaves many questions unanswered:
A formal definition answers all these questions unambiguously, providing precision that enables algorithm design and correctness proofs.
Informal descriptions lead to edge case bugs and implementation ambiguity. When Donald Knuth, James Morris, and Vaughan Pratt developed the KMP algorithm, their formal problem definition allowed them to prove their algorithm correct and analyze its complexity precisely. Without formalization, we cannot rigorously compare algorithms or guarantee correctness.
The journey from informal to formal:
We will now build our formal definition piece by piece:
Each definition builds on the previous, creating a rigorous foundation for all string algorithms.
Every string algorithm operates over an alphabet—the set of allowed symbols. This seemingly simple concept has profound implications for algorithm design and complexity analysis.
Definition (Alphabet):
An alphabet, denoted Σ (sigma), is a finite, non-empty set of distinct symbols (also called characters or letters).
Examples of alphabets:
| Alphabet Name | Σ | Size |Σ| | |---------------|---|----------| | Binary | {0, 1} | 2 | | DNA | {A, C, G, T} | 4 | | Lowercase English | {a, b, c, ..., z} | 26 | | ASCII | {all ASCII characters} | 128 | | Unicode | {all Unicode code points} | >143,000 |
The size of the alphabet, denoted |Σ|, directly impacts algorithm complexity for many string algorithms.
Many string algorithms have complexity that depends on |Σ|. For DNA (|Σ| = 4), a brute-force approach might be acceptable. For Unicode (|Σ| > 143,000), the same approach becomes impractical. This is why algorithms like KMP, which are alphabet-independent, are so valuable.
Definition (String):
A string over alphabet Σ is a finite sequence of symbols from Σ. The set of all strings over Σ is denoted Σ* (Σ-star).
Formal notation:
A string S of length n is written as:
S = S[0]S[1]S[2]...S[n-1]
where each S[i] ∈ Σ for 0 ≤ i < n.
Key string properties:
123456789101112131415161718192021
Alphabet: Σ = {a, b, c, ..., z} String: S = "algorithm"Length: |S| = 9 Indexing (0-based): S[0] = 'a' S[1] = 'l' S[2] = 'g' ... S[8] = 'm' Substrings of "algorithm": "algo" - prefix (positions 0-3) "rithm" - suffix (positions 4-8) "gori" - general substring (positions 2-5) "algorithm" - the string itself is both prefix and suffix "" - empty string (also valid prefix/suffix) Concatenation: "algo" · "rithm" = "algorithm"These three concepts form the vocabulary of pattern matching. Understanding them precisely is essential for grasping why certain algorithms work.
Definition (Substring):
String P is a substring of string S if there exist strings x and y such that:
S = xPy
where x and y can be empty. Equivalently, P = S[i..j] for some 0 ≤ i ≤ j ≤ |S|.
The substring starting at position i with length m is:
S[i..i+m-1] = S[i]S[i+1]...S[i+m-1]
Definition (Prefix):
String P is a prefix of string S if there exists string y such that:
S = Py
Equivalently, P = S[0..k-1] for some 0 ≤ k ≤ |S|. A prefix is a substring that starts at position 0.
Definition (Suffix):
String P is a suffix of string S if there exists string x such that:
S = xP
Equivalently, P = S[k..|S|-1] for some 0 ≤ k ≤ |S|. A suffix is a substring that ends at the last position.
| Length | Prefix | Suffix |
|---|---|---|
| 0 | ε (empty) | ε (empty) |
| 1 | a | m |
| 2 | al | hm |
| 3 | alg | thm |
| 4 | algo | ithm |
| 5 | algor | rithm |
| 6 | algori | orithm |
| 7 | algorit | gorithm |
| 8 | algorith | lgorithm |
| 9 | algorithm | algorithm |
A proper prefix (or suffix) of S is one that is not equal to S itself. Similarly, a proper substring excludes S as a whole. These distinctions become crucial when analyzing algorithms like KMP, where we specifically examine proper prefixes that are also suffixes.
Counting substrings:
For a string of length n:
This quadratic growth in substrings explains why brute-force substring enumeration doesn't scale, and why data structures like suffix arrays (which store all n suffixes) are so powerful.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
/** * Comprehensive prefix, suffix, and substring utilities * These operations are fundamental to pattern matching algorithms */ // Check if 'prefix' is a prefix of 'str'function isPrefix(str: string, prefix: string): boolean { if (prefix.length > str.length) return false; for (let i = 0; i < prefix.length; i++) { if (str[i] !== prefix[i]) return false; } return true; // Alternatively: return str.startsWith(prefix);} // Check if 'suffix' is a suffix of 'str'function isSuffix(str: string, suffix: string): boolean { if (suffix.length > str.length) return false; const offset = str.length - suffix.length; for (let i = 0; i < suffix.length; i++) { if (str[offset + i] !== suffix[i]) return false; } return true; // Alternatively: return str.endsWith(suffix);} // Check if 'sub' is a substring of 'str'function isSubstring(str: string, sub: string): boolean { if (sub.length === 0) return true; // Empty is always substring if (sub.length > str.length) return false; // Check all possible starting positions for (let i = 0; i <= str.length - sub.length; i++) { let match = true; for (let j = 0; j < sub.length; j++) { if (str[i + j] !== sub[j]) { match = false; break; } } if (match) return true; } return false; // Alternatively: return str.includes(sub);} // Generate all prefixes of a stringfunction allPrefixes(str: string): string[] { const prefixes: string[] = ['']; // Empty prefix for (let i = 1; i <= str.length; i++) { prefixes.push(str.slice(0, i)); } return prefixes;} // Generate all suffixes of a stringfunction allSuffixes(str: string): string[] { const suffixes: string[] = []; for (let i = 0; i <= str.length; i++) { suffixes.push(str.slice(i)); } return suffixes; // Includes empty suffix at end} // Example usageconst S = "algorithm";console.log("Prefixes:", allPrefixes(S));console.log("Suffixes:", allSuffixes(S));console.log("Is 'algo' a prefix?", isPrefix(S, "algo")); // trueconsole.log("Is 'rithm' a suffix?", isSuffix(S, "rithm")); // trueconsole.log("Is 'gori' a substring?", isSubstring(S, "gori")); // trueWith our vocabulary established, we can now state the pattern matching problem with mathematical precision.
Definition (Pattern Matching Problem):
Given:
Find:
Definition (Occurrence/Match):
Pattern P occurs (or matches) at position i in text T if and only if:
T[i + j] = P[j] for all j where 0 ≤ j < m
In other words, the substring T[i..i+m-1] equals P.
Position i is called a valid shift or match position.
We use 0-based indexing throughout (positions start at 0). Some textbooks use 1-based indexing (positions start at 1). When reading algorithm papers, always verify which convention is used. The core algorithms are identical; only the index arithmetic differs slightly.
Visual representation:
Let's visualize pattern matching with a concrete example:
Text T: "abracadabra"
01234567890
Pattern P: "abra"
0123
Question: At which positions does P occur in T?
Checking each position:
| Position i | Substring T[i..i+3] | Matches P? |
|---|---|---|
| 0 | "abra" | ✓ Yes |
| 1 | "brac" | ✗ No |
| 2 | "raca" | ✗ No |
| 3 | "acad" | ✗ No |
| 4 | "cada" | ✗ No |
| 5 | "adab" | ✗ No |
| 6 | "dabr" | ✗ No |
| 7 | "abra" | ✓ Yes |
| 8+ | (too short) | N/A |
Answer: P occurs at positions 0 and 7.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
/** * Formal implementation of the pattern matching problem * Returns all positions where pattern P occurs in text T */ interface PatternMatchResult { positions: number[]; // All valid shift positions comparisons: number; // Total character comparisons (for analysis)} /** * Naive pattern matching: directly implements the formal definition * Time Complexity: O(n × m) where n = |T| and m = |P| * Space Complexity: O(k) where k = number of matches */function findPatternOccurrences(T: string, P: string): PatternMatchResult { const n = T.length; const m = P.length; const positions: number[] = []; let comparisons = 0; // Edge cases if (m === 0) { // Empty pattern matches at every position (including after last char) return { positions: Array.from({ length: n + 1 }, (_, i) => i), comparisons: 0 }; } if (m > n) { // Pattern longer than text: no matches possible return { positions: [], comparisons: 0 }; } // Check each valid starting position // Position i is valid if 0 ≤ i ≤ n - m (enough room for pattern) for (let i = 0; i <= n - m; i++) { // Check if P occurs at position i let isMatch = true; for (let j = 0; j < m; j++) { comparisons++; if (T[i + j] !== P[j]) { isMatch = false; break; } } if (isMatch) { positions.push(i); } } return { positions, comparisons };} // Example usageconst T = "abracadabra";const P = "abra"; const result = findPatternOccurrences(T, P);console.log(`Text: "${T}" (length ${T.length})`);console.log(`Pattern: "${P}" (length ${P.length})`);console.log(`Occurrences at positions: [${result.positions.join(', ')}]`);console.log(`Total comparisons: ${result.comparisons}`); // Verify each matchresult.positions.forEach(pos => { const match = T.slice(pos, pos + P.length); console.log(` Position ${pos}: "${match}" === "${P}" ? ${match === P}`);});The basic pattern matching problem has several variants depending on what output is required. Understanding these variants is crucial because different algorithms optimize for different outputs.
Variant 1: Decision Problem
Question: Does pattern P occur in text T at all?
Output: Boolean (yes/no)
Complexity goal: Often can short-circuit after finding first match, making average case faster.
Variant 2: First Occurrence
Question: What is the first (leftmost) position where P occurs in T?
Output: Single position, or -1 if no occurrence
Complexity goal: Can stop immediately upon finding first match.
Variant 3: Count Occurrences
Question: How many times does P occur in T?
Output: Non-negative integer
Note: Overlapping occurrences may or may not be counted depending on requirements.
Variant 4: All Occurrences
Question: List all positions where P occurs in T.
Output: List of positions (may be empty)
Complexity goal: Must examine entire text; output can be O(n) in worst case.
Overlapping occurrences occur when the pattern can match starting at positions that are closer together than the pattern length. For example, pattern 'aa' in text 'aaa' has overlapping occurrences at positions 0 and 1. Most formal definitions count overlapping occurrences separately, but some applications (like word counting in natural language) may want non-overlapping matches only.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
/** * All variants of the pattern matching problem */ // Variant 1: Decision - Does pattern occur?function containsPattern(T: string, P: string): boolean { const n = T.length; const m = P.length; if (m === 0) return true; if (m > n) return false; for (let i = 0; i <= n - m; i++) { let match = true; for (let j = 0; j < m && match; j++) { if (T[i + j] !== P[j]) match = false; } if (match) return true; // Early exit! } return false;} // Variant 2: First occurrence position (-1 if not found)function findFirstOccurrence(T: string, P: string): number { const n = T.length; const m = P.length; if (m === 0) return 0; // Empty pattern at start if (m > n) return -1; for (let i = 0; i <= n - m; i++) { let match = true; for (let j = 0; j < m && match; j++) { if (T[i + j] !== P[j]) match = false; } if (match) return i; // Early exit! } return -1;} // Variant 3: Count occurrences (including overlapping)function countOccurrences(T: string, P: string): number { const n = T.length; const m = P.length; if (m === 0) return n + 1; // Empty matches everywhere if (m > n) return 0; let count = 0; for (let i = 0; i <= n - m; i++) { let match = true; for (let j = 0; j < m && match; j++) { if (T[i + j] !== P[j]) match = false; } if (match) count++; } return count;} // Variant 3b: Count non-overlapping occurrencesfunction countNonOverlapping(T: string, P: string): number { const n = T.length; const m = P.length; if (m === 0) return n + 1; if (m > n) return 0; let count = 0; let i = 0; while (i <= n - m) { let match = true; for (let j = 0; j < m && match; j++) { if (T[i + j] !== P[j]) match = false; } if (match) { count++; i += m; // Skip past this match } else { i++; } } return count;} // Example demonstrating overlappingconst T = "aaaa";const P = "aa";console.log(`Text: "${T}", Pattern: "${P}"`);console.log(`Overlapping count: ${countOccurrences(T, P)}`); // 3console.log(`Non-overlapping count: ${countNonOverlapping(T, P)}`); // 2The naive algorithm directly implements our formal definition, but its performance characteristics reveal why we need sophisticated pattern matching algorithms.
Time Complexity Analysis:
For n = 1,000,000 and m = 1,000, that's up to 1 billion comparisons.
Worst-case input:
Text T: "aaaaaaaaaaaa...aaaab" (n-1 a's followed by b)
Pattern P: "aaa...aaab" (m-1 a's followed by b)
For each of the n - m + 1 positions, the algorithm compares m - 1 characters (all matching 'a's) before the final mismatch at 'b'. This is O(n × m) in practice.
Average case:
For random text and pattern over a non-trivial alphabet, the expected comparisons per position is much smaller (mismatches happen early). For random strings, average case is O(n).
But we can't rely on average case:
| Text Length (n) | Pattern Length (m) | Worst-Case Comparisons | Time at 10⁹ ops/sec |
|---|---|---|---|
| 1,000 | 10 | ~10,000 | ~0.01 ms |
| 10,000 | 100 | ~1,000,000 | ~1 ms |
| 100,000 | 1,000 | ~100,000,000 | ~100 ms |
| 1,000,000 | 10,000 | ~10,000,000,000 | ~10 sec |
| 10,000,000 | 100,000 | ~1,000,000,000,000 | ~17 min |
A search engine processes billions of queries daily. If each search in a 1MB document took 10 seconds, the system would be unusable. This is why O(n + m) algorithms like KMP, Rabin-Karp, and Boyer-Moore are essential—they provide the massive speedup required for real-world applications.
Space Complexity:
The fundamental insight:
The naive algorithm wastes work by forgetting what it learned. When a mismatch occurs at position j, we discard all information about the j characters we already matched and start fresh. Efficient algorithms exploit the structure of the pattern to avoid redundant comparisons.
This insight—that we can preprocess the pattern to avoid redundant text examination—is the key to all efficient pattern matching algorithms.
Pattern matching can also be viewed through the lens of shifts. This perspective is particularly useful for understanding algorithms like Boyer-Moore.
Definition (Shift):
A shift s is the number of positions the pattern is shifted right relative to the text. Pattern P with shift s is aligned so that P[0] is compared with T[s].
Valid shift: A shift s is valid if P matches T starting at position s, i.e., T[s..s+m-1] = P.
The pattern matching problem restated:
Find all valid shifts s where 0 ≤ s ≤ n - m.
Visual representation of shifts:
Text: a b r a c a d a b r a
0 1 2 3 4 5 6 7 8 9 10
Shift 0: a b r a ✓ Valid
↑ ↑ ↑ ↑
Shift 1: b r a c ✗ Invalid (P[0]='a' ≠ T[1]='b')
↑
Shift 7: a b r a ✓ Valid
↑ ↑ ↑ ↑
Why the shift perspective matters:
The naive algorithm increments shift by 1 after each comparison (regardless of what was learned). Advanced algorithms compute larger shifts when a mismatch occurs, skipping positions that cannot possibly be valid.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
/** * Pattern matching from the shift perspective * This formulation directly expresses "find all valid shifts" */ function findValidShifts(T: string, P: string): number[] { const n = T.length; const m = P.length; const validShifts: number[] = []; // Range of possible shifts: 0 to n-m inclusive const maxShift = n - m; if (maxShift < 0) { return []; // Pattern longer than text } // Check each shift for (let s = 0; s <= maxShift; s++) { if (isValidShift(T, P, s)) { validShifts.push(s); } } return validShifts;} /** * Check if shift s is valid (pattern matches at position s) */function isValidShift(T: string, P: string, s: number): boolean { const m = P.length; for (let j = 0; j < m; j++) { if (T[s + j] !== P[j]) { return false; } } return true;} /** * Visualize the matching process with shifts */function visualizeShifts(T: string, P: string): void { const n = T.length; const m = P.length; const maxShift = n - m; console.log(`Text: ${T}`); console.log(` ${Array.from({length: n}, (_, i) => i % 10).join('')}`); console.log(''); for (let s = 0; s <= maxShift; s++) { const valid = isValidShift(T, P, s); const padding = ' '.repeat(s); const status = valid ? '✓ Valid' : '✗'; console.log(`Shift ${s}: ${padding}${P} ${status}`); }} // Exampleconsole.log("=== Pattern Matching via Shifts ===\n");visualizeShifts("abracadabra", "abra");We have established the precise mathematical foundation for pattern matching. This formalization is not mere academic exercise—it enables rigorous algorithm design, complexity analysis, and correctness proofs.
You now possess the formal vocabulary and definitions required to understand all pattern matching algorithms. The concepts of text, pattern, occurrence, and shift will recur throughout this chapter. Next, we'll examine the specific components—text, pattern, and occurrence—in greater depth to understand the nuances that efficient algorithms exploit.
Looking ahead:
With our formal foundation established, the next pages will explore:
Each page builds on this formal groundwork, showing how precise definitions enable elegant and efficient solutions.