Loading content...
In the formal framework of pattern matching, three elements reign supreme: the text, the pattern, and the occurrence. While we introduced these concepts on the previous page, their depth warrants thorough exploration.
Think of it this way: the text is the haystack, the pattern is the needle, and an occurrence is a precise location where that needle can be found. But unlike physical haystacks, our textual haystack may contain thousands of needles in overlapping configurations—and understanding these configurations is key to efficient searching.
On this page, we dissect each element, examining properties that matter for algorithm design and revealing relationships that efficient algorithms exploit.
By the end of this page, you will deeply understand: (1) How text characteristics affect algorithm choice, (2) Pattern properties that enable preprocessing optimizations, (3) The precise mechanics of occurrence detection, (4) Edge cases and boundary conditions that trip up implementations.
The text (T) is the string being searched—the search space within which we look for pattern occurrences. Its properties profoundly influence algorithm performance and choice.
Formal Definition Revisited:
Text T is a string of length n over alphabet Σ:
T = T[0]T[1]T[2]...T[n-1]
where each T[i] ∈ Σ and n = |T|.
Critical text properties:
Text access patterns:
How we access the text significantly impacts algorithm design:
| Access Pattern | Description | Algorithm Implications |
|---|---|---|
| Sequential | Read characters left-to-right | Most algorithms (KMP, Rabin-Karp) |
| Random access | Jump to any position O(1) | Boyer-Moore, suffix arrays |
| Streaming | Characters arrive one-by-one | Online algorithms, Aho-Corasick |
| Indexed | Pre-built index (suffix array/tree) | Single O(m) query after O(n) preprocessing |
If you search the same text for many different patterns (like a search engine indexing documents), preprocessing the TEXT makes sense. If you search many different texts for the same pattern (like grep searching files), preprocessing the PATTERN makes sense. The choice fundamentally shapes algorithm selection.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
/** * Comprehensive text analysis for pattern matching preparation * Understanding text characteristics helps choose optimal algorithms */ interface TextAnalysis { length: number; alphabetSize: number; uniqueChars: Set<string>; charFrequency: Map<string, number>; repetitiveness: number; // 0-1 scale isHighlyRepetitive: boolean;} function analyzeText(T: string): TextAnalysis { const charFrequency = new Map<string, number>(); // Count character frequencies for (const char of T) { charFrequency.set(char, (charFrequency.get(char) || 0) + 1); } const uniqueChars = new Set(T); const alphabetSize = uniqueChars.size; // Calculate repetitiveness metric // Low ratio of unique chars to length = high repetitiveness const repetitiveness = 1 - (alphabetSize / Math.max(T.length, 1)); return { length: T.length, alphabetSize, uniqueChars, charFrequency, repetitiveness, isHighlyRepetitive: repetitiveness > 0.9, };} /** * Recommend algorithm based on text characteristics */function recommendAlgorithm(analysis: TextAnalysis, willSearchMultipleTimes: boolean): string { if (willSearchMultipleTimes) { return "Use suffix array/tree - preprocess text once, query many times in O(m)"; } if (analysis.isHighlyRepetitive) { return "Use KMP - handles repetitive text without worst-case degradation"; } if (analysis.alphabetSize > 100) { return "Use Boyer-Moore - benefits from large alphabet (more bad character shifts)"; } if (analysis.alphabetSize <= 4) { return "Use KMP or Rabin-Karp - small alphabet reduces Boyer-Moore's advantage"; } return "Boyer-Moore or KMP - both perform well on typical text";} // Example analysisconst dnaSample = "ACGTACGTACGTACGT";const englishSample = "The quick brown fox jumps over the lazy dog";const repetitiveSample = "aaaaaaaaaaaaaaaa"; console.log("DNA Text Analysis:", analyzeText(dnaSample));console.log("English Text Analysis:", analyzeText(englishSample));console.log("Repetitive Text Analysis:", analyzeText(repetitiveSample));The pattern (P) is the string we seek within the text. While typically shorter than the text, the pattern's structure is the key to efficient matching algorithms.
Formal Definition Revisited:
Pattern P is a string of length m over alphabet Σ:
P = P[0]P[1]P[2]...P[m-1]
where each P[j] ∈ Σ and m = |P|.
Pattern properties that matter:
The Prefix-Suffix Relationship:
The most important pattern property for efficient matching is the relationship between prefixes and suffixes. Consider pattern P = "abcabd":
Pattern: a b c a b d
0 1 2 3 4 5
Proper prefixes: Proper suffixes:
"" ""
"a" "d"
"ab" "bd"
"abc" "abd"
"abca" "cabd"
"abcab" "bcabd"
Prefixes that are also suffixes:
"" (empty string) - trivially true
"ab" = P[0..1] - matches P[3..4]? No: "ab" ≠ "bd"
For pattern "abab":
Pattern: a b a b
0 1 2 3
"ab" (prefix P[0..1]) = "ab" (suffix P[2..3])? Yes!
This means: if we match "abab" partially and then hit a mismatch,
we don't start over—we can use the "ab" we already matched.
This insight is the foundation of the KMP algorithm.
When we preprocess the pattern, we're essentially asking: 'If I've matched k characters and then encounter a mismatch, how much of what I matched can I reuse?' The answer depends only on the pattern's internal structure, not on the text. This is why O(m) preprocessing enables O(n) matching—we do the pattern analysis once and apply it n times.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
/** * Deep pattern analysis for understanding pattern structure * These properties are exploited by efficient matching algorithms */ interface PatternAnalysis { length: number; uniqueChars: number; isPeriodic: boolean; smallestPeriod: number | null; prefixSuffixMatches: Array<{ length: number; prefix: string; suffix: string }>; lpsArray: number[]; // Longest Proper Prefix which is also Suffix hasInternalRepetition: boolean;} /** * Compute the LPS (Longest Proper Prefix which is also Suffix) array * This is the foundation of the KMP algorithm */function computeLPS(P: string): number[] { const m = P.length; const lps = new Array(m).fill(0); let len = 0; // Length of previous longest prefix suffix let i = 1; while (i < m) { if (P[i] === P[len]) { len++; lps[i] = len; i++; } else { if (len !== 0) { len = lps[len - 1]; } else { lps[i] = 0; i++; } } } return lps;} /** * Find the smallest period of a pattern * Period p means P[i] = P[i+p] for all 0 ≤ i < m-p */function findSmallestPeriod(P: string): number | null { const m = P.length; for (let period = 1; period < m; period++) { let isPeriod = true; for (let i = 0; i < m - period; i++) { if (P[i] !== P[i + period]) { isPeriod = false; break; } } if (isPeriod) return period; } return null; // No proper period (pattern is aperiodic)} /** * Find all proper prefixes that are also proper suffixes */function findPrefixSuffixMatches(P: string): Array<{ length: number; prefix: string; suffix: string }> { const matches: Array<{ length: number; prefix: string; suffix: string }> = []; const m = P.length; for (let len = 1; len < m; len++) { const prefix = P.slice(0, len); const suffix = P.slice(m - len); if (prefix === suffix) { matches.push({ length: len, prefix, suffix }); } } return matches;} function analyzePattern(P: string): PatternAnalysis { const lps = computeLPS(P); const prefixSuffixMatches = findPrefixSuffixMatches(P); const smallestPeriod = findSmallestPeriod(P); return { length: P.length, uniqueChars: new Set(P).size, isPeriodic: smallestPeriod !== null, smallestPeriod, prefixSuffixMatches, lpsArray: lps, hasInternalRepetition: lps.some(v => v > 0), };} // Demonstrate with examplesconst patterns = ["abcdef", "abab", "aaaa", "abcabc", "abcabcd"]; for (const P of patterns) { console.log(`\n=== Pattern: "${P}" ===`); const analysis = analyzePattern(P); console.log(`LPS Array: [${analysis.lpsArray.join(', ')}]`); console.log(`Prefix-Suffix Matches: ${analysis.prefixSuffixMatches.length > 0 ? analysis.prefixSuffixMatches.map(m => `"${m.prefix}"`).join(', ') : 'none'}`); console.log(`Periodic: ${analysis.isPeriodic ? `Yes (period = ${analysis.smallestPeriod})` : 'No'}`);}An occurrence is the core concept we're trying to find. Let's examine it with complete rigor.
Formal Definition:
Pattern P occurs at position (or index) i in text T if and only if:
∀j ∈ [0, m): T[i + j] = P[j]
In plain language: for every position j in the pattern (from 0 to m-1), the character at position i+j in the text equals the character at position j in the pattern.
Equivalent formulations:
The occurrence position range:
For pattern of length m in text of length n:
The maximum valid position is n - m, NOT n - m + 1. If n = 10 and m = 3, positions 0 through 7 are valid (8 positions total). Position 8 would require T[8..10], but T only goes to T[9]. This boundary condition is the source of countless bugs in pattern matching implementations.
Visualizing occurrences:
Consider T = "mississippi" and P = "issi":
T: m i s s i s s i p p i
0 1 2 3 4 5 6 7 8 9 10
P: i s s i
0 1 2 3
Checking position 1:
T[1] = 'i', P[0] = 'i' ✓
T[2] = 's', P[1] = 's' ✓
T[3] = 's', P[2] = 's' ✓
T[4] = 'i', P[3] = 'i' ✓
→ MATCH at position 1
Checking position 4:
T[4] = 'i', P[0] = 'i' ✓
T[5] = 's', P[1] = 's' ✓
T[6] = 's', P[2] = 's' ✓
T[7] = 'i', P[3] = 'i' ✓
→ MATCH at position 4
Validpositions: 0 to 10-4 = 6
Occurrences found: [1, 4]
Note: Occurrences at 1 and 4 overlap!
T[4] = 'i' is part of both matches.
| Property | Description | Example |
|---|---|---|
| Position | Index where match starts (0-based) | P at position 3 means T[3..3+m-1] = P |
| Non-overlapping | Next match starts after current ends | Matches at 0, 5 for m=5 are non-overlapping |
| Overlapping | Matches share some text positions | Matches at 1, 4 for m=4 overlap |
| Contiguous | Adjacent non-overlapping matches | Matches at 0, 4 for m=4 are contiguous |
| Leftmost | First occurrence in text | min(all occurrence positions) |
| Rightmost | Last occurrence in text | max(all occurrence positions) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
/** * Complete analysis of pattern occurrences in text * Including overlap detection and occurrence relationships */ interface Occurrence { position: number; matchedText: string; startIndex: number; endIndex: number; // Exclusive} interface OccurrenceAnalysis { occurrences: Occurrence[]; count: number; hasOverlap: boolean; overlappingPairs: Array<[number, number]>; coverage: number; // Percentage of text covered by matches gaps: Array<{ start: number; end: number; length: number }>;} function analyzeOccurrences(T: string, P: string): OccurrenceAnalysis { const n = T.length; const m = P.length; const occurrences: Occurrence[] = []; // Find all occurrences if (m > 0 && m <= n) { for (let i = 0; i <= n - m; i++) { let match = true; for (let j = 0; j < m && match; j++) { if (T[i + j] !== P[j]) match = false; } if (match) { occurrences.push({ position: i, matchedText: T.slice(i, i + m), startIndex: i, endIndex: i + m, }); } } } // Detect overlapping pairs const overlappingPairs: Array<[number, number]> = []; for (let i = 0; i < occurrences.length - 1; i++) { // Two occurrences overlap if the second starts before the first ends if (occurrences[i + 1].startIndex < occurrences[i].endIndex) { overlappingPairs.push([ occurrences[i].position, occurrences[i + 1].position ]); } } // Calculate coverage (unique text positions covered by matches) const covered = new Set<number>(); for (const occ of occurrences) { for (let i = occ.startIndex; i < occ.endIndex; i++) { covered.add(i); } } const coverage = n > 0 ? (covered.size / n) * 100 : 0; // Find gaps between matches const gaps: Array<{ start: number; end: number; length: number }> = []; if (occurrences.length > 0) { // Gap before first occurrence if (occurrences[0].startIndex > 0) { gaps.push({ start: 0, end: occurrences[0].startIndex, length: occurrences[0].startIndex }); } // Gaps between occurrences for (let i = 0; i < occurrences.length - 1; i++) { const gapStart = occurrences[i].endIndex; const gapEnd = occurrences[i + 1].startIndex; if (gapEnd > gapStart) { gaps.push({ start: gapStart, end: gapEnd, length: gapEnd - gapStart }); } } // Gap after last occurrence const lastEnd = occurrences[occurrences.length - 1].endIndex; if (lastEnd < n) { gaps.push({ start: lastEnd, end: n, length: n - lastEnd }); } } return { occurrences, count: occurrences.length, hasOverlap: overlappingPairs.length > 0, overlappingPairs, coverage, gaps, };} // Demonstrateconst T = "mississippi";const P = "issi"; const analysis = analyzeOccurrences(T, P);console.log(`Text: "${T}" (length ${T.length})`);console.log(`Pattern: "${P}" (length ${P.length})`);console.log(`Occurrences: ${analysis.count}`);console.log(`Positions: [${analysis.occurrences.map(o => o.position).join(', ')}]`);console.log(`Has overlapping matches: ${analysis.hasOverlap}`);if (analysis.hasOverlap) { console.log(`Overlapping pairs: ${analysis.overlappingPairs.map(p => `(${p[0]}, ${p[1]})`).join(', ')}`);}console.log(`Text coverage: ${analysis.coverage.toFixed(1)}%`);Robust pattern matching implementations must handle numerous edge cases correctly. Let's catalog them exhaustively.
Empty pattern (m = 0):
By mathematical convention, the empty string is a prefix and suffix of every string. Therefore:
Empty text (n = 0):
Pattern longer than text (m > n):
Pattern equals text (m = n):
Pattern equals entire text (P = T):
| Case | Text T | Pattern P | Result |
|---|---|---|---|
| Empty pattern | "abc" | "" | Positions: [0, 1, 2, 3] |
| Empty text | "" | "a" | Positions: [] (no match) |
| Both empty | "" | "" | Positions: [0] |
| Pattern > text | "ab" | "abcd" | Positions: [] (no match) |
| Pattern = text | "abc" | "abc" | Positions: [0] |
| Single char repeated | "aaa" | "a" | Positions: [0, 1, 2] |
| Pattern at end only | "xyzabc" | "abc" | Positions: [3] |
| Pattern at start only | "abcxyz" | "abc" | Positions: [0] |
| No match exists | "abcdef" | "xyz" | Positions: [] |
| Full overlap | "aaaa" | "aa" | Positions: [0, 1, 2] |
When implementing or testing pattern matching algorithms, always verify behavior with: empty strings, single-character strings, identical text and pattern, fully repetitive strings ('aaaa'), and patterns at the exact boundaries (start and end of text). These cases expose off-by-one errors and boundary condition bugs.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
/** * Comprehensive edge case test suite for pattern matching */ type MatchResult = number[]; function findAllOccurrences(T: string, P: string): MatchResult { const n = T.length; const m = P.length; const positions: number[] = []; // Handle empty pattern: matches at every position if (m === 0) { for (let i = 0; i <= n; i++) positions.push(i); return positions; } // Handle pattern longer than text if (m > n) return []; // Standard matching for (let i = 0; i <= n - m; i++) { let match = true; for (let j = 0; j < m && match; j++) { if (T[i + j] !== P[j]) match = false; } if (match) positions.push(i); } return positions;} // Edge case test harnessinterface TestCase { name: string; text: string; pattern: string; expected: number[];} const testCases: TestCase[] = [ { name: "Empty pattern in non-empty text", text: "abc", pattern: "", expected: [0, 1, 2, 3] }, { name: "Empty pattern in empty text", text: "", pattern: "", expected: [0] }, { name: "Non-empty pattern in empty text", text: "", pattern: "a", expected: [] }, { name: "Pattern longer than text", text: "ab", pattern: "abcd", expected: [] }, { name: "Pattern equals text", text: "abc", pattern: "abc", expected: [0] }, { name: "Single char in repetitive text", text: "aaa", pattern: "a", expected: [0, 1, 2] }, { name: "Pattern at start only", text: "abcxyz", pattern: "abc", expected: [0] }, { name: "Pattern at end only", text: "xyzabc", pattern: "abc", expected: [3] }, { name: "Pattern in middle", text: "xabcy", pattern: "abc", expected: [1] }, { name: "Multiple non-overlapping", text: "abcxxabc", pattern: "abc", expected: [0, 5] }, { name: "Full overlapping", text: "aaaa", pattern: "aa", expected: [0, 1, 2] }, { name: "No match", text: "abcdef", pattern: "xyz", expected: [] }, { name: "Pattern at both ends", text: "abcxxabc", pattern: "abc", expected: [0, 5] }, { name: "Single character match", text: "a", pattern: "a", expected: [0] }, { name: "Single character no match", text: "a", pattern: "b", expected: [] },]; // Run all testsconsole.log("=== Pattern Matching Edge Case Tests ===\n");let passed = 0;let failed = 0; for (const tc of testCases) { const result = findAllOccurrences(tc.text, tc.pattern); const success = JSON.stringify(result) === JSON.stringify(tc.expected); if (success) { passed++; console.log(`✓ ${tc.name}`); } else { failed++; console.log(`✗ ${tc.name}`); console.log(` Text: "${tc.text}", Pattern: "${tc.pattern}"`); console.log(` Expected: [${tc.expected}], Got: [${result}]`); }} console.log(`\nResults: ${passed} passed, ${failed} failed`);The three elements of pattern matching are intimately connected. Understanding their relationships reveals opportunities for optimization.
Key relationships:
1. Pattern Length vs. Occurrence Count
For a given text T of length n:
2. Text Repetitiveness vs. Algorithm Performance
When text has high repetitiveness (many repeated substrings):
3. Pattern Structure vs. Shift Distance
Patterns with no internal repetition (all distinct characters):
Patterns with high internal repetition:
The preprocessing trade-off:
| Preprocess | Time | Enables | Best For |
|---|---|---|---|
| Neither | O(1) | O(nm) matching | One-off searches, very short patterns |
| Pattern | O(m) | O(n) matching | Repeated searches for same pattern |
| Text | O(n) to O(n log n) | O(m) to O(m log n) matching | Many patterns searched in same text |
| Both | O(n) + O(m) | Specialized optimizations | Known pattern sets with known texts |
The decision of what to preprocess fundamentally determines algorithm choice. This is why understanding the relationship between text and pattern is not merely academic—it directly impacts system design.
We've deeply explored the three fundamental elements of pattern matching. This knowledge forms the conceptual foundation for all advanced algorithms.
You now have a thorough understanding of text, pattern, and occurrence—the core elements of pattern matching. This deep knowledge will illuminate every algorithm we study, from naive matching through KMP, Rabin-Karp, Boyer-Moore, and beyond. Next, we explore the distinction between single pattern and multiple pattern matching, revealing different problem classes with different optimal solutions.