String Matching Problem - Learning Module

Loading content...

0/276

Text, Pattern, and Occurrence — The Core Elements

The Trinity of Pattern Matching

In the formal framework of pattern matching, three elements reign supreme: the text, the pattern, and the occurrence. While we introduced these concepts on the previous page, their depth warrants thorough exploration.

Think of it this way: the text is the haystack, the pattern is the needle, and an occurrence is a precise location where that needle can be found. But unlike physical haystacks, our textual haystack may contain thousands of needles in overlapping configurations—and understanding these configurations is key to efficient searching.

On this page, we dissect each element, examining properties that matter for algorithm design and revealing relationships that efficient algorithms exploit.

Learning Objectives

By the end of this page, you will deeply understand: (1) How text characteristics affect algorithm choice, (2) Pattern properties that enable preprocessing optimizations, (3) The precise mechanics of occurrence detection, (4) Edge cases and boundary conditions that trip up implementations.

The Text: Properties and Considerations

The text (T) is the string being searched—the search space within which we look for pattern occurrences. Its properties profoundly influence algorithm performance and choice.

Formal Definition Revisited:

Text T is a string of length n over alphabet Σ:

T = T[0]T[1]T[2]...T[n-1]

where each T[i] ∈ Σ and n = |T|.

Critical text properties:

Text Characteristics That Matter

•Length (n): Determines overall algorithm running time. Text length typically dominates complexity expressions as O(n) terms.
•Alphabet size (|Σ|): Affects both algorithm choice and expected performance. Small alphabets (DNA, binary) have more repetition; large alphabets (Unicode) have more distinct characters.
•Character distribution: Uniform random vs. skewed distributions affect expected algorithm behavior. Natural language text has specific letter frequencies.
•Repetitiveness: Highly repetitive text (e.g., 'aaaaaa...') triggers worst-case behavior in some algorithms but enables compression-based approaches.
•Structure: Some texts have exploitable structure (e.g., words separated by spaces, line breaks at regular intervals) that specialized algorithms can leverage.
•Mutability: Does the text change between queries? Immutable text enables preprocessing; mutable text requires online algorithms.

Text access patterns:

How we access the text significantly impacts algorithm design:

Access Pattern	Description	Algorithm Implications
Sequential	Read characters left-to-right	Most algorithms (KMP, Rabin-Karp)
Random access	Jump to any position O(1)	Boyer-Moore, suffix arrays
Streaming	Characters arrive one-by-one	Online algorithms, Aho-Corasick
Indexed	Pre-built index (suffix array/tree)	Single O(m) query after O(n) preprocessing

When to Preprocess the Text

If you search the same text for many different patterns (like a search engine indexing documents), preprocessing the TEXT makes sense. If you search many different texts for the same pattern (like grep searching files), preprocessing the PATTERN makes sense. The choice fundamentally shapes algorithm selection.

text-analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/**
 * Comprehensive text analysis for pattern matching preparation
 * Understanding text characteristics helps choose optimal algorithms
 */
 
interface TextAnalysis {
    length: number;
    alphabetSize: number;
    uniqueChars: Set<string>;
    charFrequency: Map<string, number>;
    repetitiveness: number;  // 0-1 scale
    isHighlyRepetitive: boolean;
}
 
function analyzeText(T: string): TextAnalysis {
    const charFrequency = new Map<string, number>();
    
    // Count character frequencies
    for (const char of T) {
        charFrequency.set(char, (charFrequency.get(char) || 0) + 1);
    }
    
    const uniqueChars = new Set(T);
    const alphabetSize = uniqueChars.size;
    
    // Calculate repetitiveness metric
    // Low ratio of unique chars to length = high repetitiveness
    const repetitiveness = 1 - (alphabetSize / Math.max(T.length, 1));
    
    return {
        length: T.length,
        alphabetSize,
        uniqueChars,
        charFrequency,
        repetitiveness,
        isHighlyRepetitive: repetitiveness > 0.9,
    };
}
 
/**
 * Recommend algorithm based on text characteristics
 */
function recommendAlgorithm(analysis: TextAnalysis, willSearchMultipleTimes: boolean): string {
    if (willSearchMultipleTimes) {
        return "Use suffix array/tree - preprocess text once, query many times in O(m)";
    }
    
    if (analysis.isHighlyRepetitive) {
        return "Use KMP - handles repetitive text without worst-case degradation";
    }
    
    if (analysis.alphabetSize > 100) {
        return "Use Boyer-Moore - benefits from large alphabet (more bad character shifts)";
    }
    
    if (analysis.alphabetSize <= 4) {
        return "Use KMP or Rabin-Karp - small alphabet reduces Boyer-Moore's advantage";
    }
    
    return "Boyer-Moore or KMP - both perform well on typical text";
}
 
// Example analysis
const dnaSample = "ACGTACGTACGTACGT";
const englishSample = "The quick brown fox jumps over the lazy dog";
const repetitiveSample = "aaaaaaaaaaaaaaaa";
 
console.log("DNA Text Analysis:", analyzeText(dnaSample));
console.log("English Text Analysis:", analyzeText(englishSample));
console.log("Repetitive Text Analysis:", analyzeText(repetitiveSample));

The Pattern: Structure and Preprocessing

The pattern (P) is the string we seek within the text. While typically shorter than the text, the pattern's structure is the key to efficient matching algorithms.

Formal Definition Revisited:

Pattern P is a string of length m over alphabet Σ:

P = P[0]P[1]P[2]...P[m-1]

where each P[j] ∈ Σ and m = |P|.

Pattern properties that matter:

Critical Pattern Properties

•Length (m): Affects preprocessing time and matching complexity. Short patterns may not benefit from sophisticated algorithms.
•Self-repetition: Patterns like 'abab' have internal structure. The prefix 'ab' is also a suffix. This property is central to KMP's failure function.
•Character diversity: Patterns with all distinct characters allow simpler optimizations; repetitive patterns require careful handling.
•Periodicity: A pattern P has period p if P[i] = P[i+p] for all valid i. Periodic patterns behave specially in many algorithms.
•Prefix-suffix overlaps: Which proper prefixes are also proper suffixes? This determines how far we can safely shift after a mismatch.

The Prefix-Suffix Relationship:

The most important pattern property for efficient matching is the relationship between prefixes and suffixes. Consider pattern P = "abcabd":

Pattern: a b c a b d
         0 1 2 3 4 5

Proper prefixes:     Proper suffixes:
""                   ""
"a"                  "d"
"ab"                 "bd"
"abc"                "abd"
"abca"               "cabd"
"abcab"              "bcabd"

Prefixes that are also suffixes:
"" (empty string)     - trivially true
"ab" = P[0..1]        - matches P[3..4]? No: "ab" ≠ "bd"

For pattern "abab":

Pattern: a b a b
         0 1 2 3

"ab" (prefix P[0..1]) = "ab" (suffix P[2..3])? Yes!

This means: if we match "abab" partially and then hit a mismatch,
we don't start over—we can use the "ab" we already matched.

This insight is the foundation of the KMP algorithm.

Why Pattern Preprocessing Works

When we preprocess the pattern, we're essentially asking: 'If I've matched k characters and then encounter a mismatch, how much of what I matched can I reuse?' The answer depends only on the pattern's internal structure, not on the text. This is why O(m) preprocessing enables O(n) matching—we do the pattern analysis once and apply it n times.

pattern-analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
/**
 * Deep pattern analysis for understanding pattern structure
 * These properties are exploited by efficient matching algorithms
 */
 
interface PatternAnalysis {
    length: number;
    uniqueChars: number;
    isPeriodic: boolean;
    smallestPeriod: number | null;
    prefixSuffixMatches: Array<{ length: number; prefix: string; suffix: string }>;
    lpsArray: number[];  // Longest Proper Prefix which is also Suffix
    hasInternalRepetition: boolean;
}
 
/**
 * Compute the LPS (Longest Proper Prefix which is also Suffix) array
 * This is the foundation of the KMP algorithm
 */
function computeLPS(P: string): number[] {
    const m = P.length;
    const lps = new Array(m).fill(0);
    
    let len = 0;  // Length of previous longest prefix suffix
    let i = 1;
    
    while (i < m) {
        if (P[i] === P[len]) {
            len++;
            lps[i] = len;
            i++;
        } else {
            if (len !== 0) {
                len = lps[len - 1];
            } else {
                lps[i] = 0;
                i++;
            }
        }
    }
    
    return lps;
}
 
/**
 * Find the smallest period of a pattern
 * Period p means P[i] = P[i+p] for all 0 ≤ i < m-p
 */
function findSmallestPeriod(P: string): number | null {
    const m = P.length;
    
    for (let period = 1; period < m; period++) {
        let isPeriod = true;
        for (let i = 0; i < m - period; i++) {
            if (P[i] !== P[i + period]) {
                isPeriod = false;
                break;
            }
        }
        if (isPeriod) return period;
    }
    
    return null;  // No proper period (pattern is aperiodic)
}
 
/**
 * Find all proper prefixes that are also proper suffixes
 */
function findPrefixSuffixMatches(P: string): Array<{ length: number; prefix: string; suffix: string }> {
    const matches: Array<{ length: number; prefix: string; suffix: string }> = [];
    const m = P.length;
    
    for (let len = 1; len < m; len++) {
        const prefix = P.slice(0, len);
        const suffix = P.slice(m - len);
        if (prefix === suffix) {
            matches.push({ length: len, prefix, suffix });
        }
    }
    
    return matches;
}
 
function analyzePattern(P: string): PatternAnalysis {
    const lps = computeLPS(P);
    const prefixSuffixMatches = findPrefixSuffixMatches(P);
    const smallestPeriod = findSmallestPeriod(P);
    
    return {
        length: P.length,
        uniqueChars: new Set(P).size,
        isPeriodic: smallestPeriod !== null,
        smallestPeriod,
        prefixSuffixMatches,
        lpsArray: lps,
        hasInternalRepetition: lps.some(v => v > 0),
    };
}
 
// Demonstrate with examples
const patterns = ["abcdef", "abab", "aaaa", "abcabc", "abcabcd"];
 
for (const P of patterns) {
    console.log(`\n=== Pattern: "${P}" ===`);
    const analysis = analyzePattern(P);
    console.log(`LPS Array: [${analysis.lpsArray.join(', ')}]`);
    console.log(`Prefix-Suffix Matches: ${analysis.prefixSuffixMatches.length > 0 
        ? analysis.prefixSuffixMatches.map(m => `"${m.prefix}"`).join(', ')
        : 'none'}`);
    console.log(`Periodic: ${analysis.isPeriodic ? `Yes (period = ${analysis.smallestPeriod})` : 'No'}`);
}

Occurrence: The Precise Definition of a Match

An occurrence is the core concept we're trying to find. Let's examine it with complete rigor.

Formal Definition:

Pattern P occurs at position (or index) i in text T if and only if:

∀j ∈ [0, m): T[i + j] = P[j]

In plain language: for every position j in the pattern (from 0 to m-1), the character at position i+j in the text equals the character at position j in the pattern.

Equivalent formulations:

Substring equality: T[i..i+m-1] = P
Prefix condition: P is a prefix of T[i..n-1] (the suffix of T starting at i)
Shift validity: Shift s = i is valid

The occurrence position range:

For pattern of length m in text of length n:

Minimum valid position: 0
Maximum valid position: n - m
Total possible positions: n - m + 1
If m > n: No valid positions exist

Off-by-One Errors: The Bane of String Algorithms

The maximum valid position is n - m, NOT n - m + 1. If n = 10 and m = 3, positions 0 through 7 are valid (8 positions total). Position 8 would require T[8..10], but T only goes to T[9]. This boundary condition is the source of countless bugs in pattern matching implementations.

Visualizing occurrences:

Consider T = "mississippi" and P = "issi":

T:  m  i  s  s  i  s  s  i  p  p  i
    0  1  2  3  4  5  6  7  8  9  10

P:     i  s  s  i
       0  1  2  3

Checking position 1:
  T[1] = 'i', P[0] = 'i' ✓
  T[2] = 's', P[1] = 's' ✓
  T[3] = 's', P[2] = 's' ✓
  T[4] = 'i', P[3] = 'i' ✓
  → MATCH at position 1

Checking position 4:
  T[4] = 'i', P[0] = 'i' ✓
  T[5] = 's', P[1] = 's' ✓
  T[6] = 's', P[2] = 's' ✓
  T[7] = 'i', P[3] = 'i' ✓
  → MATCH at position 4

Validpositions: 0 to 10-4 = 6
Occurrences found: [1, 4]

Note: Occurrences at 1 and 4 overlap!
T[4] = 'i' is part of both matches.

Occurrence Properties
Property	Description	Example
Position	Index where match starts (0-based)	P at position 3 means T[3..3+m-1] = P
Non-overlapping	Next match starts after current ends	Matches at 0, 5 for m=5 are non-overlapping
Overlapping	Matches share some text positions	Matches at 1, 4 for m=4 overlap
Contiguous	Adjacent non-overlapping matches	Matches at 0, 4 for m=4 are contiguous
Leftmost	First occurrence in text	min(all occurrence positions)
Rightmost	Last occurrence in text	max(all occurrence positions)

occurrence-analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
/**
 * Complete analysis of pattern occurrences in text
 * Including overlap detection and occurrence relationships
 */
 
interface Occurrence {
    position: number;
    matchedText: string;
    startIndex: number;
    endIndex: number;  // Exclusive
}
 
interface OccurrenceAnalysis {
    occurrences: Occurrence[];
    count: number;
    hasOverlap: boolean;
    overlappingPairs: Array<[number, number]>;
    coverage: number;  // Percentage of text covered by matches
    gaps: Array<{ start: number; end: number; length: number }>;
}
 
function analyzeOccurrences(T: string, P: string): OccurrenceAnalysis {
    const n = T.length;
    const m = P.length;
    const occurrences: Occurrence[] = [];
    
    // Find all occurrences
    if (m > 0 && m <= n) {
        for (let i = 0; i <= n - m; i++) {
            let match = true;
            for (let j = 0; j < m && match; j++) {
                if (T[i + j] !== P[j]) match = false;
            }
            if (match) {
                occurrences.push({
                    position: i,
                    matchedText: T.slice(i, i + m),
                    startIndex: i,
                    endIndex: i + m,
                });
            }
        }
    }
    
    // Detect overlapping pairs
    const overlappingPairs: Array<[number, number]> = [];
    for (let i = 0; i < occurrences.length - 1; i++) {
        // Two occurrences overlap if the second starts before the first ends
        if (occurrences[i + 1].startIndex < occurrences[i].endIndex) {
            overlappingPairs.push([
                occurrences[i].position,
                occurrences[i + 1].position
            ]);
        }
    }
    
    // Calculate coverage (unique text positions covered by matches)
    const covered = new Set<number>();
    for (const occ of occurrences) {
        for (let i = occ.startIndex; i < occ.endIndex; i++) {
            covered.add(i);
        }
    }
    const coverage = n > 0 ? (covered.size / n) * 100 : 0;
    
    // Find gaps between matches
    const gaps: Array<{ start: number; end: number; length: number }> = [];
    if (occurrences.length > 0) {
        // Gap before first occurrence
        if (occurrences[0].startIndex > 0) {
            gaps.push({
                start: 0,
                end: occurrences[0].startIndex,
                length: occurrences[0].startIndex
            });
        }
        // Gaps between occurrences
        for (let i = 0; i < occurrences.length - 1; i++) {
            const gapStart = occurrences[i].endIndex;
            const gapEnd = occurrences[i + 1].startIndex;
            if (gapEnd > gapStart) {
                gaps.push({ start: gapStart, end: gapEnd, length: gapEnd - gapStart });
            }
        }
        // Gap after last occurrence
        const lastEnd = occurrences[occurrences.length - 1].endIndex;
        if (lastEnd < n) {
            gaps.push({ start: lastEnd, end: n, length: n - lastEnd });
        }
    }
    
    return {
        occurrences,
        count: occurrences.length,
        hasOverlap: overlappingPairs.length > 0,
        overlappingPairs,
        coverage,
        gaps,
    };
}
 
// Demonstrate
const T = "mississippi";
const P = "issi";
 
const analysis = analyzeOccurrences(T, P);
console.log(`Text: "${T}" (length ${T.length})`);
console.log(`Pattern: "${P}" (length ${P.length})`);
console.log(`Occurrences: ${analysis.count}`);
console.log(`Positions: [${analysis.occurrences.map(o => o.position).join(', ')}]`);
console.log(`Has overlapping matches: ${analysis.hasOverlap}`);
if (analysis.hasOverlap) {
    console.log(`Overlapping pairs: ${analysis.overlappingPairs.map(p => `(${p[0]}, ${p[1]})`).join(', ')}`);
}
console.log(`Text coverage: ${analysis.coverage.toFixed(1)}%`);

Edge Cases and Boundary Conditions

Robust pattern matching implementations must handle numerous edge cases correctly. Let's catalog them exhaustively.

Empty pattern (m = 0):

By mathematical convention, the empty string is a prefix and suffix of every string. Therefore:

Empty pattern occurs at every position from 0 to n
Total occurrences: n + 1
This may seem counterintuitive but is mathematically consistent

Empty text (n = 0):

If P = "": One occurrence at position 0
If P ≠ "": Zero occurrences (no room for pattern)

Pattern longer than text (m > n):

Zero occurrences possible
Maximum valid position is n - m < 0, so no valid positions exist

Pattern equals text (m = n):

At most one occurrence (at position 0)
Occurs only if P = T exactly

Pattern equals entire text (P = T):

Exactly one occurrence at position 0
This is the case where pattern and text are identical

Comprehensive Edge Case Reference
Case	Text T	Pattern P	Result
Empty pattern	"abc"	""	Positions: [0, 1, 2, 3]
Empty text	""	"a"	Positions: [] (no match)
Both empty	""	""	Positions: [0]
Pattern > text	"ab"	"abcd"	Positions: [] (no match)
Pattern = text	"abc"	"abc"	Positions: [0]
Single char repeated	"aaa"	"a"	Positions: [0, 1, 2]
Pattern at end only	"xyzabc"	"abc"	Positions: [3]
Pattern at start only	"abcxyz"	"abc"	Positions: [0]
No match exists	"abcdef"	"xyz"	Positions: []
Full overlap	"aaaa"	"aa"	Positions: [0, 1, 2]

Testing Edge Cases

When implementing or testing pattern matching algorithms, always verify behavior with: empty strings, single-character strings, identical text and pattern, fully repetitive strings ('aaaa'), and patterns at the exact boundaries (start and end of text). These cases expose off-by-one errors and boundary condition bugs.

edge-case-tests
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/**
 * Comprehensive edge case test suite for pattern matching
 */
 
type MatchResult = number[];
 
function findAllOccurrences(T: string, P: string): MatchResult {
    const n = T.length;
    const m = P.length;
    const positions: number[] = [];
    
    // Handle empty pattern: matches at every position
    if (m === 0) {
        for (let i = 0; i <= n; i++) positions.push(i);
        return positions;
    }
    
    // Handle pattern longer than text
    if (m > n) return [];
    
    // Standard matching
    for (let i = 0; i <= n - m; i++) {
        let match = true;
        for (let j = 0; j < m && match; j++) {
            if (T[i + j] !== P[j]) match = false;
        }
        if (match) positions.push(i);
    }
    
    return positions;
}
 
// Edge case test harness
interface TestCase {
    name: string;
    text: string;
    pattern: string;
    expected: number[];
}
 
const testCases: TestCase[] = [
    { name: "Empty pattern in non-empty text", text: "abc", pattern: "", expected: [0, 1, 2, 3] },
    { name: "Empty pattern in empty text", text: "", pattern: "", expected: [0] },
    { name: "Non-empty pattern in empty text", text: "", pattern: "a", expected: [] },
    { name: "Pattern longer than text", text: "ab", pattern: "abcd", expected: [] },
    { name: "Pattern equals text", text: "abc", pattern: "abc", expected: [0] },
    { name: "Single char in repetitive text", text: "aaa", pattern: "a", expected: [0, 1, 2] },
    { name: "Pattern at start only", text: "abcxyz", pattern: "abc", expected: [0] },
    { name: "Pattern at end only", text: "xyzabc", pattern: "abc", expected: [3] },
    { name: "Pattern in middle", text: "xabcy", pattern: "abc", expected: [1] },
    { name: "Multiple non-overlapping", text: "abcxxabc", pattern: "abc", expected: [0, 5] },
    { name: "Full overlapping", text: "aaaa", pattern: "aa", expected: [0, 1, 2] },
    { name: "No match", text: "abcdef", pattern: "xyz", expected: [] },
    { name: "Pattern at both ends", text: "abcxxabc", pattern: "abc", expected: [0, 5] },
    { name: "Single character match", text: "a", pattern: "a", expected: [0] },
    { name: "Single character no match", text: "a", pattern: "b", expected: [] },
];
 
// Run all tests
console.log("=== Pattern Matching Edge Case Tests ===\n");
let passed = 0;
let failed = 0;
 
for (const tc of testCases) {
    const result = findAllOccurrences(tc.text, tc.pattern);
    const success = JSON.stringify(result) === JSON.stringify(tc.expected);
    
    if (success) {
        passed++;
        console.log(`✓ ${tc.name}`);
    } else {
        failed++;
        console.log(`✗ ${tc.name}`);
        console.log(`  Text: "${tc.text}", Pattern: "${tc.pattern}"`);
        console.log(`  Expected: [${tc.expected}], Got: [${result}]`);
    }
}
 
console.log(`\nResults: ${passed} passed, ${failed} failed`);

The Relationship Between Text, Pattern, and Occurrence

The three elements of pattern matching are intimately connected. Understanding their relationships reveals opportunities for optimization.

Key relationships:

1. Pattern Length vs. Occurrence Count

For a given text T of length n:

Shorter patterns tend to have more occurrences
Maximum possible occurrences for pattern of length m: n - m + 1
As m increases, maximum occurrences decreases

2. Text Repetitiveness vs. Algorithm Performance

When text has high repetitiveness (many repeated substrings):

Naive algorithm suffers worst-case behavior
KMP maintains O(n) guarantee
Boyer-Moore's bad character rule becomes less effective
Preprocessing-based approaches (suffix array) become more valuable

3. Pattern Structure vs. Shift Distance

Patterns with no internal repetition (all distinct characters):

Allow maximum shift distance after mismatch
Enable simple real-time online matching

Patterns with high internal repetition:

Require more careful shift calculation
Benefit most from KMP's failure function

When Pattern Dominates

•Pattern is searched in many different texts
•Pattern is relatively long (m > 10)
•Pattern has interesting internal structure
•Same pattern reused frequently
•→ Preprocess PATTERN (KMP, Boyer-Moore)

When Text Dominates

•Many different patterns searched in same text
•Text is static or rarely changes
•Text is very large (database-scale)
•Query latency is critical
•→ Preprocess TEXT (suffix array/tree, inverted index)

The preprocessing trade-off:

Preprocess	Time	Enables	Best For
Neither	O(1)	O(nm) matching	One-off searches, very short patterns
Pattern	O(m)	O(n) matching	Repeated searches for same pattern
Text	O(n) to O(n log n)	O(m) to O(m log n) matching	Many patterns searched in same text
Both	O(n) + O(m)	Specialized optimizations	Known pattern sets with known texts

The decision of what to preprocess fundamentally determines algorithm choice. This is why understanding the relationship between text and pattern is not merely academic—it directly impacts system design.

Summary: Mastering the Core Elements

We've deeply explored the three fundamental elements of pattern matching. This knowledge forms the conceptual foundation for all advanced algorithms.

Key Takeaways

•Text properties (length, alphabet size, repetitiveness) determine which algorithm performs best
•Pattern structure (internal repetition, prefix-suffix relationships) enables preprocessing optimizations
•Occurrence definition is mathematically precise: P occurs at i iff T[i..i+m-1] = P
•Overlapping occurrences are permitted by default; applications may require non-overlapping variants
•Edge cases (empty strings, m > n, m = n) must be handled explicitly and correctly
•Preprocessing choice (pattern vs. text) depends on query patterns and performance requirements

Foundation Strengthened

You now have a thorough understanding of text, pattern, and occurrence—the core elements of pattern matching. This deep knowledge will illuminate every algorithm we study, from naive matching through KMP, Rabin-Karp, Boyer-Moore, and beyond. Next, we explore the distinction between single pattern and multiple pattern matching, revealing different problem classes with different optimal solutions.

Text, Pattern, and Occurrence — The Core Elements

The Trinity of Pattern Matching

On this page, we dissect each element, examining properties that matter for algorithm design and revealing relationships that efficient algorithms exploit.

Learning Objectives

The Text: Properties and Considerations

The text (T) is the string being searched—the search space within which we look for pattern occurrences. Its properties profoundly influence algorithm performance and choice.

Formal Definition Revisited:

Text T is a string of length n over alphabet Σ:

T = T[0]T[1]T[2]...T[n-1]

where each T[i] ∈ Σ and n = |T|.

Critical text properties:

Text Characteristics That Matter

•Length (n): Determines overall algorithm running time. Text length typically dominates complexity expressions as O(n) terms.
•Alphabet size (|Σ|): Affects both algorithm choice and expected performance. Small alphabets (DNA, binary) have more repetition; large alphabets (Unicode) have more distinct characters.
•Character distribution: Uniform random vs. skewed distributions affect expected algorithm behavior. Natural language text has specific letter frequencies.
•Repetitiveness: Highly repetitive text (e.g., 'aaaaaa...') triggers worst-case behavior in some algorithms but enables compression-based approaches.
•Structure: Some texts have exploitable structure (e.g., words separated by spaces, line breaks at regular intervals) that specialized algorithms can leverage.
•Mutability: Does the text change between queries? Immutable text enables preprocessing; mutable text requires online algorithms.

Text access patterns:

How we access the text significantly impacts algorithm design:

Access Pattern	Description	Algorithm Implications
Sequential	Read characters left-to-right	Most algorithms (KMP, Rabin-Karp)
Random access	Jump to any position O(1)	Boyer-Moore, suffix arrays
Streaming	Characters arrive one-by-one	Online algorithms, Aho-Corasick
Indexed	Pre-built index (suffix array/tree)	Single O(m) query after O(n) preprocessing

When to Preprocess the Text

text-analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/**
 * Comprehensive text analysis for pattern matching preparation
 * Understanding text characteristics helps choose optimal algorithms
 */
 
interface TextAnalysis {
    length: number;
    alphabetSize: number;
    uniqueChars: Set<string>;
    charFrequency: Map<string, number>;
    repetitiveness: number;  // 0-1 scale
    isHighlyRepetitive: boolean;
}
 
function analyzeText(T: string): TextAnalysis {
    const charFrequency = new Map<string, number>();
    
    // Count character frequencies
    for (const char of T) {
        charFrequency.set(char, (charFrequency.get(char) || 0) + 1);
    }
    
    const uniqueChars = new Set(T);
    const alphabetSize = uniqueChars.size;
    
    // Calculate repetitiveness metric
    // Low ratio of unique chars to length = high repetitiveness
    const repetitiveness = 1 - (alphabetSize / Math.max(T.length, 1));
    
    return {
        length: T.length,
        alphabetSize,
        uniqueChars,
        charFrequency,
        repetitiveness,
        isHighlyRepetitive: repetitiveness > 0.9,
    };
}
 
/**
 * Recommend algorithm based on text characteristics
 */
function recommendAlgorithm(analysis: TextAnalysis, willSearchMultipleTimes: boolean): string {
    if (willSearchMultipleTimes) {
        return "Use suffix array/tree - preprocess text once, query many times in O(m)";
    }
    
    if (analysis.isHighlyRepetitive) {
        return "Use KMP - handles repetitive text without worst-case degradation";
    }
    
    if (analysis.alphabetSize > 100) {
        return "Use Boyer-Moore - benefits from large alphabet (more bad character shifts)";
    }
    
    if (analysis.alphabetSize <= 4) {
        return "Use KMP or Rabin-Karp - small alphabet reduces Boyer-Moore's advantage";
    }
    
    return "Boyer-Moore or KMP - both perform well on typical text";
}
 
// Example analysis
const dnaSample = "ACGTACGTACGTACGT";
const englishSample = "The quick brown fox jumps over the lazy dog";
const repetitiveSample = "aaaaaaaaaaaaaaaa";
 
console.log("DNA Text Analysis:", analyzeText(dnaSample));
console.log("English Text Analysis:", analyzeText(englishSample));
console.log("Repetitive Text Analysis:", analyzeText(repetitiveSample));

The Pattern: Structure and Preprocessing

The pattern (P) is the string we seek within the text. While typically shorter than the text, the pattern's structure is the key to efficient matching algorithms.

Formal Definition Revisited:

Pattern P is a string of length m over alphabet Σ:

P = P[0]P[1]P[2]...P[m-1]

where each P[j] ∈ Σ and m = |P|.

Pattern properties that matter:

Critical Pattern Properties

•Length (m): Affects preprocessing time and matching complexity. Short patterns may not benefit from sophisticated algorithms.
•Self-repetition: Patterns like 'abab' have internal structure. The prefix 'ab' is also a suffix. This property is central to KMP's failure function.
•Character diversity: Patterns with all distinct characters allow simpler optimizations; repetitive patterns require careful handling.
•Periodicity: A pattern P has period p if P[i] = P[i+p] for all valid i. Periodic patterns behave specially in many algorithms.
•Prefix-suffix overlaps: Which proper prefixes are also proper suffixes? This determines how far we can safely shift after a mismatch.

The Prefix-Suffix Relationship:

The most important pattern property for efficient matching is the relationship between prefixes and suffixes. Consider pattern P = "abcabd":

Pattern: a b c a b d
         0 1 2 3 4 5

Proper prefixes:     Proper suffixes:
""                   ""
"a"                  "d"
"ab"                 "bd"
"abc"                "abd"
"abca"               "cabd"
"abcab"              "bcabd"

Prefixes that are also suffixes:
"" (empty string)     - trivially true
"ab" = P[0..1]        - matches P[3..4]? No: "ab" ≠ "bd"

For pattern "abab":

Pattern: a b a b
         0 1 2 3

"ab" (prefix P[0..1]) = "ab" (suffix P[2..3])? Yes!

This means: if we match "abab" partially and then hit a mismatch,
we don't start over—we can use the "ab" we already matched.

This insight is the foundation of the KMP algorithm.

Why Pattern Preprocessing Works

pattern-analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
/**
 * Deep pattern analysis for understanding pattern structure
 * These properties are exploited by efficient matching algorithms
 */
 
interface PatternAnalysis {
    length: number;
    uniqueChars: number;
    isPeriodic: boolean;
    smallestPeriod: number | null;
    prefixSuffixMatches: Array<{ length: number; prefix: string; suffix: string }>;
    lpsArray: number[];  // Longest Proper Prefix which is also Suffix
    hasInternalRepetition: boolean;
}
 
/**
 * Compute the LPS (Longest Proper Prefix which is also Suffix) array
 * This is the foundation of the KMP algorithm
 */
function computeLPS(P: string): number[] {
    const m = P.length;
    const lps = new Array(m).fill(0);
    
    let len = 0;  // Length of previous longest prefix suffix
    let i = 1;
    
    while (i < m) {
        if (P[i] === P[len]) {
            len++;
            lps[i] = len;
            i++;
        } else {
            if (len !== 0) {
                len = lps[len - 1];
            } else {
                lps[i] = 0;
                i++;
            }
        }
    }
    
    return lps;
}
 
/**
 * Find the smallest period of a pattern
 * Period p means P[i] = P[i+p] for all 0 ≤ i < m-p
 */
function findSmallestPeriod(P: string): number | null {
    const m = P.length;
    
    for (let period = 1; period < m; period++) {
        let isPeriod = true;
        for (let i = 0; i < m - period; i++) {
            if (P[i] !== P[i + period]) {
                isPeriod = false;
                break;
            }
        }
        if (isPeriod) return period;
    }
    
    return null;  // No proper period (pattern is aperiodic)
}
 
/**
 * Find all proper prefixes that are also proper suffixes
 */
function findPrefixSuffixMatches(P: string): Array<{ length: number; prefix: string; suffix: string }> {
    const matches: Array<{ length: number; prefix: string; suffix: string }> = [];
    const m = P.length;
    
    for (let len = 1; len < m; len++) {
        const prefix = P.slice(0, len);
        const suffix = P.slice(m - len);
        if (prefix === suffix) {
            matches.push({ length: len, prefix, suffix });
        }
    }
    
    return matches;
}
 
function analyzePattern(P: string): PatternAnalysis {
    const lps = computeLPS(P);
    const prefixSuffixMatches = findPrefixSuffixMatches(P);
    const smallestPeriod = findSmallestPeriod(P);
    
    return {
        length: P.length,
        uniqueChars: new Set(P).size,
        isPeriodic: smallestPeriod !== null,
        smallestPeriod,
        prefixSuffixMatches,
        lpsArray: lps,
        hasInternalRepetition: lps.some(v => v > 0),
    };
}
 
// Demonstrate with examples
const patterns = ["abcdef", "abab", "aaaa", "abcabc", "abcabcd"];
 
for (const P of patterns) {
    console.log(`\n=== Pattern: "${P}" ===`);
    const analysis = analyzePattern(P);
    console.log(`LPS Array: [${analysis.lpsArray.join(', ')}]`);
    console.log(`Prefix-Suffix Matches: ${analysis.prefixSuffixMatches.length > 0 
        ? analysis.prefixSuffixMatches.map(m => `"${m.prefix}"`).join(', ')
        : 'none'}`);
    console.log(`Periodic: ${analysis.isPeriodic ? `Yes (period = ${analysis.smallestPeriod})` : 'No'}`);
}

Occurrence: The Precise Definition of a Match

An occurrence is the core concept we're trying to find. Let's examine it with complete rigor.

Formal Definition:

Pattern P occurs at position (or index) i in text T if and only if:

∀j ∈ [0, m): T[i + j] = P[j]

In plain language: for every position j in the pattern (from 0 to m-1), the character at position i+j in the text equals the character at position j in the pattern.

Equivalent formulations:

Substring equality: T[i..i+m-1] = P
Prefix condition: P is a prefix of T[i..n-1] (the suffix of T starting at i)
Shift validity: Shift s = i is valid

The occurrence position range:

For pattern of length m in text of length n:

Minimum valid position: 0
Maximum valid position: n - m
Total possible positions: n - m + 1
If m > n: No valid positions exist

Off-by-One Errors: The Bane of String Algorithms

Visualizing occurrences:

Consider T = "mississippi" and P = "issi":

T:  m  i  s  s  i  s  s  i  p  p  i
    0  1  2  3  4  5  6  7  8  9  10

P:     i  s  s  i
       0  1  2  3

Checking position 1:
  T[1] = 'i', P[0] = 'i' ✓
  T[2] = 's', P[1] = 's' ✓
  T[3] = 's', P[2] = 's' ✓
  T[4] = 'i', P[3] = 'i' ✓
  → MATCH at position 1

Checking position 4:
  T[4] = 'i', P[0] = 'i' ✓
  T[5] = 's', P[1] = 's' ✓
  T[6] = 's', P[2] = 's' ✓
  T[7] = 'i', P[3] = 'i' ✓
  → MATCH at position 4

Validpositions: 0 to 10-4 = 6
Occurrences found: [1, 4]

Note: Occurrences at 1 and 4 overlap!
T[4] = 'i' is part of both matches.

Occurrence Properties
Property	Description	Example
Position	Index where match starts (0-based)	P at position 3 means T[3..3+m-1] = P
Non-overlapping	Next match starts after current ends	Matches at 0, 5 for m=5 are non-overlapping
Overlapping	Matches share some text positions	Matches at 1, 4 for m=4 overlap
Contiguous	Adjacent non-overlapping matches	Matches at 0, 4 for m=4 are contiguous
Leftmost	First occurrence in text	min(all occurrence positions)
Rightmost	Last occurrence in text	max(all occurrence positions)

occurrence-analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
/**
 * Complete analysis of pattern occurrences in text
 * Including overlap detection and occurrence relationships
 */
 
interface Occurrence {
    position: number;
    matchedText: string;
    startIndex: number;
    endIndex: number;  // Exclusive
}
 
interface OccurrenceAnalysis {
    occurrences: Occurrence[];
    count: number;
    hasOverlap: boolean;
    overlappingPairs: Array<[number, number]>;
    coverage: number;  // Percentage of text covered by matches
    gaps: Array<{ start: number; end: number; length: number }>;
}
 
function analyzeOccurrences(T: string, P: string): OccurrenceAnalysis {
    const n = T.length;
    const m = P.length;
    const occurrences: Occurrence[] = [];
    
    // Find all occurrences
    if (m > 0 && m <= n) {
        for (let i = 0; i <= n - m; i++) {
            let match = true;
            for (let j = 0; j < m && match; j++) {
                if (T[i + j] !== P[j]) match = false;
            }
            if (match) {
                occurrences.push({
                    position: i,
                    matchedText: T.slice(i, i + m),
                    startIndex: i,
                    endIndex: i + m,
                });
            }
        }
    }
    
    // Detect overlapping pairs
    const overlappingPairs: Array<[number, number]> = [];
    for (let i = 0; i < occurrences.length - 1; i++) {
        // Two occurrences overlap if the second starts before the first ends
        if (occurrences[i + 1].startIndex < occurrences[i].endIndex) {
            overlappingPairs.push([
                occurrences[i].position,
                occurrences[i + 1].position
            ]);
        }
    }
    
    // Calculate coverage (unique text positions covered by matches)
    const covered = new Set<number>();
    for (const occ of occurrences) {
        for (let i = occ.startIndex; i < occ.endIndex; i++) {
            covered.add(i);
        }
    }
    const coverage = n > 0 ? (covered.size / n) * 100 : 0;
    
    // Find gaps between matches
    const gaps: Array<{ start: number; end: number; length: number }> = [];
    if (occurrences.length > 0) {
        // Gap before first occurrence
        if (occurrences[0].startIndex > 0) {
            gaps.push({
                start: 0,
                end: occurrences[0].startIndex,
                length: occurrences[0].startIndex
            });
        }
        // Gaps between occurrences
        for (let i = 0; i < occurrences.length - 1; i++) {
            const gapStart = occurrences[i].endIndex;
            const gapEnd = occurrences[i + 1].startIndex;
            if (gapEnd > gapStart) {
                gaps.push({ start: gapStart, end: gapEnd, length: gapEnd - gapStart });
            }
        }
        // Gap after last occurrence
        const lastEnd = occurrences[occurrences.length - 1].endIndex;
        if (lastEnd < n) {
            gaps.push({ start: lastEnd, end: n, length: n - lastEnd });
        }
    }
    
    return {
        occurrences,
        count: occurrences.length,
        hasOverlap: overlappingPairs.length > 0,
        overlappingPairs,
        coverage,
        gaps,
    };
}
 
// Demonstrate
const T = "mississippi";
const P = "issi";
 
const analysis = analyzeOccurrences(T, P);
console.log(`Text: "${T}" (length ${T.length})`);
console.log(`Pattern: "${P}" (length ${P.length})`);
console.log(`Occurrences: ${analysis.count}`);
console.log(`Positions: [${analysis.occurrences.map(o => o.position).join(', ')}]`);
console.log(`Has overlapping matches: ${analysis.hasOverlap}`);
if (analysis.hasOverlap) {
    console.log(`Overlapping pairs: ${analysis.overlappingPairs.map(p => `(${p[0]}, ${p[1]})`).join(', ')}`);
}
console.log(`Text coverage: ${analysis.coverage.toFixed(1)}%`);

Edge Cases and Boundary Conditions

Robust pattern matching implementations must handle numerous edge cases correctly. Let's catalog them exhaustively.

Empty pattern (m = 0):

By mathematical convention, the empty string is a prefix and suffix of every string. Therefore:

Empty pattern occurs at every position from 0 to n
Total occurrences: n + 1
This may seem counterintuitive but is mathematically consistent

Empty text (n = 0):

If P = "": One occurrence at position 0
If P ≠ "": Zero occurrences (no room for pattern)

Pattern longer than text (m > n):

Zero occurrences possible
Maximum valid position is n - m < 0, so no valid positions exist

Pattern equals text (m = n):

At most one occurrence (at position 0)
Occurs only if P = T exactly

Pattern equals entire text (P = T):

Exactly one occurrence at position 0
This is the case where pattern and text are identical

Comprehensive Edge Case Reference
Case	Text T	Pattern P	Result
Empty pattern	"abc"	""	Positions: [0, 1, 2, 3]
Empty text	""	"a"	Positions: [] (no match)
Both empty	""	""	Positions: [0]
Pattern > text	"ab"	"abcd"	Positions: [] (no match)
Pattern = text	"abc"	"abc"	Positions: [0]
Single char repeated	"aaa"	"a"	Positions: [0, 1, 2]
Pattern at end only	"xyzabc"	"abc"	Positions: [3]
Pattern at start only	"abcxyz"	"abc"	Positions: [0]
No match exists	"abcdef"	"xyz"	Positions: []
Full overlap	"aaaa"	"aa"	Positions: [0, 1, 2]

Testing Edge Cases

edge-case-tests
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/**
 * Comprehensive edge case test suite for pattern matching
 */
 
type MatchResult = number[];
 
function findAllOccurrences(T: string, P: string): MatchResult {
    const n = T.length;
    const m = P.length;
    const positions: number[] = [];
    
    // Handle empty pattern: matches at every position
    if (m === 0) {
        for (let i = 0; i <= n; i++) positions.push(i);
        return positions;
    }
    
    // Handle pattern longer than text
    if (m > n) return [];
    
    // Standard matching
    for (let i = 0; i <= n - m; i++) {
        let match = true;
        for (let j = 0; j < m && match; j++) {
            if (T[i + j] !== P[j]) match = false;
        }
        if (match) positions.push(i);
    }
    
    return positions;
}
 
// Edge case test harness
interface TestCase {
    name: string;
    text: string;
    pattern: string;
    expected: number[];
}
 
const testCases: TestCase[] = [
    { name: "Empty pattern in non-empty text", text: "abc", pattern: "", expected: [0, 1, 2, 3] },
    { name: "Empty pattern in empty text", text: "", pattern: "", expected: [0] },
    { name: "Non-empty pattern in empty text", text: "", pattern: "a", expected: [] },
    { name: "Pattern longer than text", text: "ab", pattern: "abcd", expected: [] },
    { name: "Pattern equals text", text: "abc", pattern: "abc", expected: [0] },
    { name: "Single char in repetitive text", text: "aaa", pattern: "a", expected: [0, 1, 2] },
    { name: "Pattern at start only", text: "abcxyz", pattern: "abc", expected: [0] },
    { name: "Pattern at end only", text: "xyzabc", pattern: "abc", expected: [3] },
    { name: "Pattern in middle", text: "xabcy", pattern: "abc", expected: [1] },
    { name: "Multiple non-overlapping", text: "abcxxabc", pattern: "abc", expected: [0, 5] },
    { name: "Full overlapping", text: "aaaa", pattern: "aa", expected: [0, 1, 2] },
    { name: "No match", text: "abcdef", pattern: "xyz", expected: [] },
    { name: "Pattern at both ends", text: "abcxxabc", pattern: "abc", expected: [0, 5] },
    { name: "Single character match", text: "a", pattern: "a", expected: [0] },
    { name: "Single character no match", text: "a", pattern: "b", expected: [] },
];
 
// Run all tests
console.log("=== Pattern Matching Edge Case Tests ===\n");
let passed = 0;
let failed = 0;
 
for (const tc of testCases) {
    const result = findAllOccurrences(tc.text, tc.pattern);
    const success = JSON.stringify(result) === JSON.stringify(tc.expected);
    
    if (success) {
        passed++;
        console.log(`✓ ${tc.name}`);
    } else {
        failed++;
        console.log(`✗ ${tc.name}`);
        console.log(`  Text: "${tc.text}", Pattern: "${tc.pattern}"`);
        console.log(`  Expected: [${tc.expected}], Got: [${result}]`);
    }
}
 
console.log(`\nResults: ${passed} passed, ${failed} failed`);

The Relationship Between Text, Pattern, and Occurrence

The three elements of pattern matching are intimately connected. Understanding their relationships reveals opportunities for optimization.

Key relationships:

1. Pattern Length vs. Occurrence Count

For a given text T of length n:

Shorter patterns tend to have more occurrences
Maximum possible occurrences for pattern of length m: n - m + 1
As m increases, maximum occurrences decreases

2. Text Repetitiveness vs. Algorithm Performance

When text has high repetitiveness (many repeated substrings):

Naive algorithm suffers worst-case behavior
KMP maintains O(n) guarantee
Boyer-Moore's bad character rule becomes less effective
Preprocessing-based approaches (suffix array) become more valuable

3. Pattern Structure vs. Shift Distance

Patterns with no internal repetition (all distinct characters):

Allow maximum shift distance after mismatch
Enable simple real-time online matching

Patterns with high internal repetition:

Require more careful shift calculation
Benefit most from KMP's failure function

When Pattern Dominates

•Pattern is searched in many different texts
•Pattern is relatively long (m > 10)
•Pattern has interesting internal structure
•Same pattern reused frequently
•→ Preprocess PATTERN (KMP, Boyer-Moore)

When Text Dominates

•Many different patterns searched in same text
•Text is static or rarely changes
•Text is very large (database-scale)
•Query latency is critical
•→ Preprocess TEXT (suffix array/tree, inverted index)

The preprocessing trade-off:

Preprocess	Time	Enables	Best For
Neither	O(1)	O(nm) matching	One-off searches, very short patterns
Pattern	O(m)	O(n) matching	Repeated searches for same pattern
Text	O(n) to O(n log n)	O(m) to O(m log n) matching	Many patterns searched in same text
Both	O(n) + O(m)	Specialized optimizations	Known pattern sets with known texts

Summary: Mastering the Core Elements

We've deeply explored the three fundamental elements of pattern matching. This knowledge forms the conceptual foundation for all advanced algorithms.

Key Takeaways

•Text properties (length, alphabet size, repetitiveness) determine which algorithm performs best
•Pattern structure (internal repetition, prefix-suffix relationships) enables preprocessing optimizations
•Occurrence definition is mathematically precise: P occurs at i iff T[i..i+m-1] = P
•Overlapping occurrences are permitted by default; applications may require non-overlapping variants
•Edge cases (empty strings, m > n, m = n) must be handled explicitly and correctly
•Preprocessing choice (pattern vs. text) depends on query patterns and performance requirements

Foundation Strengthened