String Matching Problem - Learning Module

Loading content...

0/276

The Pattern Matching Problem Defined

The Universal Search Problem

Every time you press Ctrl+F in a document, every time a search engine processes your query, every time a biologist searches for a gene sequence in DNA data—the same fundamental problem is being solved: pattern matching in strings.

This problem is so ubiquitous that it might seem trivial. After all, how hard can it be to find a word in text? The answer reveals one of the most elegant areas of algorithm design, where clever insights transform a seemingly simple problem from requiring hours of computation to mere milliseconds.

Understanding pattern matching formally—not just intuitively—is the gateway to mastering string algorithms. Every advanced technique, from the KMP algorithm to suffix trees, builds upon the precise definitions we will establish on this page.

What You Will Learn

By the end of this page, you will understand the formal mathematical definition of the pattern matching problem, the precise terminology used throughout string algorithm literature, why naive approaches fail at scale, and the framework that enables efficient solutions. This formalization is essential for understanding every pattern matching algorithm that follows.

From Intuition to Precision

Before we formalize the pattern matching problem, let's establish our intuitive understanding and then refine it.

The everyday concept:

When you search for the word "algorithm" in a textbook, you're looking for every place where that exact sequence of characters appears. You might find it on page 5, page 23, and page 147. Each of these is an occurrence or match.

But this informal description leaves many questions unanswered:

What exactly constitutes a "match"? Must characters be consecutive?
Do overlapping occurrences count separately?
How do we describe positions within the text?
What if we're searching in binary data rather than English text?

A formal definition answers all these questions unambiguously, providing precision that enables algorithm design and correctness proofs.

Why Formalization Matters

Informal descriptions lead to edge case bugs and implementation ambiguity. When Donald Knuth, James Morris, and Vaughan Pratt developed the KMP algorithm, their formal problem definition allowed them to prove their algorithm correct and analyze its complexity precisely. Without formalization, we cannot rigorously compare algorithms or guarantee correctness.

The journey from informal to formal:

We will now build our formal definition piece by piece:

Alphabets — The set of symbols we can use
Strings — Finite sequences over an alphabet
Text and Pattern — The specific strings in our problem
Position and Index — How we reference locations
Occurrence/Match — The precise condition for a match
The Pattern Matching Problem — The complete formal statement

Each definition builds on the previous, creating a rigorous foundation for all string algorithms.

Alphabets and Strings: The Building Blocks

Every string algorithm operates over an alphabet—the set of allowed symbols. This seemingly simple concept has profound implications for algorithm design and complexity analysis.

Definition (Alphabet):

An alphabet, denoted Σ (sigma), is a finite, non-empty set of distinct symbols (also called characters or letters).

Examples of alphabets:

| Alphabet Name | Σ | Size |Σ| | |---------------|---|----------| | Binary | {0, 1} | 2 | | DNA | {A, C, G, T} | 4 | | Lowercase English | {a, b, c, ..., z} | 26 | | ASCII | {all ASCII characters} | 128 | | Unicode | {all Unicode code points} | >143,000 |

The size of the alphabet, denoted |Σ|, directly impacts algorithm complexity for many string algorithms.

Alphabet Size Matters

Many string algorithms have complexity that depends on |Σ|. For DNA (|Σ| = 4), a brute-force approach might be acceptable. For Unicode (|Σ| > 143,000), the same approach becomes impractical. This is why algorithms like KMP, which are alphabet-independent, are so valuable.

Definition (String):

A string over alphabet Σ is a finite sequence of symbols from Σ. The set of all strings over Σ is denoted Σ* (Σ-star).

Formal notation:

A string S of length n is written as:

S = S[0]S[1]S[2]...S[n-1]

where each S[i] ∈ Σ for 0 ≤ i < n.

Key string properties:

Length: The number of symbols in S, denoted |S| or len(S)
Empty string: The unique string of length 0, denoted ε (epsilon) or ""
Concatenation: Joining strings S and T produces ST (or S·T)
Substring: Any contiguous subsequence of S
Prefix: A substring starting at position 0
Suffix: A substring ending at the last position

string-definitions-example

Alphabet: Σ = {a, b, c, ..., z}
 
String: S = "algorithm"
Length: |S| = 9
 
Indexing (0-based):
  S[0] = 'a'
  S[1] = 'l'
  S[2] = 'g'
  ...
  S[8] = 'm'
 
Substrings of "algorithm":
  "algo"      - prefix (positions 0-3)
  "rithm"     - suffix (positions 4-8)
  "gori"      - general substring (positions 2-5)
  "algorithm" - the string itself is both prefix and suffix
  ""          - empty string (also valid prefix/suffix)
 
Concatenation:
  "algo" · "rithm" = "algorithm"

Substrings, Prefixes, and Suffixes

These three concepts form the vocabulary of pattern matching. Understanding them precisely is essential for grasping why certain algorithms work.

Definition (Substring):

String P is a substring of string S if there exist strings x and y such that:

S = xPy

where x and y can be empty. Equivalently, P = S[i..j] for some 0 ≤ i ≤ j ≤ |S|.

The substring starting at position i with length m is:

S[i..i+m-1] = S[i]S[i+1]...S[i+m-1]

Definition (Prefix):

String P is a prefix of string S if there exists string y such that:

S = Py

Equivalently, P = S[0..k-1] for some 0 ≤ k ≤ |S|. A prefix is a substring that starts at position 0.

Definition (Suffix):

String P is a suffix of string S if there exists string x such that:

S = xP

Equivalently, P = S[k..|S|-1] for some 0 ≤ k ≤ |S|. A suffix is a substring that ends at the last position.

All Prefixes and Suffixes of 'algorithm'
Length	Prefix	Suffix
0	ε (empty)	ε (empty)
1	a	m
2	al	hm
3	alg	thm
4	algo	ithm
5	algor	rithm
6	algori	orithm
7	algorit	gorithm
8	algorith	lgorithm
9	algorithm	algorithm

Proper Substrings, Prefixes, and Suffixes

A proper prefix (or suffix) of S is one that is not equal to S itself. Similarly, a proper substring excludes S as a whole. These distinctions become crucial when analyzing algorithms like KMP, where we specifically examine proper prefixes that are also suffixes.

Counting substrings:

For a string of length n:

Number of substrings (including empty): 1 + 2 + ... + n + 1 = (n+1)(n+2)/2 = O(n²)
Number of distinct substrings: Between n+1 (e.g., "aaa") and n(n+1)/2 + 1 (all unique)
Number of prefixes: Exactly n + 1 (including empty)
Number of suffixes: Exactly n + 1 (including empty)

This quadratic growth in substrings explains why brute-force substring enumeration doesn't scale, and why data structures like suffix arrays (which store all n suffixes) are so powerful.

prefix-suffix-operations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/**
 * Comprehensive prefix, suffix, and substring utilities
 * These operations are fundamental to pattern matching algorithms
 */
 
// Check if 'prefix' is a prefix of 'str'
function isPrefix(str: string, prefix: string): boolean {
    if (prefix.length > str.length) return false;
    for (let i = 0; i < prefix.length; i++) {
        if (str[i] !== prefix[i]) return false;
    }
    return true;
    // Alternatively: return str.startsWith(prefix);
}
 
// Check if 'suffix' is a suffix of 'str'
function isSuffix(str: string, suffix: string): boolean {
    if (suffix.length > str.length) return false;
    const offset = str.length - suffix.length;
    for (let i = 0; i < suffix.length; i++) {
        if (str[offset + i] !== suffix[i]) return false;
    }
    return true;
    // Alternatively: return str.endsWith(suffix);
}
 
// Check if 'sub' is a substring of 'str'
function isSubstring(str: string, sub: string): boolean {
    if (sub.length === 0) return true;  // Empty is always substring
    if (sub.length > str.length) return false;
    
    // Check all possible starting positions
    for (let i = 0; i <= str.length - sub.length; i++) {
        let match = true;
        for (let j = 0; j < sub.length; j++) {
            if (str[i + j] !== sub[j]) {
                match = false;
                break;
            }
        }
        if (match) return true;
    }
    return false;
    // Alternatively: return str.includes(sub);
}
 
// Generate all prefixes of a string
function allPrefixes(str: string): string[] {
    const prefixes: string[] = [''];  // Empty prefix
    for (let i = 1; i <= str.length; i++) {
        prefixes.push(str.slice(0, i));
    }
    return prefixes;
}
 
// Generate all suffixes of a string
function allSuffixes(str: string): string[] {
    const suffixes: string[] = [];
    for (let i = 0; i <= str.length; i++) {
        suffixes.push(str.slice(i));
    }
    return suffixes;  // Includes empty suffix at end
}
 
// Example usage
const S = "algorithm";
console.log("Prefixes:", allPrefixes(S));
console.log("Suffixes:", allSuffixes(S));
console.log("Is 'algo' a prefix?", isPrefix(S, "algo"));     // true
console.log("Is 'rithm' a suffix?", isSuffix(S, "rithm"));   // true
console.log("Is 'gori' a substring?", isSubstring(S, "gori")); // true

The Formal Pattern Matching Problem

With our vocabulary established, we can now state the pattern matching problem with mathematical precision.

Definition (Pattern Matching Problem):

Given:

A text T of length n over alphabet Σ: T = T[0]T[1]...T[n-1]
A pattern P of length m over the same alphabet Σ: P = P[0]P[1]...P[m-1]
Typically, m ≤ n (the pattern is not longer than the text)

Find:

All positions i (where 0 ≤ i ≤ n - m) such that P occurs in T starting at position i

Definition (Occurrence/Match):

Pattern P occurs (or matches) at position i in text T if and only if:

T[i + j] = P[j] for all j where 0 ≤ j < m

In other words, the substring T[i..i+m-1] equals P.

Position i is called a valid shift or match position.

Index Convention

We use 0-based indexing throughout (positions start at 0). Some textbooks use 1-based indexing (positions start at 1). When reading algorithm papers, always verify which convention is used. The core algorithms are identical; only the index arithmetic differs slightly.

Visual representation:

Let's visualize pattern matching with a concrete example:

Text T:    "abracadabra"
           01234567890
           
Pattern P: "abra"
           0123

Question: At which positions does P occur in T?

Checking each position:

Position i	Substring T[i..i+3]	Matches P?
0	"abra"	✓ Yes
1	"brac"	✗ No
2	"raca"	✗ No
3	"acad"	✗ No
4	"cada"	✗ No
5	"adab"	✗ No
6	"dabr"	✗ No
7	"abra"	✓ Yes
8+	(too short)	N/A

Answer: P occurs at positions 0 and 7.

formal-pattern-matching
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
/**
 * Formal implementation of the pattern matching problem
 * Returns all positions where pattern P occurs in text T
 */
 
interface PatternMatchResult {
    positions: number[];  // All valid shift positions
    comparisons: number;  // Total character comparisons (for analysis)
}
 
/**
 * Naive pattern matching: directly implements the formal definition
 * Time Complexity: O(n × m) where n = |T| and m = |P|
 * Space Complexity: O(k) where k = number of matches
 */
function findPatternOccurrences(T: string, P: string): PatternMatchResult {
    const n = T.length;
    const m = P.length;
    const positions: number[] = [];
    let comparisons = 0;
    
    // Edge cases
    if (m === 0) {
        // Empty pattern matches at every position (including after last char)
        return { 
            positions: Array.from({ length: n + 1 }, (_, i) => i), 
            comparisons: 0 
        };
    }
    
    if (m > n) {
        // Pattern longer than text: no matches possible
        return { positions: [], comparisons: 0 };
    }
    
    // Check each valid starting position
    // Position i is valid if 0 ≤ i ≤ n - m (enough room for pattern)
    for (let i = 0; i <= n - m; i++) {
        // Check if P occurs at position i
        let isMatch = true;
        
        for (let j = 0; j < m; j++) {
            comparisons++;
            if (T[i + j] !== P[j]) {
                isMatch = false;
                break;
            }
        }
        
        if (isMatch) {
            positions.push(i);
        }
    }
    
    return { positions, comparisons };
}
 
// Example usage
const T = "abracadabra";
const P = "abra";
 
const result = findPatternOccurrences(T, P);
console.log(`Text: "${T}" (length ${T.length})`);
console.log(`Pattern: "${P}" (length ${P.length})`);
console.log(`Occurrences at positions: [${result.positions.join(', ')}]`);
console.log(`Total comparisons: ${result.comparisons}`);
 
// Verify each match
result.positions.forEach(pos => {
    const match = T.slice(pos, pos + P.length);
    console.log(`  Position ${pos}: "${match}" === "${P}" ? ${match === P}`);
});

Problem Variants and Output Specifications

The basic pattern matching problem has several variants depending on what output is required. Understanding these variants is crucial because different algorithms optimize for different outputs.

Variant 1: Decision Problem

Question: Does pattern P occur in text T at all?

Output: Boolean (yes/no)

Complexity goal: Often can short-circuit after finding first match, making average case faster.

Variant 2: First Occurrence

Question: What is the first (leftmost) position where P occurs in T?

Output: Single position, or -1 if no occurrence

Complexity goal: Can stop immediately upon finding first match.

Variant 3: Count Occurrences

Question: How many times does P occur in T?

Output: Non-negative integer

Note: Overlapping occurrences may or may not be counted depending on requirements.

Variant 4: All Occurrences

Question: List all positions where P occurs in T.

Output: List of positions (may be empty)

Complexity goal: Must examine entire text; output can be O(n) in worst case.

Overlapping Occurrences

Overlapping occurrences occur when the pattern can match starting at positions that are closer together than the pattern length. For example, pattern 'aa' in text 'aaa' has overlapping occurrences at positions 0 and 1. Most formal definitions count overlapping occurrences separately, but some applications (like word counting in natural language) may want non-overlapping matches only.

pattern-matching-variants
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
/**
 * All variants of the pattern matching problem
 */
 
// Variant 1: Decision - Does pattern occur?
function containsPattern(T: string, P: string): boolean {
    const n = T.length;
    const m = P.length;
    
    if (m === 0) return true;
    if (m > n) return false;
    
    for (let i = 0; i <= n - m; i++) {
        let match = true;
        for (let j = 0; j < m && match; j++) {
            if (T[i + j] !== P[j]) match = false;
        }
        if (match) return true;  // Early exit!
    }
    return false;
}
 
// Variant 2: First occurrence position (-1 if not found)
function findFirstOccurrence(T: string, P: string): number {
    const n = T.length;
    const m = P.length;
    
    if (m === 0) return 0;  // Empty pattern at start
    if (m > n) return -1;
    
    for (let i = 0; i <= n - m; i++) {
        let match = true;
        for (let j = 0; j < m && match; j++) {
            if (T[i + j] !== P[j]) match = false;
        }
        if (match) return i;  // Early exit!
    }
    return -1;
}
 
// Variant 3: Count occurrences (including overlapping)
function countOccurrences(T: string, P: string): number {
    const n = T.length;
    const m = P.length;
    
    if (m === 0) return n + 1;  // Empty matches everywhere
    if (m > n) return 0;
    
    let count = 0;
    for (let i = 0; i <= n - m; i++) {
        let match = true;
        for (let j = 0; j < m && match; j++) {
            if (T[i + j] !== P[j]) match = false;
        }
        if (match) count++;
    }
    return count;
}
 
// Variant 3b: Count non-overlapping occurrences
function countNonOverlapping(T: string, P: string): number {
    const n = T.length;
    const m = P.length;
    
    if (m === 0) return n + 1;
    if (m > n) return 0;
    
    let count = 0;
    let i = 0;
    while (i <= n - m) {
        let match = true;
        for (let j = 0; j < m && match; j++) {
            if (T[i + j] !== P[j]) match = false;
        }
        if (match) {
            count++;
            i += m;  // Skip past this match
        } else {
            i++;
        }
    }
    return count;
}
 
// Example demonstrating overlapping
const T = "aaaa";
const P = "aa";
console.log(`Text: "${T}", Pattern: "${P}"`);
console.log(`Overlapping count: ${countOccurrences(T, P)}`);      // 3
console.log(`Non-overlapping count: ${countNonOverlapping(T, P)}`); // 2

Complexity Analysis: Why Naive Isn't Enough

The naive algorithm directly implements our formal definition, but its performance characteristics reveal why we need sophisticated pattern matching algorithms.

Time Complexity Analysis:

Outer loop: Runs at most n - m + 1 times (each valid starting position)
Inner loop: Runs at most m times per outer iteration
Worst case: O((n - m + 1) × m) = O(n × m)

For n = 1,000,000 and m = 1,000, that's up to 1 billion comparisons.

Worst-case input:

Text T:    "aaaaaaaaaaaa...aaaab" (n-1 a's followed by b)
Pattern P: "aaa...aaab"          (m-1 a's followed by b)

For each of the n - m + 1 positions, the algorithm compares m - 1 characters (all matching 'a's) before the final mismatch at 'b'. This is O(n × m) in practice.

Average case:

For random text and pattern over a non-trivial alphabet, the expected comparisons per position is much smaller (mismatches happen early). For random strings, average case is O(n).

But we can't rely on average case:

DNA sequences have only 4 characters (A, C, G, T)
Natural language has repetitive patterns
Adversarial inputs (attacks) can exploit worst-case behavior

Naive Algorithm: Concrete Performance Examples
Text Length (n)	Pattern Length (m)	Worst-Case Comparisons	Time at 10⁹ ops/sec
1,000	10	~10,000	~0.01 ms
10,000	100	~1,000,000	~1 ms
100,000	1,000	~100,000,000	~100 ms
1,000,000	10,000	~10,000,000,000	~10 sec
10,000,000	100,000	~1,000,000,000,000	~17 min

The Scale Problem

A search engine processes billions of queries daily. If each search in a 1MB document took 10 seconds, the system would be unusable. This is why O(n + m) algorithms like KMP, Rabin-Karp, and Boyer-Moore are essential—they provide the massive speedup required for real-world applications.

Space Complexity:

Naive algorithm: O(1) extra space (just loop variables and output)
Advanced algorithms: Trade space for time
- KMP: O(m) for failure function
- Rabin-Karp: O(1) for hash computation
- Suffix array: O(n) for preprocessed data structure
- Suffix tree: O(n) but with larger constants

The fundamental insight:

The naive algorithm wastes work by forgetting what it learned. When a mismatch occurs at position j, we discard all information about the j characters we already matched and start fresh. Efficient algorithms exploit the structure of the pattern to avoid redundant comparisons.

This insight—that we can preprocess the pattern to avoid redundant text examination—is the key to all efficient pattern matching algorithms.

The Shift Perspective: Another Way to View Pattern Matching

Pattern matching can also be viewed through the lens of shifts. This perspective is particularly useful for understanding algorithms like Boyer-Moore.

Definition (Shift):

A shift s is the number of positions the pattern is shifted right relative to the text. Pattern P with shift s is aligned so that P[0] is compared with T[s].

Valid shift: A shift s is valid if P matches T starting at position s, i.e., T[s..s+m-1] = P.

The pattern matching problem restated:

Find all valid shifts s where 0 ≤ s ≤ n - m.

Visual representation of shifts:

Text:    a b r a c a d a b r a
         0 1 2 3 4 5 6 7 8 9 10

Shift 0: a b r a          ✓ Valid
         ↑ ↑ ↑ ↑

Shift 1:   b r a c        ✗ Invalid (P[0]='a' ≠ T[1]='b')
           ↑

Shift 7:               a b r a    ✓ Valid
                       ↑ ↑ ↑ ↑

Why the shift perspective matters:

The naive algorithm increments shift by 1 after each comparison (regardless of what was learned). Advanced algorithms compute larger shifts when a mismatch occurs, skipping positions that cannot possibly be valid.

KMP: Uses the structure of the pattern to determine safe shifts
Boyer-Moore: Uses mismatch character and good suffix information
Rabin-Karp: Uses hashing to quickly skip non-matching positions

shift-based-matching
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/**
 * Pattern matching from the shift perspective
 * This formulation directly expresses "find all valid shifts"
 */
 
function findValidShifts(T: string, P: string): number[] {
    const n = T.length;
    const m = P.length;
    const validShifts: number[] = [];
    
    // Range of possible shifts: 0 to n-m inclusive
    const maxShift = n - m;
    
    if (maxShift < 0) {
        return [];  // Pattern longer than text
    }
    
    // Check each shift
    for (let s = 0; s <= maxShift; s++) {
        if (isValidShift(T, P, s)) {
            validShifts.push(s);
        }
    }
    
    return validShifts;
}
 
/**
 * Check if shift s is valid (pattern matches at position s)
 */
function isValidShift(T: string, P: string, s: number): boolean {
    const m = P.length;
    
    for (let j = 0; j < m; j++) {
        if (T[s + j] !== P[j]) {
            return false;
        }
    }
    return true;
}
 
/**
 * Visualize the matching process with shifts
 */
function visualizeShifts(T: string, P: string): void {
    const n = T.length;
    const m = P.length;
    const maxShift = n - m;
    
    console.log(`Text:    ${T}`);
    console.log(`         ${Array.from({length: n}, (_, i) => i % 10).join('')}`);
    console.log('');
    
    for (let s = 0; s <= maxShift; s++) {
        const valid = isValidShift(T, P, s);
        const padding = ' '.repeat(s);
        const status = valid ? '✓ Valid' : '✗';
        console.log(`Shift ${s}: ${padding}${P}  ${status}`);
    }
}
 
// Example
console.log("=== Pattern Matching via Shifts ===\n");
visualizeShifts("abracadabra", "abra");

Summary: The Formal Foundation

We have established the precise mathematical foundation for pattern matching. This formalization is not mere academic exercise—it enables rigorous algorithm design, complexity analysis, and correctness proofs.

Key Definitions Established

•Alphabet (Σ): A finite set of symbols; its size affects algorithm complexity
•String: A finite sequence of symbols from an alphabet
•Substring, Prefix, Suffix: Contiguous parts of a string with specific positions
•Text (T) and Pattern (P): The specific strings in our problem; |T| = n, |P| = m
•Occurrence/Match: Pattern P occurs at position i if T[i..i+m-1] = P
•Valid Shift: Position s where pattern matches text

Critical Insights

•The naive algorithm is O(n × m) — acceptable for small inputs but catastrophic at scale
•Overlapping occurrences matter — the same text position can participate in multiple matches
•Problem variants affect optimization — first-match can exit early; all-matches cannot
•The shift perspective reveals optimization opportunities that lead to efficient algorithms
•Preprocessing is key — efficient algorithms preprocess the pattern to enable larger shifts

Foundation Complete

You now possess the formal vocabulary and definitions required to understand all pattern matching algorithms. The concepts of text, pattern, occurrence, and shift will recur throughout this chapter. Next, we'll examine the specific components—text, pattern, and occurrence—in greater depth to understand the nuances that efficient algorithms exploit.

Looking ahead:

With our formal foundation established, the next pages will explore:

Text, Pattern, and Occurrence (Page 2): Deep dive into the relationship between these elements
Single vs Multiple Pattern Matching (Page 3): When we search for one pattern vs. many patterns simultaneously
Applications (Page 4): Real-world systems built on pattern matching—from search engines to genome analysis

Each page builds on this formal groundwork, showing how precise definitions enable elegant and efficient solutions.