Z-Algorithm - Learning Module

Loading content...

0/276

Application to Pattern Matching

From Theory to Practice

We've built a powerful tool—the Z-array computed in linear time. But how do we wield it? The most fundamental application of the Z-algorithm is pattern matching: given a text T and a pattern P, find all positions where P occurs in T.

This is the problem that powers your Ctrl+F, enables compilers to tokenize source code, allows DNA sequencing to identify gene markers, and lets search engines locate relevant documents. It's perhaps the most important problem in string algorithmics.

The Z-algorithm solves this problem with an elegant insight: by constructing a special combined string, we transform pattern matching into a Z-array query problem.

What You Will Learn

By the end of this page, you will understand the concatenation technique for Z-based pattern matching, see complete working implementations, understand why the approach is correct, and know how to handle edge cases and special characters. You'll be able to find all pattern occurrences in optimal O(n + m) time.

The Pattern Matching Problem

Let's formally define the problem we're solving.

Problem Statement:

Given:

A text string T of length n
A pattern string P of length m

Find:

All starting positions i (0 ≤ i ≤ n-m) where T[i..i+m-1] = P

Output: A list of all such positions i.

Example:

Text T = "ababcabab"
Pattern P = "abab"
Occurrences: positions 0 and 5
- T[0..3] = "abab" = P ✓
- T[5..8] = "abab" = P ✓

Complexity Requirements
Approach	Time Complexity	Practical for n, m = 10⁶?
Naive (brute force)	O(n × m)	No: ~10¹² comparisons
KMP algorithm	O(n + m)	Yes: ~2 × 10⁶ comparisons
Z-algorithm	O(n + m)	Yes: ~2 × 10⁶ comparisons
Rabin-Karp (average)	O(n + m)	Yes: with hash collisions

The Z-algorithm achieves the optimal O(n + m) complexity, matching KMP but with a different approach that many find more intuitive.

The Concatenation Technique

Here's the key insight that connects Z-arrays to pattern matching:

If we compute the Z-array of the combined string P$T (pattern + separator + text), then any position i where Z[i] = m indicates a pattern occurrence.

Why does this work?

The combined string S = P$T has P as its prefix
Z[i] measures how many characters starting at position i match the prefix
If Z[i] = m (the pattern length), then S[i..i+m-1] = S[0..m-1] = P
The positions where Z[i] = m, minus the offset of the separator, give us all occurrences in T

The Role of the Separator

The '$' (or any character not in P or T) serves as a boundary. It ensures that Z-values computed for positions in T won't artificially extend into the pattern prefix. Without a separator, matches could incorrectly "bridge" across P and T.

Visual Breakdown:

P = "aba", T = "abacaba"

Combined string S = P + '$' + T:

Index:  0 1 2 3 4 5 6 7 8 9 10
S:      a b a $ a b a c a b a
        └─P─┘ ↑ └────T────────┘
              separator

Z-array:
Z[0] = 0 (by convention)
Z[1] = 0 (b ≠ a)
Z[2] = 1 (a matches, b ≠ b... wait, let's check: S[2]='a'=S[0], S[3]='$'≠S[1]='b', so Z[2]=1)
Z[3] = 0 ($ ≠ a)
Z[4] = 3 (aba matches!) ← Pattern found at T[0]
Z[5] = 0 (b ≠ a)
Z[6] = 1 (a matches, c ≠ b)
Z[7] = 0 (c ≠ a)
Z[8] = 3 (aba matches!) ← Pattern found at T[4]
Z[9] = 0 (b ≠ a)
Z[10] = 1 (a matches, but we're at end)

Positions where Z[i] = m = 3: indices 4 and 8
Corresponding positions in T: 4 - (3+1) = 0, and 8 - (3+1) = 4

Pattern "aba" occurs in text "abacaba" at positions 0 and 4.

Complete Implementation

Let's implement the Z-based pattern matching algorithm in full detail.

z_pattern_matching
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/**
 * Finds all occurrences of pattern in text using the Z-algorithm.
 * Time complexity: O(n + m) where n = text length, m = pattern length.
 * 
 * @param text - The text to search in
 * @param pattern - The pattern to search for
 * @returns Array of starting positions where pattern occurs in text
 */
function findPattern(text: string, pattern: string): number[] {
    if (pattern.length === 0) return [];
    if (pattern.length > text.length) return [];
    
    const m = pattern.length;
    const n = text.length;
    
    // Create combined string: pattern + separator + text
    // The separator must not appear in either string
    const combined = pattern + '$' + text;
    
    // Compute Z-array of the combined string
    const Z = computeZArray(combined);
    
    // Find all positions where Z[i] equals pattern length
    const result: number[] = [];
    for (let i = m + 1; i < combined.length; i++) {
        if (Z[i] === m) {
            // Position in text is i - (m + 1)
            result.push(i - m - 1);
        }
    }
    
    return result;
}
 
/**
 * Computes the Z-array for a given string in O(n) time.
 */
function computeZArray(s: string): number[] {
    const n = s.length;
    const Z: number[] = new Array(n).fill(0);
    
    if (n === 0) return Z;
    
    let l = 0, r = 0;
    
    for (let i = 1; i < n; i++) {
        if (i > r) {
            l = r = i;
            while (r < n && s[r - l] === s[r]) {
                r++;
            }
            Z[i] = r - l;
            r--;
        } else {
            const k = i - l;
            if (Z[k] < r - i + 1) {
                Z[i] = Z[k];
            } else {
                l = i;
                while (r < n && s[r - l] === s[r]) {
                    r++;
                }
                Z[i] = r - l;
                r--;
            }
        }
    }
    
    return Z;
}
 
// Example usage
const text = "ababcabababd";
const pattern = "abab";
const occurrences = findPattern(text, pattern);
console.log(`Text: "${text}"`);
console.log(`Pattern: "${pattern}"`);
console.log(`Occurrences at positions: [${occurrences.join(', ')}]`);
// Output: Occurrences at positions: [0, 5, 7]

Proof of Correctness

Why does this technique find exactly the pattern occurrences? Let's prove it rigorously.

Theorem: For the combined string S = P$T, position i (where i > m) has Z[i] = m if and only if P occurs in T starting at position i - m - 1.

Proof:

Part 1: If Z[i] = m, then P occurs at position i - m - 1 in T

Suppose Z[i] = m for some i > m.

By definition of Z-array: S[i..i+m-1] = S[0..m-1]
S[0..m-1] = P (the pattern)
Since i > m, position i is in the T portion of S
Specifically, S[i..i+m-1] corresponds to T[(i-m-1)..(i-m-1+m-1)] = T[(i-m-1)..(i-2)]
Wait, let's be more careful: S = P$T means S[m+1..n+m] = T[0..n-1]
So S[i] = T[i - m - 1] for i > m
Therefore S[i..i+m-1] = T[(i-m-1)..(i-m-1+m-1)] = T[(i-m-1)..(i-2)]
Since S[i..i+m-1] = P, we have T[(i-m-1)..(i-2)] = P
This means P occurs at position i - m - 1 in T ✓

Part 2: If P occurs at position j in T, then Z[m+1+j] = m

Suppose T[j..j+m-1] = P for some valid j.

Consider position i = m + 1 + j in S
S[i..i+m-1] = T[j..j+m-1] = P = S[0..m-1]
By definition, Z[i] ≥ m
But can Z[i] > m? No! Because S[m] = '$' ≠ any character in P or T
So Z[i] = m exactly ✓

QED

Why Z[i] Cannot Exceed m

The separator character '$' ensures that no Z-value in the text portion can exceed m. Even if T[j..] looks like it continues matching P, the match would have to "cross" the '$' to extend beyond m characters, which is impossible since '$' only appears once and doesn't match itself.

Handling Special Cases

Real-world pattern matching requires handling several special cases correctly.

Special Cases to Handle

•Empty pattern: Return empty list (or consider returning all positions, depending on semantics)
•Pattern longer than text: Return empty list immediately
•Pattern equals text: Return [0] as the single occurrence
•No occurrences: Return empty list (algorithm handles this naturally)
•Overlapping occurrences: All are found (e.g., "aa" in "aaa" occurs at 0 and 1)
•Separator character appears in input: Must choose a different separator

The Separator Problem:

If the pattern or text might contain the '$' character, you have two options:

Option 1: Choose a different separator

Find a character not in either string. This requires O(n + m) time to scan for character frequencies, but is straightforward.

Option 2: Compute Z-array directly on P$T without storing the combined string

Modify the Z-algorithm to work on an implicit concatenation: when accessing position i, return P[i] for i < m, '$' for i = m, and T[i-m-1] for i > m. This avoids the separator issue entirely.

pattern_matching_robust
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/**
 * Robust pattern matching that handles all edge cases,
 * including when the separator character exists in the input.
 */
function findPatternRobust(text: string, pattern: string): number[] {
    if (pattern.length === 0) return [];
    if (pattern.length > text.length) return [];
    
    const m = pattern.length;
    const n = text.length;
    const total = m + 1 + n;
    
    // Virtual accessor: access the combined string without building it
    // This avoids the separator character problem
    function charAt(i: number): string {
        if (i < m) return pattern[i];
        if (i === m) return '\0';  // Use null character as separator
        return text[i - m - 1];
    }
    
    // Z-algorithm on virtual string
    const Z: number[] = new Array(total).fill(0);
    let l = 0, r = 0;
    
    for (let i = 1; i < total; i++) {
        if (i > r) {
            l = r = i;
            while (r < total && charAt(r - l) === charAt(r)) {
                r++;
            }
            Z[i] = r - l;
            r--;
        } else {
            const k = i - l;
            if (Z[k] < r - i + 1) {
                Z[i] = Z[k];
            } else {
                l = i;
                while (r < total && charAt(r - l) === charAt(r)) {
                    r++;
                }
                Z[i] = r - l;
                r--;
            }
        }
    }
    
    // Collect results
    const result: number[] = [];
    for (let i = m + 1; i < total; i++) {
        if (Z[i] === m) {
            result.push(i - m - 1);
        }
    }
    
    return result;
}

Space Optimization

The virtual accessor approach has another benefit: it avoids creating a new string of length m+1+n. For very large inputs, this can save significant memory. The cost is slightly more complex per-character access logic.

Complexity Analysis

Let's verify that Z-based pattern matching achieves optimal O(n + m) complexity.

Complexity Breakdown
Operation	Time	Space
Construct combined string	O(n + m)	O(n + m)
Compute Z-array	O(n + m)	O(n + m)
Scan for occurrences	O(n)	O(k) where k = number of occurrences
Total	O(n + m)	O(n + m)

Time Complexity: O(n + m)

The combined string has length n + m + 1
Z-array computation is O(length) = O(n + m)
Scanning for matches is O(n)
Total: O(n + m)

Space Complexity: O(n + m)

Combined string: O(n + m) characters
Z-array: O(n + m) integers
Output: O(k) where k ≤ n is the number of occurrences

Space Optimization:

Using the virtual accessor technique, we can reduce space to O(m + n) for the Z-array alone, without creating the combined string. This is optimal since we must at least store the Z-array (or compute it on-the-fly).

Optimality

O(n + m) is the best possible for this problem. We must at least read the entire pattern (m) and text (n) to guarantee correctness, establishing Ω(n + m) as a lower bound. The Z-algorithm matches this lower bound.

Worked Examples

Let's work through several examples to build intuition and verify our understanding.

Example 1: Overlapping Matches

Pattern P = "aa"
Text T = "aaaa"

Combined S = "aa$aaaa"
          Index: 0 1 2 3 4 5 6

Z-array: [0, 1, 0, 2, 1, 2, 1]

Positions where Z[i] = 2 (= m):
- i = 3: text position 3 - 2 - 1 = 0
- i = 5: text position 5 - 2 - 1 = 2

Wait, that gives only positions 0 and 2. But "aa" occurs at 0, 1, and 2!

Let me re-check:
Z[3] = length of match between "aaaa" and "aa" = 2 ("aa" matches) → occurrence at 0 ✓
Z[4] = length of match between "aaa" and "aa" = 2? Let's see: S[4]='a'=S[0], S[5]='a'=S[1], S[6]='a'≠S[2]='$'... wait, S[2]='$', so Z[4]=2 → occurrence at 1 ✓
Z[5] = length of match between "aa" and "aa" = 2 → occurrence at 2 ✓
Z[6] = length of match between "a" and "aa" = 1 (only 1 char left)

Correct Z-array: [0, 1, 0, 2, 2, 2, 1]

Occurrences at positions: 0, 1, 2 ✓

Example 2: No Matches

Pattern P = "xyz"
Text T = "abcdefg"

Combined S = "xyz$abcdefg"
          Index: 0 1 2 3 4 5 6 7 8 9 10

Z-array: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

No position has Z[i] = 3 (= m)

Occurrences: [] (empty list)

Example 3: Pattern Equals Text

Pattern P = "abcd"
Text T = "abcd"

Combined S = "abcd$abcd"
          Index: 0 1 2 3 4 5 6 7 8

Z-array: [0, 0, 0, 0, 0, 4, 0, 0, 0]

Z[5] = 4 = m → occurrence at position 5 - 4 - 1 = 0

Occurrences: [0] ✓

Practice Exercise

Try computing the Z-array and finding occurrences for: Pattern = "aba", Text = "abababa". The answer should be [0, 2, 4] corresponding to the three overlapping occurrences of "aba".

Beyond Basic Pattern Matching

The Z-algorithm's power extends beyond simple pattern matching. Here are some advanced applications that use the same concatenation technique.

Extended Applications

•Longest Prefix That Is Also Suffix: Compute Z-array of string S. For each i, if i + Z[i] = n, then S[0..Z[i]-1] is both a prefix and suffix.
•String Periodicity: A string S has period p if S[i] = S[i mod p] for all i. The Z-array helps identify the minimal period.
•Counting Distinct Substrings: Combined with suffix arrays, Z-values help count unique substrings efficiently.
•Pattern Matching with Wildcards: With modifications, Z-based techniques can handle patterns with wildcard characters.
•Circular String Matching: Finding occurrences of a pattern in a circular string uses concatenation like P$T$T.

prefix_equals_suffix
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/**
 * Finds all lengths l such that S[0..l-1] equals S[n-l..n-1]
 * i.e., all prefixes that are also suffixes.
 */
function prefixEqualsSuffix(s: string): number[] {
    const n = s.length;
    if (n === 0) return [];
    
    const Z = computeZArray(s);
    const result: number[] = [];
    
    for (let i = 1; i < n; i++) {
        // If the Z-box [i, i + Z[i] - 1] reaches the end of the string
        if (i + Z[i] === n) {
            result.push(Z[i]);
        }
    }
    
    // The entire string is trivially a prefix and suffix of itself
    result.push(n);
    
    return result.sort((a, b) => a - b);
}
 
// Example
console.log(prefixEqualsSuffix("abacaba")); 
// Output: [1, 3, 7] → "a", "aba", "abacaba"

Summary

We've seen how the Z-algorithm solves the fundamental pattern matching problem with elegant efficiency.

Key Takeaways

•Concatenation Technique: Create P$T and compute the Z-array. Positions where Z[i] = m indicate pattern occurrences.
•Separator Role: The '$' character prevents Z-values from exceeding m, making detection unambiguous.
•Optimal Complexity: O(n + m) time matches the theoretical lower bound for pattern matching.
•Correctness: Proven rigorously—Z[i] = m if and only if the pattern occurs at the corresponding text position.
•Robustness: Edge cases (empty pattern, overlapping matches, separator in input) are all handleable.
•Extensibility: The same technique powers many advanced string algorithms beyond basic pattern matching.

What's Next:

We've now seen the Z-algorithm in action. The final page compares it with KMP—another O(n + m) pattern matching algorithm—exploring their similarities, differences, and when to prefer one over the other.

Page Complete

You can now use the Z-algorithm for pattern matching. This is a practical skill—pattern matching appears in countless real-world scenarios from text editors to bioinformatics to log analysis. The next page contextualizes Z against its famous sibling, KMP.