Loading learning content...
Strings are everywhere in computing—from the DNA sequences analyzed by bioinformaticians to the log files parsed by system administrators, from the code indexed by search engines to the text processed by natural language models. At the heart of working with strings lies a fundamental question: how do parts of a string relate to each other?
The Z-array is a deceptively simple data structure that answers this question with remarkable elegance. For any string, the Z-array captures precisely how the prefix of that string matches substrings starting at each position. This seemingly modest information unlocks a world of efficient algorithms—from pattern matching to string compression to finding the longest repeated prefix.
By the end of this page, you will understand exactly what the Z-array is, how it encodes prefix-matching information, and why this structure forms the foundation for the Z-algorithm. You'll be able to compute Z-arrays by hand and reason about their properties with confidence.
Let's establish the precise definition that will guide our entire discussion.
Definition (Z-array): Given a string S of length n (indexed from 0 to n-1), the Z-array Z[0..n-1] is defined such that:
Z[0] is typically defined as 0, undefined, or equal to n (conventions vary)i > 0, Z[i] equals the length of the longest substring starting at position i that matches a prefix of the string SIn other words, Z[i] answers the question: "Starting at position i, how many characters match the beginning of the string?"
This is a prefix-matching measure. It tells us exactly how much of the prefix is "repeated" starting at position i.
Think of Z[i] as asking: "If I overlay the string on top of itself, starting from position i, how far can I go before the characters stop matching?" This overlay perspective will help you visualize Z-values throughout this module.
Formal Expression:
For position i where 1 ≤ i < n:
Z[i] = max { k : S[0..k-1] = S[i..i+k-1] }
Where S[a..b] denotes the substring from index a to index b (inclusive). The value Z[i] is the largest k such that the first k characters of S match the k characters starting at position i.
If no match exists at all (i.e., S[0] ≠ S[i]), then Z[i] = 0.
Abstract definitions become clear through concrete examples. Let's compute Z-arrays for several strings of increasing complexity.
Example 1: S = "aabcaab"
Let's trace through each position:
| Index | Character | Substring starting here | Prefix Match | Z[i] |
|---|---|---|---|---|
| 0 | a | (S itself or undefined) | — | 0* |
| 1 | a | "abcaab" | S[0]="a" matches, S[1]="a" vs S[1]="b" ≠ | 1 |
| 2 | b | "bcaab" | S[0]="a" vs "b" ≠ | 0 |
| 3 | c | "caab" | S[0]="a" vs "c" ≠ | 0 |
| 4 | a | "aab" | S[0..2]="aab" matches! | 3 |
| 5 | a | "ab" | S[0]="a" matches, S[1]="a" vs "b" ≠ | 1 |
| 6 | b | "b" | S[0]="a" vs "b" ≠ | 0 |
Z-array: [0, 1, 0, 0, 3, 1, 0]
Note: Z[0] is often set to 0 by convention since comparing the entire string to itself is trivially n.
Notice at position 4, Z[4] = 3, indicating that "aab" appears again starting at index 4. This is precisely the kind of repetition information that makes the Z-array so powerful for pattern matching.
Example 2: S = "aaaaa"
A string of all identical characters:
| Index | Substring | Prefix Match | Z[i] |
|---|---|---|---|
| 0 | "aaaaa" | — | 0 |
| 1 | "aaaa" | Matches S[0..3]="aaaa" | 4 |
| 2 | "aaa" | Matches S[0..2]="aaa" | 3 |
| 3 | "aa" | Matches S[0..1]="aa" | 2 |
| 4 | "a" | Matches S[0]="a" | 1 |
Z-array: [0, 4, 3, 2, 1]
This reveals a beautiful pattern: in a string of identical characters, Z[i] = n - i for i > 0. Each position matches as much as possible until reaching the end.
Example 3: S = "abcdefg" (all distinct)
When no characters repeat:
| Index | Substring | Prefix Match | Z[i] |
|---|---|---|---|
| 0 | "abcdefg" | — | 0 |
| 1 | "bcdefg" | "a" vs "b" ≠ | 0 |
| 2 | "cdefg" | "a" vs "c" ≠ | 0 |
| 3 | "defg" | "a" vs "d" ≠ | 0 |
| 4 | "efg" | "a" vs "e" ≠ | 0 |
| 5 | "fg" | "a" vs "f" ≠ | 0 |
| 6 | "g" | "a" vs "g" ≠ | 0 |
Z-array: [0, 0, 0, 0, 0, 0, 0]
For strings with no character repetition, the Z-array is all zeros (except potentially Z[0]). No position matches even the first character of the prefix.
Understanding the Z-array leads naturally to a related concept that becomes crucial for the efficient algorithm: the Z-box.
Definition (Z-box): If Z[i] > 0, then the interval [i, i + Z[i] - 1] is called a Z-box. This interval represents the span of characters that match the prefix.
Visually, a Z-box is a "window" into the string where we know the content exactly matches the beginning of the string.
Example: In S = "aabcaab", at position 4, Z[4] = 3.
Z-boxes are not just theoretical constructs—they're the key to computing the Z-array in linear time. When we're inside a known Z-box, we can potentially reuse previously computed Z-values rather than comparing character by character. This insight transforms a naive O(n²) algorithm into an elegant O(n) solution.
The Rightmost Z-Box:
As we compute the Z-array from left to right, we track the rightmost Z-box seen so far. Let's denote:
l (left): the starting index of the rightmost Z-boxr (right): the ending index of the rightmost Z-boxThis rightmost Z-box represents our "window of knowledge"—within this window, we know exactly how the characters relate to the prefix, and this information can accelerate our computation.
Visual Representation:
String S: a a b c a a b
Index: 0 1 2 3 4 5 6
Z-array: - 1 0 0 3 1 0
Z-boxes:
Position 1: [1, 1] (length 1)
Position 4: [4, 6] (length 3) ← This becomes the rightmost Z-box
Position 5: [5, 5] (length 1, but inside [4,6])
Before we dive into algorithms, let's establish some important properties of Z-values that inform both correctness proofs and practical applications.
Property: Immediate Mismatch Detection
If we know Z[i] = k, we immediately know:
This "mismatch boundary" information is exactly what makes the Z-array useful for pattern matching. We don't just know that a match exists—we know precisely where it ends and why.
The Z-array and the failure function (prefix function) from KMP encode related but subtly different information. The prefix function answers 'what's the longest proper prefix that's also a suffix up to this point?' while the Z-array answers 'how much of the prefix matches starting here?' These dual perspectives on prefix structure are both powerful and complementary.
Before learning the efficient algorithm, every practitioner should be able to compute Z-arrays by hand. This builds the intuition necessary to understand and debug more complex implementations.
Step-by-Step Manual Process:
Detailed Walkthrough: S = "aabxaabxcaab"
Let's compute this step by step:
Index: 0 1 2 3 4 5 6 7 8 9 10 11
String: a a b x a a b x c a a b
Position 1: Compare S[1..] with S[0..]
Position 2: Compare S[2..] with S[0..]
Position 3: Compare S[3..] with S[0..]
Position 4: Compare S[4..] with S[0..]
Position 5: Compare S[5..] with S[0..]
Continue for remaining positions...
Final Z-array: [0, 1, 0, 0, 4, 1, 0, 0, 0, 3, 1, 0]
This manual method has O(n²) worst-case complexity. Consider S = "aaaa...a" (n a's): Position 1 requires n-1 comparisons, position 2 requires n-2, and so on. Total: O(n²) comparisons. The efficient Z-algorithm achieves O(n) by cleverly reusing Z-box information.
The Z-array is one member of a family of string preprocessing structures. Understanding how it compares to alternatives helps you choose the right tool for each problem.
| Structure | What It Computes | Key Applications | Relation to Z-Array |
|---|---|---|---|
| Z-Array | Prefix matches at each position | Pattern matching, longest common prefix | The subject of this module |
| Prefix Function (KMP) | Longest proper prefix-suffix at each position | Pattern matching, string periodicity | Can be converted to/from Z-array in O(n) |
| Suffix Array | Sorted order of all suffixes | Pattern matching, LCP queries, compression | Different perspective; heavier preprocessing |
| Rolling Hash | Hash values for substrings | Pattern matching, similarity detection | Probabilistic; no structural information |
The Z-Array ↔ Prefix Function Equivalence:
A profound result in string algorithmics is that the Z-array and the KMP prefix function contain equivalent information. Given one, you can compute the other in O(n) time:
From Z to Prefix Function:
for i = 1 to n-1:
if Z[i] > 0:
for j = Z[i]-1 down to 0:
if prefix[i+j] > 0: break
prefix[i+j] = j + 1
From Prefix Function to Z:
for i = 1 to n-1:
if prefix[i] > 0:
Z[i - prefix[i] + 1] = prefix[i]
(Note: These conversions require careful handling of overlapping cases, but the fundamental equivalence holds.)
This equivalence means that any problem solvable with the Z-array is also solvable with the prefix function, and vice versa. The choice often comes down to which is more intuitive for a given problem.
Many practitioners find the Z-array more intuitive than the prefix function because Z[i] directly answers 'how much matches starting here?' In contrast, the prefix function answers 'what is the longest proper prefix that is also a suffix?'—a more indirect question. However, both are worth mastering as different problems naturally suit different perspectives.
A powerful way to understand the Z-array is through visualization. Let's see how Z-values manifest as "match bars" extending from each position.
Visual Representation for S = "aabcaab":
Index: 0 1 2 3 4 5 6
String: a a b c a a b
─────────────────────────────
Z[0]: - (not computed / convention)
Z[1]: │ • (length 1: matches 'a')
Z[2]: - (length 0: 'b' ≠ 'a')
Z[3]: - (length 0: 'c' ≠ 'a')
Z[4]: •───•───• (length 3: matches 'aab')
Z[5]: • (length 1: matches 'a')
Z[6]: - (length 0: 'b' ≠ 'a')
Each "•" represents a matching character, and the "───" shows the extent of the match. The Z-box for position 4 spans indices [4, 5, 6].
Overlay Perspective:
Another way to visualize: imagine sliding a copy of the string starting at each position and counting matching characters:
Original: a a b c a a b
Slide to 1: a a b c a a b
↓
a ✓ (1 match)
a ✗
Slide to 4: a a b c a a b
↓ ↓ ↓
a a b ✓ ✓ ✓ (3 matches)
c ✗ (position 7 would mismatch)
This overlay model makes it viscerally clear what Z[i] measures: the "overlap" between the prefix and the substring starting at i.
Spend time with these visualizations until the Z-array feels intuitive. When you can look at a string and roughly estimate its Z-values, you'll find the algorithm and its applications much easier to grasp. This visual intuition is what separates proficient from fluent string algorithm practitioners.
We've established the conceptual foundation for the Z-algorithm. Let's consolidate the key insights before moving to the efficient computation method.
What's Next:
We now understand what the Z-array is. The next page addresses the critical question: how do we compute it efficiently? We'll see how the Z-box concept enables a beautiful linear-time algorithm that computes all Z-values in just O(n) time, regardless of the string's structure.
You now have a solid understanding of the Z-array definition, its properties, and its relationship to other string structures. This conceptual foundation is essential for understanding both the efficient algorithm and its applications in pattern matching.