Loading learning content...
Imagine you're building a search engine for a massive text corpus—millions of documents, billions of characters. A user types a query, and you need to find every occurrence in milliseconds. The naive approach of scanning through all that text for each query would take hours. You need something radically more efficient.
Enter the suffix array—a data structure so elegantly simple that its power might not be immediately apparent. At its core, a suffix array is just an array of integers. Yet this humble sequence enables some of the most sophisticated string operations in computer science: pattern matching in O(m log n) time, finding the longest repeated substring, determining the longest common substring of multiple strings, and much more.
In this page, we'll demystify the suffix array, understand precisely what it represents, and see why sorting suffixes lexicographically unlocks a treasure trove of string algorithmic capabilities.
By the end of this page, you will understand what a suffix array is, how it encodes all suffixes of a string in sorted order, why lexicographic sorting of suffixes is the key insight, and how this simple representation enables powerful substring queries. You'll develop the conceptual foundation needed to understand suffix array construction and applications.
Before we define the suffix array, we must first deeply understand what we mean by a suffix of a string.
Formal Definition:
Given a string S of length n (indexed from 0 to n-1), a suffix is any substring that starts at some position i and extends to the end of the string. We denote the suffix starting at position i as S[i..n-1] or simply suffix(i).
For a string of length n, there are exactly n suffixes (one starting at each position). Each suffix is uniquely identified by its starting index.
Key Observation:
Every substring of the original string is a prefix of some suffix. This seemingly simple observation is profound:
This means that finding a pattern in a string is equivalent to finding which suffixes have that pattern as their prefix. This insight is the foundation of suffix-based string searching.
Any occurrence of a pattern P in string S corresponds to a suffix of S that begins with P. If we can efficiently find all suffixes that start with P, we've found all occurrences of P in S. This is the conceptual bridge that makes suffix arrays powerful for pattern matching.
Now we arrive at the central definition:
The Suffix Array:
A suffix array for a string S of length n is an array SA of n integers, where SA[k] gives the starting index of the suffix that is k-th in lexicographic (alphabetical) order among all suffixes of S.
In other words:
The suffix array doesn't store the suffixes themselves—it stores only their starting positions. This is crucial: an array of n integers takes O(n) space, whereas storing all suffixes explicitly would require O(n²) space.
Lexicographic order is the generalization of alphabetical order to strings. String A is lexicographically smaller than B if, at the first position where they differ, A has a smaller character. If A is a prefix of B, then A is smaller. Sorting suffixes lexicographically groups suffixes with common prefixes together—a property we'll exploit heavily.
| Rank (position in SA) | SA[rank] | Suffix | Length |
|---|---|---|---|
| 0 | 5 | a | 1 |
| 1 | 3 | ana | 3 |
| 2 | 1 | anana | 5 |
| 3 | 0 | banana | 6 |
| 4 | 4 | na | 2 |
| 5 | 2 | nana | 4 |
Sorting suffixes lexicographically creates a remarkably useful property: suffixes with common prefixes are grouped together in the suffix array.
Consider our 'banana' example. Look at how the suffixes are ordered:
"a" ← starts with 'a'
"ana" ← starts with 'a'
"anana" ← starts with 'a'
"banana" ← starts with 'b'
"na" ← starts with 'n'
"nana" ← starts with 'n'
All suffixes starting with 'a' are consecutive. All suffixes starting with 'n' are consecutive. This isn't coincidence—it's a direct consequence of lexicographic ordering.
This grouping property is transformative for pattern matching.
Pattern Matching via Binary Search:
Because suffixes with common prefixes are contiguous in the suffix array, finding all occurrences of a pattern P becomes a range query:
With a suffix array, we can find all k occurrences of a pattern P of length m in a string of length n in O(m log n + k) time—logarithmic in the string length!
When binary searching for a pattern P, each comparison involves comparing P against a suffix of S. This comparison takes O(|P|) = O(m) time in the worst case. Therefore, total search time is O(m log n). Advanced techniques using LCP arrays can reduce this to O(m + log n).
Let's formalize the key properties of suffix arrays:
Property 1: Bijection
The suffix array SA is a permutation of [0, 1, 2, ..., n-1]. Every position appears exactly once because there's exactly one suffix for each starting position, and sorting is a bijection.
Property 2: Inverse Suffix Array
The inverse suffix array (or rank array) ISA is defined as:
ISA[i] = k such that SA[k] = i
In other words, ISA[i] gives the rank (sorted position) of the suffix starting at position i.
They are inverses: SA[ISA[i]] = i and ISA[SA[k]] = k
Property 3: Order Preservation
For any two positions i and j in the original string:
suffix(i) < suffix(j) lexicographically ⟺ ISA[i] < ISA[j]
The inverse suffix array encodes the relative lexicographic ordering of all suffixes. This is extremely useful for algorithms that need to compare suffixes without actually comparing character-by-character.
Property 4: Space Efficiency
A suffix array for a string of length n requires:
Compare this to explicitly storing all suffixes, which would need O(n²) space!
In many implementations, a special 'sentinel' character (often '$' or a null byte) is appended to the string. This sentinel is defined to be lexicographically smaller than all other characters. It ensures that no suffix is a prefix of another suffix, simplifying comparisons and avoiding edge cases.
Developing the right mental model for suffix arrays is crucial. Here are several ways to visualize them:
Model 1: The Sorted Suffix List
Imagine writing out all suffixes on separate cards, then sorting them alphabetically. The suffix array is just the list of original positions on those sorted cards.
Cards after sorting:
┌─────────┬─────────┐
│ Card 1 │ Pos: 5 │ "a"
├─────────┼─────────┤
│ Card 2 │ Pos: 3 │ "ana"
├─────────┼─────────┤
│ Card 3 │ Pos: 1 │ "anana"
├─────────┼─────────┤
│ Card 4 │ Pos: 0 │ "banana"
├─────────┼─────────┤
│ Card 5 │ Pos: 4 │ "na"
├─────────┼─────────┤
│ Card 6 │ Pos: 2 │ "nana"
└─────────┴─────────┘
The suffix array is: [5, 3, 1, 0, 4, 2]
Model 2: The Rotations View
Alternatively, think of the string as circular, with each rotation representing a different "view". If you append a sentinel character, each suffix corresponds to a unique rotation:
banana$ → "banana$"
anana$ → "anana$b" (but we only care about "anana$")
nana$ → "nana$ba" (but we only care about "nana$")
...
This view connects suffix arrays to the Burrows-Wheeler Transform (BWT) used in data compression.
Model 3: The Index Pointer View
Picture an array of n pointers, each pointing into the original string at a different position. Initially unordered, we sort these pointers by what they "see" (the suffix starting at that position). The sorted pointer positions form the suffix array.
A crucial insight: we never explicitly store the suffixes themselves. We only store starting positions (indices). When we need to compare two suffixes, we compare characters directly in the original string starting from those positions. This is why suffix arrays are O(n) space despite representing O(n²) total characters across all suffixes.
To appreciate suffix arrays, let's compare them to alternative approaches for pattern matching:
Naive String Search:
KMP / Z-Algorithm:
Suffix Array:
| Approach | Build Time | Query Time | Space | Best For |
|---|---|---|---|---|
| Naive | O(1) | O(n × m) | O(1) | One-off search, small strings |
| KMP | O(m) per pattern | O(n + m) | O(m) | Single pattern, one-time search |
| Rabin-Karp | O(m) per pattern | O(n + m) avg | O(m) | Multiple pattern matching |
| Suffix Array | O(n log n) or O(n) | O(m log n + k) | O(n) | Many queries on same text |
| Suffix Tree | O(n) | O(m + k) | O(n) but large constant | Speed-critical applications |
When Suffix Arrays Shine:
Think of a suffix array like the index at the back of a book. Building the index takes effort, but once built, you can find any topic (pattern) quickly. Without an index, you'd have to scan every page for each lookup. The upfront investment pays dividends with repeated queries.
Many suffix array implementations append a sentinel character to the string. This character (often denoted '$' or '\0') is defined to be:
Why use a sentinel?
Problem Without Sentinel:
Consider the string "aa". Its suffixes are:
When comparing "a" and "aa", we need a rule for when one string is a prefix of another. Typically, the shorter string is considered smaller ("a" < "aa").
With Sentinel:
The string becomes "aa$". Its suffixes are:
Now no suffix is a prefix of another—they all end at '$'. Comparisons are unambiguous:
Sorted: SA = [2, 1, 0]
Practical Benefits:
When to Include the Sentinel:
When using a sentinel, remember: (1) The alphabet size increases by 1; (2) The string length increases by 1, so your suffix array has n+1 elements; (3) If binary data might contain null bytes, ensure your sentinel is truly unique. Some implementations use -1 or a value outside the valid character range.
We've established the conceptual foundation of suffix arrays. Let's consolidate the key insights:
What's Next:
Now that you understand what a suffix array is, the next page explores how to build one efficiently. We'll examine the naive O(n² log n) construction approach and introduce the key ideas behind the O(n log n) and even O(n) algorithms that make suffix arrays practical for massive strings.
You now possess a solid understanding of what suffix arrays are and why sorting suffixes lexicographically is such a powerful idea. The suffix array is the foundation; building upon it opens the door to a vast array of efficient string algorithms. Next, we'll learn how to construct suffix arrays efficiently.