Loading content...
Understanding what a suffix array is marks only the beginning of the journey. The real question is: how do we build one efficiently?
At first glance, the problem seems straightforward—sort the suffixes. But this naive approach hides a computational pitfall. When sorting n suffixes using standard comparison-based sorting, each comparison involves comparing two suffixes character by character, potentially taking O(n) time in the worst case. With O(n log n) comparisons in a sorting algorithm, the total time becomes O(n² log n)—prohibitively slow for million-character strings.
Yet suffix arrays are routinely built for genome sequences with billions of characters. How is this possible? The answer lies in clever algorithms that exploit the structure of suffixes. In this page, we'll journey from the naive approach through progressively more sophisticated algorithms, building intuition for how suffix array construction evolved from O(n² log n) to O(n).
By the end of this page, you will understand the naive construction approach and its complexity, the key insight of prefix doubling that enables O(n log n) construction, conceptual overviews of advanced O(n) algorithms (DC3/Skew, SA-IS), and how to choose a construction algorithm for your use case.
The most direct way to build a suffix array mirrors its definition: generate all suffixes, sort them, and record the original positions.
Naive Algorithm:
1. Create pairs: [(suffix starting at 0, 0), (suffix starting at 1, 1), ..., (suffix starting at n-1, n-1)]
2. Sort these pairs by the suffix component (lexicographically)
3. Extract the position components in sorted order → this is the suffix array
This approach is conceptually clean but computationally expensive. Let's analyze why.
12345678910111213141516171819202122232425262728293031
def naive_suffix_array(s): """ Build suffix array using naive sorting. Time: O(n² log n), Space: O(n²) """ n = len(s) # Create list of (suffix, starting_position) pairs # Note: We're explicitly storing suffixes - this takes O(n²) space! suffixes = [(s[i:], i) for i in range(n)] # Sort by suffix (lexicographic order) # Python's sort is O(n log n) comparisons # But each comparison is O(n) in worst case (comparing two suffixes) # Total: O(n² log n) suffixes.sort() # Extract positions in sorted order suffix_array = [pos for (suffix, pos) in suffixes] return suffix_array # Example usagetext = "banana"sa = naive_suffix_array(text)print(f"Suffix Array: {sa}") # Output: [5, 3, 1, 0, 4, 2] # Verify by printing sorted suffixesfor i, pos in enumerate(sa): print(f"SA[{i}] = {pos}: '{text[pos:]}'")Complexity Analysis:
Why is comparison O(n)?
When comparing suffixes like "banana" and "anana", we compare character by character until we find a difference. In the worst case (e.g., comparing "aaaa...aab" with "aaaa...aaa"), we might compare almost all characters before finding the difference.
For a string of 1 million characters, naive construction would require approximately 10^18 operations (assuming ~20 comparisons per log factor). At 10^9 operations per second, this would take about 30 years. Clearly, we need something better.
The first optimization eliminates explicit suffix storage. Instead of creating substring copies, we store only starting positions and compare by reading from the original string.
Key Insight:
We don't need to store "banana", "anana", "nana", etc. as separate strings. We can represent each suffix by its starting index and read characters from the original string when comparing.
Suffix at position i = original_string[i:]
This reduces space from O(n²) to O(n), but time remains O(n² log n) because comparisons still take O(n) in the worst case.
123456789101112131415161718192021222324252627282930313233343536
def space_efficient_suffix_array(s): """ Build suffix array without storing explicit suffixes. Time: O(n² log n), Space: O(n) """ n = len(s) # Store only indices positions = list(range(n)) # Custom comparison function that reads from original string def compare_suffixes(i, j): """Compare suffix starting at i with suffix starting at j""" while i < n and j < n: if s[i] < s[j]: return -1 elif s[i] > s[j]: return 1 i += 1 j += 1 # Shorter suffix is smaller if i == n and j == n: return 0 return -1 if i == n else 1 # Python 3 requires using functools.cmp_to_key from functools import cmp_to_key positions.sort(key=cmp_to_key(compare_suffixes)) return positions # Verify it workstext = "banana"sa = space_efficient_suffix_array(text)print(f"Suffix Array: {sa}") # [5, 3, 1, 0, 4, 2]Progress So Far:
| Approach | Time | Space |
|---|---|---|
| Naive (explicit suffixes) | O(n² log n) | O(n²) |
| Naive (indices only) | O(n² log n) | O(n) |
We've solved the space problem, but time is still quadratic-logarithmic. The bottleneck is that each comparison takes O(n) work. To break through, we need to make comparisons faster.
The breakthrough to O(n log n) construction comes from a beautiful observation: we can reuse earlier work to speed up later comparisons.
The Prefix Doubling Strategy (Manber-Myers Algorithm, 1990):
Instead of sorting suffixes all at once, we sort them incrementally:
After ⌈log₂ n⌉ iterations, we've effectively sorted by the entire suffix (since a suffix has at most n characters).
The Key Insight:
When sorting by 2k characters, we can use the results from sorting by k characters!
Here's why: The first 2k characters of suffix starting at position i are:
first k characters of suffix(i) + first k characters of suffix(i + k)
In other words, comparing two suffixes by their first 2k characters is equivalent to:
This transforms an O(n) comparison into an O(1) lookup!
By the end of iteration k (sorting by 2^k characters), we have a rank for each suffix based on its first 2^k characters. These ranks let us compare 2^(k+1) characters using just two rank lookups (O(1) time each). This is the key to achieving O(n log n) total time.
Let's rigorously analyze the prefix doubling algorithm:
Number of Iterations:
Work Per Iteration:
Total Time:
Achieving O(n log n) with Radix Sort:
The pairs we're sorting have the form (rank₁, rank₂) where each rank is in range [0, n-1]. We can sort these pairs in O(n) using:
This gives O(n) per iteration, and with O(log n) iterations, total time is O(n log n).
Space Complexity:
| Algorithm | Time | Space | Key Idea |
|---|---|---|---|
| Naive | O(n² log n) | O(n) or O(n²) | Direct suffix sorting |
| Prefix Doubling (comparison sort) | O(n log² n) | O(n) | Reuse ranks, double prefix length |
| Prefix Doubling (radix sort) | O(n log n) | O(n) | Radix sort on rank pairs |
In practice, the O(n log² n) version with standard sorting is often competitive due to cache efficiency and simpler code. The O(n log n) radix sort version has lower asymptotic complexity but may be slower for moderate n due to constant factors. Profile before optimizing.
Can we do better than O(n log n)? Surprisingly, yes! Several O(n) algorithms exist for suffix array construction. We'll focus on the conceptual understanding of the DC3 algorithm (also known as the Skew algorithm), introduced by Kärkkäinen and Sanders in 2003.
The Core Idea: Divide and Conquer
The DC3 algorithm cleverly divides suffixes into groups based on their starting position modulo 3:
Groups 1 and 2 together contain ⌈2n/3⌉ suffixes (about two-thirds of all suffixes).
Algorithm Outline:
Step 1: Recursively sort suffixes in Groups 1 and 2
Encode the ~2n/3 suffixes from groups 1 and 2 as triples of characters, creating a new string of length ~2n/3. Recursively compute the suffix array of this new string.
Why triples? Because looking at 3 characters starting at position i (where i mod 3 ≠ 0) gives us enough information to create a unique new 'character' for the recursive problem.
Step 2: Sort Group 0 suffixes using results from Step 1
For each suffix in group 0 (starting at position i where i mod 3 = 0), we can compare it with any suffix from groups 1 or 2 in O(1) time using:
Step 3: Merge the sorted groups
Merge the sorted group 0 suffixes with the sorted groups 1+2 suffixes. Since we can compare any two suffixes in O(1) using the recursive result, merging takes O(n) time.
Why is this O(n)?
Let T(n) be the time to process a string of length n.
Recurrence: T(n) = T(2n/3) + O(n)
By the Master Theorem: T(n) = O(n)
The key insight is that the recursive subproblem is smaller by a constant factor (2/3 < 1), so the recursion is geometrically decreasing.
The choice of modulo 3 is deliberate. It ensures that: (1) Groups 1+2 contain only 2/3 of suffixes, making recursion decrease; (2) We can compare a group 0 suffix with groups 1/2 suffixes in O(1) by looking at just 1 or 2 characters plus a recursive comparison. The DC3 name comes from 'Difference Cover' of size 3: {1, 2} covers all differences mod 3.
Another O(n) algorithm, often preferred in practice, is the SA-IS algorithm (Suffix Array by Induced Sorting) by Nong, Zhang, and Chan (2009). It typically outperforms DC3 due to better cache behavior and simpler operations.
Key Concepts:
1. S-type and L-type Suffixes:
Classify each suffix as:
The last suffix (length 1, the sentinel) is S-type by convention.
2. LMS (Leftmost S-type) Suffixes:
An LMS suffix starts at position i if:
LMS suffixes are the 'boundary' points where the type changes from L to S.
Algorithm Overview:
Step 1: Classify all positions as S-type or L-type (O(n) scan from right to left)
Step 2: Identify LMS positions (positions where L-type is immediately followed by S-type)
Step 3: Place LMS suffixes into buckets and induce sort
Step 4: Recursively sort LMS substrings if needed
If there are collisions (same bucket, same LMS substring), create a reduced problem and recurse. The number of LMS positions is at most n/2, so recursion depth is O(log n), but total work remains O(n).
Step 5: Final induced sort using correct LMS order
Why SA-IS is Practical:
SA-IS is widely used in bioinformatics (e.g., BWA-MEM DNA aligner) and compression tools.
The 'induced sorting' idea is powerful: once we know the correct positions of some suffixes (LMS), we can deduce the positions of all other suffixes. An L-type suffix(i) is always just before suffix(i+1) in its bucket (because suffix(i) > suffix(i+1)). Similarly for S-type suffixes. This propagation of information is what makes the algorithm linear.
With multiple algorithms available, how do you choose? Here's a decision framework:
For Learning and Understanding:
For Production Code:
For Competitive Programming:
| Scenario | Recommended Algorithm | Reasoning |
|---|---|---|
| Learning/Education | Naive → Prefix Doubling | Build conceptual understanding progressively |
| Quick prototype | Prefix Doubling (comparison sort) | Simple to implement, O(n log² n) is fine for development |
| Medium-scale production (n ≤ 10^6) | Prefix Doubling (radix sort) | O(n log n), straightforward, reliable |
| Large-scale production (n > 10^6) | SA-IS (library) | O(n), cache-friendly, battle-tested |
| Memory-constrained environment | SA-IS (in-place variant) | O(n) time with minimal extra space |
| Competitive programming | Prefix Doubling | Easy to code quickly, adequate performance |
In practice, use battle-tested libraries: C++ has the 'divsufsort' library, Python has 'pysais', and many bioinformatics tools include SA construction. Don't reinvent the wheel for production unless you have specific requirements.
Let's implement the O(n log n) prefix doubling algorithm with counting sort for a concrete reference:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
def build_suffix_array(s): """ Build suffix array using prefix doubling with counting sort. Time: O(n log n), Space: O(n) """ s = s + '$' # Append sentinel (smallest character) n = len(s) # Initial ranks based on character values rank = [ord(c) for c in s] sa = list(range(n)) tmp = [0] * n k = 1 # Current comparison length while k < n: # Sort by second component (rank[i + k]) - counting sort # Positions where i + k >= n are considered to have rank -1 (smallest) def get_second_rank(i): return rank[i + k] if i + k < n else -1 # Count sort by second rank cnt = {} for i in range(n): r = get_second_rank(i) cnt[r] = cnt.get(r, 0) + 1 # Cumulative counts sorted_keys = sorted(cnt.keys()) pos = {} p = 0 for key in sorted_keys: pos[key] = p p += cnt[key] # Place into temporary array based on second rank temp_sa = [0] * n for i in range(n - 1, -1, -1): # Reverse for stability r = get_second_rank(sa[i]) pos[r] += cnt[r] - 1 temp_sa[pos[r]] = sa[i] cnt[r] -= 1 pos[r] += 1 # Reset counts and sort by first rank cnt = {} for i in range(n): r = rank[i] cnt[r] = cnt.get(r, 0) + 1 sorted_keys = sorted(cnt.keys()) pos = {} p = 0 for key in sorted_keys: pos[key] = p p += cnt[key] for i in range(n - 1, -1, -1): r = rank[temp_sa[i]] pos[r] += cnt[r] - 1 sa[pos[r]] = temp_sa[i] cnt[r] -= 1 pos[r] += 1 # Recompute ranks based on sorted order tmp[sa[0]] = 0 for i in range(1, n): prev, curr = sa[i - 1], sa[i] prev_pair = (rank[prev], get_second_rank(prev)) curr_pair = (rank[curr], get_second_rank(curr)) tmp[curr] = tmp[prev] + (1 if curr_pair != prev_pair else 0) rank = tmp[:] # Early termination if all ranks are unique if rank[sa[n - 1]] == n - 1: break k *= 2 # Remove the sentinel position from the result return [x for x in sa if x < n - 1] # Testtext = "banana"sa = build_suffix_array(text)print(f"Suffix Array: {sa}")for i, pos in enumerate(sa): print(f"SA[{i}] = {pos}: '{text[pos:]}'")This implementation is educational, prioritizing clarity over maximum efficiency. Production implementations use arrays instead of dictionaries, handle integer-only operations, and may use pointer arrays for better cache performance. For competitive programming or production, consider specialized libraries.
We've journeyed from naive construction to sophisticated O(n) algorithms. Here are the key takeaways:
What's Next:
With suffix arrays constructed, we can now explore their applications. The next page covers two fundamental use cases: pattern searching with binary search, and the LCP array that enables even more powerful queries.
You now understand how suffix arrays are constructed, from intuitive O(n² log n) methods to sophisticated O(n) algorithms. The key insight—reusing computed rankings to speed up comparisons—is a powerful algorithmic technique applicable beyond suffix arrays. Next, we'll see these structures in action for pattern matching and LCP queries.