Suffix Arrays & Trees - Learning Module

Loading content...

0/276

Building Suffix Arrays

The Construction Challenge

Understanding what a suffix array is marks only the beginning of the journey. The real question is: how do we build one efficiently?

At first glance, the problem seems straightforward—sort the suffixes. But this naive approach hides a computational pitfall. When sorting n suffixes using standard comparison-based sorting, each comparison involves comparing two suffixes character by character, potentially taking O(n) time in the worst case. With O(n log n) comparisons in a sorting algorithm, the total time becomes O(n² log n)—prohibitively slow for million-character strings.

Yet suffix arrays are routinely built for genome sequences with billions of characters. How is this possible? The answer lies in clever algorithms that exploit the structure of suffixes. In this page, we'll journey from the naive approach through progressively more sophisticated algorithms, building intuition for how suffix array construction evolved from O(n² log n) to O(n).

What You Will Learn

By the end of this page, you will understand the naive construction approach and its complexity, the key insight of prefix doubling that enables O(n log n) construction, conceptual overviews of advanced O(n) algorithms (DC3/Skew, SA-IS), and how to choose a construction algorithm for your use case.

The Naive Approach

The most direct way to build a suffix array mirrors its definition: generate all suffixes, sort them, and record the original positions.

Naive Algorithm:

1. Create pairs: [(suffix starting at 0, 0), (suffix starting at 1, 1), ..., (suffix starting at n-1, n-1)]
2. Sort these pairs by the suffix component (lexicographically)
3. Extract the position components in sorted order → this is the suffix array

This approach is conceptually clean but computationally expensive. Let's analyze why.

naive_suffix_array.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def naive_suffix_array(s):
    """
    Build suffix array using naive sorting.
    Time: O(n² log n), Space: O(n²)
    """
    n = len(s)
    
    # Create list of (suffix, starting_position) pairs
    # Note: We're explicitly storing suffixes - this takes O(n²) space!
    suffixes = [(s[i:], i) for i in range(n)]
    
    # Sort by suffix (lexicographic order)
    # Python's sort is O(n log n) comparisons
    # But each comparison is O(n) in worst case (comparing two suffixes)
    # Total: O(n² log n)
    suffixes.sort()
    
    # Extract positions in sorted order
    suffix_array = [pos for (suffix, pos) in suffixes]
    
    return suffix_array
 
 
# Example usage
text = "banana"
sa = naive_suffix_array(text)
print(f"Suffix Array: {sa}")  # Output: [5, 3, 1, 0, 4, 2]
 
# Verify by printing sorted suffixes
for i, pos in enumerate(sa):
    print(f"SA[{i}] = {pos}: '{text[pos:]}'")

Complexity Analysis:

Space: O(n²) — explicitly storing all suffixes requires 1 + 2 + 3 + ... + n = n(n+1)/2 characters
Time: O(n² log n)
- Sorting performs O(n log n) comparisons
- Each comparison of two suffixes takes O(n) time in the worst case
- Total: O(n × n log n) = O(n² log n)

Why is comparison O(n)?

When comparing suffixes like "banana" and "anana", we compare character by character until we find a difference. In the worst case (e.g., comparing "aaaa...aab" with "aaaa...aaa"), we might compare almost all characters before finding the difference.

Practical Impact

For a string of 1 million characters, naive construction would require approximately 10^18 operations (assuming ~20 comparisons per log factor). At 10^9 operations per second, this would take about 30 years. Clearly, we need something better.

Improving Space: No Explicit Suffixes

The first optimization eliminates explicit suffix storage. Instead of creating substring copies, we store only starting positions and compare by reading from the original string.

Key Insight:

We don't need to store "banana", "anana", "nana", etc. as separate strings. We can represent each suffix by its starting index and read characters from the original string when comparing.

Suffix at position i = original_string[i:]

This reduces space from O(n²) to O(n), but time remains O(n² log n) because comparisons still take O(n) in the worst case.

space_efficient_naive.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def space_efficient_suffix_array(s):
    """
    Build suffix array without storing explicit suffixes.
    Time: O(n² log n), Space: O(n)
    """
    n = len(s)
    
    # Store only indices
    positions = list(range(n))
    
    # Custom comparison function that reads from original string
    def compare_suffixes(i, j):
        """Compare suffix starting at i with suffix starting at j"""
        while i < n and j < n:
            if s[i] < s[j]:
                return -1
            elif s[i] > s[j]:
                return 1
            i += 1
            j += 1
        # Shorter suffix is smaller
        if i == n and j == n:
            return 0
        return -1 if i == n else 1
    
    # Python 3 requires using functools.cmp_to_key
    from functools import cmp_to_key
    positions.sort(key=cmp_to_key(compare_suffixes))
    
    return positions
 
 
# Verify it works
text = "banana"
sa = space_efficient_suffix_array(text)
print(f"Suffix Array: {sa}")  # [5, 3, 1, 0, 4, 2]

Progress So Far:

Approach	Time	Space
Naive (explicit suffixes)	O(n² log n)	O(n²)
Naive (indices only)	O(n² log n)	O(n)

We've solved the space problem, but time is still quadratic-logarithmic. The bottleneck is that each comparison takes O(n) work. To break through, we need to make comparisons faster.

The Prefix Doubling Insight

The breakthrough to O(n log n) construction comes from a beautiful observation: we can reuse earlier work to speed up later comparisons.

The Prefix Doubling Strategy (Manber-Myers Algorithm, 1990):

Instead of sorting suffixes all at once, we sort them incrementally:

First, sort suffixes by their first 1 character
Then, sort by their first 2 characters
Then, sort by their first 4 characters
Then, sort by their first 8 characters
Continue doubling until we've sorted by 2^k ≥ n characters

After ⌈log₂ n⌉ iterations, we've effectively sorted by the entire suffix (since a suffix has at most n characters).

The Key Insight:

When sorting by 2k characters, we can use the results from sorting by k characters!

Here's why: The first 2k characters of suffix starting at position i are:

first k characters of suffix(i)   +   first k characters of suffix(i + k)

In other words, comparing two suffixes by their first 2k characters is equivalent to:

Comparing their first k characters (we computed this ranking already!)
If tied, comparing their next k characters (which is the first k characters of a different suffix—also already ranked!)

This transforms an O(n) comparison into an O(1) lookup!

Prefix Doubling for 'banana$'Let's trace the algorithm step by step:

Input

Output

The Magic of Rank Reuse

By the end of iteration k (sorting by 2^k characters), we have a rank for each suffix based on its first 2^k characters. These ranks let us compare 2^(k+1) characters using just two rank lookups (O(1) time each). This is the key to achieving O(n log n) total time.

Prefix Doubling Complexity Analysis

Let's rigorously analyze the prefix doubling algorithm:

Number of Iterations:

After iteration k, suffixes are sorted by their first 2^k characters
We stop when 2^k ≥ n (longest suffix length)
Number of iterations: ⌈log₂ n⌉ = O(log n)

Work Per Iteration:

For each of n suffixes, we create a pair of ranks: O(n)
We sort n pairs by their values: O(n log n) with comparison sort, or O(n) with radix sort
We recompute ranks based on sorted order: O(n)
Total per iteration: O(n log n) or O(n)

Total Time:

With comparison-based sorting per iteration: O(n log n) × O(log n) = O(n log² n)
With radix sorting per iteration: O(n) × O(log n) = O(n log n)

Achieving O(n log n) with Radix Sort:

The pairs we're sorting have the form (rank₁, rank₂) where each rank is in range [0, n-1]. We can sort these pairs in O(n) using:

Counting sort on the second component
Stable counting sort on the first component

This gives O(n) per iteration, and with O(log n) iterations, total time is O(n log n).

Space Complexity:

Rank array: O(n)
Temporary arrays for sorting: O(n)
Total: O(n)

Suffix Array Construction Complexity Summary
Algorithm	Time	Space	Key Idea
Naive	O(n² log n)	O(n) or O(n²)	Direct suffix sorting
Prefix Doubling (comparison sort)	O(n log² n)	O(n)	Reuse ranks, double prefix length
Prefix Doubling (radix sort)	O(n log n)	O(n)	Radix sort on rank pairs

Practical Considerations

In practice, the O(n log² n) version with standard sorting is often competitive due to cache efficiency and simpler code. The O(n log n) radix sort version has lower asymptotic complexity but may be slower for moderate n due to constant factors. Profile before optimizing.

Linear Time Construction: The DC3/Skew Algorithm

Can we do better than O(n log n)? Surprisingly, yes! Several O(n) algorithms exist for suffix array construction. We'll focus on the conceptual understanding of the DC3 algorithm (also known as the Skew algorithm), introduced by Kärkkäinen and Sanders in 2003.

The Core Idea: Divide and Conquer

The DC3 algorithm cleverly divides suffixes into groups based on their starting position modulo 3:

Group 0: Suffixes starting at positions 0, 3, 6, 9, ... (i mod 3 = 0)
Group 1: Suffixes starting at positions 1, 4, 7, 10, ... (i mod 3 = 1)
Group 2: Suffixes starting at positions 2, 5, 8, 11, ... (i mod 3 = 2)

Groups 1 and 2 together contain ⌈2n/3⌉ suffixes (about two-thirds of all suffixes).

Algorithm Outline:

Step 1: Recursively sort suffixes in Groups 1 and 2

Encode the ~2n/3 suffixes from groups 1 and 2 as triples of characters, creating a new string of length ~2n/3. Recursively compute the suffix array of this new string.

Why triples? Because looking at 3 characters starting at position i (where i mod 3 ≠ 0) gives us enough information to create a unique new 'character' for the recursive problem.

Step 2: Sort Group 0 suffixes using results from Step 1

For each suffix in group 0 (starting at position i where i mod 3 = 0), we can compare it with any suffix from groups 1 or 2 in O(1) time using:

Compare first character
If tied, compare suffix(i+1) with suffix(j+1)—which are in the sorted groups!

Step 3: Merge the sorted groups

Merge the sorted group 0 suffixes with the sorted groups 1+2 suffixes. Since we can compare any two suffixes in O(1) using the recursive result, merging takes O(n) time.

Why is this O(n)?

Let T(n) be the time to process a string of length n.

Step 1 (recursive call): T(2n/3) — we process ~2/3 of the suffixes
Step 2 (sort group 0): O(n) — using merge or radix sort
Step 3 (merge): O(n) — linear merge

Recurrence: T(n) = T(2n/3) + O(n)

By the Master Theorem: T(n) = O(n)

The key insight is that the recursive subproblem is smaller by a constant factor (2/3 < 1), so the recursion is geometrically decreasing.

Why mod 3?

The choice of modulo 3 is deliberate. It ensures that: (1) Groups 1+2 contain only 2/3 of suffixes, making recursion decrease; (2) We can compare a group 0 suffix with groups 1/2 suffixes in O(1) by looking at just 1 or 2 characters plus a recursive comparison. The DC3 name comes from 'Difference Cover' of size 3: {1, 2} covers all differences mod 3.

Linear Time Construction: The SA-IS Algorithm

Another O(n) algorithm, often preferred in practice, is the SA-IS algorithm (Suffix Array by Induced Sorting) by Nong, Zhang, and Chan (2009). It typically outperforms DC3 due to better cache behavior and simpler operations.

Key Concepts:

1. S-type and L-type Suffixes:

Classify each suffix as:

S-type: Suffix(i) is S-type if suffix(i) < suffix(i+1) lexicographically
L-type: Suffix(i) is L-type if suffix(i) > suffix(i+1) lexicographically

The last suffix (length 1, the sentinel) is S-type by convention.

2. LMS (Leftmost S-type) Suffixes:

An LMS suffix starts at position i if:

suffix(i) is S-type
suffix(i-1) is L-type (or i = 0)

LMS suffixes are the 'boundary' points where the type changes from L to S.

Algorithm Overview:

Step 1: Classify all positions as S-type or L-type (O(n) scan from right to left)

Step 2: Identify LMS positions (positions where L-type is immediately followed by S-type)

Step 3: Place LMS suffixes into buckets and induce sort

Place LMS suffixes at the end of their character buckets
Scan left-to-right: for each suffix inducing its predecessor if L-type
Scan right-to-left: for each suffix inducing its predecessor if S-type

Step 4: Recursively sort LMS substrings if needed

If there are collisions (same bucket, same LMS substring), create a reduced problem and recurse. The number of LMS positions is at most n/2, so recursion depth is O(log n), but total work remains O(n).

Step 5: Final induced sort using correct LMS order

Why SA-IS is Practical:

Simple operations: Mostly array scans and bucket operations
Cache-friendly: Sequential memory access patterns
Low constant factors: No complex data structures
In-place variant exists: Can achieve O(n) time with O(1) extra space beyond the output

SA-IS is widely used in bioinformatics (e.g., BWA-MEM DNA aligner) and compression tools.

Induced Sorting Intuition

The 'induced sorting' idea is powerful: once we know the correct positions of some suffixes (LMS), we can deduce the positions of all other suffixes. An L-type suffix(i) is always just before suffix(i+1) in its bucket (because suffix(i) > suffix(i+1)). Similarly for S-type suffixes. This propagation of information is what makes the algorithm linear.

Choosing a Construction Algorithm

With multiple algorithms available, how do you choose? Here's a decision framework:

For Learning and Understanding:

Start with the naive approach to build intuition
Then implement prefix doubling to understand the rank-reuse insight
Study DC3 or SA-IS conceptually for algorithm design appreciation

For Production Code:

< 10,000 characters: Prefix doubling is simple and fast enough
10,000 - 1,000,000 characters: SA-IS or a library implementation of DC3
> 1,000,000 characters: Use a well-optimized SA-IS implementation or specialized library

For Competitive Programming:

Prefix doubling with radix sort is typically sufficient up to n = 10^6
Easy to implement from memory in a contest setting
O(n log n) with a small constant factor

Construction Algorithm Selection Guide
Scenario	Recommended Algorithm	Reasoning
Learning/Education	Naive → Prefix Doubling	Build conceptual understanding progressively
Quick prototype	Prefix Doubling (comparison sort)	Simple to implement, O(n log² n) is fine for development
Medium-scale production (n ≤ 10^6)	Prefix Doubling (radix sort)	O(n log n), straightforward, reliable
Large-scale production (n > 10^6)	SA-IS (library)	O(n), cache-friendly, battle-tested
Memory-constrained environment	SA-IS (in-place variant)	O(n) time with minimal extra space
Competitive programming	Prefix Doubling	Easy to code quickly, adequate performance

Library Recommendations

In practice, use battle-tested libraries: C++ has the 'divsufsort' library, Python has 'pysais', and many bioinformatics tools include SA construction. Don't reinvent the wheel for production unless you have specific requirements.

Implementation: Prefix Doubling with Radix Sort

Let's implement the O(n log n) prefix doubling algorithm with counting sort for a concrete reference:

prefix_doubling_suffix_array.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def build_suffix_array(s):
    """
    Build suffix array using prefix doubling with counting sort.
    Time: O(n log n), Space: O(n)
    """
    s = s + '$'  # Append sentinel (smallest character)
    n = len(s)
    
    # Initial ranks based on character values
    rank = [ord(c) for c in s]
    sa = list(range(n))
    tmp = [0] * n
    
    k = 1  # Current comparison length
    while k < n:
        # Sort by second component (rank[i + k]) - counting sort
        # Positions where i + k >= n are considered to have rank -1 (smallest)
        def get_second_rank(i):
            return rank[i + k] if i + k < n else -1
        
        # Count sort by second rank
        cnt = {}
        for i in range(n):
            r = get_second_rank(i)
            cnt[r] = cnt.get(r, 0) + 1
        
        # Cumulative counts
        sorted_keys = sorted(cnt.keys())
        pos = {}
        p = 0
        for key in sorted_keys:
            pos[key] = p
            p += cnt[key]
        
        # Place into temporary array based on second rank
        temp_sa = [0] * n
        for i in range(n - 1, -1, -1):  # Reverse for stability
            r = get_second_rank(sa[i])
            pos[r] += cnt[r] - 1
            temp_sa[pos[r]] = sa[i]
            cnt[r] -= 1
            pos[r] += 1
        
        # Reset counts and sort by first rank
        cnt = {}
        for i in range(n):
            r = rank[i]
            cnt[r] = cnt.get(r, 0) + 1
        
        sorted_keys = sorted(cnt.keys())
        pos = {}
        p = 0
        for key in sorted_keys:
            pos[key] = p
            p += cnt[key]
        
        for i in range(n - 1, -1, -1):
            r = rank[temp_sa[i]]
            pos[r] += cnt[r] - 1
            sa[pos[r]] = temp_sa[i]
            cnt[r] -= 1
            pos[r] += 1
        
        # Recompute ranks based on sorted order
        tmp[sa[0]] = 0
        for i in range(1, n):
            prev, curr = sa[i - 1], sa[i]
            prev_pair = (rank[prev], get_second_rank(prev))
            curr_pair = (rank[curr], get_second_rank(curr))
            tmp[curr] = tmp[prev] + (1 if curr_pair != prev_pair else 0)
        
        rank = tmp[:]
        
        # Early termination if all ranks are unique
        if rank[sa[n - 1]] == n - 1:
            break
        
        k *= 2
    
    # Remove the sentinel position from the result
    return [x for x in sa if x < n - 1]
 
 
# Test
text = "banana"
sa = build_suffix_array(text)
print(f"Suffix Array: {sa}")
for i, pos in enumerate(sa):
    print(f"SA[{i}] = {pos}: '{text[pos:]}'")

Implementation Notes

This implementation is educational, prioritizing clarity over maximum efficiency. Production implementations use arrays instead of dictionaries, handle integer-only operations, and may use pointer arrays for better cache performance. For competitive programming or production, consider specialized libraries.

Summary: Building Suffix Arrays

We've journeyed from naive construction to sophisticated O(n) algorithms. Here are the key takeaways:

Key Takeaways

•Naive construction is O(n² log n) due to O(n) comparisons during sorting. Conceptually simple but impractical for large strings.
•Prefix doubling achieves O(n log n) by reusing rank information from previous iterations. Each iteration doubles the comparison length, requiring only O(log n) iterations.
•The rank-reuse insight transforms O(n) suffix comparisons into O(1) rank lookups—the key to efficient construction.
•DC3/Skew algorithm achieves O(n) by a clever divide-and-conquer that recursively solves a 2/3-size subproblem.
•SA-IS algorithm achieves O(n) with excellent practical performance using induced sorting from LMS suffixes.
•Choose your algorithm based on context: prefix doubling for learning and contests, SA-IS for production.

What's Next:

With suffix arrays constructed, we can now explore their applications. The next page covers two fundamental use cases: pattern searching with binary search, and the LCP array that enables even more powerful queries.

Page Complete

You now understand how suffix arrays are constructed, from intuitive O(n² log n) methods to sophisticated O(n) algorithms. The key insight—reusing computed rankings to speed up comparisons—is a powerful algorithmic technique applicable beyond suffix arrays. Next, we'll see these structures in action for pattern matching and LCP queries.