Data Structures & AlgorithmsSuffix Arrays & Suffix Trees

Suffix Arrays & Suffix Trees (Conceptual)

LevelAdvanced

Duration90 mins

TopicSuffix Arrays & Suffix Trees

3 / 5

Applications: Pattern Searching, LCP Queries

From Structure to Power

We've learned what suffix arrays are and how to construct them. Now comes the payoff: what can we actually do with them?

Suffix arrays unlock a remarkable range of applications in text processing. In this page, we'll explore two foundational capabilities:

Pattern matching via binary search: Finding all occurrences of a pattern in O(m log n + k) time
The LCP array: An auxiliary structure that enables O(m + log n) pattern matching, longest repeated substring queries, and much more

These applications form the foundation for building search engines, bioinformatics tools, plagiarism detectors, and data compression algorithms. Understanding them transforms suffix arrays from an academic curiosity into a practical engineering tool.

What You Will Learn

By the end of this page, you will understand how to search for patterns using binary search on a suffix array, what the LCP (Longest Common Prefix) array is and how to construct it efficiently, how the LCP array accelerates pattern matching to O(m + log n), and how to solve advanced problems like finding the longest repeated substring.

Pattern Matching via Binary Search

The most fundamental application of suffix arrays is pattern matching. Recall the key insight: suffixes starting with a given pattern are contiguous in the sorted suffix array.

This means finding all occurrences of pattern P in text T reduces to finding the range [left, right] in the suffix array where all suffixes begin with P.

Algorithm:

1. Binary search to find the LEFTMOST suffix that starts with P (or is ≥ P lexicographically)
2. Binary search to find the RIGHTMOST suffix that starts with P (or is ≤ P lexicographically)
3. All positions SA[left..right] are occurrences of P in T
4. Number of occurrences = right - left + 1

pattern_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def pattern_search(text, sa, pattern):
    """
    Find all occurrences of pattern in text using suffix array.
    Returns list of starting positions.
    Time: O(m log n + k) where m = |pattern|, n = |text|, k = #occurrences
    """
    n = len(text)
    m = len(pattern)
    
    # Binary search for leftmost occurrence
    def compare(suffix_start, p):
        """Compare suffix at position suffix_start with pattern p"""
        i, j = suffix_start, 0
        while i < n and j < m:
            if text[i] < p[j]:
                return -1  # suffix < pattern
            elif text[i] > p[j]:
                return 1   # suffix > pattern
            i += 1
            j += 1
        # If we've matched all of pattern, suffix starts with pattern
        if j == m:
            return 0  # suffix starts with pattern
        # Pattern is longer: suffix < pattern
        return -1
    
    # Find left boundary (first suffix >= pattern)
    left = 0
    right = n - 1
    result_left = n  # Initialize to "not found"
    
    while left <= right:
        mid = (left + right) // 2
        cmp = compare(sa[mid], pattern)
        if cmp >= 0:  # suffix >= pattern
            result_left = mid
            right = mid - 1
        else:
            left = mid + 1
    
    # Check if we actually found a match
    if result_left == n or compare(sa[result_left], pattern) != 0:
        return []  # Pattern not found
    
    # Find right boundary (last suffix that starts with pattern)
    left = result_left
    right = n - 1
    result_right = result_left
    
    while left <= right:
        mid = (left + right) // 2
        cmp = compare(sa[mid], pattern)
        if cmp == 0:  # suffix starts with pattern
            result_right = mid
            left = mid + 1
        else:
            right = mid - 1
    
    # Return all positions in the range
    return [sa[i] for i in range(result_left, result_right + 1)]
 
 
# Example
text = "banana"
sa = [5, 3, 1, 0, 4, 2]  # Precomputed suffix array
 
print(pattern_search(text, sa, "ana"))  # [3, 1]
print(pattern_search(text, sa, "na"))   # [4, 2]
print(pattern_search(text, sa, "ban"))  # [0]
print(pattern_search(text, sa, "xyz"))  # []

Complexity Analysis:

Each binary search iteration performs one suffix-pattern comparison: O(m) time
Number of binary search iterations: O(log n)
Total comparison time: O(m log n)
Collecting k results: O(k)
Total: O(m log n + k)

Compare this to naive search (O(nm)) or KMP (O(n + m) but for single pattern). With suffix arrays, we pay O(n log n) upfront to build, then answer unlimited queries in O(m log n + k) each.

Multiple Pattern Queries

The real power shows with multiple queries. If you need to search for Q different patterns in the same text, naive costs O(Q×n×m), KMP costs O(Q×(n+m)), but suffix arrays cost O(n log n) once + O(Q × m log n) for queries. For Q >> 1, suffix arrays win decisively.

Understanding the LCP Array

The LCP (Longest Common Prefix) array is a companion structure to the suffix array that dramatically increases its power.

Definition:

For a suffix array SA of a string S:

LCP[i] = length of the longest common prefix between
         suffix SA[i-1] and suffix SA[i]

In other words, LCP[i] tells us how many leading characters the i-th and (i-1)-th suffixes (in sorted order) share.

By convention, LCP[0] is undefined or set to 0 (there's no previous suffix).

LCP Array for 'banana'Let's compute the LCP array step by step:

Input

Output

Visual Representation:

SA Index │ SA[i] │ Suffix   │ LCP[i]
─────────┼───────┼──────────┼───────
   0     │   5   │ a        │   0    (no predecessor)
   1     │   3   │ ana      │   1    (shares 'a' with prev)
   2     │   1   │ anana    │   3    (shares 'ana' with prev)
   3     │   0   │ banana   │   0    (shares nothing with prev)
   4     │   4   │ na       │   0    (shares nothing with prev)
   5     │   2   │ nana     │   2    (shares 'na' with prev)

Notice how LCP values are high when consecutive suffixes share prefixes (like "ana" and "anana") and zero when they start with different characters (like "anana" and "banana").

Why Adjacent Suffixes?

The LCP array only stores common prefix lengths between adjacent suffixes in sorted order. This might seem limited, but it's sufficient! The LCP between any two suffixes can be computed as the minimum LCP value in the range between them—a range minimum query (RMQ). With O(n) preprocessing, RMQ answers in O(1), so we get O(1) LCP for any pair.

Constructing the LCP Array

The naive approach to building the LCP array compares adjacent suffixes character by character—O(n) per pair, O(n²) total. But we can do much better with Kasai's algorithm (2001), which builds the LCP array in O(n) time.

Kasai's Algorithm - Key Insight:

Suppose we know that suffix(i) shares k characters with its predecessor in sorted order. What can we say about suffix(i+1)?

Crucially: If suffix(i) shares k ≥ 1 characters with its predecessor, then suffix(i+1) shares at least k-1 characters with ITS predecessor (in sorted order).

Why? If suffix(i) = "aXYZ..." has LCP k with sorted predecessor "aXYZ'...", then:

suffix(i+1) = "XYZ..." (one character shorter)
The sorted predecessor of suffix(i+1) starts with "XY..." (the first k-1 characters match)

This means LCP values can decrease by at most 1 as we process suffixes in text order. They can increase by any amount, but decreases are bounded.

kasai_lcp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def build_lcp_array(text, sa):
    """
    Build LCP array using Kasai's algorithm.
    Time: O(n), Space: O(n)
    
    Args:
        text: The original string
        sa: The suffix array for text
    
    Returns:
        lcp: LCP array where lcp[i] = LCP of suffix sa[i-1] and sa[i]
    """
    n = len(text)
    lcp = [0] * n
    
    # Build inverse suffix array (rank array)
    # rank[i] = position of suffix(i) in the sorted suffix array
    rank = [0] * n
    for i in range(n):
        rank[sa[i]] = i
    
    # k tracks the current LCP length
    k = 0
    
    # Process suffixes in text order (not sorted order)
    for i in range(n):
        if rank[i] == 0:
            # First suffix in sorted order has no predecessor
            k = 0
            continue
        
        # j is the text position of the previous suffix in sorted order
        j = sa[rank[i] - 1]
        
        # Count matching characters starting from current k
        while i + k < n and j + k < n and text[i + k] == text[j + k]:
            k += 1
        
        # Store the LCP value
        lcp[rank[i]] = k
        
        # Decrease k for next iteration (key insight!)
        if k > 0:
            k -= 1
    
    return lcp
 
 
# Example
text = "banana"
sa = [5, 3, 1, 0, 4, 2]
lcp = build_lcp_array(text, sa)
 
print(f"SA:  {sa}")
print(f"LCP: {lcp}")
# Output: SA:  [5, 3, 1, 0, 4, 2]
#         LCP: [0, 1, 3, 0, 0, 2]
 
# Verify
for i in range(len(sa)):
    suffix = text[sa[i]:]
    prev = text[sa[i-1]:] if i > 0 else ""
    print(f"SA[{i}]={sa[i]}: '{suffix}' (LCP with prev: {lcp[i]})")

Complexity Analysis of Kasai's Algorithm:

We iterate through each position in text order: n iterations
In each iteration, k can increase by any amount (via the while loop)
But k decreases by at most 1 per iteration
k starts at 0 and never goes negative
Total increases ≤ n + (total decreases) ≤ n + n = 2n
Total: O(n)

This elegant analysis uses the "amortized" argument: the while loop might do more work in some iterations, but the total work across all iterations is bounded.

The Power of Processing Order

Kasai's algorithm processes suffixes in TEXT order (position 0, 1, 2, ...) not sorted order. This is counterintuitive—we're building an array indexed by sorted order—but it's the key to efficiency. Processing in text order lets us exploit the relationship between suffix(i) and suffix(i+1).

Accelerated Pattern Matching with LCP

With the LCP array, we can accelerate pattern matching from O(m log n) to O(m + log n). This improvement matters for long patterns.

The Problem with Basic Binary Search:

In standard binary search, each comparison with the middle suffix starts from scratch—we compare the pattern with the suffix character by character from the beginning. If we've already compared prefix 'abc' in a previous iteration, we might redundantly compare 'abc' again.

The Optimization:

We maintain additional information during binary search:

lcp_left: Length of the longest common prefix between the pattern and the leftmost suffix in our current range
lcp_right: Length of the longest common prefix between the pattern and the rightmost suffix in our current range

When we compare with the middle suffix, we can skip the first min(lcp_left, lcp_right) characters—they're guaranteed to match!

How It Works:

Suppose we know:

Pattern P matches the left boundary suffix for lcp_left characters
Pattern P matches the right boundary suffix for lcp_right characters

For the middle suffix, we need to find LCP(middle suffix, P). Using the LCP array and RMQ:

Compute LCP(left suffix, middle suffix) using range minimum query on LCP array
This tells us how many characters the left and middle suffixes share
Combined with lcp_left, we know where the middle suffix might first differ from P
Start comparisons from that point, not from the beginning!

Result: Each binary search step does O(1) work plus some new character comparisons. Across all steps, we compare each pattern character at most O(log n) times total, giving O(m + log n).

When Does This Matter?

The O(m + log n) improvement matters when m (pattern length) is comparable to or larger than log n. For a 1GB text (n = 10^9, log n ≈ 30) and a 1KB pattern (m = 1000), basic search does ~30,000 character comparisons while optimized search does ~1,030. For short patterns (m < log n), the improvement is negligible, and basic O(m log n) search is fine.

Finding the Longest Repeated Substring

One of the most elegant applications of the LCP array is finding the longest repeated substring—the longest substring that appears at least twice in the text.

Key Insight:

The longest repeated substring is a common prefix between two different suffixes. Since the LCP array stores common prefix lengths between adjacent suffixes in sorted order, and adjacent suffixes have the maximum common prefix among 'similar' suffixes:

The longest repeated substring has length = max(LCP[i]) for all i

To find the actual substring, find the index i where LCP[i] is maximum, then extract the first LCP[i] characters of suffix SA[i].

longest_repeated_substring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def longest_repeated_substring(text, sa, lcp):
    """
    Find the longest substring that appears at least twice.
    Time: O(n) given SA and LCP
    """
    n = len(text)
    if n == 0:
        return ""
    
    # Find maximum LCP value
    max_lcp = 0
    max_idx = 0
    
    for i in range(n):
        if lcp[i] > max_lcp:
            max_lcp = lcp[i]
            max_idx = i
    
    if max_lcp == 0:
        return ""  # No repeated substring
    
    # The longest repeated substring starts at SA[max_idx]
    # and has length max_lcp
    start = sa[max_idx]
    return text[start:start + max_lcp]
 
 
# Example
text = "banana"
sa = [5, 3, 1, 0, 4, 2]
lcp = [0, 1, 3, 0, 0, 2]
 
result = longest_repeated_substring(text, sa, lcp)
print(f"Longest repeated substring: '{result}'")  # 'ana'
 
# Verify: 'ana' appears at positions 1 and 3
print(f"Occurrences: {text.find('ana')}, {text.rfind('ana')}")  # 1, 3
 
 
# More complex example
text2 = "abracadabra"
# SA and LCP would need to be computed...
# The answer is "abra" which appears at positions 0 and 7

Why Does Maximum LCP Work?

Let's reason through this:

Any repeated substring is a common prefix of at least two different suffixes
In the sorted suffix array, suffixes with common prefixes are adjacent (or near-adjacent)
The LCP array captures the maximum common prefix between consecutive sorted suffixes
For the LONGEST repeated substring, its two occurrences correspond to two suffixes that are adjacent in sorted order (or can be shown to have adjacent suffixes with same LCP)
Therefore, max(LCP) gives the length of the longest repeated substring

Complexity:

Building SA: O(n log n) or O(n)
Building LCP: O(n)
Finding max LCP: O(n)
Total: O(n log n) or O(n)

Compare this to the naive O(n²) or O(n³) approaches!

Extending the Idea

This technique extends to many variations: finding the longest substring appearing at least k times (aggregate k consecutive LCP values), longest common substring of two strings (concatenate with separator, then max LCP between suffixes from different strings), and detecting plagiarism (common substrings above a threshold).

Counting Distinct Substrings

Another elegant application: counting the number of distinct substrings in a string.

Naive Approach:

Generate all O(n²) substrings, put them in a hash set, count unique ones. This takes O(n²) time and space—impractical for large strings.

Suffix Array + LCP Approach:

Every substring is a prefix of some suffix. Consider suffix SA[i] of length (n - SA[i]):

It contributes (n - SA[i]) substrings (prefixes of lengths 1, 2, ..., n - SA[i])
But wait! LCP[i] of these are shared with suffix SA[i-1] (already counted)
Net new substrings from suffix SA[i] = (n - SA[i]) - LCP[i]

Formula:

Distinct substrings = Σ (n - SA[i] - LCP[i]) for i = 0 to n-1
                    = Σ (n - SA[i]) - Σ LCP[i]
                    = n(n+1)/2 - Σ LCP[i]

count_distinct_substrings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def count_distinct_substrings(text, sa, lcp):
    """
    Count the number of distinct substrings in text.
    Time: O(n) given SA and LCP
    
    Formula: Total possible - duplicates
           = n(n+1)/2 - sum(LCP)
    """
    n = len(text)
    
    # Total possible substrings (all prefix lengths of all suffixes)
    total = n * (n + 1) // 2
    
    # Substrings shared with previous suffix in sorted order
    duplicates = sum(lcp)
    
    return total - duplicates
 
 
# Example
text = "banana"
sa = [5, 3, 1, 0, 4, 2]
lcp = [0, 1, 3, 0, 0, 2]
 
count = count_distinct_substrings(text, sa, lcp)
print(f"Distinct substrings: {count}")  # 15
 
# Verify by enumeration
substrings = set()
for i in range(len(text)):
    for j in range(i + 1, len(text) + 1):
        substrings.add(text[i:j])
print(f"By enumeration: {len(substrings)}")  # 15
 
print("Substrings:", sorted(substrings))
# ['a', 'an', 'ana', 'anan', 'anana', 'b', 'ba', 'ban', 'bana', 'banan', 
#  'banana', 'n', 'na', 'nan', 'nana']

Understanding the Formula:

Suffix	Length	Contributes	LCP with prev	Net new
a	1	1	0	1
ana	3	3	1	2
anana	5	5	3	2
banana	6	6	0	6
na	2	2	0	2
nana	4	4	2	2

Total: 1 + 2 + 2 + 6 + 2 + 2 = 15

Or using the formula:

n(n+1)/2 = 6 × 7 / 2 = 21
sum(LCP) = 0 + 1 + 3 + 0 + 0 + 2 = 6
Distinct = 21 - 6 = 15 ✓

Application: Compression Analysis

The number of distinct substrings provides insight into text complexity and compressibility. High repetition (fewer distinct substrings) means better compression potential. This is why suffix arrays and the Burrows-Wheeler Transform (BWT) are fundamental to compression algorithms.

Longest Common Substring of Two Strings

Given two strings A and B, find their longest common substring—a classic problem with important applications in bioinformatics (comparing DNA sequences), plagiarism detection, and diff algorithms.

Suffix Array Approach:

Concatenate A and B with a separator: S = A + '$' + B (where '$' doesn't appear in A or B)
Build suffix array and LCP array for S
Find the maximum LCP[i] where SA[i-1] and SA[i] come from different original strings

Why This Works:

Suffixes of A$ come from positions 0 to |A|
Suffixes of B come from positions |A|+1 onwards
A common substring of A and B is a common prefix of a suffix from A and a suffix from B
Maximum such common prefix = answer

longest_common_substring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def longest_common_substring(a, b):
    """
    Find the longest common substring of strings a and b.
    Time: O((|a| + |b|) log (|a| + |b|)) for SA construction
    """
    # Concatenate with separator
    separator = '$'  # Must not appear in a or b
    s = a + separator + b
    n = len(s)
    m = len(a)  # Position m is the separator
    
    # Build suffix array (using naive for illustration)
    # In practice, use O(n log n) or O(n) construction
    suffixes = [(s[i:], i) for i in range(n)]
    suffixes.sort()
    sa = [pos for (suffix, pos) in suffixes]
    
    # Build LCP array
    rank = [0] * n
    for i in range(n):
        rank[sa[i]] = i
    
    lcp = [0] * n
    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        j = sa[rank[i] - 1]
        while i + k < n and j + k < n and s[i + k] == s[j + k]:
            k += 1
        lcp[rank[i]] = k
        if k > 0:
            k -= 1
    
    # Find max LCP where adjacent suffixes come from different strings
    max_len = 0
    result_pos = -1
    
    for i in range(1, n):
        pos1 = sa[i - 1]
        pos2 = sa[i]
        
        # Check if they come from different strings
        from_a1 = pos1 < m
        from_a2 = pos2 < m
        
        if from_a1 != from_a2:  # Different strings
            if lcp[i] > max_len:
                max_len = lcp[i]
                result_pos = min(pos1, pos2)  # Position in original string
    
    if max_len == 0:
        return ""
    
    # Extract the substring (from string a if result_pos < m)
    return s[result_pos:result_pos + max_len]
 
 
# Example
a = "abcdefg"
b = "xyzabcpqr"
result = longest_common_substring(a, b)
print(f"Longest common substring: '{result}'")  # 'abc'
 
# Another example
a2 = "programming"
b2 = "programmer"
result2 = longest_common_substring(a2, b2)
print(f"Longest common substring: '{result2}'")  # 'programm'

Complexity:

Concatenation: O(|A| + |B|)
Suffix array construction: O((|A| + |B|) log (|A| + |B|)) or O(|A| + |B|)
LCP array construction: O(|A| + |B|)
Finding max LCP across strings: O(|A| + |B|)
Total: O((|A| + |B|) log (|A| + |B|)) or O(|A| + |B|) with linear SA construction

Compare to naive O(|A| × |B| × min(|A|, |B|)) or DP O(|A| × |B|)—suffix arrays are dramatically faster for long strings.

Generalizing to K Strings

This approach extends to finding the longest common substring among K strings: concatenate all with unique separators, build SA and LCP, then find the maximum LCP where the suffixes come from all K different strings. This requires slightly more bookkeeping but remains efficient.

The Power of Range Minimum Query (RMQ)

Many suffix array applications need to answer: "What is the LCP of suffix(i) and suffix(j)?" where i and j are not adjacent in the suffix array.

Key Observation:

The LCP of two suffixes equals the minimum of all LCP values between them in sorted order:

LCP(suffix SA[i], suffix SA[j]) = min(LCP[i+1], LCP[i+2], ..., LCP[j])  for i < j

This is because as we go from SA[i] to SA[j], each step can only decrease the common prefix (LCP[k] indicates where the k-th suffix diverges from the (k-1)-th).

This transforms LCP queries into Range Minimum Queries (RMQ)!

Efficient RMQ:

With the right preprocessing, RMQ can be answered in O(1):

Sparse Table: O(n log n) preprocessing, O(1) query—most common choice
Segment Tree: O(n) preprocessing, O(log n) query
Fischer-Heun: O(n) preprocessing, O(1) query (optimal but complex)

Applications:

O(1) LCP between any two suffixes: rank lookup + RMQ
Comparing two substrings in O(1): compute their LCP and compare next characters
Lexicographic comparison of factors in O(1)

Example Structure:

Suffix Array  SA = [5, 3, 1, 0, 4, 2]
LCP Array     LCP = [0, 1, 3, 0, 0, 2]
Rank Array    Rank = [3, 2, 5, 1, 4, 0]

Query: LCP(suffix at position 3, suffix at position 5)?
- Rank[3] = 1, Rank[5] = 0
- Need min LCP[1..1] (since Rank[5] < Rank[3])
- = LCP[1] = 1
- Verify: suffix(3) = "ana", suffix(5) = "a", LCP = 1 ✓

The SA + LCP + RMQ Combo

The combination of Suffix Array + LCP Array + RMQ preprocessing is incredibly powerful. With O(n) or O(n log n) total preprocessing, you can: find any pattern in O(m + log n), compare any two substrings in O(1), find LCP of any two suffixes in O(1), and solve numerous string problems that naively require O(n²) or worse.

Summary: Suffix Array Applications

We've explored the practical power of suffix arrays. Here are the key takeaways:

Key Takeaways

•Pattern matching via binary search achieves O(m log n + k), making suffix arrays ideal for text indexing with many queries.
•The LCP array stores the longest common prefix between adjacent suffixes in sorted order, enabling advanced queries.
•Kasai's algorithm builds the LCP array in O(n) by exploiting the relationship between consecutive suffixes in text order.
•Accelerated pattern matching with LCP reduces query time to O(m + log n) by avoiding redundant character comparisons.
•Longest repeated substring = max(LCP)—a beautifully simple result.
•Distinct substring count = n(n+1)/2 - sum(LCP)—efficient analysis of text complexity.
•Longest common substring of multiple strings: concatenate with separators, find max cross-string LCP.
•Range Minimum Query on LCP enables O(1) LCP queries between arbitrary suffixes.

What's Next:

Suffix arrays are the practical, space-efficient choice for most applications. But there's an even more powerful (if more complex) structure: the suffix tree. The next page introduces suffix trees—understanding what they are, their advantages over suffix arrays, and when the extra complexity is justified.

Page Complete

You now understand the core applications of suffix arrays—from pattern matching to substring analysis. The combination of suffix array + LCP + RMQ forms a powerful foundation for solving a vast array of string problems efficiently. Next, we'll explore suffix trees, the conceptual ancestor of suffix arrays with its own unique strengths.

3 / 5

Loading learning content...

Data Structures & AlgorithmsSuffix Arrays & Suffix Trees

Suffix Arrays & Suffix Trees (Conceptual)

LevelAdvanced

Duration90 mins

TopicSuffix Arrays & Suffix Trees

3 / 5

Applications: Pattern Searching, LCP Queries

From Structure to Power

We've learned what suffix arrays are and how to construct them. Now comes the payoff: what can we actually do with them?

Suffix arrays unlock a remarkable range of applications in text processing. In this page, we'll explore two foundational capabilities:

Pattern matching via binary search: Finding all occurrences of a pattern in O(m log n + k) time
The LCP array: An auxiliary structure that enables O(m + log n) pattern matching, longest repeated substring queries, and much more

What You Will Learn

Pattern Matching via Binary Search

The most fundamental application of suffix arrays is pattern matching. Recall the key insight: suffixes starting with a given pattern are contiguous in the sorted suffix array.

This means finding all occurrences of pattern P in text T reduces to finding the range [left, right] in the suffix array where all suffixes begin with P.

Algorithm:

1. Binary search to find the LEFTMOST suffix that starts with P (or is ≥ P lexicographically)
2. Binary search to find the RIGHTMOST suffix that starts with P (or is ≤ P lexicographically)
3. All positions SA[left..right] are occurrences of P in T
4. Number of occurrences = right - left + 1

pattern_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def pattern_search(text, sa, pattern):
    """
    Find all occurrences of pattern in text using suffix array.
    Returns list of starting positions.
    Time: O(m log n + k) where m = |pattern|, n = |text|, k = #occurrences
    """
    n = len(text)
    m = len(pattern)
    
    # Binary search for leftmost occurrence
    def compare(suffix_start, p):
        """Compare suffix at position suffix_start with pattern p"""
        i, j = suffix_start, 0
        while i < n and j < m:
            if text[i] < p[j]:
                return -1  # suffix < pattern
            elif text[i] > p[j]:
                return 1   # suffix > pattern
            i += 1
            j += 1
        # If we've matched all of pattern, suffix starts with pattern
        if j == m:
            return 0  # suffix starts with pattern
        # Pattern is longer: suffix < pattern
        return -1
    
    # Find left boundary (first suffix >= pattern)
    left = 0
    right = n - 1
    result_left = n  # Initialize to "not found"
    
    while left <= right:
        mid = (left + right) // 2
        cmp = compare(sa[mid], pattern)
        if cmp >= 0:  # suffix >= pattern
            result_left = mid
            right = mid - 1
        else:
            left = mid + 1
    
    # Check if we actually found a match
    if result_left == n or compare(sa[result_left], pattern) != 0:
        return []  # Pattern not found
    
    # Find right boundary (last suffix that starts with pattern)
    left = result_left
    right = n - 1
    result_right = result_left
    
    while left <= right:
        mid = (left + right) // 2
        cmp = compare(sa[mid], pattern)
        if cmp == 0:  # suffix starts with pattern
            result_right = mid
            left = mid + 1
        else:
            right = mid - 1
    
    # Return all positions in the range
    return [sa[i] for i in range(result_left, result_right + 1)]
 
 
# Example
text = "banana"
sa = [5, 3, 1, 0, 4, 2]  # Precomputed suffix array
 
print(pattern_search(text, sa, "ana"))  # [3, 1]
print(pattern_search(text, sa, "na"))   # [4, 2]
print(pattern_search(text, sa, "ban"))  # [0]
print(pattern_search(text, sa, "xyz"))  # []

Complexity Analysis:

Each binary search iteration performs one suffix-pattern comparison: O(m) time
Number of binary search iterations: O(log n)
Total comparison time: O(m log n)
Collecting k results: O(k)
Total: O(m log n + k)

Compare this to naive search (O(nm)) or KMP (O(n + m) but for single pattern). With suffix arrays, we pay O(n log n) upfront to build, then answer unlimited queries in O(m log n + k) each.

Multiple Pattern Queries

Understanding the LCP Array

The LCP (Longest Common Prefix) array is a companion structure to the suffix array that dramatically increases its power.

Definition:

For a suffix array SA of a string S:

LCP[i] = length of the longest common prefix between
         suffix SA[i-1] and suffix SA[i]

In other words, LCP[i] tells us how many leading characters the i-th and (i-1)-th suffixes (in sorted order) share.

By convention, LCP[0] is undefined or set to 0 (there's no previous suffix).

LCP Array for 'banana'Let's compute the LCP array step by step:

Input

Output

Visual Representation:

SA Index │ SA[i] │ Suffix   │ LCP[i]
─────────┼───────┼──────────┼───────
   0     │   5   │ a        │   0    (no predecessor)
   1     │   3   │ ana      │   1    (shares 'a' with prev)
   2     │   1   │ anana    │   3    (shares 'ana' with prev)
   3     │   0   │ banana   │   0    (shares nothing with prev)
   4     │   4   │ na       │   0    (shares nothing with prev)
   5     │   2   │ nana     │   2    (shares 'na' with prev)

Notice how LCP values are high when consecutive suffixes share prefixes (like "ana" and "anana") and zero when they start with different characters (like "anana" and "banana").

Why Adjacent Suffixes?

Constructing the LCP Array

Kasai's Algorithm - Key Insight:

Suppose we know that suffix(i) shares k characters with its predecessor in sorted order. What can we say about suffix(i+1)?

Crucially: If suffix(i) shares k ≥ 1 characters with its predecessor, then suffix(i+1) shares at least k-1 characters with ITS predecessor (in sorted order).

Why? If suffix(i) = "aXYZ..." has LCP k with sorted predecessor "aXYZ'...", then:

suffix(i+1) = "XYZ..." (one character shorter)
The sorted predecessor of suffix(i+1) starts with "XY..." (the first k-1 characters match)

This means LCP values can decrease by at most 1 as we process suffixes in text order. They can increase by any amount, but decreases are bounded.

kasai_lcp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def build_lcp_array(text, sa):
    """
    Build LCP array using Kasai's algorithm.
    Time: O(n), Space: O(n)
    
    Args:
        text: The original string
        sa: The suffix array for text
    
    Returns:
        lcp: LCP array where lcp[i] = LCP of suffix sa[i-1] and sa[i]
    """
    n = len(text)
    lcp = [0] * n
    
    # Build inverse suffix array (rank array)
    # rank[i] = position of suffix(i) in the sorted suffix array
    rank = [0] * n
    for i in range(n):
        rank[sa[i]] = i
    
    # k tracks the current LCP length
    k = 0
    
    # Process suffixes in text order (not sorted order)
    for i in range(n):
        if rank[i] == 0:
            # First suffix in sorted order has no predecessor
            k = 0
            continue
        
        # j is the text position of the previous suffix in sorted order
        j = sa[rank[i] - 1]
        
        # Count matching characters starting from current k
        while i + k < n and j + k < n and text[i + k] == text[j + k]:
            k += 1
        
        # Store the LCP value
        lcp[rank[i]] = k
        
        # Decrease k for next iteration (key insight!)
        if k > 0:
            k -= 1
    
    return lcp
 
 
# Example
text = "banana"
sa = [5, 3, 1, 0, 4, 2]
lcp = build_lcp_array(text, sa)
 
print(f"SA:  {sa}")
print(f"LCP: {lcp}")
# Output: SA:  [5, 3, 1, 0, 4, 2]
#         LCP: [0, 1, 3, 0, 0, 2]
 
# Verify
for i in range(len(sa)):
    suffix = text[sa[i]:]
    prev = text[sa[i-1]:] if i > 0 else ""
    print(f"SA[{i}]={sa[i]}: '{suffix}' (LCP with prev: {lcp[i]})")

Complexity Analysis of Kasai's Algorithm:

We iterate through each position in text order: n iterations
In each iteration, k can increase by any amount (via the while loop)
But k decreases by at most 1 per iteration
k starts at 0 and never goes negative
Total increases ≤ n + (total decreases) ≤ n + n = 2n
Total: O(n)

This elegant analysis uses the "amortized" argument: the while loop might do more work in some iterations, but the total work across all iterations is bounded.

The Power of Processing Order

Accelerated Pattern Matching with LCP

With the LCP array, we can accelerate pattern matching from O(m log n) to O(m + log n). This improvement matters for long patterns.

The Problem with Basic Binary Search:

The Optimization:

We maintain additional information during binary search:

lcp_left: Length of the longest common prefix between the pattern and the leftmost suffix in our current range
lcp_right: Length of the longest common prefix between the pattern and the rightmost suffix in our current range

When we compare with the middle suffix, we can skip the first min(lcp_left, lcp_right) characters—they're guaranteed to match!

How It Works:

Suppose we know:

Pattern P matches the left boundary suffix for lcp_left characters
Pattern P matches the right boundary suffix for lcp_right characters

For the middle suffix, we need to find LCP(middle suffix, P). Using the LCP array and RMQ:

Compute LCP(left suffix, middle suffix) using range minimum query on LCP array
This tells us how many characters the left and middle suffixes share
Combined with lcp_left, we know where the middle suffix might first differ from P
Start comparisons from that point, not from the beginning!

Result: Each binary search step does O(1) work plus some new character comparisons. Across all steps, we compare each pattern character at most O(log n) times total, giving O(m + log n).

When Does This Matter?

Finding the Longest Repeated Substring

One of the most elegant applications of the LCP array is finding the longest repeated substring—the longest substring that appears at least twice in the text.

Key Insight:

The longest repeated substring has length = max(LCP[i]) for all i

To find the actual substring, find the index i where LCP[i] is maximum, then extract the first LCP[i] characters of suffix SA[i].

longest_repeated_substring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def longest_repeated_substring(text, sa, lcp):
    """
    Find the longest substring that appears at least twice.
    Time: O(n) given SA and LCP
    """
    n = len(text)
    if n == 0:
        return ""
    
    # Find maximum LCP value
    max_lcp = 0
    max_idx = 0
    
    for i in range(n):
        if lcp[i] > max_lcp:
            max_lcp = lcp[i]
            max_idx = i
    
    if max_lcp == 0:
        return ""  # No repeated substring
    
    # The longest repeated substring starts at SA[max_idx]
    # and has length max_lcp
    start = sa[max_idx]
    return text[start:start + max_lcp]
 
 
# Example
text = "banana"
sa = [5, 3, 1, 0, 4, 2]
lcp = [0, 1, 3, 0, 0, 2]
 
result = longest_repeated_substring(text, sa, lcp)
print(f"Longest repeated substring: '{result}'")  # 'ana'
 
# Verify: 'ana' appears at positions 1 and 3
print(f"Occurrences: {text.find('ana')}, {text.rfind('ana')}")  # 1, 3
 
 
# More complex example
text2 = "abracadabra"
# SA and LCP would need to be computed...
# The answer is "abra" which appears at positions 0 and 7

Why Does Maximum LCP Work?

Let's reason through this:

Any repeated substring is a common prefix of at least two different suffixes
In the sorted suffix array, suffixes with common prefixes are adjacent (or near-adjacent)
The LCP array captures the maximum common prefix between consecutive sorted suffixes
For the LONGEST repeated substring, its two occurrences correspond to two suffixes that are adjacent in sorted order (or can be shown to have adjacent suffixes with same LCP)
Therefore, max(LCP) gives the length of the longest repeated substring

Complexity:

Building SA: O(n log n) or O(n)
Building LCP: O(n)
Finding max LCP: O(n)
Total: O(n log n) or O(n)

Compare this to the naive O(n²) or O(n³) approaches!

Extending the Idea

Counting Distinct Substrings

Another elegant application: counting the number of distinct substrings in a string.

Naive Approach:

Generate all O(n²) substrings, put them in a hash set, count unique ones. This takes O(n²) time and space—impractical for large strings.

Suffix Array + LCP Approach:

Every substring is a prefix of some suffix. Consider suffix SA[i] of length (n - SA[i]):

It contributes (n - SA[i]) substrings (prefixes of lengths 1, 2, ..., n - SA[i])
But wait! LCP[i] of these are shared with suffix SA[i-1] (already counted)
Net new substrings from suffix SA[i] = (n - SA[i]) - LCP[i]

Formula:

Distinct substrings = Σ (n - SA[i] - LCP[i]) for i = 0 to n-1
                    = Σ (n - SA[i]) - Σ LCP[i]
                    = n(n+1)/2 - Σ LCP[i]

count_distinct_substrings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def count_distinct_substrings(text, sa, lcp):
    """
    Count the number of distinct substrings in text.
    Time: O(n) given SA and LCP
    
    Formula: Total possible - duplicates
           = n(n+1)/2 - sum(LCP)
    """
    n = len(text)
    
    # Total possible substrings (all prefix lengths of all suffixes)
    total = n * (n + 1) // 2
    
    # Substrings shared with previous suffix in sorted order
    duplicates = sum(lcp)
    
    return total - duplicates
 
 
# Example
text = "banana"
sa = [5, 3, 1, 0, 4, 2]
lcp = [0, 1, 3, 0, 0, 2]
 
count = count_distinct_substrings(text, sa, lcp)
print(f"Distinct substrings: {count}")  # 15
 
# Verify by enumeration
substrings = set()
for i in range(len(text)):
    for j in range(i + 1, len(text) + 1):
        substrings.add(text[i:j])
print(f"By enumeration: {len(substrings)}")  # 15
 
print("Substrings:", sorted(substrings))
# ['a', 'an', 'ana', 'anan', 'anana', 'b', 'ba', 'ban', 'bana', 'banan', 
#  'banana', 'n', 'na', 'nan', 'nana']

Understanding the Formula:

Suffix	Length	Contributes	LCP with prev	Net new
a	1	1	0	1
ana	3	3	1	2
anana	5	5	3	2
banana	6	6	0	6
na	2	2	0	2
nana	4	4	2	2

Total: 1 + 2 + 2 + 6 + 2 + 2 = 15

Or using the formula:

n(n+1)/2 = 6 × 7 / 2 = 21
sum(LCP) = 0 + 1 + 3 + 0 + 0 + 2 = 6
Distinct = 21 - 6 = 15 ✓

Application: Compression Analysis

Longest Common Substring of Two Strings

Suffix Array Approach:

Concatenate A and B with a separator: S = A + '$' + B (where '$' doesn't appear in A or B)
Build suffix array and LCP array for S
Find the maximum LCP[i] where SA[i-1] and SA[i] come from different original strings

Why This Works:

Suffixes of A$ come from positions 0 to |A|
Suffixes of B come from positions |A|+1 onwards
A common substring of A and B is a common prefix of a suffix from A and a suffix from B
Maximum such common prefix = answer

longest_common_substring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def longest_common_substring(a, b):
    """
    Find the longest common substring of strings a and b.
    Time: O((|a| + |b|) log (|a| + |b|)) for SA construction
    """
    # Concatenate with separator
    separator = '$'  # Must not appear in a or b
    s = a + separator + b
    n = len(s)
    m = len(a)  # Position m is the separator
    
    # Build suffix array (using naive for illustration)
    # In practice, use O(n log n) or O(n) construction
    suffixes = [(s[i:], i) for i in range(n)]
    suffixes.sort()
    sa = [pos for (suffix, pos) in suffixes]
    
    # Build LCP array
    rank = [0] * n
    for i in range(n):
        rank[sa[i]] = i
    
    lcp = [0] * n
    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        j = sa[rank[i] - 1]
        while i + k < n and j + k < n and s[i + k] == s[j + k]:
            k += 1
        lcp[rank[i]] = k
        if k > 0:
            k -= 1
    
    # Find max LCP where adjacent suffixes come from different strings
    max_len = 0
    result_pos = -1
    
    for i in range(1, n):
        pos1 = sa[i - 1]
        pos2 = sa[i]
        
        # Check if they come from different strings
        from_a1 = pos1 < m
        from_a2 = pos2 < m
        
        if from_a1 != from_a2:  # Different strings
            if lcp[i] > max_len:
                max_len = lcp[i]
                result_pos = min(pos1, pos2)  # Position in original string
    
    if max_len == 0:
        return ""
    
    # Extract the substring (from string a if result_pos < m)
    return s[result_pos:result_pos + max_len]
 
 
# Example
a = "abcdefg"
b = "xyzabcpqr"
result = longest_common_substring(a, b)
print(f"Longest common substring: '{result}'")  # 'abc'
 
# Another example
a2 = "programming"
b2 = "programmer"
result2 = longest_common_substring(a2, b2)
print(f"Longest common substring: '{result2}'")  # 'programm'

Complexity:

Concatenation: O(|A| + |B|)
Suffix array construction: O((|A| + |B|) log (|A| + |B|)) or O(|A| + |B|)
LCP array construction: O(|A| + |B|)
Finding max LCP across strings: O(|A| + |B|)
Total: O((|A| + |B|) log (|A| + |B|)) or O(|A| + |B|) with linear SA construction

Compare to naive O(|A| × |B| × min(|A|, |B|)) or DP O(|A| × |B|)—suffix arrays are dramatically faster for long strings.

Generalizing to K Strings

The Power of Range Minimum Query (RMQ)

Many suffix array applications need to answer: "What is the LCP of suffix(i) and suffix(j)?" where i and j are not adjacent in the suffix array.

Key Observation:

The LCP of two suffixes equals the minimum of all LCP values between them in sorted order:

LCP(suffix SA[i], suffix SA[j]) = min(LCP[i+1], LCP[i+2], ..., LCP[j])  for i < j

This is because as we go from SA[i] to SA[j], each step can only decrease the common prefix (LCP[k] indicates where the k-th suffix diverges from the (k-1)-th).

This transforms LCP queries into Range Minimum Queries (RMQ)!

Efficient RMQ:

With the right preprocessing, RMQ can be answered in O(1):

Sparse Table: O(n log n) preprocessing, O(1) query—most common choice
Segment Tree: O(n) preprocessing, O(log n) query
Fischer-Heun: O(n) preprocessing, O(1) query (optimal but complex)

Applications:

O(1) LCP between any two suffixes: rank lookup + RMQ
Comparing two substrings in O(1): compute their LCP and compare next characters
Lexicographic comparison of factors in O(1)

Example Structure:

Suffix Array  SA = [5, 3, 1, 0, 4, 2]
LCP Array     LCP = [0, 1, 3, 0, 0, 2]
Rank Array    Rank = [3, 2, 5, 1, 4, 0]

Query: LCP(suffix at position 3, suffix at position 5)?
- Rank[3] = 1, Rank[5] = 0
- Need min LCP[1..1] (since Rank[5] < Rank[3])
- = LCP[1] = 1
- Verify: suffix(3) = "ana", suffix(5) = "a", LCP = 1 ✓

The SA + LCP + RMQ Combo

Summary: Suffix Array Applications

We've explored the practical power of suffix arrays. Here are the key takeaways:

Key Takeaways

•Pattern matching via binary search achieves O(m log n + k), making suffix arrays ideal for text indexing with many queries.
•The LCP array stores the longest common prefix between adjacent suffixes in sorted order, enabling advanced queries.
•Kasai's algorithm builds the LCP array in O(n) by exploiting the relationship between consecutive suffixes in text order.
•Accelerated pattern matching with LCP reduces query time to O(m + log n) by avoiding redundant character comparisons.
•Longest repeated substring = max(LCP)—a beautifully simple result.
•Distinct substring count = n(n+1)/2 - sum(LCP)—efficient analysis of text complexity.
•Longest common substring of multiple strings: concatenate with separators, find max cross-string LCP.
•Range Minimum Query on LCP enables O(1) LCP queries between arbitrary suffixes.

What's Next:

Page Complete

3 / 5