Loading learning content...
We've learned what suffix arrays are and how to construct them. Now comes the payoff: what can we actually do with them?
Suffix arrays unlock a remarkable range of applications in text processing. In this page, we'll explore two foundational capabilities:
These applications form the foundation for building search engines, bioinformatics tools, plagiarism detectors, and data compression algorithms. Understanding them transforms suffix arrays from an academic curiosity into a practical engineering tool.
By the end of this page, you will understand how to search for patterns using binary search on a suffix array, what the LCP (Longest Common Prefix) array is and how to construct it efficiently, how the LCP array accelerates pattern matching to O(m + log n), and how to solve advanced problems like finding the longest repeated substring.
The most fundamental application of suffix arrays is pattern matching. Recall the key insight: suffixes starting with a given pattern are contiguous in the sorted suffix array.
This means finding all occurrences of pattern P in text T reduces to finding the range [left, right] in the suffix array where all suffixes begin with P.
Algorithm:
1. Binary search to find the LEFTMOST suffix that starts with P (or is ≥ P lexicographically)
2. Binary search to find the RIGHTMOST suffix that starts with P (or is ≤ P lexicographically)
3. All positions SA[left..right] are occurrences of P in T
4. Number of occurrences = right - left + 1
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
def pattern_search(text, sa, pattern): """ Find all occurrences of pattern in text using suffix array. Returns list of starting positions. Time: O(m log n + k) where m = |pattern|, n = |text|, k = #occurrences """ n = len(text) m = len(pattern) # Binary search for leftmost occurrence def compare(suffix_start, p): """Compare suffix at position suffix_start with pattern p""" i, j = suffix_start, 0 while i < n and j < m: if text[i] < p[j]: return -1 # suffix < pattern elif text[i] > p[j]: return 1 # suffix > pattern i += 1 j += 1 # If we've matched all of pattern, suffix starts with pattern if j == m: return 0 # suffix starts with pattern # Pattern is longer: suffix < pattern return -1 # Find left boundary (first suffix >= pattern) left = 0 right = n - 1 result_left = n # Initialize to "not found" while left <= right: mid = (left + right) // 2 cmp = compare(sa[mid], pattern) if cmp >= 0: # suffix >= pattern result_left = mid right = mid - 1 else: left = mid + 1 # Check if we actually found a match if result_left == n or compare(sa[result_left], pattern) != 0: return [] # Pattern not found # Find right boundary (last suffix that starts with pattern) left = result_left right = n - 1 result_right = result_left while left <= right: mid = (left + right) // 2 cmp = compare(sa[mid], pattern) if cmp == 0: # suffix starts with pattern result_right = mid left = mid + 1 else: right = mid - 1 # Return all positions in the range return [sa[i] for i in range(result_left, result_right + 1)] # Exampletext = "banana"sa = [5, 3, 1, 0, 4, 2] # Precomputed suffix array print(pattern_search(text, sa, "ana")) # [3, 1]print(pattern_search(text, sa, "na")) # [4, 2]print(pattern_search(text, sa, "ban")) # [0]print(pattern_search(text, sa, "xyz")) # []Complexity Analysis:
Compare this to naive search (O(nm)) or KMP (O(n + m) but for single pattern). With suffix arrays, we pay O(n log n) upfront to build, then answer unlimited queries in O(m log n + k) each.
The real power shows with multiple queries. If you need to search for Q different patterns in the same text, naive costs O(Q×n×m), KMP costs O(Q×(n+m)), but suffix arrays cost O(n log n) once + O(Q × m log n) for queries. For Q >> 1, suffix arrays win decisively.
The LCP (Longest Common Prefix) array is a companion structure to the suffix array that dramatically increases its power.
Definition:
For a suffix array SA of a string S:
LCP[i] = length of the longest common prefix between
suffix SA[i-1] and suffix SA[i]
In other words, LCP[i] tells us how many leading characters the i-th and (i-1)-th suffixes (in sorted order) share.
By convention, LCP[0] is undefined or set to 0 (there's no previous suffix).
Visual Representation:
SA Index │ SA[i] │ Suffix │ LCP[i]
─────────┼───────┼──────────┼───────
0 │ 5 │ a │ 0 (no predecessor)
1 │ 3 │ ana │ 1 (shares 'a' with prev)
2 │ 1 │ anana │ 3 (shares 'ana' with prev)
3 │ 0 │ banana │ 0 (shares nothing with prev)
4 │ 4 │ na │ 0 (shares nothing with prev)
5 │ 2 │ nana │ 2 (shares 'na' with prev)
Notice how LCP values are high when consecutive suffixes share prefixes (like "ana" and "anana") and zero when they start with different characters (like "anana" and "banana").
The LCP array only stores common prefix lengths between adjacent suffixes in sorted order. This might seem limited, but it's sufficient! The LCP between any two suffixes can be computed as the minimum LCP value in the range between them—a range minimum query (RMQ). With O(n) preprocessing, RMQ answers in O(1), so we get O(1) LCP for any pair.
The naive approach to building the LCP array compares adjacent suffixes character by character—O(n) per pair, O(n²) total. But we can do much better with Kasai's algorithm (2001), which builds the LCP array in O(n) time.
Kasai's Algorithm - Key Insight:
Suppose we know that suffix(i) shares k characters with its predecessor in sorted order. What can we say about suffix(i+1)?
Crucially: If suffix(i) shares k ≥ 1 characters with its predecessor, then suffix(i+1) shares at least k-1 characters with ITS predecessor (in sorted order).
Why? If suffix(i) = "aXYZ..." has LCP k with sorted predecessor "aXYZ'...", then:
This means LCP values can decrease by at most 1 as we process suffixes in text order. They can increase by any amount, but decreases are bounded.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
def build_lcp_array(text, sa): """ Build LCP array using Kasai's algorithm. Time: O(n), Space: O(n) Args: text: The original string sa: The suffix array for text Returns: lcp: LCP array where lcp[i] = LCP of suffix sa[i-1] and sa[i] """ n = len(text) lcp = [0] * n # Build inverse suffix array (rank array) # rank[i] = position of suffix(i) in the sorted suffix array rank = [0] * n for i in range(n): rank[sa[i]] = i # k tracks the current LCP length k = 0 # Process suffixes in text order (not sorted order) for i in range(n): if rank[i] == 0: # First suffix in sorted order has no predecessor k = 0 continue # j is the text position of the previous suffix in sorted order j = sa[rank[i] - 1] # Count matching characters starting from current k while i + k < n and j + k < n and text[i + k] == text[j + k]: k += 1 # Store the LCP value lcp[rank[i]] = k # Decrease k for next iteration (key insight!) if k > 0: k -= 1 return lcp # Exampletext = "banana"sa = [5, 3, 1, 0, 4, 2]lcp = build_lcp_array(text, sa) print(f"SA: {sa}")print(f"LCP: {lcp}")# Output: SA: [5, 3, 1, 0, 4, 2]# LCP: [0, 1, 3, 0, 0, 2] # Verifyfor i in range(len(sa)): suffix = text[sa[i]:] prev = text[sa[i-1]:] if i > 0 else "" print(f"SA[{i}]={sa[i]}: '{suffix}' (LCP with prev: {lcp[i]})")Complexity Analysis of Kasai's Algorithm:
This elegant analysis uses the "amortized" argument: the while loop might do more work in some iterations, but the total work across all iterations is bounded.
Kasai's algorithm processes suffixes in TEXT order (position 0, 1, 2, ...) not sorted order. This is counterintuitive—we're building an array indexed by sorted order—but it's the key to efficiency. Processing in text order lets us exploit the relationship between suffix(i) and suffix(i+1).
With the LCP array, we can accelerate pattern matching from O(m log n) to O(m + log n). This improvement matters for long patterns.
The Problem with Basic Binary Search:
In standard binary search, each comparison with the middle suffix starts from scratch—we compare the pattern with the suffix character by character from the beginning. If we've already compared prefix 'abc' in a previous iteration, we might redundantly compare 'abc' again.
The Optimization:
We maintain additional information during binary search:
When we compare with the middle suffix, we can skip the first min(lcp_left, lcp_right) characters—they're guaranteed to match!
How It Works:
Suppose we know:
For the middle suffix, we need to find LCP(middle suffix, P). Using the LCP array and RMQ:
Result: Each binary search step does O(1) work plus some new character comparisons. Across all steps, we compare each pattern character at most O(log n) times total, giving O(m + log n).
The O(m + log n) improvement matters when m (pattern length) is comparable to or larger than log n. For a 1GB text (n = 10^9, log n ≈ 30) and a 1KB pattern (m = 1000), basic search does ~30,000 character comparisons while optimized search does ~1,030. For short patterns (m < log n), the improvement is negligible, and basic O(m log n) search is fine.
One of the most elegant applications of the LCP array is finding the longest repeated substring—the longest substring that appears at least twice in the text.
Key Insight:
The longest repeated substring is a common prefix between two different suffixes. Since the LCP array stores common prefix lengths between adjacent suffixes in sorted order, and adjacent suffixes have the maximum common prefix among 'similar' suffixes:
The longest repeated substring has length = max(LCP[i]) for all i
To find the actual substring, find the index i where LCP[i] is maximum, then extract the first LCP[i] characters of suffix SA[i].
12345678910111213141516171819202122232425262728293031323334353637383940414243
def longest_repeated_substring(text, sa, lcp): """ Find the longest substring that appears at least twice. Time: O(n) given SA and LCP """ n = len(text) if n == 0: return "" # Find maximum LCP value max_lcp = 0 max_idx = 0 for i in range(n): if lcp[i] > max_lcp: max_lcp = lcp[i] max_idx = i if max_lcp == 0: return "" # No repeated substring # The longest repeated substring starts at SA[max_idx] # and has length max_lcp start = sa[max_idx] return text[start:start + max_lcp] # Exampletext = "banana"sa = [5, 3, 1, 0, 4, 2]lcp = [0, 1, 3, 0, 0, 2] result = longest_repeated_substring(text, sa, lcp)print(f"Longest repeated substring: '{result}'") # 'ana' # Verify: 'ana' appears at positions 1 and 3print(f"Occurrences: {text.find('ana')}, {text.rfind('ana')}") # 1, 3 # More complex exampletext2 = "abracadabra"# SA and LCP would need to be computed...# The answer is "abra" which appears at positions 0 and 7Why Does Maximum LCP Work?
Let's reason through this:
Complexity:
Compare this to the naive O(n²) or O(n³) approaches!
This technique extends to many variations: finding the longest substring appearing at least k times (aggregate k consecutive LCP values), longest common substring of two strings (concatenate with separator, then max LCP between suffixes from different strings), and detecting plagiarism (common substrings above a threshold).
Another elegant application: counting the number of distinct substrings in a string.
Naive Approach:
Generate all O(n²) substrings, put them in a hash set, count unique ones. This takes O(n²) time and space—impractical for large strings.
Suffix Array + LCP Approach:
Every substring is a prefix of some suffix. Consider suffix SA[i] of length (n - SA[i]):
Formula:
Distinct substrings = Σ (n - SA[i] - LCP[i]) for i = 0 to n-1
= Σ (n - SA[i]) - Σ LCP[i]
= n(n+1)/2 - Σ LCP[i]
12345678910111213141516171819202122232425262728293031323334353637
def count_distinct_substrings(text, sa, lcp): """ Count the number of distinct substrings in text. Time: O(n) given SA and LCP Formula: Total possible - duplicates = n(n+1)/2 - sum(LCP) """ n = len(text) # Total possible substrings (all prefix lengths of all suffixes) total = n * (n + 1) // 2 # Substrings shared with previous suffix in sorted order duplicates = sum(lcp) return total - duplicates # Exampletext = "banana"sa = [5, 3, 1, 0, 4, 2]lcp = [0, 1, 3, 0, 0, 2] count = count_distinct_substrings(text, sa, lcp)print(f"Distinct substrings: {count}") # 15 # Verify by enumerationsubstrings = set()for i in range(len(text)): for j in range(i + 1, len(text) + 1): substrings.add(text[i:j])print(f"By enumeration: {len(substrings)}") # 15 print("Substrings:", sorted(substrings))# ['a', 'an', 'ana', 'anan', 'anana', 'b', 'ba', 'ban', 'bana', 'banan', # 'banana', 'n', 'na', 'nan', 'nana']Understanding the Formula:
| Suffix | Length | Contributes | LCP with prev | Net new |
|---|---|---|---|---|
| a | 1 | 1 | 0 | 1 |
| ana | 3 | 3 | 1 | 2 |
| anana | 5 | 5 | 3 | 2 |
| banana | 6 | 6 | 0 | 6 |
| na | 2 | 2 | 0 | 2 |
| nana | 4 | 4 | 2 | 2 |
Total: 1 + 2 + 2 + 6 + 2 + 2 = 15
Or using the formula:
The number of distinct substrings provides insight into text complexity and compressibility. High repetition (fewer distinct substrings) means better compression potential. This is why suffix arrays and the Burrows-Wheeler Transform (BWT) are fundamental to compression algorithms.
Given two strings A and B, find their longest common substring—a classic problem with important applications in bioinformatics (comparing DNA sequences), plagiarism detection, and diff algorithms.
Suffix Array Approach:
Why This Works:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
def longest_common_substring(a, b): """ Find the longest common substring of strings a and b. Time: O((|a| + |b|) log (|a| + |b|)) for SA construction """ # Concatenate with separator separator = '$' # Must not appear in a or b s = a + separator + b n = len(s) m = len(a) # Position m is the separator # Build suffix array (using naive for illustration) # In practice, use O(n log n) or O(n) construction suffixes = [(s[i:], i) for i in range(n)] suffixes.sort() sa = [pos for (suffix, pos) in suffixes] # Build LCP array rank = [0] * n for i in range(n): rank[sa[i]] = i lcp = [0] * n k = 0 for i in range(n): if rank[i] == 0: k = 0 continue j = sa[rank[i] - 1] while i + k < n and j + k < n and s[i + k] == s[j + k]: k += 1 lcp[rank[i]] = k if k > 0: k -= 1 # Find max LCP where adjacent suffixes come from different strings max_len = 0 result_pos = -1 for i in range(1, n): pos1 = sa[i - 1] pos2 = sa[i] # Check if they come from different strings from_a1 = pos1 < m from_a2 = pos2 < m if from_a1 != from_a2: # Different strings if lcp[i] > max_len: max_len = lcp[i] result_pos = min(pos1, pos2) # Position in original string if max_len == 0: return "" # Extract the substring (from string a if result_pos < m) return s[result_pos:result_pos + max_len] # Examplea = "abcdefg"b = "xyzabcpqr"result = longest_common_substring(a, b)print(f"Longest common substring: '{result}'") # 'abc' # Another examplea2 = "programming"b2 = "programmer"result2 = longest_common_substring(a2, b2)print(f"Longest common substring: '{result2}'") # 'programm'Complexity:
Compare to naive O(|A| × |B| × min(|A|, |B|)) or DP O(|A| × |B|)—suffix arrays are dramatically faster for long strings.
This approach extends to finding the longest common substring among K strings: concatenate all with unique separators, build SA and LCP, then find the maximum LCP where the suffixes come from all K different strings. This requires slightly more bookkeeping but remains efficient.
Many suffix array applications need to answer: "What is the LCP of suffix(i) and suffix(j)?" where i and j are not adjacent in the suffix array.
Key Observation:
The LCP of two suffixes equals the minimum of all LCP values between them in sorted order:
LCP(suffix SA[i], suffix SA[j]) = min(LCP[i+1], LCP[i+2], ..., LCP[j]) for i < j
This is because as we go from SA[i] to SA[j], each step can only decrease the common prefix (LCP[k] indicates where the k-th suffix diverges from the (k-1)-th).
This transforms LCP queries into Range Minimum Queries (RMQ)!
Efficient RMQ:
With the right preprocessing, RMQ can be answered in O(1):
Applications:
Example Structure:
Suffix Array SA = [5, 3, 1, 0, 4, 2]
LCP Array LCP = [0, 1, 3, 0, 0, 2]
Rank Array Rank = [3, 2, 5, 1, 4, 0]
Query: LCP(suffix at position 3, suffix at position 5)?
- Rank[3] = 1, Rank[5] = 0
- Need min LCP[1..1] (since Rank[5] < Rank[3])
- = LCP[1] = 1
- Verify: suffix(3) = "ana", suffix(5) = "a", LCP = 1 ✓
The combination of Suffix Array + LCP Array + RMQ preprocessing is incredibly powerful. With O(n) or O(n log n) total preprocessing, you can: find any pattern in O(m + log n), compare any two substrings in O(1), find LCP of any two suffixes in O(1), and solve numerous string problems that naively require O(n²) or worse.
We've explored the practical power of suffix arrays. Here are the key takeaways:
What's Next:
Suffix arrays are the practical, space-efficient choice for most applications. But there's an even more powerful (if more complex) structure: the suffix tree. The next page introduces suffix trees—understanding what they are, their advantages over suffix arrays, and when the extra complexity is justified.
You now understand the core applications of suffix arrays—from pattern matching to substring analysis. The combination of suffix array + LCP + RMQ forms a powerful foundation for solving a vast array of string problems efficiently. Next, we'll explore suffix trees, the conceptual ancestor of suffix arrays with its own unique strengths.