Data Structures & AlgorithmsSuffix Arrays & Suffix Trees

Suffix Arrays & Suffix Trees (Conceptual)

LevelAdvanced

Duration90 mins

TopicSuffix Arrays & Suffix Trees

1 / 5

What is a Suffix Array

The Suffix Array: A Deceptively Simple Powerhouse

Imagine you're building a search engine for a massive text corpus—millions of documents, billions of characters. A user types a query, and you need to find every occurrence in milliseconds. The naive approach of scanning through all that text for each query would take hours. You need something radically more efficient.

Enter the suffix array—a data structure so elegantly simple that its power might not be immediately apparent. At its core, a suffix array is just an array of integers. Yet this humble sequence enables some of the most sophisticated string operations in computer science: pattern matching in O(m log n) time, finding the longest repeated substring, determining the longest common substring of multiple strings, and much more.

In this page, we'll demystify the suffix array, understand precisely what it represents, and see why sorting suffixes lexicographically unlocks a treasure trove of string algorithmic capabilities.

What You Will Learn

By the end of this page, you will understand what a suffix array is, how it encodes all suffixes of a string in sorted order, why lexicographic sorting of suffixes is the key insight, and how this simple representation enables powerful substring queries. You'll develop the conceptual foundation needed to understand suffix array construction and applications.

The Concept of Suffixes

Before we define the suffix array, we must first deeply understand what we mean by a suffix of a string.

Formal Definition:

Given a string S of length n (indexed from 0 to n-1), a suffix is any substring that starts at some position i and extends to the end of the string. We denote the suffix starting at position i as S[i..n-1] or simply suffix(i).

For a string of length n, there are exactly n suffixes (one starting at each position). Each suffix is uniquely identified by its starting index.

All Suffixes of 'banana'Let's enumerate every suffix of the classic example string 'banana' (length 6):

Input

Output

Key Observation:

Every substring of the original string is a prefix of some suffix. This seemingly simple observation is profound:

If you're searching for pattern "nan" in "banana"
"nan" appears at position 2 in the original string
"nan" is the prefix of suffix "nana" (which starts at position 2)

This means that finding a pattern in a string is equivalent to finding which suffixes have that pattern as their prefix. This insight is the foundation of suffix-based string searching.

The Substring-Suffix Connection

Any occurrence of a pattern P in string S corresponds to a suffix of S that begins with P. If we can efficiently find all suffixes that start with P, we've found all occurrences of P in S. This is the conceptual bridge that makes suffix arrays powerful for pattern matching.

Defining the Suffix Array

Now we arrive at the central definition:

The Suffix Array:

A suffix array for a string S of length n is an array SA of n integers, where SA[k] gives the starting index of the suffix that is k-th in lexicographic (alphabetical) order among all suffixes of S.

In other words:

SA[0] = starting index of the lexicographically smallest suffix
SA[1] = starting index of the second smallest suffix
...
SA[n-1] = starting index of the lexicographically largest suffix

The suffix array doesn't store the suffixes themselves—it stores only their starting positions. This is crucial: an array of n integers takes O(n) space, whereas storing all suffixes explicitly would require O(n²) space.

Why Lexicographic Order?

Lexicographic order is the generalization of alphabetical order to strings. String A is lexicographically smaller than B if, at the first position where they differ, A has a smaller character. If A is a prefix of B, then A is smaller. Sorting suffixes lexicographically groups suffixes with common prefixes together—a property we'll exploit heavily.

Building the Suffix Array for 'banana'Let's construct the suffix array step-by-step:

Input

Output

Suffix Array for 'banana'
Rank (position in SA)	SA[rank]	Suffix	Length
0	5	a	1
1	3	ana	3
2	1	anana	5
3	0	banana	6
4	4	na	2
5	2	nana	4

The Power of Sorted Suffixes

Sorting suffixes lexicographically creates a remarkably useful property: suffixes with common prefixes are grouped together in the suffix array.

Consider our 'banana' example. Look at how the suffixes are ordered:

"a"       ← starts with 'a'
"ana"     ← starts with 'a'
"anana"   ← starts with 'a'
"banana"  ← starts with 'b'
"na"      ← starts with 'n'
"nana"    ← starts with 'n'

All suffixes starting with 'a' are consecutive. All suffixes starting with 'n' are consecutive. This isn't coincidence—it's a direct consequence of lexicographic ordering.

This grouping property is transformative for pattern matching.

Pattern Matching via Binary Search:

Because suffixes with common prefixes are contiguous in the suffix array, finding all occurrences of a pattern P becomes a range query:

Binary search to find the first suffix in SA that starts with P (or is ≥ P lexicographically)
Binary search to find the last suffix in SA that starts with P (or is ≤ P lexicographically)
All suffixes in this range start with P, meaning P occurs at each of those positions in the original string

With a suffix array, we can find all k occurrences of a pattern P of length m in a string of length n in O(m log n + k) time—logarithmic in the string length!

Searching for 'ana' in 'banana'Let's trace how we'd find all occurrences of 'ana':

Input

Output

Binary Search Subtlety

When binary searching for a pattern P, each comparison involves comparing P against a suffix of S. This comparison takes O(|P|) = O(m) time in the worst case. Therefore, total search time is O(m log n). Advanced techniques using LCP arrays can reduce this to O(m + log n).

Formal Properties and Representation

Let's formalize the key properties of suffix arrays:

Property 1: Bijection

The suffix array SA is a permutation of [0, 1, 2, ..., n-1]. Every position appears exactly once because there's exactly one suffix for each starting position, and sorting is a bijection.

Property 2: Inverse Suffix Array

The inverse suffix array (or rank array) ISA is defined as:

ISA[i] = k  such that  SA[k] = i

In other words, ISA[i] gives the rank (sorted position) of the suffix starting at position i.

SA: "What suffix is at rank k?"
ISA: "What is the rank of the suffix at position i?"

They are inverses: SA[ISA[i]] = i and ISA[SA[k]] = k

Suffix Array and Inverse for 'banana'Computing the inverse suffix array:

Input

Output

Property 3: Order Preservation

For any two positions i and j in the original string:

suffix(i) < suffix(j) lexicographically  ⟺  ISA[i] < ISA[j]

The inverse suffix array encodes the relative lexicographic ordering of all suffixes. This is extremely useful for algorithms that need to compare suffixes without actually comparing character-by-character.

Property 4: Space Efficiency

A suffix array for a string of length n requires:

O(n) integers for SA (typically 4n bytes for 32-bit integers)
O(n) integers for ISA if computed (another 4n bytes)
Total: O(n) space, independent of alphabet size

Compare this to explicitly storing all suffixes, which would need O(n²) space!

The Sentinel Character

In many implementations, a special 'sentinel' character (often '$' or a null byte) is appended to the string. This sentinel is defined to be lexicographically smaller than all other characters. It ensures that no suffix is a prefix of another suffix, simplifying comparisons and avoiding edge cases.

Visual Mental Model

Developing the right mental model for suffix arrays is crucial. Here are several ways to visualize them:

Model 1: The Sorted Suffix List

Imagine writing out all suffixes on separate cards, then sorting them alphabetically. The suffix array is just the list of original positions on those sorted cards.

Cards after sorting:
┌─────────┬─────────┐
│ Card 1  │ Pos: 5  │  "a"
├─────────┼─────────┤
│ Card 2  │ Pos: 3  │  "ana"
├─────────┼─────────┤
│ Card 3  │ Pos: 1  │  "anana"
├─────────┼─────────┤
│ Card 4  │ Pos: 0  │  "banana"
├─────────┼─────────┤
│ Card 5  │ Pos: 4  │  "na"
├─────────┼─────────┤
│ Card 6  │ Pos: 2  │  "nana"
└─────────┴─────────┘

The suffix array is: [5, 3, 1, 0, 4, 2]

Model 2: The Rotations View

Alternatively, think of the string as circular, with each rotation representing a different "view". If you append a sentinel character, each suffix corresponds to a unique rotation:

banana$  → "banana$"
anana$   → "anana$b"  (but we only care about "anana$")
nana$    → "nana$ba"  (but we only care about "nana$")
...

This view connects suffix arrays to the Burrows-Wheeler Transform (BWT) used in data compression.

Model 3: The Index Pointer View

Picture an array of n pointers, each pointing into the original string at a different position. Initially unordered, we sort these pointers by what they "see" (the suffix starting at that position). The sorted pointer positions form the suffix array.

No Suffix Storage Needed

A crucial insight: we never explicitly store the suffixes themselves. We only store starting positions (indices). When we need to compare two suffixes, we compare characters directly in the original string starting from those positions. This is why suffix arrays are O(n) space despite representing O(n²) total characters across all suffixes.

Comparison with Other Approaches

To appreciate suffix arrays, let's compare them to alternative approaches for pattern matching:

Naive String Search:

Time: O(n × m) per query (slide pattern across text)
No preprocessing
Simple but slow for repeated queries

KMP / Z-Algorithm:

Time: O(n + m) per query
Preprocessing: O(m) per pattern
Excellent for single pattern matching, but must reprocess for each new pattern

Suffix Array:

Preprocessing: O(n log n) or O(n) to build
Query: O(m log n) per pattern
Space: O(n)
Build once, query many patterns efficiently

Pattern Matching Approach Comparison
Approach	Build Time	Query Time	Space	Best For
Naive	O(1)	O(n × m)	O(1)	One-off search, small strings
KMP	O(m) per pattern	O(n + m)	O(m)	Single pattern, one-time search
Rabin-Karp	O(m) per pattern	O(n + m) avg	O(m)	Multiple pattern matching
Suffix Array	O(n log n) or O(n)	O(m log n + k)	O(n)	Many queries on same text
Suffix Tree	O(n)	O(m + k)	O(n) but large constant	Speed-critical applications

When Suffix Arrays Shine:

Text indexing for databases: Build suffix array once, answer unlimited pattern queries
Bioinformatics: Searching DNA/protein sequences for motifs, with billions of queries against the same genome
Plagiarism detection: Finding common substrings between documents
Data compression: The Burrows-Wheeler Transform, fundamental to bzip2 and bioinformatics tools, is intimately related to suffix arrays
Substring statistics: Finding longest repeated substrings, counting distinct substrings, etc.

The Index Analogy

Think of a suffix array like the index at the back of a book. Building the index takes effort, but once built, you can find any topic (pattern) quickly. Without an index, you'd have to scan every page for each lookup. The upfront investment pays dividends with repeated queries.

The Sentinel Character in Detail

Many suffix array implementations append a sentinel character to the string. This character (often denoted '$' or '\0') is defined to be:

Lexicographically smaller than all other characters in the alphabet
Unique—it appears only once, at the end

Why use a sentinel?

Problem Without Sentinel:

Consider the string "aa". Its suffixes are:

Position 0: "aa"
Position 1: "a"

When comparing "a" and "aa", we need a rule for when one string is a prefix of another. Typically, the shorter string is considered smaller ("a" < "aa").

With Sentinel:

The string becomes "aa$". Its suffixes are:

Position 0: "aa$"
Position 1: "a$"
Position 2: "$"

Now no suffix is a prefix of another—they all end at '$'. Comparisons are unambiguous:

"$" < "a$" < "aa$"

Sorted: SA = [2, 1, 0]

Practical Benefits:

Simplified comparisons: No special-case logic for prefix relationships
Unique smallest suffix: The sentinel-only suffix (at position n) is always the smallest
Algorithmic convenience: Many suffix array construction algorithms assume or require a sentinel
Boundary handling: Prevents out-of-bounds access when comparing suffix characters

When to Include the Sentinel:

Include it when using algorithms that expect it (DC3, SA-IS, etc.)
Include it when building suffix trees (sentinel ensures each suffix leads to a unique leaf)
Omit it for space-critical applications where you handle prefix comparisons manually

Sentinel Considerations

When using a sentinel, remember: (1) The alphabet size increases by 1; (2) The string length increases by 1, so your suffix array has n+1 elements; (3) If binary data might contain null bytes, ensure your sentinel is truly unique. Some implementations use -1 or a value outside the valid character range.

Summary: The Foundation of Suffix Arrays

We've established the conceptual foundation of suffix arrays. Let's consolidate the key insights:

Key Takeaways

•A suffix is any substring extending from some position to the end of the string. A string of length n has exactly n suffixes.
•The suffix array is the array of starting positions of suffixes, sorted by the suffixes' lexicographic order. It's a permutation of [0, n-1].
•Pattern matching becomes range finding: Because suffixes with common prefixes are contiguous in the sorted order, finding pattern occurrences reduces to binary search.
•Space efficiency: The suffix array stores only n integers (O(n) space), not the O(n²) characters of all suffixes explicitly.
•The inverse suffix array (ISA) maps positions to ranks, enabling O(1) suffix comparison by rank.
•A sentinel character simplifies implementation by ensuring no suffix is a prefix of another.

What's Next:

Now that you understand what a suffix array is, the next page explores how to build one efficiently. We'll examine the naive O(n² log n) construction approach and introduce the key ideas behind the O(n log n) and even O(n) algorithms that make suffix arrays practical for massive strings.

Page Complete

You now possess a solid understanding of what suffix arrays are and why sorting suffixes lexicographically is such a powerful idea. The suffix array is the foundation; building upon it opens the door to a vast array of efficient string algorithms. Next, we'll learn how to construct suffix arrays efficiently.

1 / 5

Loading learning content...

Data Structures & AlgorithmsSuffix Arrays & Suffix Trees

Suffix Arrays & Suffix Trees (Conceptual)

LevelAdvanced

Duration90 mins

TopicSuffix Arrays & Suffix Trees

1 / 5

What is a Suffix Array

The Suffix Array: A Deceptively Simple Powerhouse

In this page, we'll demystify the suffix array, understand precisely what it represents, and see why sorting suffixes lexicographically unlocks a treasure trove of string algorithmic capabilities.

What You Will Learn

The Concept of Suffixes

Before we define the suffix array, we must first deeply understand what we mean by a suffix of a string.

Formal Definition:

For a string of length n, there are exactly n suffixes (one starting at each position). Each suffix is uniquely identified by its starting index.

All Suffixes of 'banana'Let's enumerate every suffix of the classic example string 'banana' (length 6):

Input

Output

Key Observation:

Every substring of the original string is a prefix of some suffix. This seemingly simple observation is profound:

If you're searching for pattern "nan" in "banana"
"nan" appears at position 2 in the original string
"nan" is the prefix of suffix "nana" (which starts at position 2)

This means that finding a pattern in a string is equivalent to finding which suffixes have that pattern as their prefix. This insight is the foundation of suffix-based string searching.

The Substring-Suffix Connection

Defining the Suffix Array

Now we arrive at the central definition:

The Suffix Array:

In other words:

SA[0] = starting index of the lexicographically smallest suffix
SA[1] = starting index of the second smallest suffix
...
SA[n-1] = starting index of the lexicographically largest suffix

Why Lexicographic Order?

Building the Suffix Array for 'banana'Let's construct the suffix array step-by-step:

Input

Output

Suffix Array for 'banana'
Rank (position in SA)	SA[rank]	Suffix	Length
0	5	a	1
1	3	ana	3
2	1	anana	5
3	0	banana	6
4	4	na	2
5	2	nana	4

The Power of Sorted Suffixes

Sorting suffixes lexicographically creates a remarkably useful property: suffixes with common prefixes are grouped together in the suffix array.

Consider our 'banana' example. Look at how the suffixes are ordered:

"a"       ← starts with 'a'
"ana"     ← starts with 'a'
"anana"   ← starts with 'a'
"banana"  ← starts with 'b'
"na"      ← starts with 'n'
"nana"    ← starts with 'n'

All suffixes starting with 'a' are consecutive. All suffixes starting with 'n' are consecutive. This isn't coincidence—it's a direct consequence of lexicographic ordering.

This grouping property is transformative for pattern matching.

Pattern Matching via Binary Search:

Because suffixes with common prefixes are contiguous in the suffix array, finding all occurrences of a pattern P becomes a range query:

Binary search to find the first suffix in SA that starts with P (or is ≥ P lexicographically)
Binary search to find the last suffix in SA that starts with P (or is ≤ P lexicographically)
All suffixes in this range start with P, meaning P occurs at each of those positions in the original string

With a suffix array, we can find all k occurrences of a pattern P of length m in a string of length n in O(m log n + k) time—logarithmic in the string length!

Searching for 'ana' in 'banana'Let's trace how we'd find all occurrences of 'ana':

Input

Output

Binary Search Subtlety

Formal Properties and Representation

Let's formalize the key properties of suffix arrays:

Property 1: Bijection

The suffix array SA is a permutation of [0, 1, 2, ..., n-1]. Every position appears exactly once because there's exactly one suffix for each starting position, and sorting is a bijection.

Property 2: Inverse Suffix Array

The inverse suffix array (or rank array) ISA is defined as:

ISA[i] = k  such that  SA[k] = i

In other words, ISA[i] gives the rank (sorted position) of the suffix starting at position i.

SA: "What suffix is at rank k?"
ISA: "What is the rank of the suffix at position i?"

They are inverses: SA[ISA[i]] = i and ISA[SA[k]] = k

Suffix Array and Inverse for 'banana'Computing the inverse suffix array:

Input

Output

Property 3: Order Preservation

For any two positions i and j in the original string:

suffix(i) < suffix(j) lexicographically  ⟺  ISA[i] < ISA[j]

Property 4: Space Efficiency

A suffix array for a string of length n requires:

O(n) integers for SA (typically 4n bytes for 32-bit integers)
O(n) integers for ISA if computed (another 4n bytes)
Total: O(n) space, independent of alphabet size

Compare this to explicitly storing all suffixes, which would need O(n²) space!

The Sentinel Character

Visual Mental Model

Developing the right mental model for suffix arrays is crucial. Here are several ways to visualize them:

Model 1: The Sorted Suffix List

Imagine writing out all suffixes on separate cards, then sorting them alphabetically. The suffix array is just the list of original positions on those sorted cards.

Cards after sorting:
┌─────────┬─────────┐
│ Card 1  │ Pos: 5  │  "a"
├─────────┼─────────┤
│ Card 2  │ Pos: 3  │  "ana"
├─────────┼─────────┤
│ Card 3  │ Pos: 1  │  "anana"
├─────────┼─────────┤
│ Card 4  │ Pos: 0  │  "banana"
├─────────┼─────────┤
│ Card 5  │ Pos: 4  │  "na"
├─────────┼─────────┤
│ Card 6  │ Pos: 2  │  "nana"
└─────────┴─────────┘

The suffix array is: [5, 3, 1, 0, 4, 2]

Model 2: The Rotations View

Alternatively, think of the string as circular, with each rotation representing a different "view". If you append a sentinel character, each suffix corresponds to a unique rotation:

banana$  → "banana$"
anana$   → "anana$b"  (but we only care about "anana$")
nana$    → "nana$ba"  (but we only care about "nana$")
...

This view connects suffix arrays to the Burrows-Wheeler Transform (BWT) used in data compression.

Model 3: The Index Pointer View

No Suffix Storage Needed

Comparison with Other Approaches

To appreciate suffix arrays, let's compare them to alternative approaches for pattern matching:

Naive String Search:

Time: O(n × m) per query (slide pattern across text)
No preprocessing
Simple but slow for repeated queries

KMP / Z-Algorithm:

Time: O(n + m) per query
Preprocessing: O(m) per pattern
Excellent for single pattern matching, but must reprocess for each new pattern

Suffix Array:

Preprocessing: O(n log n) or O(n) to build
Query: O(m log n) per pattern
Space: O(n)
Build once, query many patterns efficiently

Pattern Matching Approach Comparison
Approach	Build Time	Query Time	Space	Best For
Naive	O(1)	O(n × m)	O(1)	One-off search, small strings
KMP	O(m) per pattern	O(n + m)	O(m)	Single pattern, one-time search
Rabin-Karp	O(m) per pattern	O(n + m) avg	O(m)	Multiple pattern matching
Suffix Array	O(n log n) or O(n)	O(m log n + k)	O(n)	Many queries on same text
Suffix Tree	O(n)	O(m + k)	O(n) but large constant	Speed-critical applications

When Suffix Arrays Shine:

Text indexing for databases: Build suffix array once, answer unlimited pattern queries
Bioinformatics: Searching DNA/protein sequences for motifs, with billions of queries against the same genome
Plagiarism detection: Finding common substrings between documents
Data compression: The Burrows-Wheeler Transform, fundamental to bzip2 and bioinformatics tools, is intimately related to suffix arrays
Substring statistics: Finding longest repeated substrings, counting distinct substrings, etc.

The Index Analogy

The Sentinel Character in Detail

Many suffix array implementations append a sentinel character to the string. This character (often denoted '$' or '\0') is defined to be:

Lexicographically smaller than all other characters in the alphabet
Unique—it appears only once, at the end

Why use a sentinel?

Problem Without Sentinel:

Consider the string "aa". Its suffixes are:

Position 0: "aa"
Position 1: "a"

When comparing "a" and "aa", we need a rule for when one string is a prefix of another. Typically, the shorter string is considered smaller ("a" < "aa").

With Sentinel:

The string becomes "aa$". Its suffixes are:

Position 0: "aa$"
Position 1: "a$"
Position 2: "$"

Now no suffix is a prefix of another—they all end at '$'. Comparisons are unambiguous:

"$" < "a$" < "aa$"

Sorted: SA = [2, 1, 0]

Practical Benefits:

Simplified comparisons: No special-case logic for prefix relationships
Unique smallest suffix: The sentinel-only suffix (at position n) is always the smallest
Algorithmic convenience: Many suffix array construction algorithms assume or require a sentinel
Boundary handling: Prevents out-of-bounds access when comparing suffix characters

When to Include the Sentinel:

Include it when using algorithms that expect it (DC3, SA-IS, etc.)
Include it when building suffix trees (sentinel ensures each suffix leads to a unique leaf)
Omit it for space-critical applications where you handle prefix comparisons manually

Sentinel Considerations

Summary: The Foundation of Suffix Arrays

We've established the conceptual foundation of suffix arrays. Let's consolidate the key insights:

Key Takeaways

•A suffix is any substring extending from some position to the end of the string. A string of length n has exactly n suffixes.
•The suffix array is the array of starting positions of suffixes, sorted by the suffixes' lexicographic order. It's a permutation of [0, n-1].
•Pattern matching becomes range finding: Because suffixes with common prefixes are contiguous in the sorted order, finding pattern occurrences reduces to binary search.
•Space efficiency: The suffix array stores only n integers (O(n) space), not the O(n²) characters of all suffixes explicitly.
•The inverse suffix array (ISA) maps positions to ranks, enabling O(1) suffix comparison by rank.
•A sentinel character simplifies implementation by ensuring no suffix is a prefix of another.

What's Next:

Page Complete

1 / 5