Data Structures & AlgorithmsSuffix Arrays & Suffix Trees

Suffix Arrays & Suffix Trees (Conceptual)

LevelAdvanced

Duration90 mins

TopicSuffix Arrays & Suffix Trees

4 / 5

Suffix Trees Overview

The Suffix Tree: A Complete Index of All Substrings

Imagine having a data structure that, once built, allows you to:

Find any pattern in a text in O(m) time—linear in the pattern length, independent of text length
Find the longest repeated substring in O(n) time
Locate all occurrences of a pattern in O(m + k) time
Solve complex substring problems that seem intractable with other approaches

This isn't hypothetical—this is the suffix tree, one of the most elegant and powerful data structures in the history of computer science.

The suffix tree is, in essence, a compressed trie containing all suffixes of a string. While conceptually more complex than suffix arrays and requiring more memory, suffix trees offer theoretical optimality for many string operations. In this page, we'll understand what suffix trees are, how they work, and why they remain relevant even when suffix arrays often suffice in practice.

What You Will Learn

By the end of this page, you will understand the structure of suffix trees (compressed tries of suffixes), how suffix trees enable O(m) pattern matching, conceptual overview of Ukkonen's O(n) construction algorithm, the relationship between suffix trees and suffix arrays, and when to choose suffix trees over suffix arrays.

From Trie to Suffix Trie

To understand suffix trees, we must first understand the simpler suffix trie—an uncompressed version.

Recall: The Trie Data Structure

A trie (pronounced "try") is a tree-based structure for storing strings where:

Each edge is labeled with a character
Each path from root to a node represents a prefix of some stored string
Strings sharing a prefix share the corresponding path from the root

The Suffix Trie:

A suffix trie for string S is simply a trie containing all suffixes of S. For "banana$":

Insert "banana$"
Insert "anana$"
Insert "nana$"
Insert "ana$"
Insert "na$"
Insert "a$"
Insert "$"

Visualizing the Suffix Trie:

                    (root)
                   /  |  \
                  a   b   n
                 /|   |   |
                n $  a   a
               /|    |   |
              a n    n   n
             /| |    |   |
            n $ a    a   a
           /|   |    |   |
          a $   n    n   $
         /|     |    |  
        $ n     a    a
          |     |    |
          a     $    $
          |
          $

(This is simplified—the actual structure has more branches)

Problem: Space Explosion

The suffix trie has serious issues:

O(n²) nodes in the worst case (one for each prefix of each suffix)
O(n²) edges in the worst case
Far too much memory for practical use on large strings

Why Suffix Tries Are Impractical

For a string of length 1 million, a suffix trie could have up to 10^12 nodes—requiring terabytes of memory. We need a more compact representation.

Compressing the Trie: The Suffix Tree

The suffix tree solves the space problem through path compression:

Key Insight:

In a suffix trie, many nodes have only one child—they don't represent "decision points." These chains can be collapsed into single edges labeled with multiple characters.

Definition:

A suffix tree for string S is a compressed trie of all suffixes of S where:

Each internal node has at least 2 children (except possibly the root)
Each edge is labeled with a non-empty substring of S
No two edges from the same node start with the same character
Each leaf corresponds to exactly one suffix of S
Each suffix is represented by exactly one path from root to a leaf

Suffix Tree for "banana$":

                        (root)
                        / | \
                       /  |  \
                     a    b   n
                    /|    |    \
                  na $   anana$  ana
                 / |              / \
               na$ $            na$  $
              / \
            na$  $

Wait—that's getting complex. Let me draw it more carefully:

                    (root)
                   /  |  \
                  a   b   n
                 /    |    \
           [branch] anana$ ana
              ...           / \
                          na$  $

Key Properties:

Exactly n leaves (one per suffix, assuming sentinel '$')
At most n-1 internal nodes (each internal node is a branching point)
Total nodes: ≤ 2n - 1 = O(n)
Total edges: ≤ 2n - 2 = O(n)

The suffix tree is O(n) space!

Edge Label Representation

A crucial implementation detail: edge labels aren't stored as strings but as pairs (start, end) pointing into the original text. An edge labeled "anana$" is stored as (1, 6) meaning text[1:7]. This ensures O(n) total space for labels, not O(n²).

Suffix Tree Structure in Detail

Let's examine the suffix tree structure more precisely using "banana$" as our example:

All suffixes:

0: banana$
1: anana$
2: nana$
3: ana$
4: na$
5: a$
6: $

Building Intuition:

Group suffixes by their first character:

Starting with 'a': "anana$", "ana$", "a$"
Starting with 'b': "banana$"
Starting with 'n': "nana$", "na$"
Starting with '$': "$"

This gives us 4 children of the root. Within each group, we recursively find common prefixes and split.

The 'a' subtree:

"anana$", "ana$", "a$" share 'a'
After 'a': "nana$", "na$", "$"
Common prefix of 'n': "nana$" and "na$" share "na"
After 'na': "na$" and "$"
These groups further split...

Formal Suffix Tree for "banana$":

Root
├── $ → [leaf: suffix 6]
├── a → (internal node A)
│       ├── $ → [leaf: suffix 5]  (path: "a$")
│       └── na → (internal node B)
│               ├── $ → [leaf: suffix 3]  (path: "ana$")
│               └── na$ → [leaf: suffix 1]  (path: "anana$")
├── banana$ → [leaf: suffix 0]
└── na → (internal node C)
        ├── $ → [leaf: suffix 4]  (path: "na$")
        └── na$ → [leaf: suffix 2]  (path: "nana$")

Node counts:

Leaves: 7 (one per suffix)
Internal nodes: 4 (root + A + B + C)
Total: 11 nodes, all O(n) where n = 7

Observations:

Suffix 0 ("banana$") has no common prefix with others → direct leaf
Suffixes starting with 'a' share prefix and branch
Each leaf is labeled with the starting position of its suffix

The Suffix Link (Advanced)

Suffix trees often include 'suffix links' connecting internal nodes. If a node represents string ax (where a is a single character and x is a string), its suffix link points to the node representing x. These links are crucial for efficient construction and some advanced algorithms. They exploit the relationship between suffix(i) and suffix(i+1).

Pattern Matching in O(m)

The suffix tree's greatest strength is O(m) pattern matching, where m is the pattern length—completely independent of the text length n!

Algorithm:

1. Start at the root
2. For each character in the pattern:
   a. Find the edge starting with that character
   b. Match characters along the edge
   c. If mismatch: pattern not found
   d. If edge exhausted before pattern: move to child node
3. If pattern exhausted: pattern found!
4. Count leaves in the subtree = number of occurrences
5. Report leaf labels = occurrence positions

Why O(m)?

At each step, we process one character of the pattern
Finding the correct edge: O(1) with hash table for children, O(|Σ|) with array
Following the edge: O(1) per character
Total: O(m) to find if pattern exists
Plus O(k) to report k occurrences (traversing subtree leaves)

Searching for 'ana' in 'banana$'Let's trace the search step by step:

Input

Output

Important Subtlety:

When we exhaust the pattern, we might be:

At a node — all leaves below are occurrences
In the middle of an edge — we need to handle this case

For case 2, the pattern exists as a substring (it's a proper prefix of some paths), and we still count all leaves in the subtree starting from where we'd naturally branch.

Compare with Suffix Array:

Operation	Suffix Array	Suffix Tree
Pattern search	O(m log n)	O(m)
With LCP + RMQ	O(m + log n)	O(m)
Report k occurrences	+O(k)	+O(k)
Space	O(n)	O(n) but higher constant

Suffix trees win on query time but lose on space constant factors.

The Logarithm Disappears

The suffix tree's O(m) query is strictly better than suffix array's O(m log n)—the dependence on n vanishes entirely! For massive texts (n = billions) with short patterns, this difference is profound. However, the higher space usage often makes suffix arrays preferable in practice.

Ukkonen's Algorithm: Conceptual Overview

Building a suffix tree naively (inserting each suffix) takes O(n²) time. Ukkonen's algorithm (1995) constructs the suffix tree in O(n) time—a remarkable achievement.

Key Insight: Online Construction

Ukkonen's algorithm is online: it builds the suffix tree for S[0..i] incrementally from the tree for S[0..i-1]. After reading each character, the tree is complete for the string seen so far.

Three Key Tricks:

Trick 1: Implicit Extensions

When adding character S[i], many suffixes already end at leaves. These leaves represent suffixes ending at the "current end." Instead of updating each leaf, we use a global "end" variable that all leaves reference. Adding a character simply increments this variable—O(1) to extend all leaves!

Trick 2: The Active Point

The algorithm maintains an "active point" (active_node, active_edge, active_length) representing where we are in the tree. This allows us to navigate efficiently without re-traversing from the root.

Trick 3: Suffix Links

Suffix links connect internal nodes, allowing fast transitions from suffix(i)'s location to suffix(i+1)'s location. Following a suffix link is O(1).

Algorithm Sketch:

For each character S[i] (phase i):
    Extend the tree to include all suffixes ending at position i
    
    For suffix S[j..i] (from longest to shortest):
        We're at the active point in the tree
        
        Rule 1: If active point is at a leaf
            → Just update global end (implicit, O(1))
        
        Rule 2: If active point is where we need to create a new leaf
            → Create internal node (if needed) and new leaf
            → Update suffix links
        
        Rule 3: If the next character already exists
            → Pattern continues, update active point
            → STOP (remaining suffixes are implicit)
    
    Move active point using suffix links

Why O(n)?

The genius is in Rule 3: once we find that character S[i] already exists in the tree, we STOP. All shorter suffixes are guaranteed to also exist (they're prefixes of longer suffixes that exist). Combined with suffix links and implicit leaf extensions, each character requires only O(1) amortized work.

Implementation Complexity

Ukkonen's algorithm is notoriously tricky to implement correctly. The interplay of active points, suffix links, and edge cases requires careful handling. Many experienced programmers use existing implementations rather than writing from scratch. However, understanding the conceptual ideas helps in reasoning about suffix tree capabilities.

Suffix Trees and Suffix Arrays: The Connection

Suffix trees and suffix arrays are deeply connected—in fact, they're different views of the same underlying structure.

From Suffix Tree to Suffix Array:

Perform a depth-first traversal of the suffix tree, visiting children in lexicographic order of their edge labels. The leaves, encountered in this order, give the suffix array!

This works because:

Leaves represent suffixes
Lexicographic order in the tree corresponds to sorted order of suffixes
DFS in lexicographic order visits leaves in sorted order

From Suffix Array + LCP to Suffix Tree:

Conversely, given SA and LCP arrays, we can construct the suffix tree:

Process suffixes in SA order
LCP tells us how many characters to share with the previous suffix
Build the tree via a stack-based algorithm in O(n)

This means:

Suffix array + LCP array = Suffix tree (informationally equivalent)
Suffix tree = Suffix array + LCP array + more structure

Visual Understanding:

Suffix Tree DFS traversal (lexicographic order):

Root → $    → [leaf 6] ✓
     → a    → $ → [leaf 5] ✓
            → na → $ → [leaf 3] ✓
                  → na$ → [leaf 1] ✓
     → banana$ → [leaf 0] ✓
     → na   → $ → [leaf 4] ✓
            → na$ → [leaf 2] ✓

Leaf order encountered: 6, 5, 3, 1, 0, 4, 2

Suffix Array: [6, 5, 3, 1, 0, 4, 2] (with sentinel at position 6)
              or [5, 3, 1, 0, 4, 2] without sentinel ✓

This correspondence means:

Problems solvable with suffix trees are solvable with SA + LCP
The choice between them is about implementation complexity vs. space/speed tradeoffs
Many modern implementations use suffix arrays (easier, less memory) and reconstruct tree-like queries via LCP + RMQ

The Modern Approach

In current practice, suffix arrays have largely replaced suffix trees for most applications. SA + LCP + RMQ provides the same query power with ~4x less memory. However, suffix trees remain valuable for education (they're more intuitive), certain algorithms that naturally use tree structure, and applications where O(m) vs O(m + log n) matters.

Applications Enabled by Suffix Trees

While suffix arrays cover most applications, suffix trees are particularly elegant for certain problems:

1. Generalized Suffix Trees

A generalized suffix tree contains suffixes of multiple strings, enabling:

Longest common substring of multiple strings in O(n×k) where k is number of strings
Shortest unique substring queries
Approximate pattern matching foundations

2. Lowest Common Ancestor (LCA) Queries

The LCA of two leaves in a suffix tree corresponds to the longest common prefix of the two suffixes. With O(n) preprocessing, LCA queries answer in O(1), giving:

O(1) LCP queries between any two suffixes
O(1) substring comparison (compare first differing character after LCA)
Foundation for many string algorithms

3. Maximal Repeats and Supermaximal Repeats

A substring α is a maximal repeat if:

α appears at least twice
Any extension of α (left or right) appears fewer times

Suffix trees make finding maximal repeats elegant: internal nodes with leaves from diverse contexts (different left characters) identify maximal repeats.

4. Tandem Repeats

Finding tandem repeats (patterns like "abcabc" where a substring immediately repeats) has elegant O(n log n) solutions using suffix trees that are harder to express with arrays alone.

5. Approximate Pattern Matching

Algorithms for finding occurrences with up to k errors often use suffix tree traversal, though suffix array variants exist.

6. Data Compression (Burrows-Wheeler Transform)

The BWT, fundamental to bzip2 and bioinformatics tools, is directly related to suffix tree structure. While computable via suffix arrays, the tree perspective illuminates the theory.

When the Tree Structure Helps

Problems involving hierarchical substring relationships (common ancestors, extension properties, branching structure) are often more naturally expressed using suffix tree structure. Even when solved via suffix arrays in practice, thinking in terms of the tree aids algorithm design.

Space Complexity: The Practical Reality

While both suffix trees and suffix arrays are O(n) space, the constants differ dramatically—and constants matter in practice.

Suffix Array Space:

SA: n integers → 4n bytes (32-bit) or 8n bytes (64-bit)
LCP: n integers → 4n bytes
RMQ sparse table: n log n integers → ~4n log n bytes
Total: ~8n to 10n bytes for basic SA + LCP + RMQ

Suffix Tree Space:

For each node, we need:

Pointers to children (or edge labels + child pointers)
Edge label (2 integers: start, end)
Suffix link pointer
Leaf flag and suffix start position

With up to 2n nodes, each needing 20-40 bytes:

Total: ~40n to 80n bytes

Suffix trees use 4-8x more memory than suffix arrays!

Space Comparison for 1 GB Text (n = 10^9)
Structure	Theoretical	Typical Implementation
Suffix Array (SA only)	4 GB	4-5 GB
SA + LCP	8 GB	8-10 GB
SA + LCP + RMQ	~12 GB	12-15 GB
Suffix Tree	~40 GB	40-80 GB
Enhanced Suffix Array (ESA)	~20 GB	20-25 GB

Compressed Representations:

For very large texts (whole genomes, web crawls), even suffix arrays are too large. Compressed suffix arrays and FM-indices offer:

O(n log |Σ|) bits instead of O(n log n) bits
Still support pattern matching, though slower (O(m log n) or worse)
Enable searching 3GB human genome with ~1-2GB memory

These compressed structures power tools like BWA and Bowtie in bioinformatics.

Memory Matters

For large-scale applications (genomes, web search), memory often determines feasibility. The ~5x space factor between suffix trees and suffix arrays can mean the difference between 'fits in RAM' and 'requires distributed computing'. This is why suffix arrays have largely replaced suffix trees in practice.

Summary: Suffix Trees

We've explored the suffix tree—one of the most powerful data structures in string algorithmics. Here are the key takeaways:

Key Takeaways

•A suffix tree is a compressed trie of all suffixes, achieving O(n) space through path compression.
•O(m) pattern matching is the suffix tree's superpower—query time independent of text length.
•Ukkonen's algorithm constructs suffix trees in O(n) time using implicit extensions, active points, and suffix links.
•Suffix trees and suffix arrays are informationally equivalent—DFS of the tree yields the array; SA + LCP reconstruct the tree.
•Space overhead is significant: suffix trees use 4-8x more memory than suffix arrays.
•Modern practice favors suffix arrays with LCP + RMQ for most applications, achieving comparable power with less memory.
•Suffix trees remain valuable for education, certain specialized algorithms, and when O(m) vs O(m + log n) matters.

What's Next:

With both suffix arrays and suffix trees understood, the final page of this module addresses the practical question: when should you use these structures? We'll develop a decision framework for choosing between naive approaches, suffix arrays, suffix trees, and other string algorithms based on your specific requirements.

Page Complete

You now understand suffix trees conceptually—their structure, construction, and relationship to suffix arrays. While implementation is complex, the conceptual understanding enables you to recognize when suffix tree thinking applies and to use libraries confidently. Next, we'll synthesize everything into a practical decision framework.

4 / 5

Loading learning content...

Data Structures & AlgorithmsSuffix Arrays & Suffix Trees

Suffix Arrays & Suffix Trees (Conceptual)

LevelAdvanced

Duration90 mins

TopicSuffix Arrays & Suffix Trees

4 / 5

Suffix Trees Overview

The Suffix Tree: A Complete Index of All Substrings

Imagine having a data structure that, once built, allows you to:

Find any pattern in a text in O(m) time—linear in the pattern length, independent of text length
Find the longest repeated substring in O(n) time
Locate all occurrences of a pattern in O(m + k) time
Solve complex substring problems that seem intractable with other approaches

This isn't hypothetical—this is the suffix tree, one of the most elegant and powerful data structures in the history of computer science.

What You Will Learn

From Trie to Suffix Trie

To understand suffix trees, we must first understand the simpler suffix trie—an uncompressed version.

Recall: The Trie Data Structure

A trie (pronounced "try") is a tree-based structure for storing strings where:

Each edge is labeled with a character
Each path from root to a node represents a prefix of some stored string
Strings sharing a prefix share the corresponding path from the root

The Suffix Trie:

A suffix trie for string S is simply a trie containing all suffixes of S. For "banana$":

Insert "banana$"
Insert "anana$"
Insert "nana$"
Insert "ana$"
Insert "na$"
Insert "a$"
Insert "$"

Visualizing the Suffix Trie:

                    (root)
                   /  |  \
                  a   b   n
                 /|   |   |
                n $  a   a
               /|    |   |
              a n    n   n
             /| |    |   |
            n $ a    a   a
           /|   |    |   |
          a $   n    n   $
         /|     |    |  
        $ n     a    a
          |     |    |
          a     $    $
          |
          $

(This is simplified—the actual structure has more branches)

Problem: Space Explosion

The suffix trie has serious issues:

O(n²) nodes in the worst case (one for each prefix of each suffix)
O(n²) edges in the worst case
Far too much memory for practical use on large strings

Why Suffix Tries Are Impractical

For a string of length 1 million, a suffix trie could have up to 10^12 nodes—requiring terabytes of memory. We need a more compact representation.

Compressing the Trie: The Suffix Tree

The suffix tree solves the space problem through path compression:

Key Insight:

In a suffix trie, many nodes have only one child—they don't represent "decision points." These chains can be collapsed into single edges labeled with multiple characters.

Definition:

A suffix tree for string S is a compressed trie of all suffixes of S where:

Each internal node has at least 2 children (except possibly the root)
Each edge is labeled with a non-empty substring of S
No two edges from the same node start with the same character
Each leaf corresponds to exactly one suffix of S
Each suffix is represented by exactly one path from root to a leaf

Suffix Tree for "banana$":

                        (root)
                        / | \
                       /  |  \
                     a    b   n
                    /|    |    \
                  na $   anana$  ana
                 / |              / \
               na$ $            na$  $
              / \
            na$  $

Wait—that's getting complex. Let me draw it more carefully:

                    (root)
                   /  |  \
                  a   b   n
                 /    |    \
           [branch] anana$ ana
              ...           / \
                          na$  $

Key Properties:

Exactly n leaves (one per suffix, assuming sentinel '$')
At most n-1 internal nodes (each internal node is a branching point)
Total nodes: ≤ 2n - 1 = O(n)
Total edges: ≤ 2n - 2 = O(n)

The suffix tree is O(n) space!

Edge Label Representation

Suffix Tree Structure in Detail

Let's examine the suffix tree structure more precisely using "banana$" as our example:

All suffixes:

0: banana$
1: anana$
2: nana$
3: ana$
4: na$
5: a$
6: $

Building Intuition:

Group suffixes by their first character:

Starting with 'a': "anana$", "ana$", "a$"
Starting with 'b': "banana$"
Starting with 'n': "nana$", "na$"
Starting with '$': "$"

This gives us 4 children of the root. Within each group, we recursively find common prefixes and split.

The 'a' subtree:

"anana$", "ana$", "a$" share 'a'
After 'a': "nana$", "na$", "$"
Common prefix of 'n': "nana$" and "na$" share "na"
After 'na': "na$" and "$"
These groups further split...

Formal Suffix Tree for "banana$":

Root
├── $ → [leaf: suffix 6]
├── a → (internal node A)
│       ├── $ → [leaf: suffix 5]  (path: "a$")
│       └── na → (internal node B)
│               ├── $ → [leaf: suffix 3]  (path: "ana$")
│               └── na$ → [leaf: suffix 1]  (path: "anana$")
├── banana$ → [leaf: suffix 0]
└── na → (internal node C)
        ├── $ → [leaf: suffix 4]  (path: "na$")
        └── na$ → [leaf: suffix 2]  (path: "nana$")

Node counts:

Leaves: 7 (one per suffix)
Internal nodes: 4 (root + A + B + C)
Total: 11 nodes, all O(n) where n = 7

Observations:

Suffix 0 ("banana$") has no common prefix with others → direct leaf
Suffixes starting with 'a' share prefix and branch
Each leaf is labeled with the starting position of its suffix

The Suffix Link (Advanced)

Pattern Matching in O(m)

The suffix tree's greatest strength is O(m) pattern matching, where m is the pattern length—completely independent of the text length n!

Algorithm:

1. Start at the root
2. For each character in the pattern:
   a. Find the edge starting with that character
   b. Match characters along the edge
   c. If mismatch: pattern not found
   d. If edge exhausted before pattern: move to child node
3. If pattern exhausted: pattern found!
4. Count leaves in the subtree = number of occurrences
5. Report leaf labels = occurrence positions

Why O(m)?

At each step, we process one character of the pattern
Finding the correct edge: O(1) with hash table for children, O(|Σ|) with array
Following the edge: O(1) per character
Total: O(m) to find if pattern exists
Plus O(k) to report k occurrences (traversing subtree leaves)

Searching for 'ana' in 'banana$'Let's trace the search step by step:

Input

Output

Important Subtlety:

When we exhaust the pattern, we might be:

At a node — all leaves below are occurrences
In the middle of an edge — we need to handle this case

For case 2, the pattern exists as a substring (it's a proper prefix of some paths), and we still count all leaves in the subtree starting from where we'd naturally branch.

Compare with Suffix Array:

Operation	Suffix Array	Suffix Tree
Pattern search	O(m log n)	O(m)
With LCP + RMQ	O(m + log n)	O(m)
Report k occurrences	+O(k)	+O(k)
Space	O(n)	O(n) but higher constant

Suffix trees win on query time but lose on space constant factors.

The Logarithm Disappears

Ukkonen's Algorithm: Conceptual Overview

Building a suffix tree naively (inserting each suffix) takes O(n²) time. Ukkonen's algorithm (1995) constructs the suffix tree in O(n) time—a remarkable achievement.

Key Insight: Online Construction

Ukkonen's algorithm is online: it builds the suffix tree for S[0..i] incrementally from the tree for S[0..i-1]. After reading each character, the tree is complete for the string seen so far.

Three Key Tricks:

Trick 1: Implicit Extensions

Trick 2: The Active Point

The algorithm maintains an "active point" (active_node, active_edge, active_length) representing where we are in the tree. This allows us to navigate efficiently without re-traversing from the root.

Trick 3: Suffix Links

Suffix links connect internal nodes, allowing fast transitions from suffix(i)'s location to suffix(i+1)'s location. Following a suffix link is O(1).

Algorithm Sketch:

For each character S[i] (phase i):
    Extend the tree to include all suffixes ending at position i
    
    For suffix S[j..i] (from longest to shortest):
        We're at the active point in the tree
        
        Rule 1: If active point is at a leaf
            → Just update global end (implicit, O(1))
        
        Rule 2: If active point is where we need to create a new leaf
            → Create internal node (if needed) and new leaf
            → Update suffix links
        
        Rule 3: If the next character already exists
            → Pattern continues, update active point
            → STOP (remaining suffixes are implicit)
    
    Move active point using suffix links

Why O(n)?

Implementation Complexity

Suffix Trees and Suffix Arrays: The Connection

Suffix trees and suffix arrays are deeply connected—in fact, they're different views of the same underlying structure.

From Suffix Tree to Suffix Array:

Perform a depth-first traversal of the suffix tree, visiting children in lexicographic order of their edge labels. The leaves, encountered in this order, give the suffix array!

This works because:

Leaves represent suffixes
Lexicographic order in the tree corresponds to sorted order of suffixes
DFS in lexicographic order visits leaves in sorted order

From Suffix Array + LCP to Suffix Tree:

Conversely, given SA and LCP arrays, we can construct the suffix tree:

Process suffixes in SA order
LCP tells us how many characters to share with the previous suffix
Build the tree via a stack-based algorithm in O(n)

This means:

Suffix array + LCP array = Suffix tree (informationally equivalent)
Suffix tree = Suffix array + LCP array + more structure

Visual Understanding:

Suffix Tree DFS traversal (lexicographic order):

Root → $    → [leaf 6] ✓
     → a    → $ → [leaf 5] ✓
            → na → $ → [leaf 3] ✓
                  → na$ → [leaf 1] ✓
     → banana$ → [leaf 0] ✓
     → na   → $ → [leaf 4] ✓
            → na$ → [leaf 2] ✓

Leaf order encountered: 6, 5, 3, 1, 0, 4, 2

Suffix Array: [6, 5, 3, 1, 0, 4, 2] (with sentinel at position 6)
              or [5, 3, 1, 0, 4, 2] without sentinel ✓

This correspondence means:

Problems solvable with suffix trees are solvable with SA + LCP
The choice between them is about implementation complexity vs. space/speed tradeoffs
Many modern implementations use suffix arrays (easier, less memory) and reconstruct tree-like queries via LCP + RMQ

The Modern Approach

Applications Enabled by Suffix Trees

While suffix arrays cover most applications, suffix trees are particularly elegant for certain problems:

1. Generalized Suffix Trees

A generalized suffix tree contains suffixes of multiple strings, enabling:

Longest common substring of multiple strings in O(n×k) where k is number of strings
Shortest unique substring queries
Approximate pattern matching foundations

2. Lowest Common Ancestor (LCA) Queries

The LCA of two leaves in a suffix tree corresponds to the longest common prefix of the two suffixes. With O(n) preprocessing, LCA queries answer in O(1), giving:

O(1) LCP queries between any two suffixes
O(1) substring comparison (compare first differing character after LCA)
Foundation for many string algorithms

3. Maximal Repeats and Supermaximal Repeats

A substring α is a maximal repeat if:

α appears at least twice
Any extension of α (left or right) appears fewer times

Suffix trees make finding maximal repeats elegant: internal nodes with leaves from diverse contexts (different left characters) identify maximal repeats.

4. Tandem Repeats

Finding tandem repeats (patterns like "abcabc" where a substring immediately repeats) has elegant O(n log n) solutions using suffix trees that are harder to express with arrays alone.

5. Approximate Pattern Matching

Algorithms for finding occurrences with up to k errors often use suffix tree traversal, though suffix array variants exist.

6. Data Compression (Burrows-Wheeler Transform)

The BWT, fundamental to bzip2 and bioinformatics tools, is directly related to suffix tree structure. While computable via suffix arrays, the tree perspective illuminates the theory.

When the Tree Structure Helps

Space Complexity: The Practical Reality

While both suffix trees and suffix arrays are O(n) space, the constants differ dramatically—and constants matter in practice.

Suffix Array Space:

SA: n integers → 4n bytes (32-bit) or 8n bytes (64-bit)
LCP: n integers → 4n bytes
RMQ sparse table: n log n integers → ~4n log n bytes
Total: ~8n to 10n bytes for basic SA + LCP + RMQ

Suffix Tree Space:

For each node, we need:

Pointers to children (or edge labels + child pointers)
Edge label (2 integers: start, end)
Suffix link pointer
Leaf flag and suffix start position

With up to 2n nodes, each needing 20-40 bytes:

Total: ~40n to 80n bytes

Suffix trees use 4-8x more memory than suffix arrays!

Space Comparison for 1 GB Text (n = 10^9)
Structure	Theoretical	Typical Implementation
Suffix Array (SA only)	4 GB	4-5 GB
SA + LCP	8 GB	8-10 GB
SA + LCP + RMQ	~12 GB	12-15 GB
Suffix Tree	~40 GB	40-80 GB
Enhanced Suffix Array (ESA)	~20 GB	20-25 GB

Compressed Representations:

For very large texts (whole genomes, web crawls), even suffix arrays are too large. Compressed suffix arrays and FM-indices offer:

O(n log |Σ|) bits instead of O(n log n) bits
Still support pattern matching, though slower (O(m log n) or worse)
Enable searching 3GB human genome with ~1-2GB memory

These compressed structures power tools like BWA and Bowtie in bioinformatics.

Memory Matters

Summary: Suffix Trees

We've explored the suffix tree—one of the most powerful data structures in string algorithmics. Here are the key takeaways:

Key Takeaways

•A suffix tree is a compressed trie of all suffixes, achieving O(n) space through path compression.
•O(m) pattern matching is the suffix tree's superpower—query time independent of text length.
•Ukkonen's algorithm constructs suffix trees in O(n) time using implicit extensions, active points, and suffix links.
•Suffix trees and suffix arrays are informationally equivalent—DFS of the tree yields the array; SA + LCP reconstruct the tree.
•Space overhead is significant: suffix trees use 4-8x more memory than suffix arrays.
•Modern practice favors suffix arrays with LCP + RMQ for most applications, achieving comparable power with less memory.
•Suffix trees remain valuable for education, certain specialized algorithms, and when O(m) vs O(m + log n) matters.

What's Next:

Page Complete

4 / 5