LCS Revisited - Learning Module

Loading content...

0/276

Reconstructing the LCS

From Length to Sequence

Knowing that the Longest Common Subsequence has length 4 is useful, but in many real applications, we need the actual subsequence itself. Consider these scenarios:

File diff tools: You don't just want to know "these files differ in 47 lines"—you need to show exactly which lines were added, removed, or unchanged.
DNA sequence alignment: Biologists need to see the actual aligned portions of genomes, not just similarity scores.
Plagiarism detection: You need to highlight the exact matching passages, not just report a percentage.

The DP table we constructed contains all the information needed to recover the LCS itself. This page teaches you how to backtrack through the table to extract the sequence—and how to find all possible LCS sequences when multiple exist.

What You Will Master

By the end of this page, you'll understand the backtracking algorithm for LCS reconstruction, be able to prove its correctness, implement both single-LCS and all-LCS extraction, and understand the relationship between table structure and reconstruction paths.

The Backtracking Intuition

The DP table tells us not just the answer, but how we arrived at the answer. Each cell's value was computed from one of its neighbors, and this relationship encodes the structure of the LCS.

Key Insight: When filling the table, we made decisions at each cell:

If characters matched: we added 1 to the diagonal value
If characters didn't match: we took the max of top and left

To reconstruct the LCS, we reverse these decisions. Starting from the final cell dp[m][n], we trace back to the origin dp[0][0], and every time we encounter a character match, we record that character.

The Backtracking Rules:

Start at dp[m][n]
If X[i] == Y[j]: This character is part of the LCS. Record it and move diagonally to dp[i-1][j-1]
If X[i] ≠ Y[j]: Move to the larger of dp[i-1][j] (up) or dp[i][j-1] (left)
If dp[i-1][j] == dp[i][j-1]: Either direction is valid (this creates branches for multiple LCS)
Stop when i == 0 or j == 0

Backtracking WalkthroughUsing the table from X = "ABCD" and Y = "AEBD" (LCS = "ABD")

Input

Output

Reverse Order

Characters are collected in reverse order during backtracking (last to first). Either collect into a stack/list and reverse at the end, or build the string by prepending characters.

Single LCS Reconstruction Algorithm

Here's the complete implementation for extracting one LCS from the DP table. When multiple LCS exist (i.e., when dp[i-1][j] == dp[i][j-1] during a mismatch), this algorithm arbitrarily chooses to go up first.

lcs_reconstruction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def lcs_with_reconstruction(X: str, Y: str) -> tuple[int, str]:
    """
    Compute LCS length and reconstruct one LCS sequence.
    
    Returns:
        Tuple of (LCS length, one LCS string)
    
    Time Complexity: O(m × n) for table + O(m + n) for reconstruction
    Space Complexity: O(m × n) for the DP table
    """
    m, n = len(X), len(Y)
    
    # Build the DP table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Backtrack to reconstruct the LCS
    lcs_chars = []
    i, j = m, n
    
    while i > 0 and j > 0:
        if X[i - 1] == Y[j - 1]:
            # Characters match: this char is part of LCS
            lcs_chars.append(X[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] >= dp[i][j - 1]:
            # Go up (arbitrarily prefer up when equal)
            i -= 1
        else:
            # Go left
            j -= 1
    
    # Reverse to get correct order
    lcs_chars.reverse()
    lcs = ''.join(lcs_chars)
    
    return dp[m][n], lcs
 
 
def lcs_with_positions(X: str, Y: str) -> tuple[int, str, list[tuple[int, int]]]:
    """
    Compute LCS and return positions of matched characters in both strings.
    
    Returns:
        Tuple of (LCS length, LCS string, list of (X_idx, Y_idx) pairs)
    """
    m, n = len(X), len(Y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Backtrack with position tracking
    lcs_chars = []
    positions = []
    i, j = m, n
    
    while i > 0 and j > 0:
        if X[i - 1] == Y[j - 1]:
            lcs_chars.append(X[i - 1])
            positions.append((i - 1, j - 1))  # 0-indexed positions
            i -= 1
            j -= 1
        elif dp[i - 1][j] >= dp[i][j - 1]:
            i -= 1
        else:
            j -= 1
    
    lcs_chars.reverse()
    positions.reverse()
    
    return dp[m][n], ''.join(lcs_chars), positions
 
 
# Example usage
X = "ABCBDAB"
Y = "BDCABA"
 
length, lcs = lcs_with_reconstruction(X, Y)
print(f"X = '{X}'")
print(f"Y = '{Y}'")
print(f"LCS length: {length}")
print(f"One LCS: '{lcs}'")
 
# With positions
length, lcs, positions = lcs_with_positions(X, Y)
print(f"\nLCS with positions:")
print(f"LCS: '{lcs}'")
print(f"Positions (X_idx, Y_idx): {positions}")
 
# Visualize matches
print(f"\nVisualization:")
print(f"X: ", end="")
for i, c in enumerate(X):
    if any(p[0] == i for p in positions):
        print(f"[{c}]", end="")
    else:
        print(f" {c} ", end="")
print()
print(f"Y: ", end="")
for j, c in enumerate(Y):
    if any(p[1] == j for p in positions):
        print(f"[{c}]", end="")
    else:
        print(f" {c} ", end="")
print()

Reconstruction Complexity

The backtracking step takes O(m + n) time because we move either up or left (or diagonally) at each step, and we can move at most m times up and n times left before reaching the origin. The total algorithm remains O(m × n) dominated by table construction.

Finding All LCS Sequences

When multiple distinct LCS sequences exist, we may want to find all of them. This occurs when dp[i-1][j] == dp[i][j-1] during a mismatch—we could go either direction and still achieve the optimal length.

The Challenge:

The number of distinct LCS sequences can be exponential in the length of the strings. For example, two strings of length n with no common characters have 2^n possible "empty" LCS paths (all going around the table). More practically, strings with repeated characters often have many valid LCS.

When to Find All LCS:

In most applications, finding one LCS is sufficient. However, you might need all LCS for:

Complete plagiarism analysis (showing all possible matching patterns)
Biological sequence analysis (understanding all evolutionary paths)
Academic/research purposes

all_lcs_reconstruction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
def all_lcs(X: str, Y: str) -> list[str]:
    """
    Find ALL distinct Longest Common Subsequences.
    
    WARNING: The number of LCS can be exponential!
    Use with caution on large inputs.
    
    Time Complexity: O(m × n) for table + O(k × L) for k LCS of length L
    Space Complexity: O(m × n) + O(k × L) for storing results
    """
    m, n = len(X), len(Y)
    
    if m == 0 or n == 0:
        return [""]
    
    # Build the DP table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Memoized recursive backtracking
    memo = {}
    
    def backtrack(i: int, j: int) -> set[str]:
        """Return all LCS of X[0:i] and Y[0:j] as a set."""
        if i == 0 or j == 0:
            return {""}
        
        if (i, j) in memo:
            return memo[(i, j)]
        
        result = set()
        
        if X[i - 1] == Y[j - 1]:
            # Characters match: must include this character
            for lcs in backtrack(i - 1, j - 1):
                result.add(lcs + X[i - 1])
        else:
            # Characters don't match: explore valid directions
            if dp[i - 1][j] >= dp[i][j - 1]:
                result.update(backtrack(i - 1, j))
            if dp[i][j - 1] >= dp[i - 1][j]:
                result.update(backtrack(i, j - 1))
        
        memo[(i, j)] = result
        return result
    
    return list(backtrack(m, n))
 
 
def count_lcs(X: str, Y: str) -> int:
    """
    Count the number of distinct LCS sequences.
    
    This is more efficient than finding all LCS when you just
    need the count. Uses combinatorial DP.
    """
    m, n = len(X), len(Y)
    
    # Build LCS length table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Build count table
    # count[i][j] = number of distinct LCS of X[0:i] and Y[0:j]
    count = [[0] * (n + 1) for _ in range(m + 1)]
    
    # Base cases: empty LCS has exactly 1 way
    for i in range(m + 1):
        count[i][0] = 1
    for j in range(n + 1):
        count[0][j] = 1
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                # Must take the diagonal path
                count[i][j] = count[i - 1][j - 1]
            else:
                # Sum of valid paths
                if dp[i - 1][j] > dp[i][j - 1]:
                    count[i][j] = count[i - 1][j]
                elif dp[i][j - 1] > dp[i - 1][j]:
                    count[i][j] = count[i][j - 1]
                else:
                    # Both paths are valid
                    count[i][j] = count[i - 1][j] + count[i][j - 1]
                    # Avoid double counting if both lead to same cell
                    if dp[i - 1][j - 1] == dp[i - 1][j]:
                        count[i][j] -= count[i - 1][j - 1]
    
    return count[m][n]
 
 
# Example usage
X = "ABCBDAB"
Y = "BDCABA"
 
print(f"X = '{X}'")
print(f"Y = '{Y}'")
print(f"LCS length: {len(all_lcs(X, Y)[0])}")
print(f"Number of distinct LCS: {count_lcs(X, Y)}")
print(f"All LCS sequences: {all_lcs(X, Y)}")
 
# Example with more LCS
X2 = "ABC"
Y2 = "ACB"
print(f"\nX = '{X2}', Y = '{Y2}'")
print(f"All LCS: {all_lcs(X2, Y2)}")  # Should give ["AB", "AC"]

Exponential Danger

The number of distinct LCS can grow exponentially. For example, X = "ABABABAB" and Y = "BABABABA" have many interleaved matches. Always consider whether you truly need all LCS, or if one representative suffices.

Correctness of Backtracking

Let's formally prove that the backtracking algorithm correctly reconstructs an LCS.

Theorem: The backtracking algorithm produces a common subsequence of X and Y with length dp[m][n].

Proof:

We prove two properties:

Property 1: The output is a common subsequence.

By construction, we only add a character c to our output when X[i] == Y[j] == c. At this point:

c appears at position i in X (1-indexed)
c appears at position j in Y (1-indexed)
We then move to (i-1, j-1), ensuring subsequent characters come from earlier positions

Since we process positions in decreasing order and only include matched characters, the output (when reversed) is a valid subsequence of both X and Y.

Property 2: The output has length dp[m][n].

We prove by induction that when backtracking from (i, j), we collect exactly dp[i][j] characters.

Base case: At (0, j) or (i, 0), we collect 0 characters, and dp[0][j] = dp[i][0] = 0. ✓

Inductive step: Assume the property holds for all cells with smaller indices. At cell (i, j):

Case 1: X[i] == Y[j]. We collect 1 character and move to (i-1, j-1). By induction, we collect dp[i-1][j-1] from there. Total: 1 + dp[i-1][j-1] = dp[i][j]. ✓
Case 2: X[i] ≠ Y[j]. We move to whichever of (i-1, j) or (i, j-1) has the larger value. By induction, that cell contributes its dp value. Since dp[i][j] = max(dp[i-1][j], dp[i][j-1]), we collect dp[i][j] characters. ✓

Thus, we collect dp[m][n] characters total, which equals the LCS length. ∎

Why This Matters

The proof establishes that backtracking isn't just a heuristic—it's guaranteed to find an optimal solution. This mathematical rigor is what distinguishes dynamic programming from greedy approaches, which might find good-but-not-optimal solutions.

Application: Diff Algorithm

One of the most important applications of LCS reconstruction is the diff algorithm used by version control systems like Git. The basic idea:

Find the LCS of the two files (treating lines as elements)
Lines in the LCS are unchanged
Lines in the old file but not in LCS are deleted
Lines in the new file but not in LCS are added

simple_diff.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
from typing import List, Tuple
from enum import Enum
 
class DiffType(Enum):
    UNCHANGED = "  "  # Line exists in both
    DELETED = "- "    # Line only in old
    ADDED = "+ "      # Line only in new
 
 
def compute_diff(old_lines: List[str], new_lines: List[str]) -> List[Tuple[DiffType, str]]:
    """
    Compute the diff between two files (represented as lists of lines).
    
    Returns a list of (DiffType, line_content) tuples showing the
    differences between the two files.
    """
    m, n = len(old_lines), len(new_lines)
    
    # Build LCS table for lines
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if old_lines[i - 1] == new_lines[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Backtrack to build diff
    result = []
    i, j = m, n
    
    while i > 0 or j > 0:
        if i > 0 and j > 0 and old_lines[i - 1] == new_lines[j - 1]:
            # Line is unchanged (in LCS)
            result.append((DiffType.UNCHANGED, old_lines[i - 1]))
            i -= 1
            j -= 1
        elif j > 0 and (i == 0 or dp[i][j - 1] >= dp[i - 1][j]):
            # Line was added in new version
            result.append((DiffType.ADDED, new_lines[j - 1]))
            j -= 1
        else:
            # Line was deleted from old version
            result.append((DiffType.DELETED, old_lines[i - 1]))
            i -= 1
    
    # Reverse to get correct order
    result.reverse()
    return result
 
 
def print_diff(diff: List[Tuple[DiffType, str]]) -> None:
    """Pretty print a diff."""
    colors = {
        DiffType.UNCHANGED: "",
        DiffType.DELETED: "\033[91m",  # Red
        DiffType.ADDED: "\033[92m",    # Green
    }
    reset = "\033[0m"
    
    for diff_type, line in diff:
        print(f"{colors[diff_type]}{diff_type.value}{line}{reset}")
 
 
# Example usage
old_file = [
    "def hello():",
    "    print('Hello')",
    "    print('World')",
    "",
    "hello()",
]
 
new_file = [
    "def hello(name):",
    "    print('Hello')",
    "    print(f'Hello, {name}!')",
    "",
    "hello('Alice')",
]
 
print("Diff output:")
print("=" * 40)
diff = compute_diff(old_file, new_file)
print_diff(diff)

Expected Output:

- def hello():
+ def hello(name):
    print('Hello')
- print('World')
+ print(f'Hello, {name}!')
  
- hello()
+ hello('Alice')

This shows:

Function signature changed (deleted old, added new)
First print statement unchanged
Second print statement changed
Empty line unchanged
Function call changed

Production Diff Algorithms

Real diff tools (like GNU diff, Git diff) use optimizations beyond basic LCS: Myers' algorithm (O((M+N)D) where D is the edit distance), patience diff (focuses on unique lines), and histogram diff (groups similar lines). These are more efficient for typical file comparison patterns.

Edit Script Generation

Beyond showing differences, LCS reconstruction can generate edit scripts—sequences of operations that transform one string into another. This is closely related to the concept of edit distance (Levenshtein distance).

Operations:

Keep (or Copy): A character from the LCS—it appears in both strings
Delete: A character in the source but not in the LCS
Insert: A character in the target but not in the LCS

Note: The LCS-based edit script doesn't use substitutions. A character change appears as a delete + insert.

edit_script.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
from typing import List, Tuple
from enum import Enum
 
class EditOp(Enum):
    KEEP = "KEEP"
    DELETE = "DELETE"
    INSERT = "INSERT"
 
 
def lcs_edit_script(source: str, target: str) -> List[Tuple[EditOp, str]]:
    """
    Generate an edit script to transform 'source' into 'target' using LCS.
    
    The edit script is a sequence of operations:
    - KEEP(c): Character c is in both strings (from LCS)
    - DELETE(c): Delete character c from source
    - INSERT(c): Insert character c (from target)
    
    Applying these operations in order transforms source into target.
    """
    m, n = len(source), len(target)
    
    # Build LCS table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if source[i - 1] == target[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Backtrack to generate edit script
    script = []
    i, j = m, n
    
    while i > 0 or j > 0:
        if i > 0 and j > 0 and source[i - 1] == target[j - 1]:
            # Characters match (LCS character)
            script.append((EditOp.KEEP, source[i - 1]))
            i -= 1
            j -= 1
        elif j > 0 and (i == 0 or dp[i][j - 1] >= dp[i - 1][j]):
            # Insert character from target
            script.append((EditOp.INSERT, target[j - 1]))
            j -= 1
        else:
            # Delete character from source
            script.append((EditOp.DELETE, source[i - 1]))
            i -= 1
    
    script.reverse()
    return script
 
 
def apply_edit_script(source: str, script: List[Tuple[EditOp, str]]) -> str:
    """
    Apply an edit script to source string, returning the result.
    
    This verifies the script is correct.
    """
    result = []
    source_idx = 0
    
    for op, char in script:
        if op == EditOp.KEEP:
            assert source[source_idx] == char, f"Keep mismatch at {source_idx}"
            result.append(char)
            source_idx += 1
        elif op == EditOp.DELETE:
            assert source[source_idx] == char, f"Delete mismatch at {source_idx}"
            source_idx += 1
            # Don't add to result
        elif op == EditOp.INSERT:
            result.append(char)
            # Don't advance source_idx
    
    return ''.join(result)
 
 
# Example
source = "ALGORITHM"
target = "ALTRUISTIC"
 
script = lcs_edit_script(source, target)
 
print(f"Source: '{source}'")
print(f"Target: '{target}'")
print(f"\nEdit Script:")
for op, char in script:
    print(f"  {op.value:6s} '{char}'")
 
# Verify
result = apply_edit_script(source, script)
print(f"\nApplied script: '{result}'")
print(f"Matches target: {result == target}")
 
# Statistics
keeps = sum(1 for op, _ in script if op == EditOp.KEEP)
deletes = sum(1 for op, _ in script if op == EditOp.DELETE)
inserts = sum(1 for op, _ in script if op == EditOp.INSERT)
print(f"\nStatistics:")
print(f"  LCS length (KEEP operations): {keeps}")
print(f"  Deletions: {deletes}")
print(f"  Insertions: {inserts}")
print(f"  Edit distance (DELETE + INSERT): {deletes + inserts}")

LCS and Edit Distance

There's a beautiful relationship: Edit Distance = m + n - 2 × LCS_length. The LCS represents the characters we can 'keep', so we must delete (m - LCS) characters from the source and insert (n - LCS) characters to reach the target. This connection underlies many sequence alignment algorithms.

Summary: LCS Reconstruction Mastered

We've thoroughly covered the techniques for extracting the actual LCS sequence(s) from the DP table.

Key Takeaways

•Backtracking reverses the DP decisions — We follow the path that led to each cell's value.
•Matches imply diagonal moves — When X[i] == Y[j], we move diagonally and record the character.
•Ties create branches — When dp[i-1][j] == dp[i][j-1], multiple LCS exist.
•Finding all LCS can be exponential — Use memoization but be cautious of output size.
•Diff algorithms are LCS applications — The LCS represents unchanged lines.
•Edit scripts transform strings — KEEP (from LCS), DELETE, INSERT operations.

What's Next:

The final page of this module explores variations of the LCS problem, particularly the closely related Longest Common Substring problem. We'll see how a small change in the problem definition leads to a significantly different DP formulation and solution.

Page Complete

You can now reconstruct LCS sequences from DP tables, find all possible LCS when needed, and apply these techniques to practical problems like diff generation and edit scripting. The connection between LCS and edit distance provides powerful insights for sequence comparison tasks.