Loading content...
Knowing that the Longest Common Subsequence has length 4 is useful, but in many real applications, we need the actual subsequence itself. Consider these scenarios:
The DP table we constructed contains all the information needed to recover the LCS itself. This page teaches you how to backtrack through the table to extract the sequence—and how to find all possible LCS sequences when multiple exist.
By the end of this page, you'll understand the backtracking algorithm for LCS reconstruction, be able to prove its correctness, implement both single-LCS and all-LCS extraction, and understand the relationship between table structure and reconstruction paths.
The DP table tells us not just the answer, but how we arrived at the answer. Each cell's value was computed from one of its neighbors, and this relationship encodes the structure of the LCS.
Key Insight: When filling the table, we made decisions at each cell:
To reconstruct the LCS, we reverse these decisions. Starting from the final cell dp[m][n], we trace back to the origin dp[0][0], and every time we encounter a character match, we record that character.
The Backtracking Rules:
Characters are collected in reverse order during backtracking (last to first). Either collect into a stack/list and reverse at the end, or build the string by prepending characters.
Here's the complete implementation for extracting one LCS from the DP table. When multiple LCS exist (i.e., when dp[i-1][j] == dp[i][j-1] during a mismatch), this algorithm arbitrarily chooses to go up first.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
def lcs_with_reconstruction(X: str, Y: str) -> tuple[int, str]: """ Compute LCS length and reconstruct one LCS sequence. Returns: Tuple of (LCS length, one LCS string) Time Complexity: O(m × n) for table + O(m + n) for reconstruction Space Complexity: O(m × n) for the DP table """ m, n = len(X), len(Y) # Build the DP table dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Backtrack to reconstruct the LCS lcs_chars = [] i, j = m, n while i > 0 and j > 0: if X[i - 1] == Y[j - 1]: # Characters match: this char is part of LCS lcs_chars.append(X[i - 1]) i -= 1 j -= 1 elif dp[i - 1][j] >= dp[i][j - 1]: # Go up (arbitrarily prefer up when equal) i -= 1 else: # Go left j -= 1 # Reverse to get correct order lcs_chars.reverse() lcs = ''.join(lcs_chars) return dp[m][n], lcs def lcs_with_positions(X: str, Y: str) -> tuple[int, str, list[tuple[int, int]]]: """ Compute LCS and return positions of matched characters in both strings. Returns: Tuple of (LCS length, LCS string, list of (X_idx, Y_idx) pairs) """ m, n = len(X), len(Y) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Backtrack with position tracking lcs_chars = [] positions = [] i, j = m, n while i > 0 and j > 0: if X[i - 1] == Y[j - 1]: lcs_chars.append(X[i - 1]) positions.append((i - 1, j - 1)) # 0-indexed positions i -= 1 j -= 1 elif dp[i - 1][j] >= dp[i][j - 1]: i -= 1 else: j -= 1 lcs_chars.reverse() positions.reverse() return dp[m][n], ''.join(lcs_chars), positions # Example usageX = "ABCBDAB"Y = "BDCABA" length, lcs = lcs_with_reconstruction(X, Y)print(f"X = '{X}'")print(f"Y = '{Y}'")print(f"LCS length: {length}")print(f"One LCS: '{lcs}'") # With positionslength, lcs, positions = lcs_with_positions(X, Y)print(f"\nLCS with positions:")print(f"LCS: '{lcs}'")print(f"Positions (X_idx, Y_idx): {positions}") # Visualize matchesprint(f"\nVisualization:")print(f"X: ", end="")for i, c in enumerate(X): if any(p[0] == i for p in positions): print(f"[{c}]", end="") else: print(f" {c} ", end="")print()print(f"Y: ", end="")for j, c in enumerate(Y): if any(p[1] == j for p in positions): print(f"[{c}]", end="") else: print(f" {c} ", end="")print()The backtracking step takes O(m + n) time because we move either up or left (or diagonally) at each step, and we can move at most m times up and n times left before reaching the origin. The total algorithm remains O(m × n) dominated by table construction.
When multiple distinct LCS sequences exist, we may want to find all of them. This occurs when dp[i-1][j] == dp[i][j-1] during a mismatch—we could go either direction and still achieve the optimal length.
The Challenge:
The number of distinct LCS sequences can be exponential in the length of the strings. For example, two strings of length n with no common characters have 2^n possible "empty" LCS paths (all going around the table). More practically, strings with repeated characters often have many valid LCS.
When to Find All LCS:
In most applications, finding one LCS is sufficient. However, you might need all LCS for:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
def all_lcs(X: str, Y: str) -> list[str]: """ Find ALL distinct Longest Common Subsequences. WARNING: The number of LCS can be exponential! Use with caution on large inputs. Time Complexity: O(m × n) for table + O(k × L) for k LCS of length L Space Complexity: O(m × n) + O(k × L) for storing results """ m, n = len(X), len(Y) if m == 0 or n == 0: return [""] # Build the DP table dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Memoized recursive backtracking memo = {} def backtrack(i: int, j: int) -> set[str]: """Return all LCS of X[0:i] and Y[0:j] as a set.""" if i == 0 or j == 0: return {""} if (i, j) in memo: return memo[(i, j)] result = set() if X[i - 1] == Y[j - 1]: # Characters match: must include this character for lcs in backtrack(i - 1, j - 1): result.add(lcs + X[i - 1]) else: # Characters don't match: explore valid directions if dp[i - 1][j] >= dp[i][j - 1]: result.update(backtrack(i - 1, j)) if dp[i][j - 1] >= dp[i - 1][j]: result.update(backtrack(i, j - 1)) memo[(i, j)] = result return result return list(backtrack(m, n)) def count_lcs(X: str, Y: str) -> int: """ Count the number of distinct LCS sequences. This is more efficient than finding all LCS when you just need the count. Uses combinatorial DP. """ m, n = len(X), len(Y) # Build LCS length table dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Build count table # count[i][j] = number of distinct LCS of X[0:i] and Y[0:j] count = [[0] * (n + 1) for _ in range(m + 1)] # Base cases: empty LCS has exactly 1 way for i in range(m + 1): count[i][0] = 1 for j in range(n + 1): count[0][j] = 1 for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: # Must take the diagonal path count[i][j] = count[i - 1][j - 1] else: # Sum of valid paths if dp[i - 1][j] > dp[i][j - 1]: count[i][j] = count[i - 1][j] elif dp[i][j - 1] > dp[i - 1][j]: count[i][j] = count[i][j - 1] else: # Both paths are valid count[i][j] = count[i - 1][j] + count[i][j - 1] # Avoid double counting if both lead to same cell if dp[i - 1][j - 1] == dp[i - 1][j]: count[i][j] -= count[i - 1][j - 1] return count[m][n] # Example usageX = "ABCBDAB"Y = "BDCABA" print(f"X = '{X}'")print(f"Y = '{Y}'")print(f"LCS length: {len(all_lcs(X, Y)[0])}")print(f"Number of distinct LCS: {count_lcs(X, Y)}")print(f"All LCS sequences: {all_lcs(X, Y)}") # Example with more LCSX2 = "ABC"Y2 = "ACB"print(f"\nX = '{X2}', Y = '{Y2}'")print(f"All LCS: {all_lcs(X2, Y2)}") # Should give ["AB", "AC"]The number of distinct LCS can grow exponentially. For example, X = "ABABABAB" and Y = "BABABABA" have many interleaved matches. Always consider whether you truly need all LCS, or if one representative suffices.
Let's formally prove that the backtracking algorithm correctly reconstructs an LCS.
Theorem: The backtracking algorithm produces a common subsequence of X and Y with length dp[m][n].
Proof:
We prove two properties:
Property 1: The output is a common subsequence.
By construction, we only add a character c to our output when X[i] == Y[j] == c. At this point:
Since we process positions in decreasing order and only include matched characters, the output (when reversed) is a valid subsequence of both X and Y.
Property 2: The output has length dp[m][n].
We prove by induction that when backtracking from (i, j), we collect exactly dp[i][j] characters.
Base case: At (0, j) or (i, 0), we collect 0 characters, and dp[0][j] = dp[i][0] = 0. ✓
Inductive step: Assume the property holds for all cells with smaller indices. At cell (i, j):
Case 1: X[i] == Y[j]. We collect 1 character and move to (i-1, j-1). By induction, we collect dp[i-1][j-1] from there. Total: 1 + dp[i-1][j-1] = dp[i][j]. ✓
Case 2: X[i] ≠ Y[j]. We move to whichever of (i-1, j) or (i, j-1) has the larger value. By induction, that cell contributes its dp value. Since dp[i][j] = max(dp[i-1][j], dp[i][j-1]), we collect dp[i][j] characters. ✓
Thus, we collect dp[m][n] characters total, which equals the LCS length. ∎
The proof establishes that backtracking isn't just a heuristic—it's guaranteed to find an optimal solution. This mathematical rigor is what distinguishes dynamic programming from greedy approaches, which might find good-but-not-optimal solutions.
One of the most important applications of LCS reconstruction is the diff algorithm used by version control systems like Git. The basic idea:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
from typing import List, Tuplefrom enum import Enum class DiffType(Enum): UNCHANGED = " " # Line exists in both DELETED = "- " # Line only in old ADDED = "+ " # Line only in new def compute_diff(old_lines: List[str], new_lines: List[str]) -> List[Tuple[DiffType, str]]: """ Compute the diff between two files (represented as lists of lines). Returns a list of (DiffType, line_content) tuples showing the differences between the two files. """ m, n = len(old_lines), len(new_lines) # Build LCS table for lines dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if old_lines[i - 1] == new_lines[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Backtrack to build diff result = [] i, j = m, n while i > 0 or j > 0: if i > 0 and j > 0 and old_lines[i - 1] == new_lines[j - 1]: # Line is unchanged (in LCS) result.append((DiffType.UNCHANGED, old_lines[i - 1])) i -= 1 j -= 1 elif j > 0 and (i == 0 or dp[i][j - 1] >= dp[i - 1][j]): # Line was added in new version result.append((DiffType.ADDED, new_lines[j - 1])) j -= 1 else: # Line was deleted from old version result.append((DiffType.DELETED, old_lines[i - 1])) i -= 1 # Reverse to get correct order result.reverse() return result def print_diff(diff: List[Tuple[DiffType, str]]) -> None: """Pretty print a diff.""" colors = { DiffType.UNCHANGED: "", DiffType.DELETED: "\033[91m", # Red DiffType.ADDED: "\033[92m", # Green } reset = "\033[0m" for diff_type, line in diff: print(f"{colors[diff_type]}{diff_type.value}{line}{reset}") # Example usageold_file = [ "def hello():", " print('Hello')", " print('World')", "", "hello()",] new_file = [ "def hello(name):", " print('Hello')", " print(f'Hello, {name}!')", "", "hello('Alice')",] print("Diff output:")print("=" * 40)diff = compute_diff(old_file, new_file)print_diff(diff)Expected Output:
- def hello():
+ def hello(name):
print('Hello')
- print('World')
+ print(f'Hello, {name}!')
- hello()
+ hello('Alice')
This shows:
Real diff tools (like GNU diff, Git diff) use optimizations beyond basic LCS: Myers' algorithm (O((M+N)D) where D is the edit distance), patience diff (focuses on unique lines), and histogram diff (groups similar lines). These are more efficient for typical file comparison patterns.
Beyond showing differences, LCS reconstruction can generate edit scripts—sequences of operations that transform one string into another. This is closely related to the concept of edit distance (Levenshtein distance).
Operations:
Note: The LCS-based edit script doesn't use substitutions. A character change appears as a delete + insert.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
from typing import List, Tuplefrom enum import Enum class EditOp(Enum): KEEP = "KEEP" DELETE = "DELETE" INSERT = "INSERT" def lcs_edit_script(source: str, target: str) -> List[Tuple[EditOp, str]]: """ Generate an edit script to transform 'source' into 'target' using LCS. The edit script is a sequence of operations: - KEEP(c): Character c is in both strings (from LCS) - DELETE(c): Delete character c from source - INSERT(c): Insert character c (from target) Applying these operations in order transforms source into target. """ m, n = len(source), len(target) # Build LCS table dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if source[i - 1] == target[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Backtrack to generate edit script script = [] i, j = m, n while i > 0 or j > 0: if i > 0 and j > 0 and source[i - 1] == target[j - 1]: # Characters match (LCS character) script.append((EditOp.KEEP, source[i - 1])) i -= 1 j -= 1 elif j > 0 and (i == 0 or dp[i][j - 1] >= dp[i - 1][j]): # Insert character from target script.append((EditOp.INSERT, target[j - 1])) j -= 1 else: # Delete character from source script.append((EditOp.DELETE, source[i - 1])) i -= 1 script.reverse() return script def apply_edit_script(source: str, script: List[Tuple[EditOp, str]]) -> str: """ Apply an edit script to source string, returning the result. This verifies the script is correct. """ result = [] source_idx = 0 for op, char in script: if op == EditOp.KEEP: assert source[source_idx] == char, f"Keep mismatch at {source_idx}" result.append(char) source_idx += 1 elif op == EditOp.DELETE: assert source[source_idx] == char, f"Delete mismatch at {source_idx}" source_idx += 1 # Don't add to result elif op == EditOp.INSERT: result.append(char) # Don't advance source_idx return ''.join(result) # Examplesource = "ALGORITHM"target = "ALTRUISTIC" script = lcs_edit_script(source, target) print(f"Source: '{source}'")print(f"Target: '{target}'")print(f"\nEdit Script:")for op, char in script: print(f" {op.value:6s} '{char}'") # Verifyresult = apply_edit_script(source, script)print(f"\nApplied script: '{result}'")print(f"Matches target: {result == target}") # Statisticskeeps = sum(1 for op, _ in script if op == EditOp.KEEP)deletes = sum(1 for op, _ in script if op == EditOp.DELETE)inserts = sum(1 for op, _ in script if op == EditOp.INSERT)print(f"\nStatistics:")print(f" LCS length (KEEP operations): {keeps}")print(f" Deletions: {deletes}")print(f" Insertions: {inserts}")print(f" Edit distance (DELETE + INSERT): {deletes + inserts}")There's a beautiful relationship: Edit Distance = m + n - 2 × LCS_length. The LCS represents the characters we can 'keep', so we must delete (m - LCS) characters from the source and insert (n - LCS) characters to reach the target. This connection underlies many sequence alignment algorithms.
We've thoroughly covered the techniques for extracting the actual LCS sequence(s) from the DP table.
What's Next:
The final page of this module explores variations of the LCS problem, particularly the closely related Longest Common Substring problem. We'll see how a small change in the problem definition leads to a significantly different DP formulation and solution.
You can now reconstruct LCS sequences from DP tables, find all possible LCS when needed, and apply these techniques to practical problems like diff generation and edit scripting. The connection between LCS and edit distance provides powerful insights for sequence comparison tasks.