Loading learning content...
The Longest Common Subsequence (LCS) problem is one of the most celebrated problems in computer science. It asks a deceptively simple question:
Given two sequences, what is the longest sequence that appears as a subsequence in both?
Unlike substrings (which must be contiguous), a subsequence can skip elements while maintaining relative order. For example, "ACE" is a subsequence of "ABCDE" (skip B, D), but "AEC" is not (order violated).
LCS lies at the heart of:
The algorithm we'll develop is elegant, instructive, and foundational to understanding how DP handles sequence problems.
By the end of this page, you will understand the LCS problem deeply: its recursive structure, the 2D DP formulation, time and space complexity, path reconstruction to find the actual subsequence, and connections to related problems like edit distance. This is essential knowledge for any serious algorithm practitioner.
Formal Definition:
Given two strings X of length m and Y of length n, find the length of the longest sequence Z such that:
Z is a subsequence of XZ is a subsequence of YSubsequence vs Substring:
| Concept | Definition | Example from "ALGORITHM" |
|---|---|---|
| Substring | Contiguous characters | "LGOR", "ALGO", "RITHM" |
| Subsequence | Characters in order (can skip) | "AGRM", "LRTM", "AGORITHM" |
Examples:
X = "ABCDGH"
Y = "AEDFHR"
LCS = "ADH" (length 3)
X = "AGGTAB"
Y = "GXTXAYB"
LCS = "GTAB" (length 4)
Multiple LCS of the same length may exist. For X="ABCBDAB" and Y="BDCABA", both "BDAB" and "BCAB" are valid LCS (length 4). The algorithm finds the length; reconstructing one specific LCS requires additional work.
Why LCS is Computationally Interesting:
A naive approach would enumerate all subsequences of X (2^m possibilities) and check which appear in Y. This is exponential—completely impractical for strings of even moderate length.
The breakthrough insight: LCS has optimal substructure and overlapping subproblems, making it perfectly suited for dynamic programming. The same LCS algorithm that seemed impossibly slow becomes O(m×n)—practical for strings with millions of characters.
To develop the DP solution, we first need to understand the problem's recursive structure. Consider strings X = X[0..m-1] and Y = Y[0..n-1].
The Key Insight:
Let LCS(i, j) denote the length of LCS of X[0..i-1] and Y[0..j-1] (the first i characters of X and first j characters of Y).
Case 1: Last characters match
If X[i-1] == Y[j-1], this matching character must be part of some LCS. Why? If we excluded it, we couldn't do better (matching gives us at least one character).
LCS(i, j) = 1 + LCS(i-1, j-1)
Case 2: Last characters don't match
If X[i-1] ≠ Y[j-1], at least one of them is not in the LCS. Either:
X[i-1], so LCS(i, j) = LCS(i-1, j), orY[j-1], so LCS(i, j) = LCS(i, j-1)We take the maximum:
LCS(i, j) = max(LCS(i-1, j), LCS(i, j-1))
Base Cases:
LCS(0, j) = 0 for all j (empty X has no common subsequence)LCS(i, 0) = 0 for all i (empty Y has no common subsequence)123456789101112131415161718192021222324252627
def lcs_recursive(X: str, Y: str) -> int: """ Naive recursive LCS - exponential time complexity. Time Complexity: O(2^(m+n)) in worst case Space Complexity: O(m + n) for recursion stack """ def lcs(i: int, j: int) -> int: # Base case: one string is empty if i == 0 or j == 0: return 0 # Case 1: Last characters match if X[i - 1] == Y[j - 1]: return 1 + lcs(i - 1, j - 1) # Case 2: Last characters don't match else: return max(lcs(i - 1, j), lcs(i, j - 1)) return lcs(len(X), len(Y)) # ExampleX = "AGGTAB"Y = "GXTXAYB"print(f"LCS length: {lcs_recursive(X, Y)}") # Output: 4The naive recursion is disastrously slow. For strings like "AAAA...A" (all same characters), nearly every call branches into two recursive calls. For two 20-character strings, this can require over a million calls. We need memoization.
The recursive structure reveals overlapping subproblems. Consider computing LCS(i, j): it may call LCS(i-1, j) and LCS(i, j-1), both of which may independently call LCS(i-1, j-1). This overlap multiplies as the recursion deepens.
The Solution: Memoization
Cache each (i, j) result in a 2D table. The recursion tree collapses from exponential branches to at most m × n unique computations.
123456789101112131415161718192021222324252627282930313233343536
def lcs_memoization(X: str, Y: str) -> int: """ LCS with memoization - polynomial time complexity. Time Complexity: O(m * n) - each state computed once Space Complexity: O(m * n) for memo table + O(m + n) for stack """ m, n = len(X), len(Y) # Memoization table: -1 means not yet computed memo = [[-1] * (n + 1) for _ in range(m + 1)] def lcs(i: int, j: int) -> int: # Base case if i == 0 or j == 0: return 0 # Return cached result if available if memo[i][j] != -1: return memo[i][j] # Compute based on whether characters match if X[i - 1] == Y[j - 1]: memo[i][j] = 1 + lcs(i - 1, j - 1) else: memo[i][j] = max(lcs(i - 1, j), lcs(i, j - 1)) return memo[i][j] return lcs(m, n) # Now handles large strings efficientlyX = "AGGTAB" * 100 # 600 charactersY = "GXTXAYB" * 100 # 700 charactersprint(f"LCS length: {lcs_memoization(X, Y)}")Complexity Improvement:
| Aspect | Naive Recursion | Memoization |
|---|---|---|
| Time | O(2^(m+n)) | O(m × n) |
| States computed | Exponential duplicates | Each state once |
| 20×20 strings | ~1 million calls | 400 unique states |
| 1000×1000 strings | Infeasible | 1 million states (fast) |
Memoization transforms an unusable algorithm into a practical one. But we can do better by eliminating recursion entirely.
The tabulation approach builds the solution iteratively, avoiding recursion overhead and potential stack overflow. We create an (m+1) × (n+1) table dp where dp[i][j] represents the LCS length of X[0..i-1] and Y[0..j-1].
Table Construction:
X[i-1] == Y[j-1]: dp[i][j] = 1 + dp[i-1][j-1]dp[i][j] = max(dp[i-1][j], dp[i][j-1])dp[m][n]12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
def lcs_tabulation(X: str, Y: str) -> int: """ LCS using tabulation (bottom-up DP). Time Complexity: O(m * n) Space Complexity: O(m * n) """ m, n = len(X), len(Y) # Create (m+1) x (n+1) table, initialized to 0 dp = [[0] * (n + 1) for _ in range(m + 1)] # Fill the table for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: # Characters match: extend LCS from diagonal dp[i][j] = 1 + dp[i - 1][j - 1] else: # No match: take max of excluding one character dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) return dp[m][n] def lcs_with_table_visualization(X: str, Y: str) -> int: """Show the DP table for understanding.""" m, n = len(X), len(Y) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = 1 + dp[i - 1][j - 1] else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Print table with headers print("\n " + " ".join(f"{c}" for c in " " + Y)) for i, row in enumerate(dp): prefix = X[i - 1] if i > 0 else " " print(f"{prefix} " + " ".join(f"{x}" for x in row)) return dp[m][n] X = "ABCDGH"Y = "AEDFHR"print(f"\nLCS length: {lcs_with_table_visualization(X, Y)}")Visualizing the DP Table for X="ABCDGH", Y="AEDFHR":
A E D F H R
0 0 0 0 0 0 0
A 0 1 1 1 1 1 1
B 0 1 1 1 1 1 1
C 0 1 1 1 1 1 1
D 0 1 1 2 2 2 2
G 0 1 1 2 2 2 2
H 0 1 1 2 2 3 3
Reading the Table:
In the LCS table, increases happen at diagonal moves when characters match. Horizontal and vertical moves maintain the previous maximum. The "staircase" pattern of increasing values corresponds to matched characters in the LCS.
The DP table gives us the length, but often we need the actual subsequence. We can reconstruct it by tracing back through the table, following the decisions that led to each cell's value.
Backtracking Algorithm:
Start at dp[m][n] and trace backward:
X[i-1] == Y[j-1]: This character is in the LCS. Add it, move diagonally to (i-1, j-1).dp[i-1][j] or dp[i][j-1].i == 0 or j == 0.Since we trace backward, the LCS is built in reverse order—reverse it at the end.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
def lcs_with_reconstruction(X: str, Y: str) -> tuple[int, str]: """ Compute LCS length and reconstruct the actual LCS string. Time Complexity: O(m * n) for DP + O(m + n) for reconstruction Space Complexity: O(m * n) Returns: Tuple of (length, LCS string) """ m, n = len(X), len(Y) # Build the DP table as before dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = 1 + dp[i - 1][j - 1] else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Reconstruct LCS by backtracking lcs_chars = [] i, j = m, n while i > 0 and j > 0: if X[i - 1] == Y[j - 1]: # This character is part of LCS lcs_chars.append(X[i - 1]) i -= 1 j -= 1 elif dp[i - 1][j] > dp[i][j - 1]: # Came from above i -= 1 else: # Came from left (or equal, arbitrary choice) j -= 1 # Reverse since we built it backwards lcs_string = "".join(reversed(lcs_chars)) return dp[m][n], lcs_string # ExampleX = "AGGTAB"Y = "GXTXAYB"length, lcs = lcs_with_reconstruction(X, Y)print(f"X = '{X}'")print(f"Y = '{Y}'")print(f"LCS = '{lcs}' (length {length})")# Output: LCS = 'GTAB' (length 4)When dp[i-1][j] == dp[i][j-1], we have a choice of which way to go, leading to different but equally valid LCS. To find all LCS, use backtracking with branching at these tie points. The number of distinct LCS can be exponential in pathological cases.
Like grid paths, LCS can be space-optimized. Each cell only depends on:
dp[i-1][j]dp[i][j-1]dp[i-1][j-1]We can reduce from O(m×n) to O(min(m, n)) using two rows or even one row with a saved diagonal value.
Important Caveat: Space optimization destroys the backtracking path. If you need to reconstruct the LCS string, keep the full table. If you only need the length, optimize space.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
def lcs_space_optimized(X: str, Y: str) -> int: """ Space-optimized LCS using two rows. Time Complexity: O(m * n) Space Complexity: O(min(m, n)) """ # Ensure Y is the shorter string for optimal space if len(X) < len(Y): X, Y = Y, X m, n = len(X), len(Y) # Two rows of size n+1 prev = [0] * (n + 1) curr = [0] * (n + 1) for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: curr[j] = 1 + prev[j - 1] else: curr[j] = max(prev[j], curr[j - 1]) # Swap rows prev, curr = curr, prev return prev[n] def lcs_single_row(X: str, Y: str) -> int: """ Even more optimized: single row with diagonal tracking. Time Complexity: O(m * n) Space Complexity: O(min(m, n)) """ if len(X) < len(Y): X, Y = Y, X m, n = len(X), len(Y) dp = [0] * (n + 1) for i in range(1, m + 1): prev_diagonal = 0 # dp[i-1][j-1] before update for j in range(1, n + 1): # Save current value before overwriting (it becomes prev_diagonal for next j) temp = dp[j] if X[i - 1] == Y[j - 1]: dp[j] = 1 + prev_diagonal else: dp[j] = max(dp[j], dp[j - 1]) prev_diagonal = temp return dp[n] # Verify both give same resultsX = "ABCDGH"Y = "AEDFHR"print(f"Two rows: {lcs_space_optimized(X, Y)}") # 3print(f"Single row: {lcs_single_row(X, Y)}") # 3| Approach | Space | Can Reconstruct LCS? |
|---|---|---|
| Full 2D Table | O(m × n) | Yes |
| Two Rows | O(min(m, n)) | No |
| Single Row | O(min(m, n)) | No |
| Hirschberg Algorithm | O(min(m, n)) | Yes! (Advanced) |
The Hirschberg algorithm achieves O(min(m,n)) space while still allowing LCS reconstruction. It uses a divide-and-conquer approach: find the "midpoint" of the LCS using forward and backward passes, then recurse. This is the standard algorithm for memory-constrained sequence alignment in bioinformatics.
LCS is the foundation for many practical algorithms. Understanding it unlocks a family of related problems.
1. Diff and Patch Algorithms:
The Unix diff utility computes the LCS of two files (treating lines as characters). The difference between files is everything not in the LCS. This enables minimal change descriptions for version control.
2. Edit Distance (Levenshtein Distance):
Closely related to LCS. If lcs_length is the LCS length, then:
edit_distance = (m - lcs_length) + (n - lcs_length)
Or equivalently: the number of insertions + deletions to transform X into Y.
3. Sequence Alignment in Bioinformatics:
DNA sequence comparison uses weighted LCS variants. The Smith-Waterman algorithm extends LCS with:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
def longest_palindromic_subsequence(s: str) -> int: """ LCS of s and reverse(s) gives longest palindromic subsequence. Example: "bbbab" -> "bbbb" (length 4) """ return lcs_tabulation(s, s[::-1]) def shortest_common_supersequence_length(X: str, Y: str) -> int: """ Length of shortest string containing both X and Y as subsequences. SCS length = m + n - LCS length Example: X="AGGTAB", Y="GXTXAYB" LCS = "GTAB" (4) SCS = "AGGXTXAYB" (6 + 7 - 4 = 9) """ lcs_len = lcs_tabulation(X, Y) return len(X) + len(Y) - lcs_len def edit_distance_from_lcs(X: str, Y: str) -> int: """ Edit distance (insertions + deletions) from LCS. To transform X to Y: - Delete characters in X not in LCS - Insert characters in Y not in LCS """ lcs_len = lcs_tabulation(X, Y) deletions = len(X) - lcs_len insertions = len(Y) - lcs_len return deletions + insertions # Reference implementation from earlierdef lcs_tabulation(X: str, Y: str) -> int: m, n = len(X), len(Y) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = 1 + dp[i - 1][j - 1] else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) return dp[m][n] # Examplesprint(f"Longest palindromic subseq of 'bbbab': {longest_palindromic_subsequence('bbbab')}")print(f"SCS length of 'AGGTAB' and 'GXTXAYB': {shortest_common_supersequence_length('AGGTAB', 'GXTXAYB')}")print(f"Edit distance: {edit_distance_from_lcs('kitten', 'sitting')}")LCS is a frequent interview topic. Here are common pitfalls and how to avoid them:
Mistake 1: Confusing Subsequence with Substring
Mistake 2: Off-by-One Indexing
The DP table has dimensions (m+1) × (n+1) to accommodate base cases. dp[i][j] refers to prefixes X[0..i-1] and Y[0..j-1]. Confusing 0-indexed strings with 1-indexed table positions is a common bug source.
Mistake 3: Incorrect Base Cases
Both dp[0][j] and dp[i][0] should be 0 (empty string has no common subsequence). Forgetting to initialize or initializing incorrectly breaks the algorithm.
If asked for Longest Common Substring (contiguous), the recurrence changes: dp[i][j] = dp[i-1][j-1] + 1 only when characters match, and dp[i][j] = 0 otherwise. The answer is max(dp), not dp[m][n]. Don't confuse the two problems!
The Longest Common Subsequence problem exemplifies the power of 2D dynamic programming for sequence comparison. Let's consolidate our understanding:
The Bigger Picture:
LCS and grid paths share the same 2D DP structure but solve different problems. Grid paths count possibilities; LCS optimizes sequence similarity. This versatility is the power of 2D DP: a single framework for diverse problems.
What's Next:
We've seen how to structure state with two indices. The next page formalizes this: when do we need 2D state? How do we identify the right indices? Mastering state design is the key to solving novel DP problems.
You now understand the Longest Common Subsequence problem—its structure, algorithms, and applications. Combined with grid paths, you have a strong foundation in 2D DP. Next, we'll explore the art of state design: how to identify when two indices are needed and what they should represent.