Loading content...
Imagine you're building a file comparison tool—like Git's diff command—that needs to show what changed between two versions of a document. Or perhaps you're developing a bioinformatics application that compares DNA sequences to find evolutionary relationships. Or maybe you're creating a plagiarism detection system that identifies shared content between documents despite reordering.
What do all these problems have in common? They all require finding the longest sequence of elements that appears in both inputs, maintaining their relative order but not necessarily their positions. This is precisely the Longest Common Subsequence (LCS) problem—one of the most fundamental and elegantly solvable problems in computer science.
By the end of this page, you will understand the precise formal definition of the LCS problem, why naive approaches fail dramatically, and how dynamic programming transforms an exponential problem into a polynomial one. You'll see the complete derivation of the DP recurrence and understand every detail of the classical implementation.
Before diving into algorithms, we must establish precise definitions. Ambiguity in problem understanding leads to incorrect solutions.
Definition: Subsequence
A subsequence of a sequence X = ⟨x₁, x₂, ..., xₘ⟩ is a sequence Z = ⟨z₁, z₂, ..., zₖ⟩ such that there exists a strictly increasing sequence of indices ⟨i₁, i₂, ..., iₖ⟩ with 1 ≤ i₁ < i₂ < ... < iₖ ≤ m and zⱼ = xᵢⱼ for all j = 1, 2, ..., k.
In simpler terms: A subsequence is obtained by deleting zero or more elements from the original sequence without changing the order of the remaining elements.
Critical Distinction: Subsequence vs. Substring
Definition: Common Subsequence
Given two sequences X and Y, a sequence Z is a common subsequence of X and Y if Z is a subsequence of both X and Y simultaneously.
Definition: Longest Common Subsequence (LCS)
Given two sequences X = ⟨x₁, x₂, ..., xₘ⟩ and Y = ⟨y₁, y₂, ..., yₙ⟩, find a maximum-length common subsequence of X and Y.
Important Note: The LCS may not be unique. Multiple subsequences of the same maximum length may exist.
The LCS problem has two forms: (1) Optimization: Find the length of the LCS, and (2) Construction: Actually produce one or all LCS sequences. The DP table we'll build solves the optimization problem; reconstruction (covered in a later section) addresses the construction problem.
To appreciate dynamic programming, we must first understand why simpler approaches are computationally infeasible.
Naive Approach 1: Enumerate All Subsequences
The most straightforward approach:
Complexity Analysis:
How many subsequences does a sequence of length m have? For each element, we make a binary choice: include it or exclude it. This gives us 2ᵐ possible subsequences (including the empty subsequence).
For each subsequence of length k, checking if it's a subsequence of Y takes O(n) time (linear scan with two pointers).
Total time: O(2ᵐ × n)
This is exponential in m. For two strings of length 1000, we'd need to check approximately 10³⁰¹ subsequences. Even at a trillion operations per second, this would take longer than the age of the universe—by a factor of about 10²⁸⁰.
| String Length (m) | Number of Subsequences (2ᵐ) | Time at 10⁹ ops/sec |
|---|---|---|
| 10 | 1,024 | ~1 microsecond |
| 20 | 1,048,576 | ~1 millisecond |
| 30 | 1,073,741,824 | ~1 second |
| 40 | ~1.1 trillion | ~18 minutes |
| 50 | ~1 quadrillion | ~13 days |
| 100 | ~1.27 × 10³⁰ | ~4 × 10¹³ years |
Naive Approach 2: Recursive Definition
A more structured approach uses recursion based on comparing characters:
LCS(X[1..m], Y[1..n]) =
| 0 if m = 0 or n = 0
| 1 + LCS(X[1..m-1], Y[1..n-1]) if xₘ = yₙ
| max(LCS(X[1..m-1], Y), LCS(X, Y[1..n-1])) if xₘ ≠ yₙ
This recurrence is correct, but direct implementation has a critical flaw: overlapping subproblems lead to exponential redundant computation.
123456789101112131415161718192021222324252627282930
def lcs_naive(X: str, Y: str, m: int, n: int) -> int: """ Naive recursive LCS implementation. Time Complexity: O(2^(m+n)) in the worst case Space Complexity: O(m + n) for recursion stack WARNING: This is for educational purposes only. Do NOT use this for strings longer than ~20 characters. """ # Base case: empty string has LCS of 0 if m == 0 or n == 0: return 0 # If last characters match, they're part of the LCS if X[m - 1] == Y[n - 1]: return 1 + lcs_naive(X, Y, m - 1, n - 1) # If they don't match, try both possibilities # and take the maximum return max( lcs_naive(X, Y, m - 1, n), # Exclude last char of X lcs_naive(X, Y, m, n - 1) # Exclude last char of Y ) # Example usage (keep strings short!)X = "ABCBDAB"Y = "BDCABA"print(f"LCS length: {lcs_naive(X, Y, len(X), len(Y))}") # Output: 4The recursion tree for this naive approach explodes exponentially. For strings "ABCD" and "EFGH" (worst case: no matches), we make 2 recursive calls at each node, creating 2^8 = 256 calls for just 4 characters each. For length-20 strings with no matches, we'd make over a trillion calls.
Visualizing the Problem
Consider computing LCS("ABC", "AC"). The recursion creates:
LCS(3,2)
├── LCS(2,2) [match at C]
│ └── LCS(1,1) [no match]
│ ├── LCS(0,1) → 0
│ └── LCS(1,0) → 0
└── (not taken, since characters matched)
But for non-matching characters, both branches are explored:
LCS("AXBY", "AYBX")
└── LCS(3,3) [Y ≠ X]
├── LCS(2,3) → explores many subproblems
└── LCS(3,2) → SAME subproblems recomputed!
The subproblem LCS(2,2) gets computed multiple times, as do many others. This redundancy is exactly what dynamic programming eliminates.
Dynamic programming requires two properties: optimal substructure and overlapping subproblems. The LCS problem exhibits both beautifully.
Theorem: Optimal Substructure of LCS
Let X = ⟨x₁, x₂, ..., xₘ⟩ and Y = ⟨y₁, y₂, ..., yₙ⟩ be sequences, and let Z = ⟨z₁, z₂, ..., zₖ⟩ be any LCS of X and Y.
If xₘ = yₙ, then zₖ = xₘ = yₙ and Z[1..k-1] is an LCS of X[1..m-1] and Y[1..n-1]
If xₘ ≠ yₙ, then:
For case 1: If the last characters match, they must be in some optimal solution (if they're not, we can always add them to extend the LCS). For case 2: If the last characters don't match, at least one of them cannot be in the LCS, so we can safely remove it without affecting the optimal solution.
Formal Proof of Case 1
Suppose xₘ = yₙ. We claim that this common character must appear at the end of some LCS.
Proof by contradiction: Assume Z = ⟨z₁, ..., zₖ⟩ is an LCS and zₖ ≠ xₘ (= yₙ). Then we could form Z' = ⟨z₁, ..., zₖ, xₘ⟩. Since xₘ appears at position m in X (after all elements of Z that came from X) and yₙ appears at position n in Y (after all elements of Z that came from Y), Z' is a valid common subsequence. But |Z'| = k + 1 > |Z| = k, contradicting that Z is longest. ∎
Now, if zₖ = xₘ = yₙ, then Z[1..k-1] must be a common subsequence of X[1..m-1] and Y[1..n-1]. If it weren't longest, we could replace it with a longer one and get a longer Z, contradiction.
Formal Proof of Case 2
Suppose xₘ ≠ yₙ. At most one of them can equal zₖ.
If zₖ ≠ xₘ: Z doesn't use xₘ, so Z is a common subsequence of X[1..m-1] and Y. If there were a longer common subsequence of X[1..m-1] and Y, it would also be a common subsequence of X and Y, contradicting that Z is longest.
The argument for zₖ ≠ yₙ is symmetric. ∎
The Recurrence Relation
Based on optimal substructure, we define:
c[i, j] = length of LCS of X[1..i] and Y[1..j]
The recurrence is:
c[i, j] =
| 0 if i = 0 or j = 0
| c[i-1, j-1] + 1 if i, j > 0 and xᵢ = yⱼ
| max(c[i-1, j], c[i, j-1]) if i, j > 0 and xᵢ ≠ yⱼ
This recurrence has:
This is the key insight: we've reduced exponential work to polynomial work by recognizing and eliminating redundant computation.
The transformation from O(2^(m+n)) to O(m×n) represents an enormous speedup. For two strings of length 1000, we go from ~10^600 operations to just 1,000,000 operations—a reduction by a factor of approximately 10^594. This is why dynamic programming is considered one of the most powerful algorithmic techniques.
Now we translate the recurrence into working code. We'll present both the standard 2D table approach and analyze every aspect of the implementation.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
def lcs_length(X: str, Y: str) -> int: """ Compute the length of the Longest Common Subsequence using DP. Time Complexity: O(m × n) where m = len(X), n = len(Y) Space Complexity: O(m × n) for the DP table Args: X: First string Y: Second string Returns: Length of the longest common subsequence """ m, n = len(X), len(Y) # Create DP table with (m+1) rows and (n+1) columns # dp[i][j] = LCS length of X[0:i] and Y[0:j] # Extra row/column for base case (empty prefix) dp = [[0] * (n + 1) for _ in range(m + 1)] # Fill the table bottom-up for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: # Characters match: extend LCS from diagonal dp[i][j] = dp[i - 1][j - 1] + 1 else: # Characters don't match: take max of excluding either dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Answer is in the bottom-right cell return dp[m][n] def lcs_with_table(X: str, Y: str) -> tuple[int, list[list[int]]]: """ Compute LCS length and return the full DP table for visualization. Returns: Tuple of (LCS length, DP table) """ m, n = len(X), len(Y) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) return dp[m][n], dp # Example with detailed outputX = "ABCBDAB"Y = "BDCABA" length, table = lcs_with_table(X, Y) print(f"X = '{X}'")print(f"Y = '{Y}'")print(f"LCS Length = {length}") # Print the DP tableprint("DP Table:")print(" ", end="")print(" ε " + " ".join(f" {c} " for c in Y))print(" ε " + " ".join(f" {table[0][j]:2d}" for j in range(len(Y) + 1)))for i in range(1, len(X) + 1): print(f" {X[i-1]} " + " ".join(f" {table[i][j]:2d}" for j in range(len(Y) + 1)))Understanding the Table Structure
The DP table for X = "ABCBDAB" and Y = "BDCABA" looks like this:
ε B D C A B A
ε 0 0 0 0 0 0 0
A 0 0 0 0 1 1 1
B 0 1 1 1 1 2 2
C 0 1 1 2 2 2 2
B 0 1 1 2 2 3 3
D 0 1 2 2 2 3 3
A 0 1 2 2 3 3 4
B 0 1 2 2 3 4 4
Reading the Table:
Notice how values can only stay the same or increase as we move right or down. This monotonicity comes from the fact that adding characters can only maintain or improve the LCS, never shorten it. Also, values increase by at most 1 per cell—each match contributes exactly one character to the LCS.
The standard implementation uses O(m × n) space for the DP table. For very long strings (e.g., genome sequences with millions of characters), this may be prohibitive. We can reduce space significantly.
Observation: When computing row i, we only need values from row i-1 and the current row. Rows 0 through i-2 are never accessed again.
Two-Row Optimization: O(min(m, n)) space
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
def lcs_length_optimized(X: str, Y: str) -> int: """ Space-optimized LCS using only two rows. Time Complexity: O(m × n) Space Complexity: O(min(m, n)) Key insight: Each cell only depends on: - The cell directly above (previous row, same column) - The cell to the left (current row, previous column) - The diagonal cell (previous row, previous column) Therefore, we only need to keep track of two rows at a time. """ # Ensure Y is the shorter string to minimize space if len(X) < len(Y): X, Y = Y, X m, n = len(X), len(Y) # Only two rows needed: previous and current prev = [0] * (n + 1) curr = [0] * (n + 1) for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: curr[j] = prev[j - 1] + 1 else: curr[j] = max(prev[j], curr[j - 1]) # Swap rows: current becomes previous for next iteration prev, curr = curr, prev # Answer is in prev because we swapped after the last iteration return prev[n] def lcs_length_single_row(X: str, Y: str) -> int: """ Further optimized LCS using a single row plus one variable. Time Complexity: O(m × n) Space Complexity: O(min(m, n)) The trick: We process left-to-right, so curr[j-1] is already updated. We just need to save the diagonal value before overwriting it. """ if len(X) < len(Y): X, Y = Y, X m, n = len(X), len(Y) dp = [0] * (n + 1) for i in range(1, m + 1): prev_diagonal = 0 # Stores dp[i-1][j-1] before overwrite for j in range(1, n + 1): temp = dp[j] # Save current value (will become diagonal) if X[i - 1] == Y[j - 1]: dp[j] = prev_diagonal + 1 else: dp[j] = max(dp[j], dp[j - 1]) prev_diagonal = temp return dp[n] # Verify both implementationsX = "ABCBDAB"Y = "BDCABA" print(f"Two-row optimized: {lcs_length_optimized(X, Y)}")print(f"Single-row optimized: {lcs_length_single_row(X, Y)}")The space-optimized versions can only compute the LCS length, not the actual LCS sequence. To reconstruct the LCS, we need the full table (or use Hirschberg's divide-and-conquer algorithm, which achieves O(n) space while still allowing reconstruction—but that's an advanced topic).
Let's rigorously analyze the time and space complexity of our DP solution.
| Algorithm | Time Complexity | Space Complexity | Reconstruction? |
|---|---|---|---|
| Naive Recursive | O(2^(m+n)) | O(m + n) stack | Yes |
| Memoized Recursive | O(m × n) | O(m × n) + stack | Yes |
| Bottom-up DP | O(m × n) | O(m × n) | Yes |
| Two-row DP | O(m × n) | O(min(m, n)) | No |
| Single-row DP | O(m × n) | O(min(m, n)) | No |
| Hirschberg's Algorithm | O(m × n) | O(min(m, n)) | Yes |
Detailed Time Analysis:
Total: Θ(m × n)
Note: This is tight—we must fill every cell, and each cell takes constant time.
Detailed Space Analysis (standard version):
Total: Θ(m × n) (dominated by the table)
For practical applications, this means:
For genome sequences with millions of base pairs, space optimization becomes essential.
For general LCS, O(m × n) is considered optimal under the Strong Exponential Time Hypothesis (SETH). However, for strings with bounded alphabet size or specific structure, faster algorithms exist. For example, the "Four Russians" technique achieves O(m × n / log n) for constant-sized alphabets, though the constants are large.
We've established a complete, rigorous understanding of the LCS problem and its dynamic programming solution.
What's Next:
Now that we understand how to compute the LCS length, the next page dives deep into DP table construction—examining the table visually, understanding the flow of dependencies, and building intuition for why the algorithm works through detailed worked examples.
You now have a complete understanding of the LCS problem definition, why naive approaches fail, and how dynamic programming provides an efficient O(m × n) solution. You've seen both the standard and space-optimized implementations with full complexity analysis.