Loading learning content...
We've mastered the Longest Common Subsequence, where elements can be scattered throughout the strings as long as their order is preserved. But what if we need consecutive matches?
Consider these scenarios:
The Longest Common Substring problem requires the matched portion to be contiguous in both strings. This seemingly small change—from subsequence to substring—leads to a different DP formulation and different algorithmic insights.
By the end of this page, you'll understand how the contiguity constraint changes the problem, master the DP solution for longest common substring, compare the two problems systematically, and explore related variations including longest repeated substring and longest palindromic substring.
Definition: Substring
A substring (also called a subarray or factor) of a string S is a string formed by a contiguous block of characters from S. Formally, for a string S = s₁s₂...sₙ, a substring is S[i..j] = sᵢsᵢ₊₁...sⱼ where 1 ≤ i ≤ j ≤ n.
Definition: Longest Common Substring (LCSubstr)
Given two strings X and Y, find the longest string Z that is a substring of both X and Y.
Key Difference from LCS:
| Property | Longest Common Subsequence | Longest Common Substring |
|---|---|---|
| Elements | Can be non-contiguous | Must be contiguous |
| Order | Preserved | Preserved |
| Typical result | Scattered through strings | Single block in each string |
| Length | ≥ Substring length | ≤ Subsequence length |
The terms 'subsequence' and 'substring' are often confused, leading to incorrect solutions. Remember: substring = contiguous (like a physical cut from the string), subsequence = can skip elements (like highlighting certain letters).
Naive Algorithm:
Time Complexity: O(m² × n) or O(m² × n) depending on how we check for substring membership.
Can We Do Better?
Yes! The key insight is that substring matching has locality—if S[i..j] matches T[k..l], then S[i-1..j+1] matches T[k-1..l+1] only if the surrounding characters also match. This enables dynamic programming.
Another Approach: Suffix Arrays/Trees
For very long strings, suffix arrays or suffix trees can solve LCSubstr in O(m + n) time, but these are complex data structures. The DP approach we'll cover is O(m × n), which is often good enough and much simpler to implement.
| Algorithm | Time Complexity | Space Complexity | Implementation Difficulty |
|---|---|---|---|
| Naive (all substrings) | O(m² × n) | O(1) | Easy |
| Dynamic Programming | O(m × n) | O(m × n) or O(min(m,n)) | Easy |
| Suffix Array | O((m + n) log(m + n)) | O(m + n) | Moderate |
| Suffix Tree | O(m + n) | O(m + n) | Hard |
The Critical Insight:
For longest common substring, we need to track contiguous matches. Unlike LCS, where we could combine results from subproblems freely, here a mismatch "breaks the chain."
DP State Definition:
Let dp[i][j] = length of the longest common suffix of X[1..i] and Y[1..j] that is also a common substring ending at positions i and j.
In simpler terms: dp[i][j] is the length of the longest common substring that ends at X[i] and Y[j].
Recurrence:
dp[i][j] =
| dp[i-1][j-1] + 1 if X[i] == Y[j]
| 0 if X[i] ≠ Y[j]
Why the Difference from LCS?
In LCS, when characters don't match, we can still inherit the best result from shorter prefixes. In LCSubstr, a mismatch means the contiguous run is broken—we must restart from 0.
The Answer:
Unlike LCS where dp[m][n] gives the answer, for LCSubstr we need:
answer = max(dp[i][j]) for all i, j
The longest common substring could end anywhere in the strings.
Think of the LCSubstr DP table as tracking "how long is the current matching streak ending here?" Like counting consecutive wins in a game—any loss resets the counter to zero.
Let's implement the longest common substring algorithm with full reconstruction capability.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123
def longest_common_substring(X: str, Y: str) -> tuple[int, str]: """ Find the longest common substring of X and Y. Returns: Tuple of (length, the longest common substring) Time Complexity: O(m × n) Space Complexity: O(m × n) """ m, n = len(X), len(Y) if m == 0 or n == 0: return 0, "" # DP table: dp[i][j] = length of LCSubstr ending at X[i-1] and Y[j-1] dp = [[0] * (n + 1) for _ in range(m + 1)] max_length = 0 end_index_x = 0 # End position in X (exclusive) for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 # Track the maximum if dp[i][j] > max_length: max_length = dp[i][j] end_index_x = i else: dp[i][j] = 0 # Chain broken # Extract the substring substring = X[end_index_x - max_length:end_index_x] return max_length, substring def longest_common_substring_all(X: str, Y: str) -> tuple[int, list[str]]: """ Find ALL longest common substrings (may have multiple of same max length). Returns: Tuple of (max length, list of all LCS strings) """ m, n = len(X), len(Y) if m == 0 or n == 0: return 0, [""] dp = [[0] * (n + 1) for _ in range(m + 1)] max_length = 0 end_positions = [] # (end_index_x, end_index_y) for all max-length substrings for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 if dp[i][j] > max_length: max_length = dp[i][j] end_positions = [(i, j)] # Reset with new max elif dp[i][j] == max_length: end_positions.append((i, j)) # Add to list else: dp[i][j] = 0 # Extract all unique substrings substrings = set() for end_x, end_y in end_positions: substrings.add(X[end_x - max_length:end_x]) return max_length, list(substrings) def print_table(X: str, Y: str) -> None: """Visualize the DP table for educational purposes.""" m, n = len(X), len(Y) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 # Print table print("LCSubstr DP Table:") print(" ε", end="") for c in Y: print(f" {c}", end="") print() print(" ε", end="") for j in range(n + 1): print(f" {dp[0][j]}", end="") print() for i in range(1, m + 1): print(f" {X[i-1]}", end="") for j in range(n + 1): print(f" {dp[i][j]}", end="") print() # Example usageX = "ABABC"Y = "BABCA" print(f"X = '{X}'")print(f"Y = '{Y}'")print() print_table(X, Y)print() length, substr = longest_common_substring(X, Y)print(f"Longest common substring length: {length}")print(f"One longest common substring: '{substr}'") length, all_substrs = longest_common_substring_all(X, Y)print(f"All longest common substrings: {all_substrs}")Understanding the Table:
For X = "ABABC" and Y = "BABCA":
LCSubstr DP Table:
ε B A B C A
ε 0 0 0 0 0 0
A 0 0 1 0 0 1
B 0 1 0 2 0 0
A 0 0 2 0 0 1
B 0 1 0 3 0 0
C 0 0 0 0 4 0
Wait—that's wrong! Let me recalculate:
X = ABABC (indices 0-4)
Y = BABCA (indices 0-4)
Actually:
ε B A B C A
ε 0 0 0 0 0 0
A 0 0 1 0 0 1
B 0 1 0 2 0 0
A 0 0 2 0 0 1
B 0 1 0 3 0 0
C 0 0 0 0 4 0
Hmm, that shows max = 4 at position (5, 4), which would mean substring ending at X[4]='C' and Y[3]='C'. Tracing back 4 characters: X[1:5] = "BABC", but Y doesn't have "BABC" as a substring...
Let me recalculate more carefully:
So dp[2][1]=1, dp[3][2]=2, dp[4][3]=3, dp[5][4]=4? But that means "BABC" is common, which isn't right for the original strings.
Actually, I made an indexing error. Let me be more careful:
X = "ABABC": X[0]='A', X[1]='B', X[2]='A', X[3]='B', X[4]='C' Y = "BABCA": Y[0]='B', Y[1]='A', Y[2]='B', Y[3]='C', Y[4]='A'
The table is:
ε B A B C A
ε 0 0 0 0 0 0
A 0 0 1 0 0 1
B 0 1 0 2 0 0
A 0 0 2 0 0 1
B 0 1 0 3 0 0
C 0 0 0 0 1 0
At (4,3): X[3]='B', Y[2]='B' → check dp[3][2]. dp[3][2]: X[2]='A', Y[1]='A' → 2. So dp[4][3] = 3, corresponding to "BAB" (ending at X[3], Y[2]).
At (5,4): X[4]='C', Y[3]='C' → dp[4][3]=3 → dp[5][4]=4? No wait, dp[i-1][j-1] = dp[4][3] = 3, so dp[5][4] = 4?
But the substring would be X[1:5] = "BABC". Is "BABC" a substring of Y = "BABCA"? Yes! Y[0:4] = "BABC". So the answer IS 4, and the substring is "BABC".
I was wrong earlier—"BABC" IS a substring of both strings!
The longest common substring of "ABABC" and "BABCA" is indeed "BABC" with length 4. In X, it's at positions [1:5]. In Y, it's at positions [0:4]. Always verify by checking that the substring appears contiguously in both strings.
Just like with LCS, we can reduce the space complexity from O(m × n) to O(min(m, n)) by observing that each cell only depends on the diagonal cell from the previous row.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
def longest_common_substring_optimized(X: str, Y: str) -> tuple[int, str]: """ Space-optimized longest common substring. Time Complexity: O(m × n) Space Complexity: O(min(m, n)) Key insight: Each cell dp[i][j] only depends on dp[i-1][j-1]. We can use a single row, but need to save the diagonal value before it's overwritten. """ # Make Y the shorter string if len(X) < len(Y): X, Y = Y, X m, n = len(X), len(Y) if m == 0 or n == 0: return 0, "" # Single row DP dp = [0] * (n + 1) max_length = 0 end_index_x = 0 for i in range(1, m + 1): # We need to process right-to-left to avoid using updated values # OR we save the diagonal separately prev_diagonal = 0 for j in range(1, n + 1): # Save current value (it will be diagonal for next iteration) old_dp_j = dp[j] if X[i - 1] == Y[j - 1]: dp[j] = prev_diagonal + 1 if dp[j] > max_length: max_length = dp[j] end_index_x = i else: dp[j] = 0 prev_diagonal = old_dp_j return max_length, X[end_index_x - max_length:end_index_x] def longest_common_substring_constant_space(X: str, Y: str) -> int: """ O(1) space solution using a different approach. Idea: For each possible diagonal in the table, compute the longest matching run. Each diagonal can be processed with O(1) extra space. Time Complexity: O(m × n) Space Complexity: O(1) (not counting input/output) Note: This approach is trickier to implement correctly and reconstruction is harder. Use only when space is critical. """ m, n = len(X), len(Y) if m == 0 or n == 0: return 0 max_length = 0 # Process diagonals starting from first column for start_i in range(m): i, j = start_i, 0 current_length = 0 while i < m and j < n: if X[i] == Y[j]: current_length += 1 max_length = max(max_length, current_length) else: current_length = 0 i += 1 j += 1 # Process diagonals starting from first row (except 0,0 already done) for start_j in range(1, n): i, j = 0, start_j current_length = 0 while i < m and j < n: if X[i] == Y[j]: current_length += 1 max_length = max(max_length, current_length) else: current_length = 0 i += 1 j += 1 return max_length # ExampleX = "ABABC"Y = "BABCA" length1, substr1 = longest_common_substring_optimized(X, Y)length2 = longest_common_substring_constant_space(X, Y) print(f"Optimized: length={length1}, substring='{substr1}'")print(f"Constant space: length={length2}")The constant-space solution exploits the fact that in LCSubstr, information only flows along diagonals. Each diagonal is independent! This is unlike LCS where cells can inherit from multiple neighbors.
The LCS and LCSubstr problems belong to a family of related string comparison problems. Understanding their relationships builds intuition for string algorithms.
| Problem | Description | Time Complexity | Key Insight |
|---|---|---|---|
| Longest Common Subsequence | Longest sequence preserving order | O(m × n) | Matches can be scattered |
| Longest Common Substring | Longest contiguous match | O(m × n) | Matches must be consecutive |
| Edit Distance | Min operations to transform | O(m × n) | Insert, delete, substitute |
| Longest Repeated Substring | Longest substring appearing twice in same string | O(n²) or O(n) | LCSubstr of S with itself, excluding trivial match |
| Longest Palindromic Substring | Longest substring that's a palindrome | O(n²) or O(n) | LCSubstr of S with reverse(S), with position constraints |
| Shortest Common Supersequence | Shortest string containing both as subsequences | O(m × n) | Length = m + n - LCS |
Longest Repeated Substring
Given a string S, find the longest substring that appears at least twice. This can be solved by finding the LCSubstr of S with itself, but we must ensure the two occurrences don't overlap trivially.
Example: For S = "banana", the longest repeated substring is "ana" (appears at positions 1 and 3).
Longest Palindromic Substring
While commonly solved with the "expand around center" technique, it can also be approached as finding the LCSubstr of S and reverse(S), with the constraint that the positions correspond (i.e., position i in S corresponds to position n-1-i in reverse(S)).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
def longest_repeated_substring(s: str) -> str: """ Find the longest substring that appears at least twice. Uses a modified LCSubstr approach comparing s with itself, ensuring we don't count overlapping matches. Time Complexity: O(n²) Space Complexity: O(n²) """ n = len(s) if n == 0: return "" # dp[i][j] = length of LCSubstr ending at s[i-1] and s[j-1] # where i != j (different positions) dp = [[0] * (n + 1) for _ in range(n + 1)] max_length = 0 end_index = 0 for i in range(1, n + 1): for j in range(1, n + 1): # Skip diagonal (same position) if i != j and s[i - 1] == s[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 if dp[i][j] > max_length: max_length = dp[i][j] end_index = i return s[end_index - max_length:end_index] def longest_palindromic_substring_expand(s: str) -> str: """ Find the longest palindromic substring using expand-around-center. Time Complexity: O(n²) Space Complexity: O(1) """ if len(s) < 2: return s def expand(left: int, right: int) -> tuple[int, int]: """Expand around center while characters match.""" while left >= 0 and right < len(s) and s[left] == s[right]: left -= 1 right += 1 return left + 1, right - 1 # Return valid bounds start, end = 0, 0 for i in range(len(s)): # Odd-length palindromes (center at i) l1, r1 = expand(i, i) if r1 - l1 > end - start: start, end = l1, r1 # Even-length palindromes (center between i and i+1) l2, r2 = expand(i, i + 1) if r2 - l2 > end - start: start, end = l2, r2 return s[start:end + 1] def shortest_common_supersequence(X: str, Y: str) -> str: """ Find the shortest string that contains both X and Y as subsequences. Uses the relationship: SCS_length = len(X) + len(Y) - LCS_length Time Complexity: O(m × n) Space Complexity: O(m × n) """ m, n = len(X), len(Y) # Build LCS table dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(1, m + 1): for j in range(1, n + 1): if X[i - 1] == Y[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 else: dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # Reconstruct SCS by backtracking result = [] i, j = m, n while i > 0 and j > 0: if X[i - 1] == Y[j - 1]: # Character in LCS - add once result.append(X[i - 1]) i -= 1 j -= 1 elif dp[i - 1][j] >= dp[i][j - 1]: # Add character from X result.append(X[i - 1]) i -= 1 else: # Add character from Y result.append(Y[j - 1]) j -= 1 # Add remaining characters while i > 0: result.append(X[i - 1]) i -= 1 while j > 0: result.append(Y[j - 1]) j -= 1 return ''.join(reversed(result)) # Examplesprint("Longest Repeated Substring:")print(f" 'banana' → '{longest_repeated_substring('banana')}'")print(f" 'abcabc' → '{longest_repeated_substring('abcabc')}'") print("\nLongest Palindromic Substring:")print(f" 'babad' → '{longest_palindromic_substring_expand('babad')}'")print(f" 'cbbd' → '{longest_palindromic_substring_expand('cbbd')}'") print("\nShortest Common Supersequence:")print(f" 'AGGTAB', 'GXTXAYB' → '{shortest_common_supersequence('AGGTAB', 'GXTXAYB')}'")Many string problems can be reduced to LCS or LCSubstr with modifications. Understanding these core algorithms provides a foundation for solving an entire class of related problems.
Understanding when to use each algorithm is crucial for solving real-world problems correctly.
Decision Framework:
Ask: Do matches need to be consecutive?
Ask: Can matched elements be interleaved with non-matched elements?
Ask: Is the output a single contiguous block?
Hybrid Approach:
Sometimes you need both. For example, a plagiarism detector might:
We've completed our deep dive into the Longest Common Subsequence problem and its close relative, the Longest Common Substring.
Module Complete:
You've now mastered the Longest Common Subsequence family of problems:
These algorithms are foundational for string processing, appearing in version control systems, bioinformatics tools, plagiarism detectors, and many other applications. The DP techniques you've learned here transfer directly to dozens of other problems.
Congratulations! You've completed the Longest Common Subsequence module. You now have a deep understanding of both LCS and LCSubstr, including their DP formulations, implementations, reconstructions, and applications. These are essential tools in any algorithm engineer's toolkit.