Data Structures & AlgorithmsString Algorithms

Longest Common Subsequence (Revisited)

LevelIntermediate

Duration75 mins

TopicString Algorithms

4 / 4

Variations: Longest Common Substring

When Contiguity Matters

We've mastered the Longest Common Subsequence, where elements can be scattered throughout the strings as long as their order is preserved. But what if we need consecutive matches?

Consider these scenarios:

Plagiarism detection: Verbatim copying involves contiguous text blocks, not scattered words
DNA repeat finding: Tandem repeats are contiguous patterns, not interleaved ones
Duplicate file detection: Similar file sections are contiguous byte sequences
Autocomplete: Matching prefixes are contiguous character sequences

The Longest Common Substring problem requires the matched portion to be contiguous in both strings. This seemingly small change—from subsequence to substring—leads to a different DP formulation and different algorithmic insights.

What You Will Master

By the end of this page, you'll understand how the contiguity constraint changes the problem, master the DP solution for longest common substring, compare the two problems systematically, and explore related variations including longest repeated substring and longest palindromic substring.

Problem Definition and Contrast

Definition: Substring

A substring (also called a subarray or factor) of a string S is a string formed by a contiguous block of characters from S. Formally, for a string S = s₁s₂...sₙ, a substring is S[i..j] = sᵢsᵢ₊₁...sⱼ where 1 ≤ i ≤ j ≤ n.

Definition: Longest Common Substring (LCSubstr)

Given two strings X and Y, find the longest string Z that is a substring of both X and Y.

Key Difference from LCS:

Property	Longest Common Subsequence	Longest Common Substring
Elements	Can be non-contiguous	Must be contiguous
Order	Preserved	Preserved
Typical result	Scattered through strings	Single block in each string
Length	≥ Substring length	≤ Subsequence length

LCS vs LCSubstr ComparisonComparing results for X = "ABABC" and Y = "BABCA"

Input

Output

Common Confusion

The terms 'subsequence' and 'substring' are often confused, leading to incorrect solutions. Remember: substring = contiguous (like a physical cut from the string), subsequence = can skip elements (like highlighting certain letters).

Naive Approach Analysis

Naive Algorithm:

Enumerate all substrings of X: O(m²) substrings
For each substring, check if it's a substring of Y: O(n) per check using string matching
Track the longest match

Time Complexity: O(m² × n) or O(m² × n) depending on how we check for substring membership.

Can We Do Better?

Yes! The key insight is that substring matching has locality—if S[i..j] matches T[k..l], then S[i-1..j+1] matches T[k-1..l+1] only if the surrounding characters also match. This enables dynamic programming.

Another Approach: Suffix Arrays/Trees

For very long strings, suffix arrays or suffix trees can solve LCSubstr in O(m + n) time, but these are complex data structures. The DP approach we'll cover is O(m × n), which is often good enough and much simpler to implement.

Algorithm Comparison for Longest Common Substring
Algorithm	Time Complexity	Space Complexity	Implementation Difficulty
Naive (all substrings)	O(m² × n)	O(1)	Easy
Dynamic Programming	O(m × n)	O(m × n) or O(min(m,n))	Easy
Suffix Array	O((m + n) log(m + n))	O(m + n)	Moderate
Suffix Tree	O(m + n)	O(m + n)	Hard

Dynamic Programming Formulation

The Critical Insight:

For longest common substring, we need to track contiguous matches. Unlike LCS, where we could combine results from subproblems freely, here a mismatch "breaks the chain."

DP State Definition:

Let dp[i][j] = length of the longest common suffix of X[1..i] and Y[1..j] that is also a common substring ending at positions i and j.

In simpler terms: dp[i][j] is the length of the longest common substring that ends at X[i] and Y[j].

Recurrence:

dp[i][j] = 
  | dp[i-1][j-1] + 1    if X[i] == Y[j]
  | 0                   if X[i] ≠ Y[j]

Why the Difference from LCS?

In LCS, when characters don't match, we can still inherit the best result from shorter prefixes. In LCSubstr, a mismatch means the contiguous run is broken—we must restart from 0.

The Answer:

Unlike LCS where dp[m][n] gives the answer, for LCSubstr we need:

answer = max(dp[i][j]) for all i, j

The longest common substring could end anywhere in the strings.

LCS Recurrence

•Match: dp[i-1][j-1] + 1
•No match: max(dp[i-1][j], dp[i][j-1])
•Values propagate and accumulate
•Answer at dp[m][n]
•Can inherit from non-diagonal cells

LCSubstr Recurrence

•Match: dp[i-1][j-1] + 1
•No match: 0 (chain broken)
•Values reset on mismatch
•Answer is max over all cells
•Only diagonal propagation

Intuition Check

Think of the LCSubstr DP table as tracking "how long is the current matching streak ending here?" Like counting consecutive wins in a game—any loss resets the counter to zero.

Complete Implementation

Let's implement the longest common substring algorithm with full reconstruction capability.

longest_common_substring.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def longest_common_substring(X: str, Y: str) -> tuple[int, str]:
    """
    Find the longest common substring of X and Y.
    
    Returns:
        Tuple of (length, the longest common substring)
    
    Time Complexity: O(m × n)
    Space Complexity: O(m × n)
    """
    m, n = len(X), len(Y)
    
    if m == 0 or n == 0:
        return 0, ""
    
    # DP table: dp[i][j] = length of LCSubstr ending at X[i-1] and Y[j-1]
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    max_length = 0
    end_index_x = 0  # End position in X (exclusive)
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
                
                # Track the maximum
                if dp[i][j] > max_length:
                    max_length = dp[i][j]
                    end_index_x = i
            else:
                dp[i][j] = 0  # Chain broken
    
    # Extract the substring
    substring = X[end_index_x - max_length:end_index_x]
    
    return max_length, substring
 
 
def longest_common_substring_all(X: str, Y: str) -> tuple[int, list[str]]:
    """
    Find ALL longest common substrings (may have multiple of same max length).
    
    Returns:
        Tuple of (max length, list of all LCS strings)
    """
    m, n = len(X), len(Y)
    
    if m == 0 or n == 0:
        return 0, [""]
    
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    max_length = 0
    end_positions = []  # (end_index_x, end_index_y) for all max-length substrings
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
                
                if dp[i][j] > max_length:
                    max_length = dp[i][j]
                    end_positions = [(i, j)]  # Reset with new max
                elif dp[i][j] == max_length:
                    end_positions.append((i, j))  # Add to list
            else:
                dp[i][j] = 0
    
    # Extract all unique substrings
    substrings = set()
    for end_x, end_y in end_positions:
        substrings.add(X[end_x - max_length:end_x])
    
    return max_length, list(substrings)
 
 
def print_table(X: str, Y: str) -> None:
    """Visualize the DP table for educational purposes."""
    m, n = len(X), len(Y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
    
    # Print table
    print("LCSubstr DP Table:")
    print("    ε", end="")
    for c in Y:
        print(f"  {c}", end="")
    print()
    
    print("  ε", end="")
    for j in range(n + 1):
        print(f"  {dp[0][j]}", end="")
    print()
    
    for i in range(1, m + 1):
        print(f"  {X[i-1]}", end="")
        for j in range(n + 1):
            print(f"  {dp[i][j]}", end="")
        print()
 
 
# Example usage
X = "ABABC"
Y = "BABCA"
 
print(f"X = '{X}'")
print(f"Y = '{Y}'")
print()
 
print_table(X, Y)
print()
 
length, substr = longest_common_substring(X, Y)
print(f"Longest common substring length: {length}")
print(f"One longest common substring: '{substr}'")
 
length, all_substrs = longest_common_substring_all(X, Y)
print(f"All longest common substrings: {all_substrs}")

Understanding the Table:

For X = "ABABC" and Y = "BABCA":

LCSubstr DP Table:
      ε  B  A  B  C  A
  ε   0  0  0  0  0  0
  A   0  0  1  0  0  1
  B   0  1  0  2  0  0
  A   0  0  2  0  0  1
  B   0  1  0  3  0  0
  C   0  0  0  0  4  0

Wait—that's wrong! Let me recalculate:

X = ABABC (indices 0-4)
Y = BABCA (indices 0-4)

Actually:
      ε  B  A  B  C  A
  ε   0  0  0  0  0  0
  A   0  0  1  0  0  1
  B   0  1  0  2  0  0
  A   0  0  2  0  0  1
  B   0  1  0  3  0  0
  C   0  0  0  0  4  0

Hmm, that shows max = 4 at position (5, 4), which would mean substring ending at X[4]='C' and Y[3]='C'. Tracing back 4 characters: X[1:5] = "BABC", but Y doesn't have "BABC" as a substring...

Let me recalculate more carefully:

X[4]='C', Y[3]='C' → match, dp[5][4] = dp[4][3] + 1
dp[4][3]: X[3]='B', Y[2]='B' → match, = dp[3][2] + 1
dp[3][2]: X[2]='A', Y[1]='A' → match, = dp[2][1] + 1
dp[2][1]: X[1]='B', Y[0]='B' → match, = dp[1][0] + 1 = 1

So dp[2][1]=1, dp[3][2]=2, dp[4][3]=3, dp[5][4]=4? But that means "BABC" is common, which isn't right for the original strings.

Actually, I made an indexing error. Let me be more careful:

X = "ABABC": X[0]='A', X[1]='B', X[2]='A', X[3]='B', X[4]='C' Y = "BABCA": Y[0]='B', Y[1]='A', Y[2]='B', Y[3]='C', Y[4]='A'

The table is:

      ε  B  A  B  C  A
  ε   0  0  0  0  0  0
  A   0  0  1  0  0  1
  B   0  1  0  2  0  0
  A   0  0  2  0  0  1
  B   0  1  0  3  0  0
  C   0  0  0  0  1  0

At (4,3): X[3]='B', Y[2]='B' → check dp[3][2]. dp[3][2]: X[2]='A', Y[1]='A' → 2. So dp[4][3] = 3, corresponding to "BAB" (ending at X[3], Y[2]).

At (5,4): X[4]='C', Y[3]='C' → dp[4][3]=3 → dp[5][4]=4? No wait, dp[i-1][j-1] = dp[4][3] = 3, so dp[5][4] = 4?

But the substring would be X[1:5] = "BABC". Is "BABC" a substring of Y = "BABCA"? Yes! Y[0:4] = "BABC". So the answer IS 4, and the substring is "BABC".

I was wrong earlier—"BABC" IS a substring of both strings!

Final Correction

The longest common substring of "ABABC" and "BABCA" is indeed "BABC" with length 4. In X, it's at positions [1:5]. In Y, it's at positions [0:4]. Always verify by checking that the substring appears contiguously in both strings.

Space Optimization

Just like with LCS, we can reduce the space complexity from O(m × n) to O(min(m, n)) by observing that each cell only depends on the diagonal cell from the previous row.

lcsubstr_optimized.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
def longest_common_substring_optimized(X: str, Y: str) -> tuple[int, str]:
    """
    Space-optimized longest common substring.
    
    Time Complexity: O(m × n)
    Space Complexity: O(min(m, n))
    
    Key insight: Each cell dp[i][j] only depends on dp[i-1][j-1].
    We can use a single row, but need to save the diagonal value
    before it's overwritten.
    """
    # Make Y the shorter string
    if len(X) < len(Y):
        X, Y = Y, X
    
    m, n = len(X), len(Y)
    
    if m == 0 or n == 0:
        return 0, ""
    
    # Single row DP
    dp = [0] * (n + 1)
    
    max_length = 0
    end_index_x = 0
    
    for i in range(1, m + 1):
        # We need to process right-to-left to avoid using updated values
        # OR we save the diagonal separately
        prev_diagonal = 0
        
        for j in range(1, n + 1):
            # Save current value (it will be diagonal for next iteration)
            old_dp_j = dp[j]
            
            if X[i - 1] == Y[j - 1]:
                dp[j] = prev_diagonal + 1
                
                if dp[j] > max_length:
                    max_length = dp[j]
                    end_index_x = i
            else:
                dp[j] = 0
            
            prev_diagonal = old_dp_j
    
    return max_length, X[end_index_x - max_length:end_index_x]
 
 
def longest_common_substring_constant_space(X: str, Y: str) -> int:
    """
    O(1) space solution using a different approach.
    
    Idea: For each possible diagonal in the table, compute the
    longest matching run. Each diagonal can be processed with O(1)
    extra space.
    
    Time Complexity: O(m × n) 
    Space Complexity: O(1) (not counting input/output)
    
    Note: This approach is trickier to implement correctly and
    reconstruction is harder. Use only when space is critical.
    """
    m, n = len(X), len(Y)
    
    if m == 0 or n == 0:
        return 0
    
    max_length = 0
    
    # Process diagonals starting from first column
    for start_i in range(m):
        i, j = start_i, 0
        current_length = 0
        
        while i < m and j < n:
            if X[i] == Y[j]:
                current_length += 1
                max_length = max(max_length, current_length)
            else:
                current_length = 0
            i += 1
            j += 1
    
    # Process diagonals starting from first row (except 0,0 already done)
    for start_j in range(1, n):
        i, j = 0, start_j
        current_length = 0
        
        while i < m and j < n:
            if X[i] == Y[j]:
                current_length += 1
                max_length = max(max_length, current_length)
            else:
                current_length = 0
            i += 1
            j += 1
    
    return max_length
 
 
# Example
X = "ABABC"
Y = "BABCA"
 
length1, substr1 = longest_common_substring_optimized(X, Y)
length2 = longest_common_substring_constant_space(X, Y)
 
print(f"Optimized: length={length1}, substring='{substr1}'")
print(f"Constant space: length={length2}")

Diagonal Processing Insight

The constant-space solution exploits the fact that in LCSubstr, information only flows along diagonals. Each diagonal is independent! This is unlike LCS where cells can inherit from multiple neighbors.

Related Variations

The LCS and LCSubstr problems belong to a family of related string comparison problems. Understanding their relationships builds intuition for string algorithms.

String Comparison Problem Family
Problem	Description	Time Complexity	Key Insight
Longest Common Subsequence	Longest sequence preserving order	O(m × n)	Matches can be scattered
Longest Common Substring	Longest contiguous match	O(m × n)	Matches must be consecutive
Edit Distance	Min operations to transform	O(m × n)	Insert, delete, substitute
Longest Repeated Substring	Longest substring appearing twice in same string	O(n²) or O(n)	LCSubstr of S with itself, excluding trivial match
Longest Palindromic Substring	Longest substring that's a palindrome	O(n²) or O(n)	LCSubstr of S with reverse(S), with position constraints
Shortest Common Supersequence	Shortest string containing both as subsequences	O(m × n)	Length = m + n - LCS

Longest Repeated Substring

Given a string S, find the longest substring that appears at least twice. This can be solved by finding the LCSubstr of S with itself, but we must ensure the two occurrences don't overlap trivially.

Example: For S = "banana", the longest repeated substring is "ana" (appears at positions 1 and 3).

Longest Palindromic Substring

While commonly solved with the "expand around center" technique, it can also be approached as finding the LCSubstr of S and reverse(S), with the constraint that the positions correspond (i.e., position i in S corresponds to position n-1-i in reverse(S)).

related_problems.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def longest_repeated_substring(s: str) -> str:
    """
    Find the longest substring that appears at least twice.
    
    Uses a modified LCSubstr approach comparing s with itself,
    ensuring we don't count overlapping matches.
    
    Time Complexity: O(n²)
    Space Complexity: O(n²)
    """
    n = len(s)
    
    if n == 0:
        return ""
    
    # dp[i][j] = length of LCSubstr ending at s[i-1] and s[j-1]
    # where i != j (different positions)
    dp = [[0] * (n + 1) for _ in range(n + 1)]
    
    max_length = 0
    end_index = 0
    
    for i in range(1, n + 1):
        for j in range(1, n + 1):
            # Skip diagonal (same position)
            if i != j and s[i - 1] == s[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
                
                if dp[i][j] > max_length:
                    max_length = dp[i][j]
                    end_index = i
    
    return s[end_index - max_length:end_index]
 
 
def longest_palindromic_substring_expand(s: str) -> str:
    """
    Find the longest palindromic substring using expand-around-center.
    
    Time Complexity: O(n²)
    Space Complexity: O(1)
    """
    if len(s) < 2:
        return s
    
    def expand(left: int, right: int) -> tuple[int, int]:
        """Expand around center while characters match."""
        while left >= 0 and right < len(s) and s[left] == s[right]:
            left -= 1
            right += 1
        return left + 1, right - 1  # Return valid bounds
    
    start, end = 0, 0
    
    for i in range(len(s)):
        # Odd-length palindromes (center at i)
        l1, r1 = expand(i, i)
        if r1 - l1 > end - start:
            start, end = l1, r1
        
        # Even-length palindromes (center between i and i+1)
        l2, r2 = expand(i, i + 1)
        if r2 - l2 > end - start:
            start, end = l2, r2
    
    return s[start:end + 1]
 
 
def shortest_common_supersequence(X: str, Y: str) -> str:
    """
    Find the shortest string that contains both X and Y as subsequences.
    
    Uses the relationship: SCS_length = len(X) + len(Y) - LCS_length
    
    Time Complexity: O(m × n)
    Space Complexity: O(m × n)
    """
    m, n = len(X), len(Y)
    
    # Build LCS table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Reconstruct SCS by backtracking
    result = []
    i, j = m, n
    
    while i > 0 and j > 0:
        if X[i - 1] == Y[j - 1]:
            # Character in LCS - add once
            result.append(X[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] >= dp[i][j - 1]:
            # Add character from X
            result.append(X[i - 1])
            i -= 1
        else:
            # Add character from Y
            result.append(Y[j - 1])
            j -= 1
    
    # Add remaining characters
    while i > 0:
        result.append(X[i - 1])
        i -= 1
    while j > 0:
        result.append(Y[j - 1])
        j -= 1
    
    return ''.join(reversed(result))
 
 
# Examples
print("Longest Repeated Substring:")
print(f"  'banana' → '{longest_repeated_substring('banana')}'")
print(f"  'abcabc' → '{longest_repeated_substring('abcabc')}'")
 
print("\nLongest Palindromic Substring:")
print(f"  'babad' → '{longest_palindromic_substring_expand('babad')}'")
print(f"  'cbbd' → '{longest_palindromic_substring_expand('cbbd')}'")
 
print("\nShortest Common Supersequence:")
print(f"  'AGGTAB', 'GXTXAYB' → '{shortest_common_supersequence('AGGTAB', 'GXTXAYB')}'")

Problem Reduction

Many string problems can be reduced to LCS or LCSubstr with modifications. Understanding these core algorithms provides a foundation for solving an entire class of related problems.

Choosing Between LCS and LCSubstr

Understanding when to use each algorithm is crucial for solving real-world problems correctly.

Use LCS When...

•File diff / version comparison
•DNA sequence alignment
•Finding similar orderings
•Edit distance calculation
•Merge conflict resolution
•Reordering detection
•Order-preserving similarity

Use LCSubstr When...

•Exact phrase matching
•Plagiarism detection (verbatim copy)
•DNA tandem repeat finding
•Common block identification
•Autocomplete matching
•Exact prefix/suffix matching
•Contiguous similarity

Decision Framework:

Ask: Do matches need to be consecutive?
- Yes → LCSubstr
- No → LCS
Ask: Can matched elements be interleaved with non-matched elements?
- Yes → LCS
- No → LCSubstr
Ask: Is the output a single contiguous block?
- Yes → LCSubstr
- No (can be scattered) → LCS

Hybrid Approach:

Sometimes you need both. For example, a plagiarism detector might:

Use LCSubstr to find verbatim copied blocks
Use LCS to find paraphrased content with preserved structure
Combine both signals for a plagiarism score

Summary: LCS Variations Complete

We've completed our deep dive into the Longest Common Subsequence problem and its close relative, the Longest Common Substring.

Module Takeaways

•LCS allows scattered matches; LCSubstr requires contiguity — A small definition change leads to different recurrences.
•LCS: max(top, left) on mismatch — Values propagate from neighbors.
•LCSubstr: 0 on mismatch — The chain breaks, counter resets.
•LCS answer is at dp[m][n]; LCSubstr answer is max over all cells — Different answer locations.
•Both have O(m×n) time and can be space-optimized — Similar efficiency profiles.
•Related problems form a family — Edit distance, repeated substrings, palindromes, supersequences.

Module Complete:

You've now mastered the Longest Common Subsequence family of problems:

Page 1: Complete LCS implementation with complexity analysis
Page 2: DP table construction and visual understanding
Page 3: Reconstruction algorithms for extracting the actual LCS
Page 4: Variations including Longest Common Substring and related problems

These algorithms are foundational for string processing, appearing in version control systems, bioinformatics tools, plagiarism detectors, and many other applications. The DP techniques you've learned here transfer directly to dozens of other problems.

Module Complete

Congratulations! You've completed the Longest Common Subsequence module. You now have a deep understanding of both LCS and LCSubstr, including their DP formulations, implementations, reconstructions, and applications. These are essential tools in any algorithm engineer's toolkit.

4 / 4

Loading learning content...

Data Structures & AlgorithmsString Algorithms

Longest Common Subsequence (Revisited)

LevelIntermediate

Duration75 mins

TopicString Algorithms

4 / 4

Variations: Longest Common Substring

When Contiguity Matters

We've mastered the Longest Common Subsequence, where elements can be scattered throughout the strings as long as their order is preserved. But what if we need consecutive matches?

Consider these scenarios:

Plagiarism detection: Verbatim copying involves contiguous text blocks, not scattered words
DNA repeat finding: Tandem repeats are contiguous patterns, not interleaved ones
Duplicate file detection: Similar file sections are contiguous byte sequences
Autocomplete: Matching prefixes are contiguous character sequences

What You Will Master

Problem Definition and Contrast

Definition: Substring

Definition: Longest Common Substring (LCSubstr)

Given two strings X and Y, find the longest string Z that is a substring of both X and Y.

Key Difference from LCS:

Property	Longest Common Subsequence	Longest Common Substring
Elements	Can be non-contiguous	Must be contiguous
Order	Preserved	Preserved
Typical result	Scattered through strings	Single block in each string
Length	≥ Substring length	≤ Subsequence length

LCS vs LCSubstr ComparisonComparing results for X = "ABABC" and Y = "BABCA"

Input

Output

Common Confusion

Naive Approach Analysis

Naive Algorithm:

Enumerate all substrings of X: O(m²) substrings
For each substring, check if it's a substring of Y: O(n) per check using string matching
Track the longest match

Time Complexity: O(m² × n) or O(m² × n) depending on how we check for substring membership.

Can We Do Better?

Another Approach: Suffix Arrays/Trees

Algorithm Comparison for Longest Common Substring
Algorithm	Time Complexity	Space Complexity	Implementation Difficulty
Naive (all substrings)	O(m² × n)	O(1)	Easy
Dynamic Programming	O(m × n)	O(m × n) or O(min(m,n))	Easy
Suffix Array	O((m + n) log(m + n))	O(m + n)	Moderate
Suffix Tree	O(m + n)	O(m + n)	Hard

Dynamic Programming Formulation

The Critical Insight:

For longest common substring, we need to track contiguous matches. Unlike LCS, where we could combine results from subproblems freely, here a mismatch "breaks the chain."

DP State Definition:

Let dp[i][j] = length of the longest common suffix of X[1..i] and Y[1..j] that is also a common substring ending at positions i and j.

In simpler terms: dp[i][j] is the length of the longest common substring that ends at X[i] and Y[j].

Recurrence:

dp[i][j] = 
  | dp[i-1][j-1] + 1    if X[i] == Y[j]
  | 0                   if X[i] ≠ Y[j]

Why the Difference from LCS?

In LCS, when characters don't match, we can still inherit the best result from shorter prefixes. In LCSubstr, a mismatch means the contiguous run is broken—we must restart from 0.

The Answer:

Unlike LCS where dp[m][n] gives the answer, for LCSubstr we need:

answer = max(dp[i][j]) for all i, j

The longest common substring could end anywhere in the strings.

LCS Recurrence

•Match: dp[i-1][j-1] + 1
•No match: max(dp[i-1][j], dp[i][j-1])
•Values propagate and accumulate
•Answer at dp[m][n]
•Can inherit from non-diagonal cells

LCSubstr Recurrence

•Match: dp[i-1][j-1] + 1
•No match: 0 (chain broken)
•Values reset on mismatch
•Answer is max over all cells
•Only diagonal propagation

Intuition Check

Think of the LCSubstr DP table as tracking "how long is the current matching streak ending here?" Like counting consecutive wins in a game—any loss resets the counter to zero.

Complete Implementation

Let's implement the longest common substring algorithm with full reconstruction capability.

longest_common_substring.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def longest_common_substring(X: str, Y: str) -> tuple[int, str]:
    """
    Find the longest common substring of X and Y.
    
    Returns:
        Tuple of (length, the longest common substring)
    
    Time Complexity: O(m × n)
    Space Complexity: O(m × n)
    """
    m, n = len(X), len(Y)
    
    if m == 0 or n == 0:
        return 0, ""
    
    # DP table: dp[i][j] = length of LCSubstr ending at X[i-1] and Y[j-1]
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    max_length = 0
    end_index_x = 0  # End position in X (exclusive)
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
                
                # Track the maximum
                if dp[i][j] > max_length:
                    max_length = dp[i][j]
                    end_index_x = i
            else:
                dp[i][j] = 0  # Chain broken
    
    # Extract the substring
    substring = X[end_index_x - max_length:end_index_x]
    
    return max_length, substring
 
 
def longest_common_substring_all(X: str, Y: str) -> tuple[int, list[str]]:
    """
    Find ALL longest common substrings (may have multiple of same max length).
    
    Returns:
        Tuple of (max length, list of all LCS strings)
    """
    m, n = len(X), len(Y)
    
    if m == 0 or n == 0:
        return 0, [""]
    
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    max_length = 0
    end_positions = []  # (end_index_x, end_index_y) for all max-length substrings
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
                
                if dp[i][j] > max_length:
                    max_length = dp[i][j]
                    end_positions = [(i, j)]  # Reset with new max
                elif dp[i][j] == max_length:
                    end_positions.append((i, j))  # Add to list
            else:
                dp[i][j] = 0
    
    # Extract all unique substrings
    substrings = set()
    for end_x, end_y in end_positions:
        substrings.add(X[end_x - max_length:end_x])
    
    return max_length, list(substrings)
 
 
def print_table(X: str, Y: str) -> None:
    """Visualize the DP table for educational purposes."""
    m, n = len(X), len(Y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
    
    # Print table
    print("LCSubstr DP Table:")
    print("    ε", end="")
    for c in Y:
        print(f"  {c}", end="")
    print()
    
    print("  ε", end="")
    for j in range(n + 1):
        print(f"  {dp[0][j]}", end="")
    print()
    
    for i in range(1, m + 1):
        print(f"  {X[i-1]}", end="")
        for j in range(n + 1):
            print(f"  {dp[i][j]}", end="")
        print()
 
 
# Example usage
X = "ABABC"
Y = "BABCA"
 
print(f"X = '{X}'")
print(f"Y = '{Y}'")
print()
 
print_table(X, Y)
print()
 
length, substr = longest_common_substring(X, Y)
print(f"Longest common substring length: {length}")
print(f"One longest common substring: '{substr}'")
 
length, all_substrs = longest_common_substring_all(X, Y)
print(f"All longest common substrings: {all_substrs}")

Understanding the Table:

For X = "ABABC" and Y = "BABCA":

LCSubstr DP Table:
      ε  B  A  B  C  A
  ε   0  0  0  0  0  0
  A   0  0  1  0  0  1
  B   0  1  0  2  0  0
  A   0  0  2  0  0  1
  B   0  1  0  3  0  0
  C   0  0  0  0  4  0

Wait—that's wrong! Let me recalculate:

X = ABABC (indices 0-4)
Y = BABCA (indices 0-4)

Actually:
      ε  B  A  B  C  A
  ε   0  0  0  0  0  0
  A   0  0  1  0  0  1
  B   0  1  0  2  0  0
  A   0  0  2  0  0  1
  B   0  1  0  3  0  0
  C   0  0  0  0  4  0

Hmm, that shows max = 4 at position (5, 4), which would mean substring ending at X[4]='C' and Y[3]='C'. Tracing back 4 characters: X[1:5] = "BABC", but Y doesn't have "BABC" as a substring...

Let me recalculate more carefully:

X[4]='C', Y[3]='C' → match, dp[5][4] = dp[4][3] + 1
dp[4][3]: X[3]='B', Y[2]='B' → match, = dp[3][2] + 1
dp[3][2]: X[2]='A', Y[1]='A' → match, = dp[2][1] + 1
dp[2][1]: X[1]='B', Y[0]='B' → match, = dp[1][0] + 1 = 1

So dp[2][1]=1, dp[3][2]=2, dp[4][3]=3, dp[5][4]=4? But that means "BABC" is common, which isn't right for the original strings.

Actually, I made an indexing error. Let me be more careful:

X = "ABABC": X[0]='A', X[1]='B', X[2]='A', X[3]='B', X[4]='C' Y = "BABCA": Y[0]='B', Y[1]='A', Y[2]='B', Y[3]='C', Y[4]='A'

The table is:

      ε  B  A  B  C  A
  ε   0  0  0  0  0  0
  A   0  0  1  0  0  1
  B   0  1  0  2  0  0
  A   0  0  2  0  0  1
  B   0  1  0  3  0  0
  C   0  0  0  0  1  0

At (4,3): X[3]='B', Y[2]='B' → check dp[3][2]. dp[3][2]: X[2]='A', Y[1]='A' → 2. So dp[4][3] = 3, corresponding to "BAB" (ending at X[3], Y[2]).

At (5,4): X[4]='C', Y[3]='C' → dp[4][3]=3 → dp[5][4]=4? No wait, dp[i-1][j-1] = dp[4][3] = 3, so dp[5][4] = 4?

But the substring would be X[1:5] = "BABC". Is "BABC" a substring of Y = "BABCA"? Yes! Y[0:4] = "BABC". So the answer IS 4, and the substring is "BABC".

I was wrong earlier—"BABC" IS a substring of both strings!

Final Correction

Space Optimization

Just like with LCS, we can reduce the space complexity from O(m × n) to O(min(m, n)) by observing that each cell only depends on the diagonal cell from the previous row.

lcsubstr_optimized.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
def longest_common_substring_optimized(X: str, Y: str) -> tuple[int, str]:
    """
    Space-optimized longest common substring.
    
    Time Complexity: O(m × n)
    Space Complexity: O(min(m, n))
    
    Key insight: Each cell dp[i][j] only depends on dp[i-1][j-1].
    We can use a single row, but need to save the diagonal value
    before it's overwritten.
    """
    # Make Y the shorter string
    if len(X) < len(Y):
        X, Y = Y, X
    
    m, n = len(X), len(Y)
    
    if m == 0 or n == 0:
        return 0, ""
    
    # Single row DP
    dp = [0] * (n + 1)
    
    max_length = 0
    end_index_x = 0
    
    for i in range(1, m + 1):
        # We need to process right-to-left to avoid using updated values
        # OR we save the diagonal separately
        prev_diagonal = 0
        
        for j in range(1, n + 1):
            # Save current value (it will be diagonal for next iteration)
            old_dp_j = dp[j]
            
            if X[i - 1] == Y[j - 1]:
                dp[j] = prev_diagonal + 1
                
                if dp[j] > max_length:
                    max_length = dp[j]
                    end_index_x = i
            else:
                dp[j] = 0
            
            prev_diagonal = old_dp_j
    
    return max_length, X[end_index_x - max_length:end_index_x]
 
 
def longest_common_substring_constant_space(X: str, Y: str) -> int:
    """
    O(1) space solution using a different approach.
    
    Idea: For each possible diagonal in the table, compute the
    longest matching run. Each diagonal can be processed with O(1)
    extra space.
    
    Time Complexity: O(m × n) 
    Space Complexity: O(1) (not counting input/output)
    
    Note: This approach is trickier to implement correctly and
    reconstruction is harder. Use only when space is critical.
    """
    m, n = len(X), len(Y)
    
    if m == 0 or n == 0:
        return 0
    
    max_length = 0
    
    # Process diagonals starting from first column
    for start_i in range(m):
        i, j = start_i, 0
        current_length = 0
        
        while i < m and j < n:
            if X[i] == Y[j]:
                current_length += 1
                max_length = max(max_length, current_length)
            else:
                current_length = 0
            i += 1
            j += 1
    
    # Process diagonals starting from first row (except 0,0 already done)
    for start_j in range(1, n):
        i, j = 0, start_j
        current_length = 0
        
        while i < m and j < n:
            if X[i] == Y[j]:
                current_length += 1
                max_length = max(max_length, current_length)
            else:
                current_length = 0
            i += 1
            j += 1
    
    return max_length
 
 
# Example
X = "ABABC"
Y = "BABCA"
 
length1, substr1 = longest_common_substring_optimized(X, Y)
length2 = longest_common_substring_constant_space(X, Y)
 
print(f"Optimized: length={length1}, substring='{substr1}'")
print(f"Constant space: length={length2}")

Diagonal Processing Insight

Related Variations

The LCS and LCSubstr problems belong to a family of related string comparison problems. Understanding their relationships builds intuition for string algorithms.

String Comparison Problem Family
Problem	Description	Time Complexity	Key Insight
Longest Common Subsequence	Longest sequence preserving order	O(m × n)	Matches can be scattered
Longest Common Substring	Longest contiguous match	O(m × n)	Matches must be consecutive
Edit Distance	Min operations to transform	O(m × n)	Insert, delete, substitute
Longest Repeated Substring	Longest substring appearing twice in same string	O(n²) or O(n)	LCSubstr of S with itself, excluding trivial match
Longest Palindromic Substring	Longest substring that's a palindrome	O(n²) or O(n)	LCSubstr of S with reverse(S), with position constraints
Shortest Common Supersequence	Shortest string containing both as subsequences	O(m × n)	Length = m + n - LCS

Longest Repeated Substring

Given a string S, find the longest substring that appears at least twice. This can be solved by finding the LCSubstr of S with itself, but we must ensure the two occurrences don't overlap trivially.

Example: For S = "banana", the longest repeated substring is "ana" (appears at positions 1 and 3).

Longest Palindromic Substring

related_problems.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def longest_repeated_substring(s: str) -> str:
    """
    Find the longest substring that appears at least twice.
    
    Uses a modified LCSubstr approach comparing s with itself,
    ensuring we don't count overlapping matches.
    
    Time Complexity: O(n²)
    Space Complexity: O(n²)
    """
    n = len(s)
    
    if n == 0:
        return ""
    
    # dp[i][j] = length of LCSubstr ending at s[i-1] and s[j-1]
    # where i != j (different positions)
    dp = [[0] * (n + 1) for _ in range(n + 1)]
    
    max_length = 0
    end_index = 0
    
    for i in range(1, n + 1):
        for j in range(1, n + 1):
            # Skip diagonal (same position)
            if i != j and s[i - 1] == s[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
                
                if dp[i][j] > max_length:
                    max_length = dp[i][j]
                    end_index = i
    
    return s[end_index - max_length:end_index]
 
 
def longest_palindromic_substring_expand(s: str) -> str:
    """
    Find the longest palindromic substring using expand-around-center.
    
    Time Complexity: O(n²)
    Space Complexity: O(1)
    """
    if len(s) < 2:
        return s
    
    def expand(left: int, right: int) -> tuple[int, int]:
        """Expand around center while characters match."""
        while left >= 0 and right < len(s) and s[left] == s[right]:
            left -= 1
            right += 1
        return left + 1, right - 1  # Return valid bounds
    
    start, end = 0, 0
    
    for i in range(len(s)):
        # Odd-length palindromes (center at i)
        l1, r1 = expand(i, i)
        if r1 - l1 > end - start:
            start, end = l1, r1
        
        # Even-length palindromes (center between i and i+1)
        l2, r2 = expand(i, i + 1)
        if r2 - l2 > end - start:
            start, end = l2, r2
    
    return s[start:end + 1]
 
 
def shortest_common_supersequence(X: str, Y: str) -> str:
    """
    Find the shortest string that contains both X and Y as subsequences.
    
    Uses the relationship: SCS_length = len(X) + len(Y) - LCS_length
    
    Time Complexity: O(m × n)
    Space Complexity: O(m × n)
    """
    m, n = len(X), len(Y)
    
    # Build LCS table
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if X[i - 1] == Y[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
    
    # Reconstruct SCS by backtracking
    result = []
    i, j = m, n
    
    while i > 0 and j > 0:
        if X[i - 1] == Y[j - 1]:
            # Character in LCS - add once
            result.append(X[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] >= dp[i][j - 1]:
            # Add character from X
            result.append(X[i - 1])
            i -= 1
        else:
            # Add character from Y
            result.append(Y[j - 1])
            j -= 1
    
    # Add remaining characters
    while i > 0:
        result.append(X[i - 1])
        i -= 1
    while j > 0:
        result.append(Y[j - 1])
        j -= 1
    
    return ''.join(reversed(result))
 
 
# Examples
print("Longest Repeated Substring:")
print(f"  'banana' → '{longest_repeated_substring('banana')}'")
print(f"  'abcabc' → '{longest_repeated_substring('abcabc')}'")
 
print("\nLongest Palindromic Substring:")
print(f"  'babad' → '{longest_palindromic_substring_expand('babad')}'")
print(f"  'cbbd' → '{longest_palindromic_substring_expand('cbbd')}'")
 
print("\nShortest Common Supersequence:")
print(f"  'AGGTAB', 'GXTXAYB' → '{shortest_common_supersequence('AGGTAB', 'GXTXAYB')}'")

Problem Reduction

Many string problems can be reduced to LCS or LCSubstr with modifications. Understanding these core algorithms provides a foundation for solving an entire class of related problems.

Choosing Between LCS and LCSubstr

Understanding when to use each algorithm is crucial for solving real-world problems correctly.

Use LCS When...

•File diff / version comparison
•DNA sequence alignment
•Finding similar orderings
•Edit distance calculation
•Merge conflict resolution
•Reordering detection
•Order-preserving similarity

Use LCSubstr When...

•Exact phrase matching
•Plagiarism detection (verbatim copy)
•DNA tandem repeat finding
•Common block identification
•Autocomplete matching
•Exact prefix/suffix matching
•Contiguous similarity

Decision Framework:

Ask: Do matches need to be consecutive?
- Yes → LCSubstr
- No → LCS
Ask: Can matched elements be interleaved with non-matched elements?
- Yes → LCS
- No → LCSubstr
Ask: Is the output a single contiguous block?
- Yes → LCSubstr
- No (can be scattered) → LCS

Hybrid Approach:

Sometimes you need both. For example, a plagiarism detector might:

Use LCSubstr to find verbatim copied blocks
Use LCS to find paraphrased content with preserved structure
Combine both signals for a plagiarism score

Summary: LCS Variations Complete

We've completed our deep dive into the Longest Common Subsequence problem and its close relative, the Longest Common Substring.

Module Takeaways

•LCS allows scattered matches; LCSubstr requires contiguity — A small definition change leads to different recurrences.
•LCS: max(top, left) on mismatch — Values propagate from neighbors.
•LCSubstr: 0 on mismatch — The chain breaks, counter resets.
•LCS answer is at dp[m][n]; LCSubstr answer is max over all cells — Different answer locations.
•Both have O(m×n) time and can be space-optimized — Similar efficiency profiles.
•Related problems form a family — Edit distance, repeated substrings, palindromes, supersequences.

Module Complete:

You've now mastered the Longest Common Subsequence family of problems:

Page 1: Complete LCS implementation with complexity analysis
Page 2: DP table construction and visual understanding
Page 3: Reconstruction algorithms for extracting the actual LCS
Page 4: Variations including Longest Common Substring and related problems

Module Complete

4 / 4