Loading learning content...
We've seen what the LPS array contains and how it enables efficient pattern matching. But how do we build the LPS array efficiently?
The naive approach would be: for each position i, try all possible prefix lengths and check if they match the corresponding suffix. This would be O(m³) or O(m²) with optimization—far too slow for large patterns.
The breakthrough insight is beautiful in its self-reference: we use the same pattern-matching technique to build the LPS array that we'll later use for text matching. The pattern becomes both the "text" and the "pattern" in a brilliant recursive construction.
This is one of the most elegant algorithms in computer science—efficient, correct, and conceptually deep.
By the end of this page, you will understand the complete LPS construction algorithm, why it works, and how to implement it. You'll see the elegant recursion that makes O(m) possible and be able to trace the algorithm by hand.
Before appreciating the efficient algorithm, let's understand why the obvious approach fails.
Naive Algorithm:
def naive_lps(pattern):
m = len(pattern)
lps = [0] * m
for i in range(1, m):
# For each position i, find longest proper prefix = suffix
for length in range(i, 0, -1): # Try longest first
prefix = pattern[0:length]
suffix = pattern[i-length+1:i+1]
if prefix == suffix:
lps[i] = length
break
return lps
Complexity Analysis:
For a pattern of 100,000 characters, this is 10¹⁵ operations—completely impractical.
The naive algorithm repeatedly compares the same substrings. If we've determined that 'ABAB' has LPS of 2 ('AB' matches), why should we re-examine 'AB' when computing the next LPS value? This redundancy is exactly what KMP's construction algorithm eliminates.
The efficient construction uses a powerful observation:
If we know LPS[i-1], we can compute LPS[i] by trying to extend the previous overlap.
Let's say LPS[i-1] = k. This means:
Now we want LPS[i]. There are two cases:
Case 1: pattern[k] == pattern[i]
The overlap extends! We have:
EXTENDING AN OVERLAP Previous state: pattern[0..i-1], LPS[i-1] = k Pattern: ... [0] [1] ... [k-1] [k] ... [i-k] [i-k+1] ... [i-1] [i] └─── prefix (k chars) ───┘ └──── suffix (k chars) ────┘ ↑ ↑ These are equal (LPS[i-1] = k) Now checking if pattern[k] == pattern[i]: Pattern: ... [0] [1] ... [k-1] [k] ... [i-k] [i-k+1] ... [i-1] [i] ↑ ↑ pattern[k] pattern[i] If they're equal:- pattern[0..k] = pattern[i-k..i]- The k-length overlap becomes (k+1)-length- LPS[i] = k + 1Case 2: pattern[k] != pattern[i]
The overlap doesn't extend. But we shouldn't give up—there might be a shorter overlap that we can extend.
Here's the key: the next candidate length to try is LPS[k-1].
Why? Because LPS[k-1] is the longest proper prefix of pattern[0..k-1] that equals a suffix. And since pattern[0..k-1] equals a suffix of what we're building, this "nested" overlap might extend to include pattern[i].
When extension fails, we "fall back" to LPS[k-1] and try again. This is the exact same fallback logic that happens during matching! We're matching the pattern against itself, using previously computed LPS values.
Here's the complete, efficient LPS construction algorithm:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
def compute_lps(pattern: str) -> list[int]: """ Compute the LPS (Longest Proper Prefix which is also Suffix) array. Time Complexity: O(m) where m = len(pattern) Space Complexity: O(m) for the LPS array Args: pattern: The pattern string to preprocess Returns: LPS array where LPS[i] = length of longest proper prefix of pattern[0..i] that is also a suffix of pattern[0..i] """ m = len(pattern) if m == 0: return [] lps = [0] * m # LPS[0] is always 0 (no proper prefix for single char) # 'length' tracks the length of the current longest prefix-suffix # Think of it as: "I have matched 'length' characters of the prefix" length = 0 # Start from index 1 (LPS[0] = 0 is given) i = 1 while i < m: if pattern[i] == pattern[length]: # Extension successful! # The prefix pattern[0..length] matches suffix ending at i length += 1 lps[i] = length i += 1 else: # Extension failed if length > 0: # Fall back to the next candidate: LPS[length-1] # This is the "nested" overlap within our current overlap length = lps[length - 1] # NOTE: we do NOT increment i here - we retry with shorter length else: # length is 0, no prefix matches lps[i] = 0 i += 1 return lpsKey Points in the Algorithm:
length variable: Tracks how many characters of prefix we've matched to suffix.
Extension (line 30-33): When pattern[i] matches pattern[length], we've extended our prefix-suffix match.
Fallback (line 38): When extension fails but length > 0, we use the LPS value at position length-1 to find the next candidate.
Not incrementing i during fallback: This is crucial! We keep trying shorter overlaps at the same position i until we either find one that extends, or exhaust all possibilities (length becomes 0).
Reset (lines 41-43): When length = 0 and we still mismatch, there's no prefix-suffix—LPS[i] = 0.
Let's trace through the algorithm with several patterns to deeply understand how it works.
Example 1: Pattern "ABCAB"
Pattern: A B C A BIndex: 0 1 2 3 4 Initialize: lps = [0, 0, 0, 0, 0], length = 0, i = 1 ───────────────────────────────────────────────────────── i = 1, length = 0:Compare pattern[1]='B' with pattern[0]='A': NOT EQUALlength = 0, so: lps[1] = 0, i++→ lps = [0, 0, 0, 0, 0], length = 0, i = 2 ───────────────────────────────────────────────────────── i = 2, length = 0:Compare pattern[2]='C' with pattern[0]='A': NOT EQUALlength = 0, so: lps[2] = 0, i++→ lps = [0, 0, 0, 0, 0], length = 0, i = 3 ───────────────────────────────────────────────────────── i = 3, length = 0:Compare pattern[3]='A' with pattern[0]='A': EQUAL ✓length++, lps[3] = 1, i++→ lps = [0, 0, 0, 1, 0], length = 1, i = 4 ───────────────────────────────────────────────────────── i = 4, length = 1:Compare pattern[4]='B' with pattern[1]='B': EQUAL ✓length++, lps[4] = 2, i++→ lps = [0, 0, 0, 1, 2], length = 2, i = 5 ───────────────────────────────────────────────────────── i = 5 = m, DONE! FINAL: LPS = [0, 0, 0, 1, 2]Example 2: Pattern "AABAAAB" (with fallbacks)
Pattern: A A B A A A BIndex: 0 1 2 3 4 5 6 Initialize: lps = [0, 0, 0, 0, 0, 0, 0], length = 0, i = 1 ───────────────────────────────────────────────────────── i = 1, length = 0:Compare pattern[1]='A' with pattern[0]='A': EQUAL ✓length = 1, lps[1] = 1, i = 2→ lps = [0, 1, 0, 0, 0, 0, 0] ───────────────────────────────────────────────────────── i = 2, length = 1:Compare pattern[2]='B' with pattern[1]='A': NOT EQUALlength > 0, so fallback: length = lps[0] = 0(i stays at 2) length = 0:Compare pattern[2]='B' with pattern[0]='A': NOT EQUALlength = 0, so: lps[2] = 0, i = 3→ lps = [0, 1, 0, 0, 0, 0, 0] ───────────────────────────────────────────────────────── i = 3, length = 0:Compare pattern[3]='A' with pattern[0]='A': EQUAL ✓length = 1, lps[3] = 1, i = 4→ lps = [0, 1, 0, 1, 0, 0, 0] ───────────────────────────────────────────────────────── i = 4, length = 1:Compare pattern[4]='A' with pattern[1]='A': EQUAL ✓length = 2, lps[4] = 2, i = 5→ lps = [0, 1, 0, 1, 2, 0, 0] ───────────────────────────────────────────────────────── i = 5, length = 2:Compare pattern[5]='A' with pattern[2]='B': NOT EQUALlength > 0, so fallback: length = lps[1] = 1(i stays at 5) length = 1:Compare pattern[5]='A' with pattern[1]='A': EQUAL ✓length = 2, lps[5] = 2, i = 6→ lps = [0, 1, 0, 1, 2, 2, 0] ───────────────────────────────────────────────────────── i = 6, length = 2:Compare pattern[6]='B' with pattern[2]='B': EQUAL ✓length = 3, lps[6] = 3, i = 7→ lps = [0, 1, 0, 1, 2, 2, 3] ───────────────────────────────────────────────────────── i = 7 = m, DONE! FINAL: LPS = [0, 1, 0, 1, 2, 2, 3] NOTE: At i=5, we had a FALLBACK.pattern[5..5] = "A" didn't extend the "AA" overlap (position 2 is 'B').But after falling back, "A" DID extend the "A" overlap.lps[5] = 2 means "AA" (prefix) matches "AA" (the last two A's at positions 4,5).At position 5, we tried to extend the 'AAB' prefix, failed, fell back to extending 'A', and succeeded. This is the power of the recursive fallback—it finds the best possible overlap without restarting from scratch.
The fallback length = lps[length - 1] might seem magical. Let's prove it's correct.
Claim: When we can't extend a length-k overlap to include pattern[i], the next candidate overlap to try is of length lps[k-1].
Proof:
We have:
We're looking for a shorter prefix that could extend. Suppose there's a length-j overlap (j < k) that can extend:
Question: What are the valid candidates for j?
Since pattern[0..j-1] must equal pattern[i-j..i-1], and we know pattern[0..k-1] = pattern[i-k..i-1], the substring pattern[i-j..i-1] is a suffix of both:
So pattern[0..j-1] must be both:
This is exactly the definition of "proper prefix of pattern[0..k-1] that is also a suffix"—which is what lps[k-1] measures!
lps[k-1] is the LONGEST prefix of pattern[0..k-1] that is also a suffix. If this can't extend to include pattern[i], then nothing longer can. We try the longest first; if it fails, we fall back to lps[lps[k-1]-1], and so on.
VISUALIZING THE FALLBACK Pattern: A A B A A A BPositions: [0] [1] [2] [3] [4] [5] [6] A A B A A A B At i=5, trying to extend length=2 overlap:Current overlap: pattern[0..1] = pattern[3..4] = "AA"Try pattern[2]='B' vs pattern[5]='A': FAIL Now we need a shorter valid prefix that could extend. What's inside our overlap pattern[0..1] = "AA"?lps[1] = 1 means pattern[0..0] = "A" is also a suffix of "AA". So pattern[0..0] = "A" matches pattern[4..4] = "A"(the suffix of our known-matching region) Try pattern[1]='A' vs pattern[5]='A': SUCCESS!length = 2, lps[5] = 2 The "A" at position 0 matched the "A" at position 4,Now the "A" at position 1 matches the "A" at position 5.So pattern[0..1] = "AA" matches pattern[4..5] = "AA". This is a DIFFERENT "AA" overlap than before:Before: positions 0-1 matching 3-4After: positions 0-1 matching 4-5Now let's see the complete KMP implementation with both LPS construction and pattern matching:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
def kmp_search(text: str, pattern: str) -> list[int]: """ Complete KMP algorithm implementation. Finds all occurrences of pattern in text using the Knuth-Morris-Pratt algorithm. Time Complexity: O(n + m) where n = len(text), m = len(pattern) Space Complexity: O(m) for the LPS array Args: text: The text to search in pattern: The pattern to search for Returns: List of starting indices where pattern occurs in text """ n, m = len(text), len(pattern) # Edge cases if m == 0: return list(range(n + 1)) # Empty pattern matches at every position if n < m: return [] # Pattern longer than text # ======================================== # PHASE 1: Build the LPS array - O(m) # ======================================== lps = [0] * m length = 0 # Length of the previous longest prefix-suffix i = 1 while i < m: if pattern[i] == pattern[length]: length += 1 lps[i] = length i += 1 else: if length > 0: length = lps[length - 1] # Fallback else: lps[i] = 0 i += 1 # ======================================== # PHASE 2: Search for pattern - O(n) # ======================================== matches = [] i = 0 # Index in text j = 0 # Index in pattern while i < n: if text[i] == pattern[j]: i += 1 j += 1 if j == m: # Complete match found! matches.append(i - m) # Continue searching for overlapping matches j = lps[j - 1] else: if j > 0: j = lps[j - 1] # Use LPS to skip else: i += 1 # Move to next text position return matches # ========================================# TESTING AND VERIFICATION# ======================================== def verify_lps(pattern: str, lps: list[int]) -> bool: """Verify that an LPS array is correct (for testing).""" for i in range(len(pattern)): # lps[i] should be the longest proper prefix = suffix of pattern[0..i] k = lps[i] if k > i: return False # lps[i] can't exceed i if k > 0: if pattern[0:k] != pattern[i-k+1:i+1]: return False # The prefix-suffix must actually match return True # Example usageif __name__ == "__main__": text = "ABABABABCABAAB" pattern = "ABABC" print(f"Text: {text}") print(f"Pattern: {pattern}") print(f"Matches at positions: {kmp_search(text, pattern)}")| Component | Lines | Complexity | Purpose |
|---|---|---|---|
| LPS Construction | 30-42 | O(m) | Preprocess pattern structure |
| Matching Loop | 48-63 | O(n) | Find all occurrences |
| Match Recording | 57-59 | O(1) each | Store match positions |
| Overlap Continuation | 62 | O(1) | Handle overlapping matches |
The KMP algorithm can be implemented in various languages. Here are optimized versions:
1234567891011121314151617181920212223242526272829
def kmp_search(text: str, pattern: str) -> list[int]: """Memory-efficient KMP with generator option.""" n, m = len(text), len(pattern) if m == 0 or n < m: return [] if n < m else list(range(n + 1)) # Build LPS lps = [0] * m k = 0 for i in range(1, m): while k > 0 and pattern[i] != pattern[k]: k = lps[k - 1] if pattern[i] == pattern[k]: k += 1 lps[i] = k # Search matches = [] j = 0 for i, char in enumerate(text): while j > 0 and char != pattern[j]: j = lps[j - 1] if char == pattern[j]: j += 1 if j == m: matches.append(i - m + 1) j = lps[j - 1] return matchesWe've completed our deep dive into the KMP algorithm's LPS construction. Let's consolidate:
KMP Mastery Complete:
Congratulations! You now have complete mastery of the Knuth-Morris-Pratt algorithm:
The KMP algorithm stands as one of the most elegant achievements in algorithm design—transforming a seemingly simple problem into a showcase for deep insight about information, redundancy, and efficient computation.
You have mastered the KMP algorithm in full. You understand its motivation (avoiding redundant comparisons), its central data structure (the LPS/failure function), its technique (leveraging prefix-suffix relationships), its complexity (O(n+m) optimal), and its implementation. You're now equipped to implement KMP confidently and explain why it works.