Loading learning content...
In 1974, three computer scientists—Donald Knuth, James Morris, and Vaughan Pratt—independently discovered the same profound insight about string matching. Their discovery, now known as the KMP algorithm, doesn't just improve pattern matching by a constant factor or shave off a few operations. It fundamentally transforms how we think about the problem itself.
The KMP algorithm's genius lies in a single, elegant observation: every mismatch carries information we can exploit. When the naive algorithm encounters a mismatch, it discards everything it has learned about the text and starts fresh at the next position. This is spectacularly wasteful. KMP asks: What if we remembered what we've seen?
By the end of this page, you will understand the fundamental insight that powers the KMP algorithm: that mismatches during pattern matching aren't failures to be forgotten—they're information to be leveraged. You'll see exactly why the naive algorithm does redundant work and precisely how KMP avoids it.
Before we can appreciate KMP's insight, we need to deeply understand the problem it solves. The naive pattern matching algorithm is intuitive and easy to implement:
This approach works, but it has a fundamental inefficiency that becomes catastrophic in worst-case scenarios.
12345678910111213141516171819202122
def naive_search(text: str, pattern: str) -> list[int]: """ Naive pattern matching algorithm. Returns all starting positions where pattern occurs in text. Time Complexity: O(n × m) in the worst case Space Complexity: O(1) excluding output """ n, m = len(text), len(pattern) matches = [] for i in range(n - m + 1): # Try each starting position j = 0 while j < m and text[i + j] == pattern[j]: j += 1 if j == m: # Full match found matches.append(i) # If mismatch: we just move to next position, # FORGETTING everything we learned about text[i+1..i+j-1] return matchesThe hidden catastrophe:
Consider searching for the pattern "AAAAAAB" in the text "AAAAAAAAAAAAAAAAB". Watch what happens:
Text: A A A A A A A A A A A A A A A A B
Pattern: A A A A A A B
✓ ✓ ✓ ✓ ✓ ✓ ✗ → Mismatch at position 6
Shift pattern by 1:
A A A A A A B
✓ ✓ ✓ ✓ ✓ ✗ → Mismatch at position 6 (again!)
Shift pattern by 1:
A A A A A A B
✓ ✓ ✓ ✓ ✗ → Mismatch at position 6 (again!)
Do you see the problem? We're comparing many of the same characters over and over. We already know that text[1..5] contains all As—we compared them in the first alignment! Yet we compare them again, and again, and again.
This redundancy leads to O(n × m) time complexity in the worst case, where n is the text length and m is the pattern length.
This isn't a theoretical concern. DNA sequences, repetitive log files, and network protocol parsing regularly encounter patterns with high self-similarity. A naive search for 'AAAAAAAT' in a megabyte of 'A's would perform billions of redundant character comparisons.
Let's analyze precisely what the naive algorithm forgets. Consider this scenario:
Text: A B C A B C A B D
Pattern: A B C A B D
✓ ✓ ✓ ✓ ✓ ✗ → Mismatch at index 5 (pattern 'D' vs text 'C')
At the moment of mismatch, we know the following:
We matched "ABCAB": The first 5 characters of the pattern matched the text at positions 0-4.
Therefore, text[0..4] = "ABCAB": This is certain—we just verified it character by character.
Now observe something crucial: The suffix "AB" of what we matched ("ABC**AB**") is also a prefix of the pattern ("**AB**CABD")!
This last observation is the key insight. When we shift the pattern right by one position and start comparing again from scratch, we're ignoring the fact that we already know some characters will match.
When a mismatch occurs after matching k characters, we don't need to start over from scratch. If the matched portion has a suffix that's also a prefix of the pattern, we can 'slide' the pattern forward to align that prefix with the matching suffix—and continue comparing from there.
Let's visualize this:
Step 1: After mismatch at position 5
Text: A B C A B C A B D
Pattern: A B C A B D
└─┬─┘ └─┬─┘ ✗
prefix suffix (of matched part) matches!
Step 2: SMART shift - align the matching prefix with the suffix
Text: A B C A B C A B D
Pattern: A B C A B D
└─┬─┘
These are already known to match!
Instead of starting comparison from the beginning of the pattern at the new position, we can resume from index 2 of the pattern (right after the prefix "AB" that we know matches). We skip 2 comparisons that we already know would succeed.
| Approach | Action After Mismatch | Characters Re-compared | Efficient? |
|---|---|---|---|
| Naive | Shift pattern by 1, restart comparison from index 0 | All previous matches (potentially m-1) | No |
| Smart (KMP) | Shift pattern to align prefix with matching suffix, resume from alignment point | Zero | Yes |
The key to KMP is recognizing a relationship that exists purely within the pattern itself. We need to precompute, for each position in the pattern: "If a mismatch occurs after matching up to this position, how far back in the pattern can we safely resume?"
The answer depends on finding proper prefixes that are also suffixes of the matched portion.
Definition:
s is any substring starting at index 0: s[0..k] for some ks is any substring ending at the last index: s[j..len-1] for some jPattern: "ABCAB" Prefixes: "", "A", "AB", "ABC", "ABCA", "ABCAB"Suffixes: "", "B", "AB", "CAB", "BCAB", "ABCAB" Proper prefixes that are also suffixes:- "" (empty string, trivial, length 0)- "AB" (length 2) ✓ This is what we care about! The LONGEST proper prefix that is also a suffix of "ABCAB" is "AB", length 2. Pattern: "AAAA" Proper prefixes that are also suffixes:- "" (length 0)- "A" (length 1)- "AA" (length 2)- "AAA" (length 3) ✓ Longest Pattern: "ABCD" Proper prefixes that are also suffixes:- "" (length 0) ✓ Only the empty string qualifies No non-trivial overlap, so if mismatch occurs, we must restart from 0.Why does this matter?
When we've matched pattern[0..j-1] and encounter a mismatch at pattern[j]:
text[i-j..i-1] equals pattern[0..j-1] (we just matched them)pattern[0..j-1] has a proper prefix of length k that equals its suffix of length k...text[i-k..i-1] (the last k characters we matched in text) equals pattern[0..k-1] (the first k characters of pattern)pattern[k]!This length—the longest proper prefix that's also a suffix—is exactly what the 'failure function' (or 'LPS array' for Longest Proper Prefix which is also Suffix) computes. For each position j in the pattern, it tells us: 'If you mismatch at position j, resume at position lps[j-1]'. We'll build this array in detail in the next pages.
Let's trace through a complete example to see how the smart shift strategy works. We'll search for "ABABAC" in "ABABABAC".
Pattern Analysis First:
Pattern: A B A B A C
Index: 0 1 2 3 4 5
For prefix "ABABA" (indices 0-4), what's the longest proper prefix = suffix?
Prefixes: A, AB, ABA, ABAB
Suffixes: A, BA, ABA, BABA
Common: A (length 1), ABA (length 3)
Longest: ABA (length 3)
So if we mismatch at index 5, we can resume at index 3!
Now the search:
Step 1:
Text: A B A B A B A C
Pattern: A B A B A C
✓ ✓ ✓ ✓ ✓ ✗
Mismatch at pattern index 5 (C ≠ B)
We matched "ABABA" (indices 0-4)
Longest prefix-suffix: "ABA" (length 3)
Step 2: SMART SHIFT
Text: A B A B A B A C
Pattern: A B A B A C
↑
Resume comparing from pattern index 3!
We DON'T compare indices 0,1,2 - we KNOW they match:
- text[2..4] = "ABA" = pattern[0..2]
Step 3: Continue
Text: A B A B A B A C
Pattern: A B A B A C
- - - ✓ ✓ ✓
(we start here at j=3)
All remaining characters match! Match found at position 2.
Notice that in KMP, the text pointer NEVER moves backward. Once we've compared text[i], we never compare it again. This single property guarantees O(n) text traversal. Combined with O(m) pattern preprocessing, we achieve O(n + m) total time—a dramatic improvement over O(n × m).
Let's formalize the insight to see why KMP is correct (never misses a match) and complete (the shift is maximal—we don't skip valid match positions).
Theorem (Correctness): When KMP shifts the pattern after a mismatch, any starting positions skipped over cannot yield a valid match.
Proof Sketch:
Suppose we've matched pattern[0..j-1] against text[i-j..i-1], and mismatch occurs at pattern[j] vs text[i]. Consider any starting position s where i-j < s < i-j+k (positions we skip). For a match at position s:
pattern[0] must match text[s]pattern[1] must match text[s+1]pattern[i-s-1] must match text[i-1]But this means pattern[0..i-s-1] would have to equal text[s..i-1]. Since we know text[s..i-1] equals pattern[i-j-s..j-1] (part of our matched substring), we'd need:
pattern[0..i-s-1] = pattern[j-(i-s)..j-1]
This is saying: a prefix of pattern (length i-s) equals a suffix of pattern[0..j-1] of the same length. If this length exceeds our computed LPS value, it contradicts the definition of LPS as the longest such prefix-suffix!
The LPS (Longest Proper Prefix which is also Suffix) value is exactly the right amount: shifting any less would mean re-checking positions we've already proven work; shifting any more would risk skipping a valid match position. LPS finds the perfect balance.
The Key Invariant:
KMP maintains a crucial invariant throughout its execution:
At every step, if we're comparing
pattern[j]withtext[i], thenpattern[0..j-1]is guaranteed to matchtext[i-j..i-1].
This invariant means:
The text pointer i only moves forward (never backward), and for each position of i, the pattern pointer j can decrease (via LPS lookups), but the total decreases are bounded by the total increases. This leads to the amortized analysis showing O(n) text processing.
| Property | What It Means | Why It Matters |
|---|---|---|
| No text backtracking | Text pointer only moves forward | Each text character examined at most once → O(n) |
| LPS maximality | LPS gives longest valid prefix-suffix overlap | We never skip valid match positions |
| LPS minimality | We shift as much as safely possible | No redundant comparisons with same text position |
| Invariant preservation | pattern[0..j-1] always matches text[i-j..i-1] | Algorithm state is always valid for correct matching |
KMP represents a fundamental paradigm shift in how we think about string matching. The naive algorithm treats the pattern and text symmetrically—just characters to compare. KMP recognizes a crucial asymmetry:
The pattern is known in advance. The text is not.
This asymmetry is the key. Since we know the pattern beforehand, we can precompute everything interesting about its structure. We can analyze which prefixes equal which suffixes, and encode this in a lookup table (the failure function / LPS array).
When we then scan the text, we're not just matching characters—we're leveraging our precomputed knowledge to make globally optimal decisions at each mismatch.
This 'precompute table from known data, then use table with unknown data' pattern appears throughout computer science: in dynamic programming, in compiler optimization, in database query planning. KMP is one of the cleanest examples of this principle in action.
Understanding why KMP works helps you make informed decisions about when to use it and how it fits into larger systems.
| Scenario | Pattern Characteristics | KMP Benefit |
|---|---|---|
| DNA sequence search | Long patterns with repetitive motifs (ATAT, CGCG) | Massive reduction in comparisons for similar sequences |
| Log file analysis | Searching for error codes, timestamps | Handles repetitive log line structures efficiently |
| Network protocol parsing | Fixed headers with repeated bytes | O(n) guarantee prevents timeout attacks |
| Text editor find | User-specified patterns, any structure | Consistent performance regardless of pattern type |
| Intrusion detection | Signature matching in traffic | Predictable, bounded time per byte scanned |
When KMP's insight is less impactful:
"ABCDEF", the LPS array is all zeros—every mismatch restarts from the beginning anyway. KMP degenerates to naive behavior (but never worse!).However, even in these cases, KMP never performs worse than naive asymptotically, and in practice, the overhead is minimal. Many real-world implementations use KMP as the default for longer patterns.
In network and security applications, worst-case guarantees matter. An attacker could craft input to trigger naive algorithm's O(n×m) behavior, causing denial-of-service. KMP's O(n+m) guarantee is a security feature, not just a performance optimization.
We've established the fundamental insight that powers the KMP algorithm. Let's consolidate what we've learned:
What's next:
Now that we understand why KMP works, we need to understand how it works in detail. The next page explores the failure function (LPS array)—the central data structure that encodes prefix-suffix relationships. We'll see exactly what values it contains, why each value is correct, and how to interpret it during matching.
You now understand the fundamental insight behind the KMP algorithm: that pattern matching can be made efficient by recognizing and exploiting the internal structure of the pattern. This insight—that mismatches provide information rather than just representing failure—is one of the most elegant ideas in algorithm design.