Loading learning content...
The KMP algorithm's power is concentrated in a single data structure: the failure function, also known as the LPS array (Longest Proper Prefix which is also Suffix). This array, computed during preprocessing, encodes everything the algorithm needs to know about the pattern's internal structure.
Understanding the LPS array is essential because:
Once you truly understand the LPS array, the entire KMP algorithm becomes transparent.
By the end of this page, you will understand exactly what the LPS array represents, how to compute its values by hand, and how to interpret its values during matching. You'll see multiple examples and develop intuition for the array's structure.
Let's establish precise definitions before we explore examples.
Definitions:
We require 'proper' prefix/suffix (not the whole string) because any string is trivially both a prefix and suffix of itself. If we allowed improper matches, LPS[i] would always equal i+1, which tells us nothing useful. The proper constraint forces us to find genuine internal repetition.
Formal Definition:
LPS[i] = max { k : P[0..k-1] = P[i-k+1..i] and k < i+1 }
In words: LPS[i] is the length of the longest string that is both:
If no non-trivial match exists, LPS[i] = 0.
| Property | Value | Explanation |
|---|---|---|
| Size | m (pattern length) | One entry for each pattern position |
| LPS[i] range | 0 to i | Can't exceed i (that would be the whole substring) |
| LPS[0] | Always 0 | Single character has no proper prefix-suffix |
| LPS[i] meaning | Resume position after mismatch | If mismatch at i+1, resume comparing at LPS[i] |
Let's compute LPS arrays for several patterns to build intuition. We'll be thorough and methodical.
Example 1: Pattern "ABCAB"
Index: 0 1 2 3 4
Pattern: A B C A B
LPS[0]: Consider P[0..0] = "A"
LPS[1]: Consider P[0..1] = "AB"
LPS[2]: Consider P[0..2] = "ABC"
LPS[3]: Consider P[0..3] = "ABCA"
LPS[4]: Consider P[0..4] = "ABCAB"
Pattern: A B C A BIndex: 0 1 2 3 4LPS: 0 0 0 1 2 Interpretation:- LPS[3] = 1: "ABCA" has "A" as a proper prefix that equals its suffix- LPS[4] = 2: "ABCAB" has "AB" as a proper prefix that equals its suffixExample 2: Pattern "AAAA" (highly repetitive)
Index: 0 1 2 3
Pattern: A A A A
LPS[0]: P[0..0] = "A" → No proper prefix-suffix → LPS[0] = 0
LPS[1]: P[0..1] = "AA"
LPS[2]: P[0..2] = "AAA"
LPS[3]: P[0..3] = "AAAA"
Pattern: A A A AIndex: 0 1 2 3LPS: 0 1 2 3 Interpretation:This pattern has maximum possible LPS values (except LPS[0] which is always 0).Each position i has LPS[i] = i, meaning:- "AA" has 1-overlap- "AAA" has 2-overlap - "AAAA" has 3-overlap This allows aggressive shifting during mismatch - after matching k A's and hitting a mismatch, we can resume at position k-1 immediately!Example 3: Pattern "ABCDABD"
Index: 0 1 2 3 4 5 6
Pattern: A B C D A B D
I'll compute each value:
Pattern: A B C D A B DIndex: 0 1 2 3 4 5 6LPS: 0 0 0 0 1 2 0 Interpretation:- The pattern has no repetition until position 4, where "A" reappears- Positions 4-5 build up an overlap ("A", then "AB")- Position 6 breaks the pattern (D ≠ C), so overlap reset to 0 This means: if we mismatch at position 6, we can resume at position 2 (LPS[5]=2)But if we mismatch at position 6 and position 0 works, we restart from 0.The LPS array reveals the internal symmetry of a pattern. Let's visualize what LPS values tell us about pattern structure.
Pattern: "ABABAB" (periodic pattern)
Index: 0 1 2 3 4 5
Pattern: A B A B A B
┌───────────────────────┐
│ A │ B │ A │ B │ A │ B │
└───────────────────────┘
↑ ↑ ↑
└───────────────│───── These A's at positions 0,2,4 are "aligned"
↑ ↑
└───────│───── These AB's are "aligned" prefixes/suffixes
↑
ABAB's are aligned at positions 0-3 and 2-5
LPS: 0 0 1 2 3 4
LPS[2]=1: prefix "A" matches suffix "A" of "ABA"
LPS[3]=2: prefix "AB" matches suffix "AB" of "ABAB"
LPS[4]=3: prefix "ABA" matches suffix "ABA" of "ABABA"
LPS[5]=4: prefix "ABAB" matches suffix "ABAB" of "ABABAB"
A pattern with period p (repeating every p characters) will have LPS[i] = i - p + 1 for i ≥ p. The LPS array effectively encodes the period length! For 'ABABAB', period = 2, and we see LPS[5] = 5 - 2 + 1 = 4 ✓
Pattern: "AABAAAB" (mixed repetition)
Index: 0 1 2 3 4 5 6
Pattern: A A B A A A B
LPS[0] = 0 "A" - no proper prefix-suffix
LPS[1] = 1 "AA" - prefix "A" = suffix "A"
LPS[2] = 0 "AAB" - neither "A" nor "AA" equals "B" or "AB"
LPS[3] = 1 "AABA" - prefix "A" = suffix "A"
LPS[4] = 2 "AABAA" - prefix "AA" = suffix "AA"
LPS[5] = 2 "AABAAA" - prefix "AA" = suffix "AA"
(not "AAA" because prefix "AAA" doesn't exist - pattern is AAB...)
LPS[6] = 3 "AABAAAB" - prefix "AAB" = suffix "AAB"
A A B A A A B
└─┬─┘ └─┬─┘
AAB = AAB → LPS[6] = 3
Pattern: A A B A A A BIndex: 0 1 2 3 4 5 6LPS: 0 1 0 1 2 2 3 LPS Visualization (showing what overlaps): Position 1 (LPS=1): [A|A] prefix "A" = suffix "A"Position 4 (LPS=2): [A A|B A A] prefix "AA" = suffix "AA" Position 6 (LPS=3): [A A B|A A A B] prefix "AAB" = suffix "AAB" The LPS values form a kind of "contour" showing how structure builds up: Index: 0 1 2 3 4 5 6 LPS: 0 1 0 1 2 2 3 └┬┘ └┬┘ └─┬─┘ Pattern resets and rebuilds overlap multiple times| LPS Pattern | Example Pattern | Interpretation |
|---|---|---|
| All zeros | ABCDEF | No internal repetition - every mismatch restarts at 0 |
| Increasing 0,1,2,3... | AAAA | Highly repetitive - aggressive shifts possible |
| Saw-tooth (up then down) | ABCABD | Partial repetition breaks mid-way |
| Periodic staircase | ABABAB | Regular periodic pattern detected |
| Sparse non-zeros | ABXABY | Limited overlap opportunities |
Now let's see exactly how the LPS array is used during pattern matching. The rule is simple:
When you mismatch at pattern position j (after matching positions 0 through j-1), look up LPS[j-1]. That's your new pattern position to resume comparison.
Why LPS[j-1]? Because LPS[j-1] tells us the longest prefix of the pattern that we know still matches the text (since it equals a suffix of what we just matched). We resume comparing from position LPS[j-1] in the pattern.
Complete Trace Example:
Pattern: "ABCDABD" with LPS = [0,0,0,0,1,2,0]
Text: "ABCDABCDABD"
Step 1: Align at position 0
Text: A B C D A B C D A B D
Pattern: A B C D A B D
✓ ✓ ✓ ✓ ✓ ✓ ✗ (mismatch at j=6: pattern[6]='D', text[6]='C')
j = 6, so we look up LPS[5] = 2
This means: the prefix pattern[0..1] = "AB" matches what we just saw at text[4..5]
Step 2: Resume with j = 2 (NOT shift pattern, keep text pointer!)
Text: A B C D A B C D A B D
- - - - - - ↑
│ We're still at text position 6
Pattern: A B C D A B D
↑
Compare pattern[2]='C' vs text[6]='C' ✓
then: pattern[3]='D' vs text[7]='D' ✓
then: pattern[4]='A' vs text[8]='A' ✓
then: pattern[5]='B' vs text[9]='B' ✓
then: pattern[6]='D' vs text[10]='D' ✓
j reached 7 = m, MATCH FOUND at position 10 - 7 + 1 = 4
Crucially, notice that we never re-examine text positions 0-5. After the mismatch at text position 6, we stay at position 6 and just change our pattern position. The text pointer only moves forward, guaranteeing O(n) text traversal.
The Matching Algorithm Using LPS:
function kmp_search(text, pattern, lps):
n = length(text)
m = length(pattern)
i = 0 // text pointer
j = 0 // pattern pointer
while i < n:
if text[i] == pattern[j]:
i += 1
j += 1
if j == m: // full match!
report_match(i - m)
j = lps[j-1] // look for next occurrence
else:
if j > 0:
j = lps[j-1] // use LPS to skip
else:
i += 1 // no prefix matches, move text pointer
The elegance is in the j = lps[j-1] line. When we mismatch:
| When Mismatch At | LPS Value Used | Action |
|---|---|---|
| j = 0 | N/A | Increment text pointer (i++), pattern stays at 0 |
| j > 0 | LPS[j-1] | Set j = LPS[j-1], keep text pointer unchanged |
| After full match (j = m) | LPS[m-1] | Report match, then j = LPS[m-1] to find overlapping matches |
Understanding how LPS behaves in edge cases deepens our understanding and helps debug implementations.
Case 1: Pattern with no repetition ("ABCDEF")
Pattern: A B C D E F
LPS: 0 0 0 0 0 0
Behavior: Any mismatch at position j > 0 sends us to LPS[j-1] = 0.
This degenerates to naive behavior, but KMP still works correctly.
No benefit from LPS, but no penalty either.
Case 2: Single character pattern ("A")
Pattern: A
LPS: 0
When matching: Compare text[i] with 'A'.
- Match: i++, j++ → j=1=m, report match, j = LPS[0] = 0
- Mismatch: j=0, so i++
This is just a linear scan - O(n) as expected.
Case 3: All same character ("AAAA")
Pattern: A A A A
LPS: 0 1 2 3
Searching in "AAAAAAAA":
Step 1: Match A A A A at positions 0-3. j=4=m, match at 0!
j = LPS[3] = 3 (we know text[1..3] = "AAA" = pattern[0..2])
Step 2: Compare pattern[3]='A' with text[4]='A' ✓
j=4=m, match at 1!
j = LPS[3] = 3
Step 3: Compare pattern[3]='A' with text[5]='A' ✓
j=4=m, match at 2!
...and so on
We find all 5 matches (at positions 0,1,2,3,4) in O(n) time!
Without KMP, naive would take O(n×m) to find overlapping matches.
The LPS array naturally handles overlapping matches. After finding a match, j = LPS[m-1] positions us to continue looking for a new match that might overlap with the one we just found. This is a major advantage over simpler algorithms that might miss overlapping occurrences.
Case 4: Pattern longer than text
Pattern: "ABCDEFGH" (length 8)
Text: "ABC" (length 3)
The algorithm simply never finds j = m = 8, so it reports no matches.
No special handling needed - the loop terminates when i reaches n = 3.
Case 5: Nested repetition ("AABAAB")
Pattern: A A B A A B
Index: 0 1 2 3 4 5
LPS[0] = 0 ("A")
LPS[1] = 1 ("AA" → "A" = "A")
LPS[2] = 0 ("AAB" → no match)
LPS[3] = 1 ("AABA" → "A" = "A")
LPS[4] = 2 ("AABAA" → "AA" = "AA")
LPS[5] = 3 ("AABAAB" → "AAB" = "AAB")
LPS: 0 1 0 1 2 3
Notice LPS[5] = 3, meaning half the pattern ("AAB") is a valid prefix-suffix.
After a full match, we can resume at position 3, effectively checking for
patterns that share "AAB" with the previous occurrence.
There's another way to understand the LPS array that provides deeper insight: the pattern matcher is a finite automaton, and the LPS array encodes its failure transitions.
State Machine View:
Pattern: "ABABC" with LPS = [0, 0, 1, 2, 0] State Diagram: A B A B C[0] ───→ [1] ───→ [2] ───→ [3] ───→ [4] ───→ [5] (Accept!) │ │ │ │ │ │ │ │ │ └── on anything except C: → state LPS[4-1] = LPS[3] = 2 │ │ │ │ (we know "AB" still matches) │ │ │ │ │ │ │ └─────────── on anything except B: → state LPS[3-1] = LPS[2] = 1 │ │ │ (we know "A" still matches) │ │ │ │ │ └──────────────────── on anything except A: → state LPS[2-1] = LPS[1] = 0 │ │ (no prefix matches) │ │ │ └───────────────────────────── on anything except B: → state LPS[1-1] = LPS[0] = 0 │ └────────────────────────────────────── on anything except A: stay at state 0Why the State Machine View Matters:
Determinism: At any state, there's exactly one transition for each possible input character. The KMP automaton is a deterministic finite automaton (DFA).
O(1) Per Character: Once we've built the automaton (computed LPS), each text character triggers exactly one state transition. This is why KMP is O(n) for matching.
Streaming Capability: We can process the text as a stream. At any point, our state tells us how much of the pattern we've matched. This makes KMP ideal for real-time applications.
Pattern Independence: The state machine is determined entirely by the pattern. We can precompute it once and use it for multiple texts, or even in parallel.
Production regex engines use similar automaton-based approaches. The KMP automaton is a special case of a DFA for fixed string matching. More complex patterns require NFAs (non-deterministic automata), but the core idea—precompute pattern structure, then stream through text—remains the same.
The LPS array goes by several names in different textbooks and contexts. Understanding these variations helps when reading other resources.
Common Names for the Same Concept:
| Name | Full Form | Context/Source |
|---|---|---|
| LPS Array | Longest Proper Prefix which is also Suffix | Common in competitive programming |
| Failure Function | f(j) or fail[] | Original KMP paper, academic literature |
| Prefix Function | π (pi) function | CLRS textbook, theoretical CS |
| Partial Match Table | — | Some implementation guides |
| Border Array | — | String algorithmics literature |
| Next Array | next[] | Some implementations (offset by 1) |
Some sources define the failure function with different indexing. Some use f(j) = LPS[j-1], others define next[j] = LPS[j] + 1, etc. When reading code or papers, always check the exact definition being used. The core concept is the same, but indexing variations can cause subtle bugs.
The "Border" Terminology:
In formal string algorithmics, a border of string S is a string that is both a proper prefix and proper suffix of S. The LPS value at position i is the length of the longest border of pattern[0..i].
This terminology makes some theorems cleaner to state:
We'll use "LPS" in this course for consistency, but know that "failure function," "prefix function," and "border array" all refer to the same structure.
We've thoroughly examined the LPS (failure function) array—the central data structure of the KMP algorithm. Let's consolidate our understanding:
What's next:
We've seen what the LPS array contains and how it's used during matching. But we haven't yet seen how to compute the LPS array efficiently. A naive approach (checking all prefixes against all suffixes) would take O(m²) or even O(m³) time, undermining the whole point of KMP.
The next page reveals the elegant algorithm for building the LPS array in O(m) time—using the same insight recursively that makes KMP matching efficient.
You now understand the LPS array deeply: what it represents, how to compute values by hand, how it guides the matching process, and how it can be viewed as encoding a finite automaton. This foundation prepares you to understand the elegant recursive algorithm for building the LPS array efficiently.