Data Structures & AlgorithmsString DP — Edit Distance

String DP — Edit Distance

LevelIntermediate

Duration75 mins

TopicString DP — Edit Distance

4 / 4

Applications in Spell Checking

Edit Distance in the Real World

Every time you type a misspelled word and see a red underline with suggestions, every time a search engine understands "amazno" means "amazon," every time a database search finds "Jon Smith" when you typed "John Smith"—edit distance is at work behind the scenes.

The theory we've developed isn't purely academic. It's the foundation for systems used by billions of people daily. In this final page, we'll see how edit distance translates from textbook algorithm to production system, with a deep focus on spell checking as the paradigmatic application.

What You Will Master

By the end of this page, you will understand how spell checkers use edit distance to generate and rank suggestions, learn about data structures (like BK-trees) that make spell checking efficient, explore applications beyond spell checking (bioinformatics, fuzzy search, NLP), and implement a complete, working spell checker from scratch.

The Spell Checking Problem

At its core, spell checking has two subproblems:

1. Error Detection: Is this word spelled correctly?

2. Suggestion Generation: If not, what did the user probably mean?

Error detection is straightforward—check if the word exists in a dictionary. Suggestion generation is where edit distance shines.

The Suggestion Problem:

Given a misspelled word w and a dictionary D of valid words, find the words in D that are "most similar" to w. Edit distance provides the precise definition of similarity:

$$\text{suggestions}(w) = {d \in D : \text{distance}(w, d) \leq k}$$

Where k is a threshold (typically 1-3 for spell checking). Words with smaller edit distance are better suggestions.

Key Challenges:

Scale: A typical dictionary has 100,000+ words. Computing edit distance against every word is expensive.
Speed: Spell checking must be real-time (as you type). Users expect instant feedback.
Ranking: Multiple words might have the same edit distance. How do we rank them?
Context: "teh" near "the" suggests "the", but "teh" near "tea" might suggest "tea".

Peter Norvig's Insight

In his famous essay 'How to Write a Spelling Corrector,' Peter Norvig observes that 80% of spelling errors are within edit distance 1 of the correct word, and nearly 99% are within distance 2. This insight is crucial: we only need to search a small neighborhood around each misspelled word.

Building a Naive Spell Checker

Let's start with the simplest approach: compute edit distance from the query to every word in the dictionary, then return the closest ones.

naive_spell_checker.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
from typing import List, Tuple
 
def edit_distance(s: str, t: str) -> int:
    """Standard O(mn) edit distance."""
    m, n = len(s), len(t)
    
    # Space optimization: only keep two rows
    prev = list(range(n + 1))
    curr = [0] * (n + 1)
    
    for i in range(1, m + 1):
        curr[0] = i
        for j in range(1, n + 1):
            if s[i - 1] == t[j - 1]:
                curr[j] = prev[j - 1]
            else:
                curr[j] = 1 + min(curr[j - 1], prev[j], prev[j - 1])
        prev, curr = curr, prev
    
    return prev[n]
 
 
class NaiveSpellChecker:
    """
    Naive spelling corrector - computes distance to all words.
    Simple but O(|D| × m × n) per query - too slow for large dictionaries.
    """
    
    def __init__(self, dictionary: List[str]):
        self.dictionary = set(dictionary)  # For O(1) membership check
        self.word_list = list(dictionary)
    
    def check(self, word: str) -> bool:
        """Check if word is spelled correctly."""
        return word.lower() in self.dictionary
    
    def suggest(
        self, 
        word: str, 
        max_distance: int = 2,
        max_suggestions: int = 5
    ) -> List[Tuple[str, int]]:
        """
        Get spelling suggestions for a word.
        
        Returns:
            List of (word, distance) tuples, sorted by distance then alphabetically
        """
        word = word.lower()
        candidates = []
        
        for dict_word in self.word_list:
            dist = edit_distance(word, dict_word)
            if dist <= max_distance:
                candidates.append((dict_word, dist))
        
        # Sort by distance first, then alphabetically
        candidates.sort(key=lambda x: (x[1], x[0]))
        
        return candidates[:max_suggestions]
 
 
# Simple usage example
def demo_naive_checker():
    # Small dictionary for demonstration
    dictionary = [
        "the", "there", "their", "they", "them", "these", "those",
        "this", "that", "than", "then", "thing", "think",
        "hello", "help", "held", "heal", "health", "healthy",
        "world", "word", "work", "words", "worker",
        "python", "program", "programming", "code", "coder",
        "spell", "spells", "spelling", "spelled", "speller"
    ]
    
    checker = NaiveSpellChecker(dictionary)
    
    # Test cases
    tests = ["teh", "helo", "wrold", "progam", "speling"]
    
    for word in tests:
        if checker.check(word):
            print(f"'{word}' is spelled correctly")
        else:
            suggestions = checker.suggest(word)
            print(f"'{word}' → Suggestions: {suggestions}")
 
 
demo_naive_checker()

Performance Problem

For a 100,000-word dictionary and an 8-character query, this approach does 100,000 × 8 × 8 = 6.4 million cell computations per query. At typical typing speed (5-10 characters per second), this quickly becomes too slow. We need smarter approaches.

Smarter Candidate Generation

Instead of checking every dictionary word, we can generate candidates by applying all possible edit operations to the misspelled word, then check which candidates are valid dictionary words. This inverts the problem.

The Insight:

If "speling" differs from a correct word by edit distance 1, then that correct word can be obtained by:

Deleting one character from "speling" (7 possibilities)
Inserting one character anywhere (8 × 26 = 208 possibilities)
Replacing one character (7 × 25 = 175 possibilities)
Transposing two adjacent characters (6 possibilities)

That's only ~400 possibilities for distance 1, compared to checking 100,000+ dictionary words!

candidate_generation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
from typing import Set, List, Tuple
from collections import Counter
 
class SmartSpellChecker:
    """
    Spell checker using candidate generation approach.
    Instead of computing distance to all words, we:
    1. Generate all strings within distance k
    2. Filter those that are in dictionary
    
    This is O(26^k * m) instead of O(|D| * m * n)
    Much faster for typical values (k ≤ 2, m ≤ 20)
    """
    
    ALPHABET = 'abcdefghijklmnopqrstuvwxyz'
    
    def __init__(self, dictionary: List[str], word_frequencies: dict = None):
        self.dictionary = set(w.lower() for w in dictionary)
        # Optional: word frequencies for ranking suggestions
        self.frequencies = word_frequencies or {}
    
    def edits1(self, word: str) -> Set[str]:
        """Generate all strings that are one edit away from 'word'."""
        letters = self.ALPHABET
        splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        
        # All single deletions
        deletes = [L + R[1:] for L, R in splits if R]
        
        # All single transpositions (swap adjacent chars)
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
        
        # All single replacements
        replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
        
        # All single insertions
        inserts = [L + c + R for L, R in splits for c in letters]
        
        return set(deletes + transposes + replaces + inserts)
    
    def edits2(self, word: str) -> Set[str]:
        """Generate all strings that are two edits away from 'word'."""
        return set(e2 for e1 in self.edits1(word) for e2 in self.edits1(e1))
    
    def known(self, words: Set[str]) -> Set[str]:
        """Return subset of words that are in dictionary."""
        return words & self.dictionary
    
    def candidates(self, word: str) -> Set[str]:
        """
        Generate correction candidates in priority order:
        1. The word itself (if it's correct)
        2. Known words at distance 1
        3. Known words at distance 2
        4. The original word (no correction found)
        """
        word = word.lower()
        
        # Priority order: return first non-empty set
        return (
            self.known({word}) or           # Already correct
            self.known(self.edits1(word)) or  # Distance 1
            self.known(self.edits2(word)) or  # Distance 2
            {word}                           # Give up, return original
        )
    
    def probability(self, word: str) -> float:
        """Probability of word based on frequency (for ranking)."""
        total = sum(self.frequencies.values()) or 1
        return self.frequencies.get(word, 1) / total
    
    def correct(self, word: str) -> str:
        """Return the single best correction for word."""
        word = word.lower()
        candidates = self.candidates(word)
        # Pick candidate with highest probability
        return max(candidates, key=self.probability)
    
    def suggest(self, word: str, max_suggestions: int = 5) -> List[Tuple[str, int]]:
        """
        Return ranked list of suggestions with their edit distances.
        """
        word = word.lower()
        if word in self.dictionary:
            return [(word, 0)]
        
        # Distance 1 candidates
        dist1 = self.known(self.edits1(word))
        # Distance 2 candidates
        dist2 = self.known(self.edits2(word)) - dist1
        
        suggestions = []
        for w in dist1:
            suggestions.append((w, 1, self.probability(w)))
        for w in dist2:
            suggestions.append((w, 2, self.probability(w)))
        
        # Sort by distance first, then by probability (descending)
        suggestions.sort(key=lambda x: (x[0], -x[2]))  # distance, -probability
        
        return [(w, d) for w, d, _ in suggestions[:max_suggestions]]
 
 
# Demonstration with frequency-based ranking
def demo_smart_checker():
    # Words with frequencies (in a real system, from corpus)
    word_freq = {
        "the": 1000, "there": 100, "their": 80, "they": 200,
        "hello": 50, "help": 150, "held": 30, "heal": 20,
        "world": 120, "word": 80, "work": 200, "words": 60,
        "spell": 25, "spelling": 40, "spelled": 15,
        "program": 100, "programming": 80
    }
    
    dictionary = list(word_freq.keys())
    checker = SmartSpellChecker(dictionary, word_freq)
    
    tests = ["teh", "helo", "wrold", "speling", "programing"]
    
    for word in tests:
        correction = checker.correct(word)
        suggestions = checker.suggest(word)
        print(f"'{word}':")
        print(f"  Best correction: '{correction}'")
        print(f"  All suggestions: {suggestions}")
        print()
 
 
demo_smart_checker()

The Speed Advantage

This approach generates ~50 candidates at distance 1 and ~2,500 at distance 2 for a typical word. Checking set membership for 2,550 words is much faster than computing 100,000+ edit distances. This is how production spell checkers achieve real-time performance.

BK-Trees: Metric Space Indexing

For even better performance, especially with larger edit distance thresholds, we can use a BK-tree (Burkhard-Keller tree)—a data structure specifically designed for metric space queries.

Key Insight: The Triangle Inequality

Because edit distance is a metric, it satisfies the triangle inequality:

$$|d(a, c) - d(b, c)| \leq d(a, b)$$

This means: if we know the distance from query q to node n in our tree, we can bound the possible distances from q to n's children without computing them all!

BK-Tree Structure:

Each node contains a word
Children are labeled with edges denoting distance to the parent
Child at edge k contains a word with distance exactly k from parent
A node can have at most one child per distance value

BK-Tree Query:

To find all words within distance d of query q:

Compute distance k from q to root
If k ≤ d, the root is a match
Recursively search children with edge labels in range [k-d, k+d]
Skip children outside this range (triangle inequality prunes them)

Pruning Power:

For a query with tolerance d=2, we only explore children with labels in a range of width 2d+1=5. If nodes have many children (distances spread out), we prune most of the tree!

bk_tree.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
from typing import List, Tuple, Optional, Dict
 
class BKTree:
    """
    Burkhard-Keller tree for efficient metric space queries.
    
    Enables finding all dictionary words within distance k of a query
    without computing distance to every word.
    """
    
    def __init__(self, distance_fn=None):
        self.root: Optional[BKTreeNode] = None
        self.distance = distance_fn or edit_distance
        self.size = 0
    
    def add(self, word: str) -> None:
        """Add a word to the BK-tree."""
        word = word.lower()
        
        if self.root is None:
            self.root = BKTreeNode(word)
            self.size = 1
            return
        
        node = self.root
        while True:
            dist = self.distance(word, node.word)
            
            if dist == 0:
                return  # Word already in tree
            
            if dist in node.children:
                node = node.children[dist]
            else:
                node.children[dist] = BKTreeNode(word)
                self.size += 1
                return
    
    def search(self, query: str, max_distance: int) -> List[Tuple[str, int]]:
        """
        Find all words within max_distance of query.
        
        Returns:
            List of (word, distance) tuples
        """
        if self.root is None:
            return []
        
        query = query.lower()
        results = []
        candidates = [self.root]
        
        while candidates:
            node = candidates.pop()
            dist = self.distance(query, node.word)
            
            if dist <= max_distance:
                results.append((node.word, dist))
            
            # Triangle inequality: only explore children in range
            # [dist - max_distance, dist + max_distance]
            low = max(1, dist - max_distance)
            high = dist + max_distance
            
            for child_dist, child_node in node.children.items():
                if low <= child_dist <= high:
                    candidates.append(child_node)
        
        return sorted(results, key=lambda x: x[1])
    
    def build_from_list(self, words: List[str]) -> 'BKTree':
        """Convenience method to build tree from word list."""
        for word in words:
            self.add(word)
        return self
 
 
class BKTreeNode:
    """Node in a BK-tree."""
    
    def __init__(self, word: str):
        self.word = word
        self.children: Dict[int, 'BKTreeNode'] = {}
 
 
class BKTreeSpellChecker:
    """Spell checker using BK-tree for efficient lookups."""
    
    def __init__(self, dictionary: List[str]):
        self.tree = BKTree().build_from_list(dictionary)
        self.dictionary = set(w.lower() for w in dictionary)
    
    def check(self, word: str) -> bool:
        return word.lower() in self.dictionary
    
    def suggest(
        self, 
        word: str, 
        max_distance: int = 2,
        max_suggestions: int = 5
    ) -> List[Tuple[str, int]]:
        results = self.tree.search(word, max_distance)
        return results[:max_suggestions]
 
 
# Demo and performance comparison
def demo_bk_tree():
    import time
    
    # Generate a larger dictionary for performance testing
    # Using common English words + variations
    base_words = [
        "the", "be", "to", "of", "and", "a", "in", "that", "have", "I",
        "it", "for", "not", "on", "with", "he", "as", "you", "do", "at",
        "this", "but", "his", "by", "from", "they", "we", "say", "her", "she",
        "or", "an", "will", "my", "one", "all", "would", "there", "their", "what",
        "program", "programming", "computer", "algorithm", "function", "variable",
        "spell", "spelling", "check", "checker", "dictionary", "word", "words"
    ]
    
    # Add some variations
    dictionary = []
    for word in base_words:
        dictionary.append(word)
        if len(word) > 3:
            dictionary.append(word + "s")
            dictionary.append(word + "ing")
            dictionary.append(word + "ed")
    
    print(f"Dictionary size: {len(dictionary)} words")
    
    # Build BK-tree
    start = time.time()
    checker = BKTreeSpellChecker(dictionary)
    build_time = time.time() - start
    print(f"BK-tree build time: {build_time*1000:.2f}ms")
    
    # Test queries
    test_words = ["programing", "speling", "dictionery", "algoritm"]
    
    for word in test_words:
        start = time.time()
        suggestions = checker.suggest(word, max_distance=2)
        query_time = time.time() - start
        print(f"\n'{word}' ({query_time*1000:.3f}ms):")
        print(f"  Suggestions: {suggestions[:5]}")
 
 
demo_bk_tree()

When to Use BK-Trees

BK-trees shine when: (1) you have a large, static dictionary, (2) you're doing many queries, (3) the tolerance is small relative to word length. For spell checking with max_distance=2, BK-trees typically examine only 10-15% of the dictionary, giving 5-10x speedup over linear scan. For larger tolerances, the advantage diminishes.

Ranking Suggestions: Beyond Edit Distance

Edit distance gets us candidates, but choosing among equally-distant suggestions requires additional signals. A good spell checker considers multiple factors.

Ranking Signals

•Word Frequency: Common words are more likely to be intended. "the" is more likely than "teh" (if both were valid).
•Keyboard Layout: Typos often involve adjacent keys. 'r' and 't' are close; 'r' and 'p' are far. Weight substitutions by key distance.
•Character Similarity: 'i' and 'l' look similar; 'b' and 'd' are visually confusable. Phonetic similarity ('f' and 'ph') matters for dictation errors.
•Context: In "I went to teh store," context strongly suggests "the." Language models can provide contextual probability.
•User History: If a user types "colour" often (British spelling), prefer it over "color." Personalization improves corrections.
•Document Domain: In a medical document, medical terms should rank higher. Domain-specific dictionaries and weights help.

advanced_ranking.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
from typing import List, Tuple
import math
 
# Simplified keyboard layout for distance calculation
KEYBOARD_LAYOUT = {
    'q': (0, 0), 'w': (0, 1), 'e': (0, 2), 'r': (0, 3), 't': (0, 4),
    'y': (0, 5), 'u': (0, 6), 'i': (0, 7), 'o': (0, 8), 'p': (0, 9),
    'a': (1, 0), 's': (1, 1), 'd': (1, 2), 'f': (1, 3), 'g': (1, 4),
    'h': (1, 5), 'j': (1, 6), 'k': (1, 7), 'l': (1, 8),
    'z': (2, 0), 'x': (2, 1), 'c': (2, 2), 'v': (2, 3), 'b': (2, 4),
    'n': (2, 5), 'm': (2, 6)
}
 
 
def keyboard_distance(c1: str, c2: str) -> float:
    """Euclidean distance between two keys on keyboard."""
    c1, c2 = c1.lower(), c2.lower()
    if c1 not in KEYBOARD_LAYOUT or c2 not in KEYBOARD_LAYOUT:
        return 3.0  # Far apart if not on keyboard
    
    r1, c1_pos = KEYBOARD_LAYOUT[c1]
    r2, c2_pos = KEYBOARD_LAYOUT[c2]
    
    return math.sqrt((r1 - r2) ** 2 + (c1_pos - c2_pos) ** 2)
 
 
def keyboard_weighted_edit_distance(s: str, t: str) -> float:
    """
    Edit distance with keyboard-aware substitution costs.
    Adjacent keys cost less to substitute than distant keys.
    """
    m, n = len(s), len(t)
    dp = [[0.0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(m + 1):
        dp[i][0] = float(i)
    for j in range(n + 1):
        dp[0][j] = float(j)
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s[i - 1] == t[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                # Substitution cost based on keyboard distance
                kbd_dist = keyboard_distance(s[i - 1], t[j - 1])
                # Normalize: adjacent keys ~0.5 cost, far keys ~1.5 cost
                sub_cost = 0.5 + kbd_dist / 6.0
                
                dp[i][j] = min(
                    dp[i][j - 1] + 1.0,      # Insert
                    dp[i - 1][j] + 1.0,      # Delete
                    dp[i - 1][j - 1] + sub_cost  # Substitute
                )
    
    return dp[m][n]
 
 
class AdvancedSpellChecker:
    """
    Spell checker with advanced ranking using multiple signals.
    """
    
    def __init__(
        self, 
        dictionary: List[str],
        frequencies: dict = None
    ):
        self.dictionary = set(w.lower() for w in dictionary)
        self.word_list = [w.lower() for w in dictionary]
        self.frequencies = frequencies or {}
        self.total_freq = sum(self.frequencies.values()) or 1
    
    def score(
        self, 
        original: str, 
        suggestion: str,
        edit_dist: int
    ) -> float:
        """
        Compute a ranking score (higher = better suggestion).
        
        Combines:
        - Inverse edit distance (smaller is better)
        - Word frequency (more common is better)
        - Keyboard distance (nearby typos are common)
        """
        # Base: inverse edit distance (heavily weighted)
        base_score = 1.0 / (1.0 + edit_dist)
        
        # Frequency bonus (log scale to prevent domination)
        freq = self.frequencies.get(suggestion, 1)
        freq_score = math.log(1 + freq / self.total_freq * 1000)
        
        # Keyboard proximity bonus for substitutions
        kbd_score = 0
        if len(original) == len(suggestion):
            for c1, c2 in zip(original, suggestion):
                if c1 != c2:
                    # Lower keyboard distance = higher bonus
                    kbd_dist = keyboard_distance(c1, c2)
                    kbd_score += 1.0 / (1.0 + kbd_dist)
        
        # Combine scores (tune weights for your use case)
        return base_score * 10 + freq_score * 3 + kbd_score * 2
    
    def suggest(
        self, 
        word: str, 
        max_distance: int = 2,
        max_suggestions: int = 5
    ) -> List[Tuple[str, int, float]]:
        """
        Get ranked suggestions with scores.
        
        Returns:
            List of (word, distance, score) tuples, sorted by score descending
        """
        word = word.lower()
        candidates = []
        
        for dict_word in self.word_list:
            dist = edit_distance(word, dict_word)
            if 0 < dist <= max_distance:  # Exclude exact match (dist=0)
                score = self.score(word, dict_word, dist)
                candidates.append((dict_word, dist, score))
        
        # Sort by score descending
        candidates.sort(key=lambda x: -x[2])
        
        return candidates[:max_suggestions]
 
 
# Demonstration
def demo_advanced_ranking():
    dictionary = ["the", "they", "them", "there", "three", "tree", "free"]
    frequencies = {
        "the": 10000,  # Very common
        "they": 500,
        "them": 400,
        "there": 800,
        "three": 200,
        "tree": 100,
        "free": 150
    }
    
    checker = AdvancedSpellChecker(dictionary, frequencies)
    
    # "thr" - 'r' is adjacent to 'e' on keyboard
    # "three", "there", "tree" are all distance 2-3
    print("Checking 'thr' (likely meant 'the' or 'three'):")
    suggestions = checker.suggest("thr", max_distance=3)
    for word, dist, score in suggestions:
        print(f"  {word}: distance={dist}, score={score:.2f}")
    
    print("\nChecking 'yhe' ('y' adjacent to 't' on some keyboards):")
    suggestions = checker.suggest("yhe", max_distance=2)
    for word, dist, score in suggestions:
        print(f"  {word}: distance={dist}, score={score:.2f}")
 
 
demo_advanced_ranking()

Beyond Spell Checking: Other Applications

Edit distance extends far beyond spell checking. Here are major application areas where the same algorithmic foundation applies.

DNA and Protein Sequence Alignment

Biological sequences (DNA, RNA, proteins) mutate over time through:

Point mutations (substitutions): A→G
Insertions: Adding nucleotides
Deletions: Losing nucleotides

These correspond exactly to our edit operations! Edit distance measures evolutionary distance between sequences.

Key Algorithms:

Needleman-Wunsch: Global alignment (full sequence comparison) — essentially edit distance with biological scoring
Smith-Waterman: Local alignment (find best matching subsequences)
BLAST: Heuristic for searching sequence databases

Substitution Matrices:

In biology, not all substitutions are equally likely. The BLOSUM and PAM matrices specify scores based on observed mutation rates. For example, A→G (both purines) is more common than A→C (purine→pyrimidine).

# Simplified biological scoring
score_matrix[('A', 'G')] = 1  # Similar (transition)
score_matrix[('A', 'C')] = 2  # Different (transversion)

Practical Optimizations for Production

Moving from textbook algorithm to production system requires careful optimization. Here are techniques used in real spell checkers and fuzzy search systems.

Production Optimization Techniques

•Early Termination: If edit distance exceeds threshold during computation, abort early. This is called 'banded' edit distance—only compute cells near the diagonal where valid solutions exist.
•Length Filtering: If |len(s) - len(t)| > max_distance, skip. Strings of very different lengths can't be within the threshold.
•Common Prefix/Suffix: Strip matching prefixes and suffixes before computing. 'unhelpful' vs 'unhelpfully' only differs in the last characters.
•Caching: Cache edit distances for common pairs. Memoize at the application level for repeated queries.
•Approximate Methods: For very large datasets, accept approximations. MinHash, SimHash, and LSH trade accuracy for speed.
•Parallelization: Anti-diagonal cells are independent. GPU implementations can compute thousands of cells in parallel.
•SIMD Instructions: Modern CPUs have vector instructions that can process multiple cells simultaneously.

optimized_edit_distance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
def edit_distance_bounded(s: str, t: str, max_dist: int) -> int:
    """
    Edit distance with early termination if exceeds threshold.
    Returns max_dist + 1 if actual distance exceeds max_dist.
    
    Time: O(max_dist * min(m, n)) in common case
    """
    m, n = len(s), len(t)
    
    # Quick length check
    if abs(m - n) > max_dist:
        return max_dist + 1
    
    # Ensure n >= m for slightly simpler logic
    if m > n:
        s, t = t, s
        m, n = n, m
    
    # Strip common prefix
    while m > 0 and n > 0 and s[0] == t[0]:
        s = s[1:]
        t = t[1:]
        m -= 1
        n -= 1
    
    # Strip common suffix
    while m > 0 and n > 0 and s[-1] == t[-1]:
        s = s[:-1]
        t = t[:-1]
        m -= 1
        n -= 1
    
    # Edge cases after stripping
    if m == 0:
        return n if n <= max_dist else max_dist + 1
    if n == 0:
        return m if m <= max_dist else max_dist + 1
    
    # Use banded algorithm: only compute cells within max_dist of diagonal
    # prev[j] = dp[i-1][j], curr[j] = dp[i][j]
    prev = list(range(min(n + 1, max_dist + 2)))
    
    for i in range(1, m + 1):
        curr = [0] * len(prev)
        curr[0] = i
        
        # Band: only update j in range [j_start, j_end]
        j_start = max(1, i - max_dist)
        j_end = min(n, i + max_dist)
        
        # Fill values outside band with infinity
        for j in range(j_start):
            curr[j] = max_dist + 1
        
        min_in_row = max_dist + 1
        
        for j in range(j_start, j_end + 1):
            if j >= len(prev):
                break
            
            if s[i - 1] == t[j - 1]:
                curr[j] = prev[j - 1] if j > 0 else i
            else:
                ins = curr[j - 1] if j > 0 else max_dist + 1
                delete = prev[j] if j < len(prev) else max_dist + 1
                repl = prev[j - 1] if j > 0 else max_dist + 1
                curr[j] = 1 + min(ins, delete, repl)
            
            min_in_row = min(min_in_row, curr[j])
        
        # Early termination: if minimum in row exceeds threshold, no solution
        if min_in_row > max_dist:
            return max_dist + 1
        
        prev = curr
    
    result = prev[min(n, len(prev) - 1)]
    return result if result <= max_dist else max_dist + 1
 
 
# Test the optimization
def benchmark_bounded():
    import time
    
    # Generate test cases
    s = "abcdefghij" * 10  # 100 chars
    t = "abcdefghik" * 10  # 100 chars, differs at end of each segment
    
    # Standard unbounded
    start = time.time()
    for _ in range(100):
        d1 = edit_distance(s, t)
    t1 = time.time() - start
    
    # Bounded with low threshold (should be faster)
    start = time.time()
    for _ in range(100):
        d2 = edit_distance_bounded(s, t, 5)
    t2 = time.time() - start
    
    print(f"Standard: {d1}, time: {t1*1000:.1f}ms")
    print(f"Bounded(5): {d2}, time: {t2*1000:.1f}ms")
    print(f"Speedup: {t1/t2:.1f}x")
 
 
benchmark_bounded()

The 80/20 Rule

In practice, most queries are "easy"—either exact matches (no computation) or close misses (early termination kicks in). Optimize for the common case. Profile your actual workload before implementing exotic optimizations.

Summary: Edit Distance in Practice

We've completed our comprehensive exploration of edit distance, from mathematical foundations to production applications. Let's consolidate the key insights from this module.

Module Key Takeaways

•Edit distance is the minimum number of insert/delete/substitute operations to transform one string into another
•It's a proper metric, enabling efficient data structures like BK-trees
•The DP solution uses a 2D table with O(mn) time and space, optimizable to O(min(m,n)) space
•Reconstruction walks backward through the table to recover the actual edit sequence
•Spell checking uses candidate generation (faster than checking all dictionary words)
•Ranking considers word frequency, keyboard layout, context, and other signals
•Applications span bioinformatics, fuzzy search, NLP, and version control
•Production optimizations include early termination, length filtering, and banding

What You've Mastered:

You now understand one of the most important string algorithms in computer science. Edit distance appears in coding interviews, but more importantly, it's the foundation for systems you use every day. Whether you're building a search feature, working on NLP, analyzing genetic data, or just implementing autocomplete, the techniques from this module apply directly.

The Pattern Generalizes:

The DP pattern we've studied—2D table over two sequences, transitions based on element comparisons, reconstruction via backtracking—applies to many related problems: longest common subsequence, longest palindromic subsequence, sequence alignment, and more. Master edit distance, and you've mastered a fundamental pattern in string DP.

Module Complete

Congratulations! You've completed the String DP — Edit Distance module. You understand the theory deeply, can implement efficient solutions, reconstruct edit sequences, and apply the algorithm to real-world problems. This knowledge forms a cornerstone of string algorithm expertise that will serve you throughout your career.

4 / 4

Loading learning content...

Data Structures & AlgorithmsString DP — Edit Distance

String DP — Edit Distance

LevelIntermediate

Duration75 mins

TopicString DP — Edit Distance

4 / 4

Applications in Spell Checking

Edit Distance in the Real World

What You Will Master

The Spell Checking Problem

At its core, spell checking has two subproblems:

1. Error Detection: Is this word spelled correctly?

2. Suggestion Generation: If not, what did the user probably mean?

Error detection is straightforward—check if the word exists in a dictionary. Suggestion generation is where edit distance shines.

The Suggestion Problem:

Given a misspelled word w and a dictionary D of valid words, find the words in D that are "most similar" to w. Edit distance provides the precise definition of similarity:

$$\text{suggestions}(w) = {d \in D : \text{distance}(w, d) \leq k}$$

Where k is a threshold (typically 1-3 for spell checking). Words with smaller edit distance are better suggestions.

Key Challenges:

Scale: A typical dictionary has 100,000+ words. Computing edit distance against every word is expensive.
Speed: Spell checking must be real-time (as you type). Users expect instant feedback.
Ranking: Multiple words might have the same edit distance. How do we rank them?
Context: "teh" near "the" suggests "the", but "teh" near "tea" might suggest "tea".

Peter Norvig's Insight

Building a Naive Spell Checker

Let's start with the simplest approach: compute edit distance from the query to every word in the dictionary, then return the closest ones.

naive_spell_checker.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
from typing import List, Tuple
 
def edit_distance(s: str, t: str) -> int:
    """Standard O(mn) edit distance."""
    m, n = len(s), len(t)
    
    # Space optimization: only keep two rows
    prev = list(range(n + 1))
    curr = [0] * (n + 1)
    
    for i in range(1, m + 1):
        curr[0] = i
        for j in range(1, n + 1):
            if s[i - 1] == t[j - 1]:
                curr[j] = prev[j - 1]
            else:
                curr[j] = 1 + min(curr[j - 1], prev[j], prev[j - 1])
        prev, curr = curr, prev
    
    return prev[n]
 
 
class NaiveSpellChecker:
    """
    Naive spelling corrector - computes distance to all words.
    Simple but O(|D| × m × n) per query - too slow for large dictionaries.
    """
    
    def __init__(self, dictionary: List[str]):
        self.dictionary = set(dictionary)  # For O(1) membership check
        self.word_list = list(dictionary)
    
    def check(self, word: str) -> bool:
        """Check if word is spelled correctly."""
        return word.lower() in self.dictionary
    
    def suggest(
        self, 
        word: str, 
        max_distance: int = 2,
        max_suggestions: int = 5
    ) -> List[Tuple[str, int]]:
        """
        Get spelling suggestions for a word.
        
        Returns:
            List of (word, distance) tuples, sorted by distance then alphabetically
        """
        word = word.lower()
        candidates = []
        
        for dict_word in self.word_list:
            dist = edit_distance(word, dict_word)
            if dist <= max_distance:
                candidates.append((dict_word, dist))
        
        # Sort by distance first, then alphabetically
        candidates.sort(key=lambda x: (x[1], x[0]))
        
        return candidates[:max_suggestions]
 
 
# Simple usage example
def demo_naive_checker():
    # Small dictionary for demonstration
    dictionary = [
        "the", "there", "their", "they", "them", "these", "those",
        "this", "that", "than", "then", "thing", "think",
        "hello", "help", "held", "heal", "health", "healthy",
        "world", "word", "work", "words", "worker",
        "python", "program", "programming", "code", "coder",
        "spell", "spells", "spelling", "spelled", "speller"
    ]
    
    checker = NaiveSpellChecker(dictionary)
    
    # Test cases
    tests = ["teh", "helo", "wrold", "progam", "speling"]
    
    for word in tests:
        if checker.check(word):
            print(f"'{word}' is spelled correctly")
        else:
            suggestions = checker.suggest(word)
            print(f"'{word}' → Suggestions: {suggestions}")
 
 
demo_naive_checker()

Performance Problem

Smarter Candidate Generation

The Insight:

If "speling" differs from a correct word by edit distance 1, then that correct word can be obtained by:

Deleting one character from "speling" (7 possibilities)
Inserting one character anywhere (8 × 26 = 208 possibilities)
Replacing one character (7 × 25 = 175 possibilities)
Transposing two adjacent characters (6 possibilities)

That's only ~400 possibilities for distance 1, compared to checking 100,000+ dictionary words!

candidate_generation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
from typing import Set, List, Tuple
from collections import Counter
 
class SmartSpellChecker:
    """
    Spell checker using candidate generation approach.
    Instead of computing distance to all words, we:
    1. Generate all strings within distance k
    2. Filter those that are in dictionary
    
    This is O(26^k * m) instead of O(|D| * m * n)
    Much faster for typical values (k ≤ 2, m ≤ 20)
    """
    
    ALPHABET = 'abcdefghijklmnopqrstuvwxyz'
    
    def __init__(self, dictionary: List[str], word_frequencies: dict = None):
        self.dictionary = set(w.lower() for w in dictionary)
        # Optional: word frequencies for ranking suggestions
        self.frequencies = word_frequencies or {}
    
    def edits1(self, word: str) -> Set[str]:
        """Generate all strings that are one edit away from 'word'."""
        letters = self.ALPHABET
        splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        
        # All single deletions
        deletes = [L + R[1:] for L, R in splits if R]
        
        # All single transpositions (swap adjacent chars)
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
        
        # All single replacements
        replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
        
        # All single insertions
        inserts = [L + c + R for L, R in splits for c in letters]
        
        return set(deletes + transposes + replaces + inserts)
    
    def edits2(self, word: str) -> Set[str]:
        """Generate all strings that are two edits away from 'word'."""
        return set(e2 for e1 in self.edits1(word) for e2 in self.edits1(e1))
    
    def known(self, words: Set[str]) -> Set[str]:
        """Return subset of words that are in dictionary."""
        return words & self.dictionary
    
    def candidates(self, word: str) -> Set[str]:
        """
        Generate correction candidates in priority order:
        1. The word itself (if it's correct)
        2. Known words at distance 1
        3. Known words at distance 2
        4. The original word (no correction found)
        """
        word = word.lower()
        
        # Priority order: return first non-empty set
        return (
            self.known({word}) or           # Already correct
            self.known(self.edits1(word)) or  # Distance 1
            self.known(self.edits2(word)) or  # Distance 2
            {word}                           # Give up, return original
        )
    
    def probability(self, word: str) -> float:
        """Probability of word based on frequency (for ranking)."""
        total = sum(self.frequencies.values()) or 1
        return self.frequencies.get(word, 1) / total
    
    def correct(self, word: str) -> str:
        """Return the single best correction for word."""
        word = word.lower()
        candidates = self.candidates(word)
        # Pick candidate with highest probability
        return max(candidates, key=self.probability)
    
    def suggest(self, word: str, max_suggestions: int = 5) -> List[Tuple[str, int]]:
        """
        Return ranked list of suggestions with their edit distances.
        """
        word = word.lower()
        if word in self.dictionary:
            return [(word, 0)]
        
        # Distance 1 candidates
        dist1 = self.known(self.edits1(word))
        # Distance 2 candidates
        dist2 = self.known(self.edits2(word)) - dist1
        
        suggestions = []
        for w in dist1:
            suggestions.append((w, 1, self.probability(w)))
        for w in dist2:
            suggestions.append((w, 2, self.probability(w)))
        
        # Sort by distance first, then by probability (descending)
        suggestions.sort(key=lambda x: (x[0], -x[2]))  # distance, -probability
        
        return [(w, d) for w, d, _ in suggestions[:max_suggestions]]
 
 
# Demonstration with frequency-based ranking
def demo_smart_checker():
    # Words with frequencies (in a real system, from corpus)
    word_freq = {
        "the": 1000, "there": 100, "their": 80, "they": 200,
        "hello": 50, "help": 150, "held": 30, "heal": 20,
        "world": 120, "word": 80, "work": 200, "words": 60,
        "spell": 25, "spelling": 40, "spelled": 15,
        "program": 100, "programming": 80
    }
    
    dictionary = list(word_freq.keys())
    checker = SmartSpellChecker(dictionary, word_freq)
    
    tests = ["teh", "helo", "wrold", "speling", "programing"]
    
    for word in tests:
        correction = checker.correct(word)
        suggestions = checker.suggest(word)
        print(f"'{word}':")
        print(f"  Best correction: '{correction}'")
        print(f"  All suggestions: {suggestions}")
        print()
 
 
demo_smart_checker()

The Speed Advantage

BK-Trees: Metric Space Indexing

For even better performance, especially with larger edit distance thresholds, we can use a BK-tree (Burkhard-Keller tree)—a data structure specifically designed for metric space queries.

Key Insight: The Triangle Inequality

Because edit distance is a metric, it satisfies the triangle inequality:

$$|d(a, c) - d(b, c)| \leq d(a, b)$$

This means: if we know the distance from query q to node n in our tree, we can bound the possible distances from q to n's children without computing them all!

BK-Tree Structure:

Each node contains a word
Children are labeled with edges denoting distance to the parent
Child at edge k contains a word with distance exactly k from parent
A node can have at most one child per distance value

BK-Tree Query:

To find all words within distance d of query q:

Compute distance k from q to root
If k ≤ d, the root is a match
Recursively search children with edge labels in range [k-d, k+d]
Skip children outside this range (triangle inequality prunes them)

Pruning Power:

For a query with tolerance d=2, we only explore children with labels in a range of width 2d+1=5. If nodes have many children (distances spread out), we prune most of the tree!

bk_tree.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
from typing import List, Tuple, Optional, Dict
 
class BKTree:
    """
    Burkhard-Keller tree for efficient metric space queries.
    
    Enables finding all dictionary words within distance k of a query
    without computing distance to every word.
    """
    
    def __init__(self, distance_fn=None):
        self.root: Optional[BKTreeNode] = None
        self.distance = distance_fn or edit_distance
        self.size = 0
    
    def add(self, word: str) -> None:
        """Add a word to the BK-tree."""
        word = word.lower()
        
        if self.root is None:
            self.root = BKTreeNode(word)
            self.size = 1
            return
        
        node = self.root
        while True:
            dist = self.distance(word, node.word)
            
            if dist == 0:
                return  # Word already in tree
            
            if dist in node.children:
                node = node.children[dist]
            else:
                node.children[dist] = BKTreeNode(word)
                self.size += 1
                return
    
    def search(self, query: str, max_distance: int) -> List[Tuple[str, int]]:
        """
        Find all words within max_distance of query.
        
        Returns:
            List of (word, distance) tuples
        """
        if self.root is None:
            return []
        
        query = query.lower()
        results = []
        candidates = [self.root]
        
        while candidates:
            node = candidates.pop()
            dist = self.distance(query, node.word)
            
            if dist <= max_distance:
                results.append((node.word, dist))
            
            # Triangle inequality: only explore children in range
            # [dist - max_distance, dist + max_distance]
            low = max(1, dist - max_distance)
            high = dist + max_distance
            
            for child_dist, child_node in node.children.items():
                if low <= child_dist <= high:
                    candidates.append(child_node)
        
        return sorted(results, key=lambda x: x[1])
    
    def build_from_list(self, words: List[str]) -> 'BKTree':
        """Convenience method to build tree from word list."""
        for word in words:
            self.add(word)
        return self
 
 
class BKTreeNode:
    """Node in a BK-tree."""
    
    def __init__(self, word: str):
        self.word = word
        self.children: Dict[int, 'BKTreeNode'] = {}
 
 
class BKTreeSpellChecker:
    """Spell checker using BK-tree for efficient lookups."""
    
    def __init__(self, dictionary: List[str]):
        self.tree = BKTree().build_from_list(dictionary)
        self.dictionary = set(w.lower() for w in dictionary)
    
    def check(self, word: str) -> bool:
        return word.lower() in self.dictionary
    
    def suggest(
        self, 
        word: str, 
        max_distance: int = 2,
        max_suggestions: int = 5
    ) -> List[Tuple[str, int]]:
        results = self.tree.search(word, max_distance)
        return results[:max_suggestions]
 
 
# Demo and performance comparison
def demo_bk_tree():
    import time
    
    # Generate a larger dictionary for performance testing
    # Using common English words + variations
    base_words = [
        "the", "be", "to", "of", "and", "a", "in", "that", "have", "I",
        "it", "for", "not", "on", "with", "he", "as", "you", "do", "at",
        "this", "but", "his", "by", "from", "they", "we", "say", "her", "she",
        "or", "an", "will", "my", "one", "all", "would", "there", "their", "what",
        "program", "programming", "computer", "algorithm", "function", "variable",
        "spell", "spelling", "check", "checker", "dictionary", "word", "words"
    ]
    
    # Add some variations
    dictionary = []
    for word in base_words:
        dictionary.append(word)
        if len(word) > 3:
            dictionary.append(word + "s")
            dictionary.append(word + "ing")
            dictionary.append(word + "ed")
    
    print(f"Dictionary size: {len(dictionary)} words")
    
    # Build BK-tree
    start = time.time()
    checker = BKTreeSpellChecker(dictionary)
    build_time = time.time() - start
    print(f"BK-tree build time: {build_time*1000:.2f}ms")
    
    # Test queries
    test_words = ["programing", "speling", "dictionery", "algoritm"]
    
    for word in test_words:
        start = time.time()
        suggestions = checker.suggest(word, max_distance=2)
        query_time = time.time() - start
        print(f"\n'{word}' ({query_time*1000:.3f}ms):")
        print(f"  Suggestions: {suggestions[:5]}")
 
 
demo_bk_tree()

When to Use BK-Trees

Ranking Suggestions: Beyond Edit Distance

Edit distance gets us candidates, but choosing among equally-distant suggestions requires additional signals. A good spell checker considers multiple factors.

Ranking Signals

•Word Frequency: Common words are more likely to be intended. "the" is more likely than "teh" (if both were valid).
•Keyboard Layout: Typos often involve adjacent keys. 'r' and 't' are close; 'r' and 'p' are far. Weight substitutions by key distance.
•Character Similarity: 'i' and 'l' look similar; 'b' and 'd' are visually confusable. Phonetic similarity ('f' and 'ph') matters for dictation errors.
•Context: In "I went to teh store," context strongly suggests "the." Language models can provide contextual probability.
•User History: If a user types "colour" often (British spelling), prefer it over "color." Personalization improves corrections.
•Document Domain: In a medical document, medical terms should rank higher. Domain-specific dictionaries and weights help.

advanced_ranking.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
from typing import List, Tuple
import math
 
# Simplified keyboard layout for distance calculation
KEYBOARD_LAYOUT = {
    'q': (0, 0), 'w': (0, 1), 'e': (0, 2), 'r': (0, 3), 't': (0, 4),
    'y': (0, 5), 'u': (0, 6), 'i': (0, 7), 'o': (0, 8), 'p': (0, 9),
    'a': (1, 0), 's': (1, 1), 'd': (1, 2), 'f': (1, 3), 'g': (1, 4),
    'h': (1, 5), 'j': (1, 6), 'k': (1, 7), 'l': (1, 8),
    'z': (2, 0), 'x': (2, 1), 'c': (2, 2), 'v': (2, 3), 'b': (2, 4),
    'n': (2, 5), 'm': (2, 6)
}
 
 
def keyboard_distance(c1: str, c2: str) -> float:
    """Euclidean distance between two keys on keyboard."""
    c1, c2 = c1.lower(), c2.lower()
    if c1 not in KEYBOARD_LAYOUT or c2 not in KEYBOARD_LAYOUT:
        return 3.0  # Far apart if not on keyboard
    
    r1, c1_pos = KEYBOARD_LAYOUT[c1]
    r2, c2_pos = KEYBOARD_LAYOUT[c2]
    
    return math.sqrt((r1 - r2) ** 2 + (c1_pos - c2_pos) ** 2)
 
 
def keyboard_weighted_edit_distance(s: str, t: str) -> float:
    """
    Edit distance with keyboard-aware substitution costs.
    Adjacent keys cost less to substitute than distant keys.
    """
    m, n = len(s), len(t)
    dp = [[0.0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(m + 1):
        dp[i][0] = float(i)
    for j in range(n + 1):
        dp[0][j] = float(j)
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s[i - 1] == t[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                # Substitution cost based on keyboard distance
                kbd_dist = keyboard_distance(s[i - 1], t[j - 1])
                # Normalize: adjacent keys ~0.5 cost, far keys ~1.5 cost
                sub_cost = 0.5 + kbd_dist / 6.0
                
                dp[i][j] = min(
                    dp[i][j - 1] + 1.0,      # Insert
                    dp[i - 1][j] + 1.0,      # Delete
                    dp[i - 1][j - 1] + sub_cost  # Substitute
                )
    
    return dp[m][n]
 
 
class AdvancedSpellChecker:
    """
    Spell checker with advanced ranking using multiple signals.
    """
    
    def __init__(
        self, 
        dictionary: List[str],
        frequencies: dict = None
    ):
        self.dictionary = set(w.lower() for w in dictionary)
        self.word_list = [w.lower() for w in dictionary]
        self.frequencies = frequencies or {}
        self.total_freq = sum(self.frequencies.values()) or 1
    
    def score(
        self, 
        original: str, 
        suggestion: str,
        edit_dist: int
    ) -> float:
        """
        Compute a ranking score (higher = better suggestion).
        
        Combines:
        - Inverse edit distance (smaller is better)
        - Word frequency (more common is better)
        - Keyboard distance (nearby typos are common)
        """
        # Base: inverse edit distance (heavily weighted)
        base_score = 1.0 / (1.0 + edit_dist)
        
        # Frequency bonus (log scale to prevent domination)
        freq = self.frequencies.get(suggestion, 1)
        freq_score = math.log(1 + freq / self.total_freq * 1000)
        
        # Keyboard proximity bonus for substitutions
        kbd_score = 0
        if len(original) == len(suggestion):
            for c1, c2 in zip(original, suggestion):
                if c1 != c2:
                    # Lower keyboard distance = higher bonus
                    kbd_dist = keyboard_distance(c1, c2)
                    kbd_score += 1.0 / (1.0 + kbd_dist)
        
        # Combine scores (tune weights for your use case)
        return base_score * 10 + freq_score * 3 + kbd_score * 2
    
    def suggest(
        self, 
        word: str, 
        max_distance: int = 2,
        max_suggestions: int = 5
    ) -> List[Tuple[str, int, float]]:
        """
        Get ranked suggestions with scores.
        
        Returns:
            List of (word, distance, score) tuples, sorted by score descending
        """
        word = word.lower()
        candidates = []
        
        for dict_word in self.word_list:
            dist = edit_distance(word, dict_word)
            if 0 < dist <= max_distance:  # Exclude exact match (dist=0)
                score = self.score(word, dict_word, dist)
                candidates.append((dict_word, dist, score))
        
        # Sort by score descending
        candidates.sort(key=lambda x: -x[2])
        
        return candidates[:max_suggestions]
 
 
# Demonstration
def demo_advanced_ranking():
    dictionary = ["the", "they", "them", "there", "three", "tree", "free"]
    frequencies = {
        "the": 10000,  # Very common
        "they": 500,
        "them": 400,
        "there": 800,
        "three": 200,
        "tree": 100,
        "free": 150
    }
    
    checker = AdvancedSpellChecker(dictionary, frequencies)
    
    # "thr" - 'r' is adjacent to 'e' on keyboard
    # "three", "there", "tree" are all distance 2-3
    print("Checking 'thr' (likely meant 'the' or 'three'):")
    suggestions = checker.suggest("thr", max_distance=3)
    for word, dist, score in suggestions:
        print(f"  {word}: distance={dist}, score={score:.2f}")
    
    print("\nChecking 'yhe' ('y' adjacent to 't' on some keyboards):")
    suggestions = checker.suggest("yhe", max_distance=2)
    for word, dist, score in suggestions:
        print(f"  {word}: distance={dist}, score={score:.2f}")
 
 
demo_advanced_ranking()

Beyond Spell Checking: Other Applications

Edit distance extends far beyond spell checking. Here are major application areas where the same algorithmic foundation applies.

DNA and Protein Sequence Alignment

Biological sequences (DNA, RNA, proteins) mutate over time through:

Point mutations (substitutions): A→G
Insertions: Adding nucleotides
Deletions: Losing nucleotides

These correspond exactly to our edit operations! Edit distance measures evolutionary distance between sequences.

Key Algorithms:

Needleman-Wunsch: Global alignment (full sequence comparison) — essentially edit distance with biological scoring
Smith-Waterman: Local alignment (find best matching subsequences)
BLAST: Heuristic for searching sequence databases

Substitution Matrices:

# Simplified biological scoring
score_matrix[('A', 'G')] = 1  # Similar (transition)
score_matrix[('A', 'C')] = 2  # Different (transversion)

Practical Optimizations for Production

Moving from textbook algorithm to production system requires careful optimization. Here are techniques used in real spell checkers and fuzzy search systems.

Production Optimization Techniques

•Early Termination: If edit distance exceeds threshold during computation, abort early. This is called 'banded' edit distance—only compute cells near the diagonal where valid solutions exist.
•Length Filtering: If |len(s) - len(t)| > max_distance, skip. Strings of very different lengths can't be within the threshold.
•Common Prefix/Suffix: Strip matching prefixes and suffixes before computing. 'unhelpful' vs 'unhelpfully' only differs in the last characters.
•Caching: Cache edit distances for common pairs. Memoize at the application level for repeated queries.
•Approximate Methods: For very large datasets, accept approximations. MinHash, SimHash, and LSH trade accuracy for speed.
•Parallelization: Anti-diagonal cells are independent. GPU implementations can compute thousands of cells in parallel.
•SIMD Instructions: Modern CPUs have vector instructions that can process multiple cells simultaneously.

optimized_edit_distance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
def edit_distance_bounded(s: str, t: str, max_dist: int) -> int:
    """
    Edit distance with early termination if exceeds threshold.
    Returns max_dist + 1 if actual distance exceeds max_dist.
    
    Time: O(max_dist * min(m, n)) in common case
    """
    m, n = len(s), len(t)
    
    # Quick length check
    if abs(m - n) > max_dist:
        return max_dist + 1
    
    # Ensure n >= m for slightly simpler logic
    if m > n:
        s, t = t, s
        m, n = n, m
    
    # Strip common prefix
    while m > 0 and n > 0 and s[0] == t[0]:
        s = s[1:]
        t = t[1:]
        m -= 1
        n -= 1
    
    # Strip common suffix
    while m > 0 and n > 0 and s[-1] == t[-1]:
        s = s[:-1]
        t = t[:-1]
        m -= 1
        n -= 1
    
    # Edge cases after stripping
    if m == 0:
        return n if n <= max_dist else max_dist + 1
    if n == 0:
        return m if m <= max_dist else max_dist + 1
    
    # Use banded algorithm: only compute cells within max_dist of diagonal
    # prev[j] = dp[i-1][j], curr[j] = dp[i][j]
    prev = list(range(min(n + 1, max_dist + 2)))
    
    for i in range(1, m + 1):
        curr = [0] * len(prev)
        curr[0] = i
        
        # Band: only update j in range [j_start, j_end]
        j_start = max(1, i - max_dist)
        j_end = min(n, i + max_dist)
        
        # Fill values outside band with infinity
        for j in range(j_start):
            curr[j] = max_dist + 1
        
        min_in_row = max_dist + 1
        
        for j in range(j_start, j_end + 1):
            if j >= len(prev):
                break
            
            if s[i - 1] == t[j - 1]:
                curr[j] = prev[j - 1] if j > 0 else i
            else:
                ins = curr[j - 1] if j > 0 else max_dist + 1
                delete = prev[j] if j < len(prev) else max_dist + 1
                repl = prev[j - 1] if j > 0 else max_dist + 1
                curr[j] = 1 + min(ins, delete, repl)
            
            min_in_row = min(min_in_row, curr[j])
        
        # Early termination: if minimum in row exceeds threshold, no solution
        if min_in_row > max_dist:
            return max_dist + 1
        
        prev = curr
    
    result = prev[min(n, len(prev) - 1)]
    return result if result <= max_dist else max_dist + 1
 
 
# Test the optimization
def benchmark_bounded():
    import time
    
    # Generate test cases
    s = "abcdefghij" * 10  # 100 chars
    t = "abcdefghik" * 10  # 100 chars, differs at end of each segment
    
    # Standard unbounded
    start = time.time()
    for _ in range(100):
        d1 = edit_distance(s, t)
    t1 = time.time() - start
    
    # Bounded with low threshold (should be faster)
    start = time.time()
    for _ in range(100):
        d2 = edit_distance_bounded(s, t, 5)
    t2 = time.time() - start
    
    print(f"Standard: {d1}, time: {t1*1000:.1f}ms")
    print(f"Bounded(5): {d2}, time: {t2*1000:.1f}ms")
    print(f"Speedup: {t1/t2:.1f}x")
 
 
benchmark_bounded()

The 80/20 Rule

Summary: Edit Distance in Practice

We've completed our comprehensive exploration of edit distance, from mathematical foundations to production applications. Let's consolidate the key insights from this module.

Module Key Takeaways

•Edit distance is the minimum number of insert/delete/substitute operations to transform one string into another
•It's a proper metric, enabling efficient data structures like BK-trees
•The DP solution uses a 2D table with O(mn) time and space, optimizable to O(min(m,n)) space
•Reconstruction walks backward through the table to recover the actual edit sequence
•Spell checking uses candidate generation (faster than checking all dictionary words)
•Ranking considers word frequency, keyboard layout, context, and other signals
•Applications span bioinformatics, fuzzy search, NLP, and version control
•Production optimizations include early termination, length filtering, and banding

What You've Mastered:

The Pattern Generalizes:

Module Complete

4 / 4