Typeahead Autocomplete - Learning Module

Loading content...

0/273

Prefix Matching: Beyond Simple Character Lookup

The Complexity Hidden in 'Simple' Prefix Search

On the surface, prefix matching seems trivial: given "prog", find all strings starting with "prog". But production typeahead systems face challenges that transform this simple problem into a complex engineering undertaking.

Consider these real-world queries:

"new york pizza" — Should this match "new york style pizza" (prefix of first word) or "pizza new york" (different word order)?
"iPhone 15 pro" — Case sensitivity? What about "iphone 15 PRO"?
"café latte" — The "é" is encoded differently in different systems. Should "cafe latte" match?
"机器学习" (machine learning in Chinese) — No spaces between words. Where are the prefix boundaries?
"progarmmign" — Obvious typo. Should we return "programming"?

This page explores the full spectrum of prefix matching challenges and their solutions.

What You Will Learn

By the end of this page, you will understand how to handle multi-word queries, implement word-boundary prefix matching, normalize text for robust matching, handle international text (CJK, Arabic, etc.), and implement fuzzy matching for typo tolerance—all while maintaining sub-100ms latency.

Multi-Word Query Handling

Users type queries with multiple words. The typeahead system must understand how these words relate to potential suggestions.

Query Decomposition Strategies

When a user types "machine le", we need to determine: is this a prefix of "machine learning" or two separate terms "machine" and "le*"?

Strategy 1: Phrase Prefix (Last Word is Prefix)

Treat all words except the last as complete words, and only the last word as a prefix:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
function phrasePrefix(query: string): { completeWords: string[]; prefix: string } {
  const words = query.trim().split(/\s+/);
  
  if (words.length === 0) {
    return { completeWords: [], prefix: '' };
  }
  
  if (words.length === 1) {
    return { completeWords: [], prefix: words[0] };
  }
  
  return {
    completeWords: words.slice(0, -1),
    prefix: words[words.length - 1],
  };
}
 
// "machine le" → complete: ["machine"], prefix: "le"
// Matches: "machine learning", "machine learning python"
// Does NOT match: "how machines learn" (word order matters)
 
function matchPhrasePrefix(
  suggestion: string,
  completeWords: string[],
  prefix: string
): boolean {
  const suggestionWords = suggestion.toLowerCase().split(/\s+/);
  
  // Check if complete words appear in order at the start
  for (let i = 0; i < completeWords.length; i++) {
    if (i >= suggestionWords.length) return false;
    if (suggestionWords[i] !== completeWords[i].toLowerCase()) return false;
  }
  
  // Check if prefix matches next word
  const nextWordIndex = completeWords.length;
  if (nextWordIndex >= suggestionWords.length) return false;
  
  return suggestionWords[nextWordIndex].startsWith(prefix.toLowerCase());
}

Strategy 2: All Words as Prefixes (Boolean AND)

Treat every word as a prefix and require all to match:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function matchAllPrefixes(
  suggestion: string,
  queryWords: string[]
): boolean {
  const suggestionLower = suggestion.toLowerCase();
  const suggestionWords = suggestionLower.split(/\s+/);
  
  // Each query word must match as prefix of some suggestion word
  return queryWords.every(queryWord => {
    const qLower = queryWord.toLowerCase();
    return suggestionWords.some(sWord => sWord.startsWith(qLower));
  });
}
 
// "le mach" →
// Matches: "machine learning" (le→learning, mach→machine)
// Matches: "learn machine" (same reason)
// Matches: "machine learning for beginners" (same prefixes matched)
 
// This is more flexible but can over-match
// "py learn" might match "learn python" or "learning pygame"

Strategy 3: Hybrid with Position Weighting

Combine strategies with scoring that prefers phrase-order matches:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
interface MatchResult {
  matches: boolean;
  score: number;
  matchType: 'exact_phrase' | 'ordered_prefix' | 'unordered_prefix' | 'no_match';
}
 
function hybridMatch(suggestion: string, query: string): MatchResult {
  const queryWords = query.toLowerCase().trim().split(/\s+/);
  const suggestionWords = suggestion.toLowerCase().split(/\s+/);
  
  // Check phrase prefix (highest score)
  if (matchesPhrasePrefix(suggestionWords, queryWords)) {
    return { matches: true, score: 1.0, matchType: 'exact_phrase' };
  }
  
  // Check ordered prefix (medium score)
  if (matchesOrderedPrefixes(suggestionWords, queryWords)) {
    return { matches: true, score: 0.7, matchType: 'ordered_prefix' };
  }
  
  // Check unordered prefix (lowest score)
  if (matchesUnorderedPrefixes(suggestionWords, queryWords)) {
    return { matches: true, score: 0.4, matchType: 'unordered_prefix' };
  }
  
  return { matches: false, score: 0, matchType: 'no_match' };
}
 
function matchesOrderedPrefixes(
  suggestionWords: string[],
  queryWords: string[]
): boolean {
  // Query words must match in order, but not necessarily contiguously
  // "java script" should match "javascript fundamentals tutorial"
  let sIndex = 0;
  
  for (const qWord of queryWords) {
    // Find this query word prefix in remaining suggestion words
    let found = false;
    while (sIndex < suggestionWords.length) {
      if (suggestionWords[sIndex].startsWith(qWord)) {
        found = true;
        sIndex++;
        break;
      }
      sIndex++;
    }
    if (!found) return false;
  }
  
  return true;
}

Production Recommendation

Most production systems use phrase prefix as the primary strategy (it matches user mental model), with ordered-prefix as a fallback. Unordered matching is typically reserved for fuzzy search features rather than core autocomplete. Score-weight your strategies accordingly.

Text Normalization for Robust Matching

Text normalization ensures that superficially different strings match when semantically equivalent. This is critical for user experience—users shouldn't need to guess the exact casing or spelling used in the corpus.

Case Normalization

The simplest but most impactful normalization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Simple approach: lowercase everything
function normalizeCase(text: string): string {
  return text.toLowerCase();
}
 
// But beware of locale-specific issues!
// Turkish: 'I'.toLowerCase() should be 'ı' not 'i'
// German: 'ß'.toUpperCase() is 'SS'
 
function normalizeCase(text: string, locale: string = 'en'): string {
  return text.toLocaleLowerCase(locale);
}
 
// For search indices, often use case-folding (Unicode standard)
function caseFold(text: string): string {
  // Full Unicode case folding handles all edge cases
  return text.normalize('NFKC').toLowerCase();
}

Diacritic/Accent Normalization

Handle accented characters for languages like French, Spanish, German:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
function removeDiacritics(text: string): string {
  // NFD: Decomposes characters (é → e + combining-acute)
  // Then remove the combining marks
  return text
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '');  // Remove combining diacritical marks
}
 
// "café" → "cafe"
// "résumé" → "resume"
// "naïve" → "naive"
// "Zürich" → "Zurich"
 
// But preserve original for display!
interface NormalizedSuggestion {
  original: string;      // "Café Latte" - shown to user
  normalized: string;    // "cafe latte" - used for matching
}
 
// Index stores both; search normalizes query
function search(query: string): Suggestion[] {
  const normalizedQuery = removeDiacritics(normalizeCase(query));
  return index.search(normalizedQuery);  // Returns with original text
}

Unicode Normalization Forms

Unicode text can represent the same visual character in multiple ways. Normalization ensures consistent representation:

Unicode Normalization Forms
Form	Name	Use Case
NFC	Canonical Composition	Most common; combines characters (e + ́ → é)
NFD	Canonical Decomposition	Separates characters (é → e + ́); useful for diacritic removal
NFKC	Compatibility Composition	Maps compatibility characters (ﬁ → fi, ① → 1)
NFKD	Compatibility Decomposition	Most aggressive; use for search indices

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
function normalizeForIndex(text: string): string {
  return text
    // 1. Unicode normalization
    .normalize('NFKC')
    // 2. Case folding
    .toLocaleLowerCase('en')
    // 3. Remove diacritics
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '')
    // 4. Normalize whitespace
    .replace(/\s+/g, ' ')
    .trim()
    // 5. Remove punctuation (optional, depends on use case)
    .replace(/[^\p{L}\p{N}\s]/gu, '');
}
 
// "  Café   Latte™  " → "cafe latte"
// "naïve résumé" → "naive resume"
// "Ｈｅｌｌｏ　Ｗｏｒｌｄ" (full-width) → "hello world"

Normalization Must Be Consistent

The same normalization must be applied to both the indexed suggestions AND the user query at search time. Mismatched normalization is a common source of "why doesn't my search work" bugs. Centralize normalization logic and use it everywhere.

Internationalization: Global Prefix Matching

Building a typeahead system for global users introduces fundamental challenges. Different languages have different concepts of "words," "prefixes," and even character boundaries.

CJK Languages (Chinese, Japanese, Korean)

These languages don't use spaces between words. A sentence like "机器学习" (machine learning) is four characters with no explicit word boundaries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// The problem: Where are the "words" in "机器学习"?
// 机器 (machine) + 学习 (learning) - but this requires linguistic knowledge
 
// Solution 1: Character-level prefix matching
// Simple but over-matches: "机器" matches "机器学习" and "机器人"
function cjkCharacterMatch(suggestion: string, query: string): boolean {
  return suggestion.includes(query);  // Substring, not just prefix
}
 
// Solution 2: Dictionary-based segmentation
// Use libraries like jieba (Chinese), MeCab (Japanese), Okt (Korean)
import Segment from 'segment';  // Chinese word segmentation
 
const segmenter = new Segment();
segmenter.useDefault();
 
function segmentChinese(text: string): string[] {
  return segmenter.doSegment(text, { simple: true });
}
 
// "机器学习教程" → ["机器", "学习", "教程"]
 
// Solution 3: N-gram indexing
// Index all overlapping character sequences
function generateNgrams(text: string, n: number = 2): string[] {
  const ngrams: string[] = [];
  for (let i = 0; i <= text.length - n; i++) {
    ngrams.push(text.slice(i, i + n));
  }
  return ngrams;
}
 
// "学习" → ["学习"] (2-gram)
// Can match within "机器学习" without full segmentation

Right-to-Left Languages (Arabic, Hebrew)

RTL languages have unique challenges for prefix matching:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Arabic example: "الترجمة" (translation)
// Visually appears right-to-left, but string storage is left-to-right
 
// The "prefix" from user perspective is actually at the end of string storage
// No code change needed - just understand that:
// - User types aleph (ا) first
// - String is stored with aleph at index 0
// - Prefix matching works normally
 
// BUT: Arabic has complex letter forms based on position
// "ب" (bā') looks different at start, middle, end of word
// Initial: بـ  Medial: ـبـ  Final: ـب  Isolated: ب
 
// Solution: Normalize to isolated forms for matching
function normalizeArabic(text: string): string {
  // Map presentation forms to base letters
  const arabicNormMap: Record<string, string> = {
    // ... mapping of presentation forms to base
  };
  
  return [...text]
    .map(char => arabicNormMap[char] || char)
    .join('');
}
 
// Hebrew considerations: 
// - Final letter forms (ם vs מ, ן vs נ, etc.)
// - Vowel points (niqqud) typically stripped for search

Mixed-Script Queries

Users may mix scripts in a single query ("Python チュートリアル" = Python tutorial in Japanese):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
function detectScripts(text: string): Set<string> {
  const scripts = new Set<string>();
  
  for (const char of text) {
    const code = char.charCodeAt(0);
    
    if (code >= 0x0000 && code <= 0x007F) scripts.add('latin');
    else if (code >= 0x4E00 && code <= 0x9FFF) scripts.add('cjk');
    else if (code >= 0x3040 && code <= 0x309F) scripts.add('hiragana');
    else if (code >= 0x30A0 && code <= 0x30FF) scripts.add('katakana');
    else if (code >= 0x0600 && code <= 0x06FF) scripts.add('arabic');
    else if (code >= 0x0590 && code <= 0x05FF) scripts.add('hebrew');
    else if (code >= 0xAC00 && code <= 0xD7AF) scripts.add('korean');
    // ... more scripts
  }
  
  return scripts;
}
 
function handleMixedScriptQuery(query: string): string[] {
  // Split into script-homogeneous segments
  const segments: { script: string; text: string }[] = [];
  let currentScript = '';
  let currentText = '';
  
  for (const char of query) {
    const charScript = getCharScript(char);
    if (charScript !== currentScript && currentText) {
      segments.push({ script: currentScript, text: currentText });
      currentText = '';
    }
    currentScript = charScript;
    currentText += char;
  }
  if (currentText) {
    segments.push({ script: currentScript, text: currentText });
  }
  
  // Apply appropriate normalization per segment
  return segments.map(s => normalizeForScript(s.text, s.script));
}

ICU: The Standard Solution

For production internationalization, use the ICU library (International Components for Unicode). It provides locale-aware text segmentation, collation, and normalization. Available in most languages: icu4c (C/C++), icu4j (Java), Intl API (JavaScript). Don't reinvent linguistic wheels.

Fuzzy Matching and Typo Tolerance

Users make typos. A typeahead system that only matches exact prefixes frustrates users who mis-type. Fuzzy matching allows "prgoramming" to match "programming."

Edit Distance (Levenshtein Distance)

The classic metric for string similarity: minimum edits (insert, delete, replace) to transform one string into another.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
function levenshteinDistance(a: string, b: string): number {
  const m = a.length;
  const n = b.length;
  
  // DP table: dp[i][j] = edit distance between a[0..i-1] and b[0..j-1]
  const dp: number[][] = Array(m + 1)
    .fill(null)
    .map(() => Array(n + 1).fill(0));
  
  // Base cases
  for (let i = 0; i <= m; i++) dp[i][0] = i;  // Delete all from a
  for (let j = 0; j <= n; j++) dp[0][j] = j;  // Insert all to get b
  
  // Fill DP table
  for (let i = 1; i <= m; i++) {
    for (let j = 1; j <= n; j++) {
      if (a[i - 1] === b[j - 1]) {
        dp[i][j] = dp[i - 1][j - 1];  // No edit needed
      } else {
        dp[i][j] = 1 + Math.min(
          dp[i - 1][j],     // Delete from a
          dp[i][j - 1],     // Insert into a
          dp[i - 1][j - 1]  // Replace
        );
      }
    }
  }
  
  return dp[m][n];
}
 
// "prgoramming" vs "programming": distance = 2 (swap 'rg' and 'or')
// "pyton" vs "python": distance = 1 (missing 'h')

The Performance Challenge

Naive fuzzy matching is O(n × m × k) where n = corpus size, m = query length, k = average suggestion length. This is far too slow for real-time typeahead.

Solution 1: Levenshtein Automata

Build a finite automaton that accepts all strings within edit distance d of the query. Then traverse the trie in parallel with the automaton:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// A Levenshtein automaton for query "cat" with distance 1
// accepts: "cat" (0), "cats" (1), "bat" (1), "at" (1), "cart" (1)
// rejects: "dog" (3), "catch" (2)
 
interface AutomatonState {
  // The state encodes how much of the query we've matched
  // and how many errors we've used
  position: number;
  errorCount: number;
  
  // Transitions on input character
  transitions: Map<string | 'any', AutomatonState[]>;
  
  // Can we accept (match) from this state?
  canMatch: boolean;
}
 
function fuzzySearchWithAutomaton(
  trie: TrieNode,
  query: string,
  maxDistance: number
): Suggestion[] {
  const automaton = buildLevenshteinAutomaton(query, maxDistance);
  const results: Suggestion[] = [];
  
  function traverse(
    trieNode: TrieNode,
    autoState: AutomatonState
  ): void {
    // If automaton can match and trie node is terminal, we have a result
    if (autoState.canMatch && trieNode.isTerminal) {
      results.push({ word: trieNode.word!, score: trieNode.metadata!.score });
    }
    
    // Try all trie children
    for (const [char, childTrie] of trieNode.children) {
      // Check what automaton states we can reach
      const nextStates = step(autoState, char);
      for (const nextState of nextStates) {
        if (!isDead(nextState)) {
          traverse(childTrie, nextState);
        }
      }
    }
  }
  
  traverse(trie.root, automaton.initialState);
  return results;
}
 
// Time complexity: O(m × |Σ| × d) to build automaton
// + O(k × d) per trie node visited, where d = max distance
// Much better than O(n × m × k) naive approach

Solution 2: Phonetic Algorithms

Match based on how words sound, not how they're spelled. Useful for names and products:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Soundex: Encode words by their sound (English-centric)
function soundex(word: string): string {
  const codes: Record<string, string> = {
    'b': '1', 'f': '1', 'p': '1', 'v': '1',
    'c': '2', 'g': '2', 'j': '2', 'k': '2', 'q': '2', 's': '2', 'x': '2', 'z': '2',
    'd': '3', 't': '3',
    'l': '4',
    'm': '5', 'n': '5',
    'r': '6',
  };
  
  const first = word[0].toUpperCase();
  let code = first;
  let lastCode = codes[word[0].toLowerCase()] || '';
  
  for (let i = 1; i < word.length && code.length < 4; i++) {
    const c = word[i].toLowerCase();
    const newCode = codes[c];
    
    if (newCode && newCode !== lastCode) {
      code += newCode;
      lastCode = newCode;
    } else if (!newCode) {
      lastCode = '';  // Vowels/h/w separate consonants
    }
  }
  
  return code.padEnd(4, '0');
}
 
// "Robert" → "R163"
// "Rupert" → "R163" (same code!)
// "Programming" → "P625"
// "Programing" → "P625" (same code - handles misspelling)
 
// Build a phonetic index alongside the prefix index
interface DualIndex {
  prefixTrie: Trie;
  phoneticIndex: Map<string, string[]>;  // Soundex → words
}
 
function fuzzySearch(query: string): Suggestion[] {
  const prefixMatches = prefixTrie.search(query);
  const phoneticCode = soundex(query);
  const phoneticMatches = phoneticIndex.get(phoneticCode) || [];
  
  // Merge and deduplicate
  return mergeResults(prefixMatches, phoneticMatches);
}

Fuzzy Matching in Practice

Production systems often use a tiered approach: exact prefix matches first (score 1.0), then edit distance 1 matches (score 0.8), then phonetic matches (score 0.5). This ensures exact matches always rank highest while still surfacing relevant fuzzy matches.

Word Boundary and Anchor Matching

Users expect prefix matching to respect word boundaries. "learn" should match "machine learning" but perhaps not "relearn" (unless we explicitly want infix matching).

Prefix Anchoring Strategies

Strategy 1: First Word Prefix Only

Only match if the query is a prefix of the first word:

1
2
3
4
5
6
7
8
9
10
11
12
function matchFirstWordPrefix(
  suggestion: string,
  query: string
): boolean {
  const firstWord = suggestion.split(/\s+/)[0];
  return firstWord.toLowerCase().startsWith(query.toLowerCase());
}
 
// Query: "mac"
// ✓ "MacBook Pro" (first word starts with mac)
// ✓ "Machine Learning" (first word starts with mac)
// ✗ "Apple MacBook" (first word is Apple)

Strategy 2: Any Word Prefix (Word Boundary)

Match if query is prefix of ANY word in the suggestion:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function matchAnyWordPrefix(
  suggestion: string,
  query: string
): { matches: boolean; position: number } {
  const words = suggestion.split(/\s+/);
  const queryLower = query.toLowerCase();
  
  for (let i = 0; i < words.length; i++) {
    if (words[i].toLowerCase().startsWith(queryLower)) {
      return { matches: true, position: i };
    }
  }
  
  return { matches: false, position: -1 };
}
 
// Query: "learn"
// ("Machine Learning", 1) - matches 2nd word
// ("Learn Python", 0) - matches 1st word
// ("Relearn Basics", false) - "learn" is not a word prefix
 
// Use position for ranking: earlier position = higher score
function scoreByPosition(position: number): number {
  return position === 0 ? 1.0 : 1.0 / (position + 1);
}

Strategy 3: Multi-Index Approach

Build separate trie indices for different word positions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
interface MultiPositionIndex {
  // Index by word position
  firstWordTrie: Trie;      // Only first words
  allWordsTrie: Trie;       // All words with pointers to full suggestions
  
  // Posting lists: word → [(suggestionId, position), ...]
  wordToSuggestions: Map<string, { id: number; position: number }[]>;
}
 
function buildMultiPositionIndex(suggestions: string[]): MultiPositionIndex {
  const firstWordTrie = new Trie();
  const allWordsTrie = new Trie();
  const wordToSuggestions = new Map<string, { id: number; position: number }[]>();
  
  for (let id = 0; id < suggestions.length; id++) {
    const suggestion = suggestions[id];
    const words = suggestion.split(/\s+/);
    
    // Index first word
    firstWordTrie.insert(words[0].toLowerCase(), { suggestionId: id });
    
    // Index all words
    for (let pos = 0; pos < words.length; pos++) {
      const wordLower = words[pos].toLowerCase();
      allWordsTrie.insert(wordLower, { suggestionId: id, position: pos });
      
      // Posting list
      if (!wordToSuggestions.has(wordLower)) {
        wordToSuggestions.set(wordLower, []);
      }
      wordToSuggestions.get(wordLower)!.push({ id, position: pos });
    }
  }
  
  return { firstWordTrie, allWordsTrie, wordToSuggestions };
}
 
function search(query: string, index: MultiPositionIndex): Suggestion[] {
  // First try exact first-word match (highest relevance)
  const firstWordMatches = index.firstWordTrie.search(query);
  if (firstWordMatches.length >= 5) {
    return firstWordMatches.slice(0, 10);  // Enough first-word matches
  }
  
  // Supplement with any-word matches
  const anyWordMatches = index.allWordsTrie.search(query);
  return mergeByScore(firstWordMatches, anyWordMatches).slice(0, 10);
}

CamelCase and Special Boundaries

For technical domains (code, product names), consider CamelCase/snake_case boundaries: 'getUser' should be searchable as 'get' or 'user'. Split on case transitions and underscores when indexing: 'getUserById' → ['get', 'user', 'by', 'id'].

Prefix Matching Performance Optimizations

Even with tries, matching at scale requires careful optimization. Here are proven techniques.

Short Prefix Caching

Very short prefixes (1-2 characters) match too many suggestions. Cache their results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class ShortPrefixCache {
  // Pre-computed results for all 1-2 character prefixes
  private cache: Map<string, Suggestion[]>;
  
  constructor(trie: Trie, topK: number = 10) {
    this.cache = new Map();
    
    // Generate all 1-char prefixes
    for (let c = 'a'.charCodeAt(0); c <= 'z'.charCodeAt(0); c++) {
      const prefix = String.fromCharCode(c);
      this.cache.set(prefix, trie.getTopK(prefix, topK));
    }
    
    // Generate all 2-char prefixes (26 × 26 = 676)
    for (let c1 = 'a'.charCodeAt(0); c1 <= 'z'.charCodeAt(0); c1++) {
      for (let c2 = 'a'.charCodeAt(0); c2 <= 'z'.charCodeAt(0); c2++) {
        const prefix = String.fromCharCode(c1) + String.fromCharCode(c2);
        this.cache.set(prefix, trie.getTopK(prefix, topK));
      }
    }
    
    // Optional: 3-char prefixes (26³ = 17,576 entries)
    // Only if memory allows and 2-char results are too broad
  }
  
  search(prefix: string, k: number): Suggestion[] | null {
    if (prefix.length <= 2) {
      return this.cache.get(prefix.toLowerCase()) || null;
    }
    return null;  // Fall through to trie search
  }
}
 
// Usage pattern:
function search(prefix: string): Suggestion[] {
  const cached = shortPrefixCache.search(prefix, 10);
  if (cached) return cached;
  return trie.search(prefix);
}

Incremental/Differential Search

Keystrokes often extend the previous query. Reuse previous computation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class IncrementalSearchSession {
  private lastQuery: string = '';
  private lastNode: TrieNode | null = null;
  private lastResults: Suggestion[] = [];
  
  search(query: string, trie: Trie): Suggestion[] {
    // Check if this is an extension of the last query
    if (query.startsWith(this.lastQuery) && this.lastNode) {
      // Navigate from last position instead of root
      const extension = query.slice(this.lastQuery.length);
      let node = this.lastNode;
      
      for (const char of extension) {
        const child = node.children.get(char.toLowerCase());
        if (!child) {
          // No match for extension
          this.reset();
          return [];
        }
        node = child;
      }
      
      this.lastQuery = query;
      this.lastNode = node;
      this.lastResults = node.getTopK(10);
      return this.lastResults;
    }
    
    // Not an extension, start fresh
    this.reset();
    this.lastNode = trie.findNode(query);
    this.lastQuery = query;
    
    if (this.lastNode) {
      this.lastResults = this.lastNode.getTopK(10);
      return this.lastResults;
    }
    
    return [];
  }
  
  reset(): void {
    this.lastQuery = '';
    this.lastNode = null;
    this.lastResults = [];
  }
}
 
// Benefits:
// - "mach" → "machi" → "machin" → "machine"
//   Each keystroke traverses only 1 node instead of full path
// - Dramatic speedup for interactive typing

Debouncing at the Client

Client-side optimization reduces unnecessary requests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class TypeaheadClient {
  private debounceMs: number = 100;  // Wait 100ms after last keystroke
  private minChars: number = 2;       // Don't search until 2+ characters
  private timeoutId: number | null = null;
  private lastQuery: string = '';
  private controller: AbortController | null = null;
  
  onInput(query: string): void {
    // Cancel pending request
    if (this.controller) {
      this.controller.abort();
    }
    
    // Clear pending debounce
    if (this.timeoutId) {
      clearTimeout(this.timeoutId);
    }
    
    // Skip short queries
    if (query.length < this.minChars) {
      this.clearResults();
      return;
    }
    
    // Debounce
    this.timeoutId = setTimeout(() => {
      this.executeSearch(query);
    }, this.debounceMs);
  }
  
  private async executeSearch(query: string): Promise<void> {
    // Avoid duplicate requests
    if (query === this.lastQuery) return;
    this.lastQuery = query;
    
    // Abort previous in-flight request
    this.controller = new AbortController();
    
    try {
      const response = await fetch(`/api/suggest?q=${encodeURIComponent(query)}`, {
        signal: this.controller.signal,
      });
      const suggestions = await response.json();
      this.displayResults(suggestions);
    } catch (e) {
      if (e.name !== 'AbortError') {
        console.error('Search failed:', e);
      }
    }
  }
}
 
// Optimization impact:
// Without debouncing: "machine" = 7 requests
// With 100ms debounce: "machine" typed quickly = 1-2 requests

The Full Stack Matters

Prefix matching performance is a full-stack concern. Client-side debouncing, edge caching, incremental search sessions, and trie optimizations all contribute. A 20ms server response means nothing if the network round-trip is 200ms—that's why edge deployment and caching matter.

Measuring Matching Quality

How do we know if our prefix matching is working well? Quality metrics help guide optimization efforts.

Standard IR Metrics Applied to Typeahead

Typeahead Quality Metrics
Metric	Definition	Target
Prefix Recall	% of relevant suggestions retrieved for given prefix	90% for exact prefix
Fuzzy Recall	% of relevant suggestions retrieved despite typos	70% at edit distance 1
MRR (Mean Reciprocal Rank)	Average of 1/rank of first relevant result	0.5 (avg rank 2 or better)
P@K (Precision at K)	% of top-K results that are relevant	80% for K=5
Character Savings Rate	Avg keystrokes saved by using suggestions	50%
Zero Results Rate	% of queries with no suggestions	< 5%

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
interface SearchEvaluation {
  query: string;
  expectedResult: string;  // What the user was looking for
  actualResults: string[]; // What the system returned
}
 
function computeMRR(evaluations: SearchEvaluation[]): number {
  let sumReciprocalRanks = 0;
  
  for (const eval of evaluations) {
    const rank = eval.actualResults.findIndex(
      r => r.toLowerCase() === eval.expectedResult.toLowerCase()
    ) + 1;  // 1-indexed rank
    
    if (rank > 0) {
      sumReciprocalRanks += 1 / rank;
    }
    // If not found, adds 0 (reciprocal rank of ∞ is 0)
  }
  
  return sumReciprocalRanks / evaluations.length;
}
 
// MRR = 1.0 means perfect (always rank 1)
// MRR = 0.5 means average rank 2
// MRR = 0.33 means average rank 3
 
function computeCharacterSavings(evaluations: SearchEvaluation[]): number {
  let totalSaved = 0;
  let totalPossible = 0;
  
  for (const eval of evaluations) {
    const queryLength = eval.query.length;
    const targetLength = eval.expectedResult.length;
    
    if (eval.actualResults.includes(eval.expectedResult)) {
      totalSaved += targetLength - queryLength;
    }
    totalPossible += targetLength;
  }
  
  return totalSaved / totalPossible;
}

A/B Testing Matching Strategies

Metrics matter, but user behavior is the final judge. A/B test matching strategies with real users. Measure not just clicks on suggestions, but downstream outcomes: Did users find what they wanted? Did they complete their task? Matching changes can have surprising effects on user behavior.

Summary: Prefix Matching Mastery

Prefix matching is far more nuanced than simple string comparison. Let's consolidate the key takeaways:

Key Takeaways

•Multi-word strategies matter: Phrase prefix (last word is prefix) matches user mental model; hybrid scoring handles edge cases.
•Normalization is critical: Case folding, diacritic removal, and Unicode normalization must be applied consistently to both index and queries.
•Internationalization requires investment: CJK segmentation, RTL handling, and script detection are essential for global products.
•Fuzzy matching enables forgiveness: Edit distance automata and phonetic algorithms handle user typos without sacrificing performance.
•Word boundaries define relevance: Multi-index approaches (by word position) enable nuanced ranking while maintaining speed.
•Full-stack optimization: Client debouncing, edge caching, and incremental search all contribute to perceived performance.

The Matching Pipeline

User Input: "prgrm tutoral"
    ↓
Normalization: "prgrm tutoral" (lowercase, no diacritics)
    ↓
Tokenization: ["prgrm", "tutoral"]
    ↓
Fuzzy Expansion: ["program", "programming"], ["tutorial", "tutorials"]
    ↓
Prefix Match: Find all suggestions matching expanded terms
    ↓
Scoring: Rank by match quality, position, popularity
    ↓
Top-K: Return best 10 suggestions
    ↓
Display: "Programming Tutorial", "Program Tutorial", ...

What's next:

With matching mechanics understood, the next page explores Ranking Suggestions—how to order millions of matching suggestions so the best ones appear first. Ranking determines whether typeahead feels magical or mediocre.

Page Complete

You now understand the breadth and depth of prefix matching challenges. From simple case-folding to complex CJK segmentation, from exact prefix to fuzzy tolerance—you can make informed decisions about matching strategies for any typeahead system.