System DesignTypeahead and Autocomplete

Typeahead and Autocomplete Systems

LevelAdvanced

Duration90 mins

TopicTypeahead and Autocomplete

5 / 6

Performance Optimization: Speed at Scale

Every Millisecond Counts

The user has typed "prog" and is waiting. Every millisecond that passes without suggestions appearing is a millisecond of friction, a small erosion of trust in your product. Research shows that autocomplete latency directly impacts user satisfaction: 100ms feels instant, 300ms feels sluggish, and beyond 500ms users start retyping or abandoning.

Building autocomplete that works is one challenge. Building autocomplete that works fast—for billions of suggestions, millions of queries per second, with personalization and ranking—is an entirely different engineering discipline.

This page takes you through the complete performance optimization toolkit: from caching and precomputation through distributed architecture and hardware acceleration. You'll learn to design systems where speed isn't an afterthought but an architectural principle.

What You Will Master

By completing this page, you'll understand latency budgeting, caching strategies for autocomplete, precomputation patterns, index sharding and replication, network optimization, and the production techniques used by Google, Amazon, and other major players to deliver sub-50ms autocomplete at global scale.

Latency Budget Framework

Performance optimization starts with a latency budget—a breakdown of where time is spent and how much each component can consume. Without a budget, optimization is guesswork.

End-to-End Latency Components

From keypress to display, autocomplete involves multiple latency-adding stages:

Autocomplete Latency Breakdown
Stage	Typical Latency	Optimization Lever
Client debounce	50-150ms	UX tradeoff—shorter = more requests
Network RTT (client→server)	10-100ms	Edge servers, CDN
Request parsing/auth	1-5ms	Efficient serialization
Cache lookup	0.1-1ms	Hot data in memory
Index query	1-10ms	Optimized data structures
Ranking computation	1-10ms	Precomputation, simpler models
Response serialization	0.5-2ms	Efficient formats (protobuf)
Network RTT (server→client)	10-100ms	Edge servers, CDN
Client rendering	5-20ms	Efficient DOM updates

Budget Allocation Strategy

For a target of 100ms end-to-end (perceived instant):

Client debounce:    50ms  (deliberate delay to reduce requests)
Network (round trip): 30ms  (with edge servers)
Server processing:   15ms  (cache + index + rank)
Client rendering:     5ms  (minimal DOM work)
─────────────────────────
Total:             100ms

This leaves only 15ms for all server-side work. At scale, this is an aggressive target requiring careful optimization.

latency_tracking.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import time
from contextlib import contextmanager
from dataclasses import dataclass, field
from typing import Dict
 
 
@dataclass
class LatencyBudget:
    """Track and enforce latency budgets."""
    
    total_budget_ms: float = 15.0
    
    # Per-component budgets
    component_budgets: Dict[str, float] = field(default_factory=lambda: {
        'cache_lookup': 1.0,
        'index_query': 5.0,
        'fuzzy_matching': 3.0,
        'ranking': 4.0,
        'serialization': 2.0,
    })
    
    # Actual timings
    actual_timings: Dict[str, float] = field(default_factory=dict)
    start_time: float = field(default_factory=time.perf_counter)
    
    @contextmanager
    def track(self, component: str):
        """Context manager to track component latency."""
        component_start = time.perf_counter()
        try:
            yield
        finally:
            elapsed_ms = (time.perf_counter() - component_start) * 1000
            self.actual_timings[component] = elapsed_ms
    
    def remaining_budget_ms(self) -> float:
        """How much time remains in total budget."""
        elapsed = (time.perf_counter() - self.start_time) * 1000
        return max(0, self.total_budget_ms - elapsed)
    
    def is_over_budget(self, component: str = None) -> bool:
        """Check if we've exceeded budget."""
        if component:
            actual = self.actual_timings.get(component, 0)
            budget = self.component_budgets.get(component, float('inf'))
            return actual > budget
        
        elapsed = (time.perf_counter() - self.start_time) * 1000
        return elapsed > self.total_budget_ms
    
    def should_skip_expensive_operation(self) -> bool:
        """Determine if we should skip optional expensive operations."""
        remaining = self.remaining_budget_ms()
        
        # If less than 3ms remaining, skip fuzzy matching
        return remaining < 3.0
 
 
class AdaptiveTimeoutManager:
    """
    Dynamically adjust component timeouts based on budget.
    """
    
    def __init__(self, total_budget_ms: float = 15.0):
        self.total_budget_ms = total_budget_ms
        self.start_time = None
    
    def start_request(self):
        """Mark request start."""
        self.start_time = time.perf_counter()
    
    def get_timeout_for(self, component: str, 
                        default_budget_ms: float) -> float:
        """
        Get remaining timeout for a component.
        Returns smaller of default budget or remaining time.
        """
        if self.start_time is None:
            return default_budget_ms
        
        elapsed = (time.perf_counter() - self.start_time) * 1000
        remaining = self.total_budget_ms - elapsed
        
        # Leave 2ms buffer for response serialization
        available = remaining - 2.0
        
        return max(0.1, min(default_budget_ms, available))

P99, Not P50

Optimize for 99th percentile latency, not median. Users remember the slow requests. If P50 is 10ms but P99 is 500ms, 1% of requests feel broken. Monitor and alert on tail latency, not just average.

Caching Strategies for Autocomplete

Caching is the single most impactful optimization for autocomplete. Query patterns follow a power law: the top 1% of prefixes generate 50%+ of traffic. An effective caching strategy can serve the majority of requests without hitting the index.

Cache Levels

Multi-Level Cache Architecture

•L1: Client-side cache — Browser localStorage/memory. Instant, no network. Good for user's recent queries.
•L2: CDN/Edge cache — Cloudflare, CloudFront. 10-20ms latency. Good for popular global prefixes.
•L3: Application cache — Redis/Memcached in data center. 1-5ms. Warm prefixes.
•L4: Precomputed results — Pre-materialized for top prefixes. Stored in-process. <1ms.

caching_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
from typing import Optional, List, Any
import hashlib
import json
import time
 
 
class AutocompleteCache:
    """
    Multi-level cache for autocomplete results.
    """
    
    def __init__(self, redis_client, local_capacity: int = 10000):
        self.redis = redis_client
        
        # L4: In-process LRU cache for hottest prefixes
        self.local_cache: Dict[str, Tuple[List[str], float]] = {}
        self.local_capacity = local_capacity
        
        # Precomputed results for top prefixes (loaded at startup)
        self.precomputed: Dict[str, List[str]] = {}
        
        # TTLs
        self.local_ttl = 60  # 1 minute in-process
        self.redis_ttl = 300  # 5 minutes in Redis
        
    def cache_key(self, prefix: str, user_id: Optional[str] = None) -> str:
        """
        Generate cache key for prefix.
        Include user for personalized results.
        """
        if user_id:
            # Personalized: include user segment (not full ID for privacy)
            user_segment = hashlib.md5(user_id.encode()).hexdigest()[:4]
            return f"ac:{prefix}:{user_segment}"
        return f"ac:{prefix}"
    
    def get(self, prefix: str, user_id: Optional[str] = None) -> Optional[List[str]]:
        """
        Get cached results, checking all levels.
        Returns (results, cache_level) or (None, None).
        """
        key = self.cache_key(prefix, user_id)
        
        # L4: Check precomputed (non-personalized only)
        if not user_id and prefix in self.precomputed:
            return self.precomputed[prefix]
        
        # L3: Check local cache
        if key in self.local_cache:
            results, timestamp = self.local_cache[key]
            if time.time() - timestamp < self.local_ttl:
                return results
            else:
                del self.local_cache[key]
        
        # L2: Check Redis
        try:
            cached = self.redis.get(key)
            if cached:
                results = json.loads(cached)
                # Promote to local cache
                self._local_set(key, results)
                return results
        except Exception:
            pass  # Redis failure - fall through
        
        return None
    
    def set(self, prefix: str, results: List[str], 
            user_id: Optional[str] = None) -> None:
        """Cache results at appropriate levels."""
        key = self.cache_key(prefix, user_id)
        
        # Always set in local cache
        self._local_set(key, results)
        
        # Set in Redis for longer persistence
        try:
            self.redis.setex(key, self.redis_ttl, json.dumps(results))
        except Exception:
            pass  # Redis failure - local cache still works
    
    def _local_set(self, key: str, results: List[str]) -> None:
        """Set in local cache with LRU eviction."""
        # Simple eviction: remove oldest if over capacity
        if len(self.local_cache) >= self.local_capacity:
            oldest_key = min(self.local_cache.keys(),
                           key=lambda k: self.local_cache[k][1])
            del self.local_cache[oldest_key]
        
        self.local_cache[key] = (results, time.time())
    
    def load_precomputed(self, popular_prefixes: Dict[str, List[str]]) -> None:
        """
        Load precomputed results for popular prefixes.
        Called at application startup.
        """
        self.precomputed = popular_prefixes
        print(f"Loaded {len(popular_prefixes)} precomputed prefixes")
 
 
class PersonalizedCacheStrategy:
    """
    Handle caching for personalized results.
    
    Challenge: personalized results reduce cache hit rate.
    Solution: segment users into cohorts for shared caching.
    """
    
    def __init__(self, num_segments: int = 100):
        self.num_segments = num_segments
        self.segment_profiles: Dict[int, List[str]] = {}
    
    def get_user_segment(self, user: UserProfile) -> int:
        """
        Assign user to cache segment based on interests.
        Users with similar interests share a segment.
        """
        # Simple: hash top interests
        top_interests = sorted(
            user.interests.items(), 
            key=lambda x: -x[1]
        )[:5]
        
        interest_key = ','.join(k for k, v in top_interests)
        segment = hash(interest_key) % self.num_segments
        
        return segment
    
    def should_personalize_cache(self, query: str, user: UserProfile) -> bool:
        """
        Determine if query should use personalized cache.
        
        Short prefixes: use personalized (high ambiguity)
        Long prefixes: use global (user intent is clearer)
        """
        if len(query) <= 2:
            return True  # High ambiguity, personalization helps
        
        if len(query) >= 6:
            return False  # Intent is clear, use global cache
        
        # Medium length: personalize if user has strong preferences
        return len(user.interests) > 10

Cache Hit Rate Benchmarks
Prefix Length	Typical Hit Rate	Strategy
1 character	95%+	Precompute all 26 letters
2 characters	80-90%	Precompute common bigrams
3 characters	60-80%	Cache in Redis
4+ characters	30-60%	Cache most popular, compute rest
Personalized	20-40%	Segment-based caching

Cache Warming

Don't wait for organic traffic to fill caches after deployment. Pre-warm caches by replaying a sample of yesterday's queries before receiving production traffic. This ensures good performance from the first request.

Precomputation Patterns

Precomputation moves work from query time to indexing time. For autocomplete, we can precompute top suggestions for the most frequent prefixes and store them directly.

What to Precompute

At query time, we compute: prefix matching → fuzzy expansion → scoring → ranking → top-K selection. Much of this can shift to preprocessing.

precomputation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
class PrecomputedAutocomplete:
    """
    Precompute top suggestions for popular prefixes.
    """
    
    def __init__(self, trie: Trie, scorer: BasicRanker):
        self.trie = trie
        self.scorer = scorer
        
        # Precomputed: prefix -> top 20 suggestions
        self.precomputed: Dict[str, List[str]] = {}
        
        # Track which prefixes we've precomputed
        self.precomputed_prefixes: Set[str] = set()
    
    def precompute_for_prefix(self, prefix: str, top_k: int = 20) -> None:
        """Precompute top suggestions for a prefix."""
        # Get all matching terms
        all_matches = self.trie.get_all_words_with_prefix(prefix)
        
        if not all_matches:
            return
        
        # Score and rank all matches
        context = RankingContext(
            query=prefix,
            user_id=None,  # Non-personalized
            location=None,
            timestamp=time.time(),
            previous_queries=[]
        )
        
        scored = []
        for term in all_matches:
            suggestion = self._make_suggestion(term, prefix)
            score = self.scorer.score(prefix, suggestion, context)
            scored.append((term, score))
        
        # Sort and take top K
        scored.sort(key=lambda x: -x[1])
        top_terms = [term for term, _ in scored[:top_k]]
        
        self.precomputed[prefix] = top_terms
        self.precomputed_prefixes.add(prefix)
    
    def precompute_all_popular(self, query_logs: List[str], 
                                min_frequency: int = 100,
                                max_prefixes: int = 100000) -> None:
        """
        Precompute for prefixes appearing frequently in query logs.
        
        Run this as a batch job (e.g., daily).
        """
        from collections import Counter
        
        # Count prefix frequencies
        prefix_counts: Counter = Counter()
        
        for query in query_logs:
            query = query.lower().strip()
            # Count all prefixes of this query
            for length in range(1, len(query) + 1):
                prefix_counts[query[:length]] += 1
        
        # Filter to frequent prefixes
        popular = [
            prefix for prefix, count in prefix_counts.items()
            if count >= min_frequency
        ]
        
        # Sort by frequency, take top N
        popular.sort(key=lambda p: -prefix_counts[p])
        popular = popular[:max_prefixes]
        
        print(f"Precomputing {len(popular)} popular prefixes...")
        
        for i, prefix in enumerate(popular):
            self.precompute_for_prefix(prefix)
            
            if (i + 1) % 10000 == 0:
                print(f"  Precomputed {i + 1}/{len(popular)}")
        
        print(f"Precomputation complete. Memory: ~{self._estimate_memory_mb():.1f} MB")
    
    def query(self, prefix: str, limit: int = 10) -> List[str]:
        """
        Query with precomputed optimization.
        """
        # Check precomputed first
        if prefix in self.precomputed:
            return self.precomputed[prefix][:limit]
        
        # Fall back to real-time computation
        return self._compute_live(prefix, limit)
    
    def _compute_live(self, prefix: str, limit: int) -> List[str]:
        """Fall back to live computation."""
        matches = self.trie.get_all_words_with_prefix(prefix)
        # Simplified ranking for real-time path
        matches.sort(key=lambda x: len(x))  # Shorter = more relevant
        return matches[:limit]
    
    def _estimate_memory_mb(self) -> float:
        """Estimate memory usage of precomputed data."""
        total_chars = sum(
            sum(len(term) for term in terms)
            for terms in self.precomputed.values()
        )
        # Rough estimate: chars + overhead
        return total_chars / 1_000_000 * 2
 
 
class IncrementalPrecomputation:
    """
    Update precomputed results incrementally as data changes.
    """
    
    def __init__(self):
        self.precomputed: Dict[str, List[str]] = {}
        self.last_computed: Dict[str, float] = {}
        self.staleness_threshold_hours = 24
    
    def is_stale(self, prefix: str) -> bool:
        """Check if precomputed results are outdated."""
        if prefix not in self.last_computed:
            return True
        
        age_hours = (time.time() - self.last_computed[prefix]) / 3600
        return age_hours > self.staleness_threshold_hours
    
    def mark_affected_prefixes(self, added_term: str, 
                                removed_term: str = None) -> Set[str]:
        """
        Identify which precomputed prefixes are affected by term changes.
        """
        affected = set()
        
        terms_to_check = [added_term] if added_term else []
        if removed_term:
            terms_to_check.append(removed_term)
        
        for term in terms_to_check:
            # All prefixes of this term are affected
            for length in range(1, len(term) + 1):
                prefix = term[:length]
                if prefix in self.precomputed:
                    affected.add(prefix)
        
        return affected

Precomputation Trade-offs

Precomputation trades freshness for speed. Precomputed results are stale until regenerated. For trending topics or time-sensitive content, blend precomputed results with a real-time tail that captures recent changes.

Index Sharding and Replication

For billion-term corpora and millions of QPS, a single server cannot handle the load. Sharding (horizontal partitioning) and replication (copies for redundancy and load distribution) are essential.

Sharding Strategies for Autocomplete

Autocomplete Sharding Strategies
Strategy	How It Works	Pros	Cons
Prefix-based	Shard by first N characters	Single shard per query	Uneven distribution (common letters)
Hash-based	Hash term → shard	Even distribution	Query hits all shards (scatter-gather)
Popularity-based	Hot terms on dedicated shards	Optimizes common queries	Complex routing logic
Range-based	Lexicographic ranges	Continuous prefixes together	Rebalancing on updates

sharding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import asyncio
from typing import List, Tuple
 
 
class PrefixShardedIndex:
    """
    Shard autocomplete index by prefix for single-shard queries.
    """
    
    def __init__(self, num_shards: int = 26):
        self.num_shards = num_shards
        # In production: actual shard client connections
        self.shard_clients: Dict[int, Any] = {}
        
        # Prefix ranges for each shard
        self.shard_ranges: List[Tuple[str, str]] = self._compute_ranges()
    
    def _compute_ranges(self) -> List[Tuple[str, str]]:
        """
        Compute prefix ranges for even distribution.
        
        Simple version: one shard per letter.
        Production: analyze traffic for balanced splits.
        """
        ranges = []
        for i in range(self.num_shards):
            start_char = chr(ord('a') + i)
            if i == self.num_shards - 1:
                end_char = '{'  # Character after 'z'
            else:
                end_char = chr(ord('a') + i + 1)
            ranges.append((start_char, end_char))
        return ranges
    
    def get_shard_for_prefix(self, prefix: str) -> int:
        """Route prefix to appropriate shard."""
        if not prefix:
            return 0  # Default shard for empty prefix
        
        first_char = prefix[0].lower()
        
        for shard_id, (start, end) in enumerate(self.shard_ranges):
            if start <= first_char < end:
                return shard_id
        
        return self.num_shards - 1  # Default to last shard
    
    async def query_shard(self, shard_id: int, prefix: str, 
                          limit: int) -> List[str]:
        """Query a single shard."""
        client = self.shard_clients.get(shard_id)
        if not client:
            return []
        
        # In production: actual RPC call
        results = await client.query(prefix, limit)
        return results
    
    async def query(self, prefix: str, limit: int = 10) -> List[str]:
        """
        Query with routing to appropriate shard.
        Single shard for prefix queries (no scatter-gather).
        """
        shard_id = self.get_shard_for_prefix(prefix)
        return await self.query_shard(shard_id, prefix, limit)
 
 
class ReplicatedShardCluster:
    """
    Manage replicated shards for high availability and load distribution.
    """
    
    def __init__(self, num_shards: int, replicas_per_shard: int = 3):
        self.num_shards = num_shards
        self.replicas_per_shard = replicas_per_shard
        
        # For each shard: list of replica addresses
        self.shard_replicas: Dict[int, List[str]] = {}
        
        # Health tracking
        self.replica_health: Dict[str, bool] = {}
        
        # Load balancing: round-robin counters
        self.rr_counters: Dict[int, int] = {}
    
    def register_replica(self, shard_id: int, address: str) -> None:
        """Register a replica for a shard."""
        if shard_id not in self.shard_replicas:
            self.shard_replicas[shard_id] = []
            self.rr_counters[shard_id] = 0
        
        self.shard_replicas[shard_id].append(address)
        self.replica_health[address] = True
    
    def get_replica_for_query(self, shard_id: int) -> Optional[str]:
        """
        Select a healthy replica for querying.
        Uses round-robin for load balancing.
        """
        replicas = self.shard_replicas.get(shard_id, [])
        healthy = [r for r in replicas if self.replica_health.get(r, False)]
        
        if not healthy:
            return None
        
        # Round-robin selection
        counter = self.rr_counters.get(shard_id, 0)
        selected = healthy[counter % len(healthy)]
        self.rr_counters[shard_id] = counter + 1
        
        return selected
    
    def mark_unhealthy(self, address: str) -> None:
        """Mark a replica as unhealthy after failure."""
        self.replica_health[address] = False
        # In production: trigger health check retry
    
    async def query_with_failover(self, shard_id: int, prefix: str,
                                   limit: int, timeout_ms: float = 10) -> List[str]:
        """
        Query shard with automatic failover to other replicas.
        """
        replicas = self.shard_replicas.get(shard_id, [])
        healthy = [r for r in replicas if self.replica_health.get(r, False)]
        
        for replica in healthy:
            try:
                result = await asyncio.wait_for(
                    self._query_replica(replica, prefix, limit),
                    timeout=timeout_ms / 1000
                )
                return result
            except asyncio.TimeoutError:
                self.mark_unhealthy(replica)
                continue
            except Exception:
                self.mark_unhealthy(replica)
                continue
        
        # All replicas failed
        return []
    
    async def _query_replica(self, address: str, prefix: str,
                              limit: int) -> List[str]:
        """Query a specific replica."""
        # In production: actual network call
        pass

Prefer Prefix Sharding

For autocomplete, prefix-based sharding is almost always better than hash-based. It eliminates scatter-gather overhead since each query routes to exactly one shard. Handle uneven distribution by sub-sharding popular prefixes (e.g., 's' → 'sa-sh', 'si-sz').

Network and Protocol Optimization

Network latency often dominates the autocomplete budget, especially for geographically distributed users. Optimizing network paths and protocols can shave tens of milliseconds.

Key Network Optimizations

Network Optimization Strategies

•Edge servers: Deploy autocomplete services at CDN edge locations. Reduces RTT from 100ms+ to 10-20ms for most users.
•Connection reuse: Keep HTTP/2 connections warm. First request on a new connection incurs TLS handshake (~30ms).
•Protocol choice: Use HTTP/2 for multiplexing. Consider gRPC for internal services (efficient binary protocol).
•Request coalescing: If user types fast, cancel in-flight requests and only send the latest prefix.
•Compression: gzip/brotli response bodies. Typical autocomplete responses compress 3-5x.
•Early flush: Start sending response headers before full body is ready. Parallelizes network with processing.

network_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import gzip
import json
from typing import Optional
 
 
class OptimizedResponseFormatter:
    """
    Optimize response size and format for autocomplete.
    """
    
    def __init__(self, enable_compression: bool = True):
        self.enable_compression = enable_compression
        self.min_compression_size = 200  # bytes
    
    def format_response(self, suggestions: List[str], 
                        query: str,
                        metadata: Optional[Dict] = None) -> bytes:
        """
        Create compact response format.
        """
        # Minimal response structure
        response = {
            'q': query,  # Short key names
            's': suggestions,  # Just the essentials
        }
        
        if metadata:
            response['m'] = metadata
        
        json_bytes = json.dumps(response, separators=(',', ':')).encode('utf-8')
        
        # Compress if worthwhile
        if self.enable_compression and len(json_bytes) > self.min_compression_size:
            compressed = gzip.compress(json_bytes, compresslevel=6)
            if len(compressed) < len(json_bytes) * 0.8:
                return compressed
        
        return json_bytes
    
    def streaming_response(self, suggestions: List[str], query: str):
        """
        Stream response incrementally.
        
        Benefit: First suggestion can render while rest are still being sent.
        """
        yield b'{"q":"'
        yield query.encode('utf-8')
        yield b'","s":['
        
        for i, suggestion in enumerate(suggestions):
            if i > 0:
                yield b','
            yield json.dumps(suggestion).encode('utf-8')
            # Flush after each suggestion
        
        yield b']}'
 
 
class RequestOptimizer:
    """
    Client-side request optimization strategies.
    """
    
    def __init__(self, debounce_ms: int = 50):
        self.debounce_ms = debounce_ms
        self.pending_request: Optional[asyncio.Task] = None
    
    async def query_with_debounce(self, prefix: str, 
                                   query_fn) -> Optional[List[str]]:
        """
        Debounce rapid keystrokes into single request.
        
        If user types "prog" in 100ms:
        - Without debounce: 4 requests (p, pr, pro, prog)
        - With 50ms debounce: 1 request (prog)
        """
        # Cancel any pending request
        if self.pending_request and not self.pending_request.done():
            self.pending_request.cancel()
        
        # Wait for debounce period
        await asyncio.sleep(self.debounce_ms / 1000)
        
        # Execute the query
        return await query_fn(prefix)
    
    async def query_with_cancellation(self, prefix: str,
                                       query_fn) -> Optional[List[str]]:
        """
        Cancel in-flight requests when new query arrives.
        """
        # Cancel previous request
        if self.pending_request and not self.pending_request.done():
            self.pending_request.cancel()
            try:
                await self.pending_request
            except asyncio.CancelledError:
                pass
        
        # Start new request
        self.pending_request = asyncio.create_task(query_fn(prefix))
        
        try:
            return await self.pending_request
        except asyncio.CancelledError:
            return None

Network Optimization Impact
Optimization	Latency Reduction	Implementation Effort
Edge deployment	30-100ms	High (infrastructure)
HTTP/2	10-30ms (cold requests)	Low (configuration)
Request debouncing	N/A (reduces load)	Low (client code)
Response compression	2-10ms (large responses)	Low (configuration)
Connection keep-alive	15-30ms (repeated requests)	Low (configuration)

Mobile Networks Are Different

Mobile networks have higher latency and more variability than fixed networks. 4G typically adds 50-100ms base latency. For mobile, debounce aggressively (100-150ms), and consider larger batch responses to reduce round trips.

Data Structure Optimizations

Beyond caching and architecture, the underlying data structures significantly impact performance. Memory layout, cache efficiency, and algorithmic constants matter at scale.

Memory Layout Optimizations

Data Structure Optimization Techniques

•Cache line alignment: Arrange frequently co-accessed data within same 64-byte cache line.
•Contiguous allocation: Use array-based structures over pointer-chasing linked structures.
•String interning: Deduplicate string storage to reduce memory and cache pressure.
•Integer IDs: Reference terms by integer ID, not string. Comparison is O(1) instead of O(m).
•Compact representations: Use uint32 instead of uint64 when values fit. Halves memory, improves cache.
•SIMD operations: Process multiple elements in parallel using vector instructions.

optimized_data_structures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import array
import struct
from typing import List, Tuple
 
 
class CompactSuggestionStore:
    """
    Memory-optimized storage for autocomplete suggestions.
    
    Instead of storing strings directly, store:
    1. Term IDs (4 bytes each)
    2. Scores (4 bytes each, float16 or fixed-point)
    3. Actual strings in separate pooled storage
    
    Reduces memory 60%+ compared to Python objects.
    """
    
    def __init__(self):
        # String pool: all unique terms concatenated
        self.string_pool: bytearray = bytearray()
        
        # Term table: term_id -> (offset, length) in string pool
        self.term_offsets: array.array = array.array('I')  # uint32
        self.term_lengths: array.array = array.array('H')  # uint16
        
        # Reverse lookup: term -> term_id
        self.term_to_id: Dict[str, int] = {}
        
        # Suggestion lists: prefix -> list of (term_id, score)
        self.suggestions: Dict[str, bytes] = {}
    
    def add_term(self, term: str) -> int:
        """Add term to string pool, return term_id."""
        if term in self.term_to_id:
            return self.term_to_id[term]
        
        term_id = len(self.term_offsets)
        term_bytes = term.encode('utf-8')
        
        # Record offset and length
        self.term_offsets.append(len(self.string_pool))
        self.term_lengths.append(len(term_bytes))
        
        # Append to pool
        self.string_pool.extend(term_bytes)
        
        self.term_to_id[term] = term_id
        return term_id
    
    def get_term(self, term_id: int) -> str:
        """Retrieve term from string pool."""
        offset = self.term_offsets[term_id]
        length = self.term_lengths[term_id]
        return self.string_pool[offset:offset + length].decode('utf-8')
    
    def set_suggestions(self, prefix: str, 
                        suggestions: List[Tuple[str, float]]) -> None:
        """Store suggestions as compact binary format."""
        # Format: [term_id (4 bytes), score (4 bytes)] × count
        packed = bytearray()
        
        for term, score in suggestions:
            term_id = self.add_term(term)
            packed.extend(struct.pack('If', term_id, score))
        
        self.suggestions[prefix] = bytes(packed)
    
    def get_suggestions(self, prefix: str) -> List[Tuple[str, float]]:
        """Retrieve and unpack suggestions."""
        if prefix not in self.suggestions:
            return []
        
        packed = self.suggestions[prefix]
        count = len(packed) // 8  # 4 + 4 bytes per entry
        
        result = []
        for i in range(count):
            offset = i * 8
            term_id, score = struct.unpack('If', packed[offset:offset + 8])
            result.append((self.get_term(term_id), score))
        
        return result
    
    def memory_usage_bytes(self) -> Dict[str, int]:
        """Report memory usage breakdown."""
        return {
            'string_pool': len(self.string_pool),
            'term_offsets': len(self.term_offsets) * 4,
            'term_lengths': len(self.term_lengths) * 2,
            'suggestions': sum(len(v) for v in self.suggestions.values()),
            'prefix_keys': sum(len(k) for k in self.suggestions.keys()),
        }
 
 
class SIMDAwareSearch:
    """
    Demonstrate SIMD-aware data layout for fast comparison.
    
    Note: Python doesn't directly support SIMD.
    In production, use NumPy (which uses SIMD internally)
    or write critical paths in Rust/C++.
    """
    
    def __init__(self):
        # Store first 16 bytes of each term for fast filtering
        # Aligns with SSE 128-bit registers
        self.prefixes_16: List[bytes] = []
        self.full_terms: List[str] = []
    
    def add_term(self, term: str) -> None:
        """Add term with SIMD-friendly prefix extraction."""
        # Pad to exactly 16 bytes
        prefix = term[:16].encode('utf-8').ljust(16, b'\x00')
        self.prefixes_16.append(prefix)
        self.full_terms.append(term)
    
    def filter_by_prefix(self, query_prefix: str) -> List[int]:
        """
        Fast filter using fixed-size prefix comparison.
        
        In native code, this would use _mm_cmpeq_epi8 and _mm_movemask_epi8
        to compare 16 bytes in parallel.
        """
        query_bytes = query_prefix.encode('utf-8')
        query_len = len(query_bytes)
        
        # Create comparison mask
        query_padded = query_bytes.ljust(16, b'\x00')
        
        matches = []
        for i, prefix in enumerate(self.prefixes_16):
            # Compare first query_len bytes
            if prefix[:query_len] == query_bytes:
                matches.append(i)
        
        return matches

Profile Before Optimizing

These micro-optimizations matter at extreme scale but add complexity. Profile your system to identify actual bottlenecks. Often, a simple Python implementation with good caching outperforms a complex optimized version with poor cache strategy.

Monitoring and Debugging Performance

Performance optimization is an ongoing process. Without proper monitoring, regressions go unnoticed until users complain. Effective monitoring reveals where time is spent and guides optimization efforts.

Key Metrics to Track

Autocomplete Performance Metrics
Metric	Target	Alert Threshold
P50 latency	<15ms	25ms
P99 latency	<50ms	100ms
Cache hit rate	70%	<50%
Error rate	<0.01%	0.1%
Timeout rate	<0.1%	1%
QPS per server	Capacity - 20%	Capacity - 10%

monitoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import time
from collections import defaultdict
from typing import Dict, List
import statistics
 
 
class PerformanceMonitor:
    """
    Track and analyze autocomplete performance.
    """
    
    def __init__(self, window_seconds: int = 60):
        self.window_seconds = window_seconds
        
        # Latency samples: component -> list of (timestamp, latency_ms)
        self.latency_samples: Dict[str, List[Tuple[float, float]]] = defaultdict(list)
        
        # Counters
        self.request_count = 0
        self.cache_hits = 0
        self.cache_misses = 0
        self.errors = 0
        self.timeouts = 0
    
    def record_latency(self, component: str, latency_ms: float) -> None:
        """Record latency for a component."""
        now = time.time()
        self.latency_samples[component].append((now, latency_ms))
        self._prune_old_samples(component, now)
    
    def record_request(self, cache_hit: bool, error: bool = False,
                       timeout: bool = False) -> None:
        """Record request outcome."""
        self.request_count += 1
        
        if cache_hit:
            self.cache_hits += 1
        else:
            self.cache_misses += 1
        
        if error:
            self.errors += 1
        if timeout:
            self.timeouts += 1
    
    def _prune_old_samples(self, component: str, now: float) -> None:
        """Remove samples older than window."""
        cutoff = now - self.window_seconds
        self.latency_samples[component] = [
            (ts, lat) for ts, lat in self.latency_samples[component]
            if ts >= cutoff
        ]
    
    def get_percentile(self, component: str, percentile: float) -> float:
        """Get latency percentile for component."""
        samples = [lat for _, lat in self.latency_samples[component]]
        if not samples:
            return 0.0
        
        samples.sort()
        idx = int(len(samples) * percentile / 100)
        return samples[min(idx, len(samples) - 1)]
    
    def get_stats(self, component: str) -> Dict[str, float]:
        """Get comprehensive stats for component."""
        samples = [lat for _, lat in self.latency_samples[component]]
        
        if not samples:
            return {'count': 0}
        
        return {
            'count': len(samples),
            'mean': statistics.mean(samples),
            'median': statistics.median(samples),
            'p95': self.get_percentile(component, 95),
            'p99': self.get_percentile(component, 99),
            'min': min(samples),
            'max': max(samples),
        }
    
    def get_cache_hit_rate(self) -> float:
        """Calculate cache hit rate."""
        total = self.cache_hits + self.cache_misses
        if total == 0:
            return 0.0
        return self.cache_hits / total
    
    def get_error_rate(self) -> float:
        """Calculate error rate."""
        if self.request_count == 0:
            return 0.0
        return self.errors / self.request_count
    
    def should_alert(self, component: str) -> bool:
        """Check if component is performing poorly."""
        stats = self.get_stats(component)
        
        if stats.get('count', 0) < 100:
            return False  # Not enough data
        
        # Alert thresholds
        if stats.get('p99', 0) > 100:
            return True  # P99 > 100ms
        
        if stats.get('median', 0) > 25:
            return True  # P50 > 25ms
        
        return False
 
 
class LatencyBreakdown:
    """
    Trace latency breakdown for a single request.
    Useful for debugging slow requests.
    """
    
    def __init__(self, request_id: str):
        self.request_id = request_id
        self.start_time = time.perf_counter()
        self.checkpoints: List[Tuple[str, float]] = []
    
    def checkpoint(self, name: str) -> None:
        """Record a checkpoint."""
        elapsed = (time.perf_counter() - self.start_time) * 1000
        self.checkpoints.append((name, elapsed))
    
    def finish(self) -> Dict[str, float]:
        """Complete trace and return breakdown."""
        total = (time.perf_counter() - self.start_time) * 1000
        
        result = {'total_ms': total}
        prev = 0.0
        
        for name, elapsed in self.checkpoints:
            result[f'{name}_ms'] = elapsed - prev
            prev = elapsed
        
        return result

Don't Average Latencies

Average latency hides tail latency problems. A system with P50=10ms, P99=500ms has serious issues that don't show in the average (~15ms). Always track percentiles, especially P95 and P99.

Summary: Performance is a Feature

Performance isn't an afterthought—it's a core feature of autocomplete. Users don't consciously appreciate fast autocomplete, but they definitely notice when it's slow. Every optimization technique covered in this page contributes to the invisible excellence users expect.

Key Takeaways

•Start with a latency budget — Break down end-to-end time into components and allocate budgets. Measure against budget continuously.
•Caching is the biggest lever — Multi-level caching (client, edge, application, precomputed) can serve 80%+ of requests without hitting the index.
•Precompute popular prefixes — Move ranking work to offline batch jobs. Serve common queries in <1ms from memory.
•Shard by prefix for single-shard queries — Avoid scatter-gather overhead. Each query hits exactly one shard.
•Optimize the network path — Edge deployment, connection reuse, compression, and debouncing reduce network latency.
•Data structure details matter — Cache-friendly layouts, compact representations, and SIMD can provide 2-5x speedups.
•Monitor P99, not average — Tail latency reveals problems that averages hide. Alert on percentiles.

Module Complete: Typeahead and Autocomplete Mastery

You've completed the comprehensive journey through autocomplete system design:

Prefix Matching — Understanding the problem and naive approach failures
Trie Data Structures — The engine behind efficient prefix search
Fuzzy Matching — Handling user errors gracefully
Suggestions Ranking — Surfacing the most relevant results
Performance Optimization — Delivering speed at scale

You now have the deep knowledge to design, implement, and optimize autocomplete systems that serve billions of users with sub-50ms latency.

Module Complete: Typeahead and Autocomplete

Congratulations! You've mastered the complete autocomplete system design stack. From data structures through ranking to performance optimization, you have the knowledge to build typeahead experiences that feel instant and intelligent—the gold standard expected by users of modern applications.

5 / 6

Loading learning content...

System DesignTypeahead and Autocomplete

Typeahead and Autocomplete Systems

LevelAdvanced

Duration90 mins

TopicTypeahead and Autocomplete

5 / 6

Performance Optimization: Speed at Scale

Every Millisecond Counts

What You Will Master

Latency Budget Framework

Performance optimization starts with a latency budget—a breakdown of where time is spent and how much each component can consume. Without a budget, optimization is guesswork.

End-to-End Latency Components

From keypress to display, autocomplete involves multiple latency-adding stages:

Autocomplete Latency Breakdown
Stage	Typical Latency	Optimization Lever
Client debounce	50-150ms	UX tradeoff—shorter = more requests
Network RTT (client→server)	10-100ms	Edge servers, CDN
Request parsing/auth	1-5ms	Efficient serialization
Cache lookup	0.1-1ms	Hot data in memory
Index query	1-10ms	Optimized data structures
Ranking computation	1-10ms	Precomputation, simpler models
Response serialization	0.5-2ms	Efficient formats (protobuf)
Network RTT (server→client)	10-100ms	Edge servers, CDN
Client rendering	5-20ms	Efficient DOM updates

Budget Allocation Strategy

For a target of 100ms end-to-end (perceived instant):

Client debounce:    50ms  (deliberate delay to reduce requests)
Network (round trip): 30ms  (with edge servers)
Server processing:   15ms  (cache + index + rank)
Client rendering:     5ms  (minimal DOM work)
─────────────────────────
Total:             100ms

This leaves only 15ms for all server-side work. At scale, this is an aggressive target requiring careful optimization.

latency_tracking.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import time
from contextlib import contextmanager
from dataclasses import dataclass, field
from typing import Dict
 
 
@dataclass
class LatencyBudget:
    """Track and enforce latency budgets."""
    
    total_budget_ms: float = 15.0
    
    # Per-component budgets
    component_budgets: Dict[str, float] = field(default_factory=lambda: {
        'cache_lookup': 1.0,
        'index_query': 5.0,
        'fuzzy_matching': 3.0,
        'ranking': 4.0,
        'serialization': 2.0,
    })
    
    # Actual timings
    actual_timings: Dict[str, float] = field(default_factory=dict)
    start_time: float = field(default_factory=time.perf_counter)
    
    @contextmanager
    def track(self, component: str):
        """Context manager to track component latency."""
        component_start = time.perf_counter()
        try:
            yield
        finally:
            elapsed_ms = (time.perf_counter() - component_start) * 1000
            self.actual_timings[component] = elapsed_ms
    
    def remaining_budget_ms(self) -> float:
        """How much time remains in total budget."""
        elapsed = (time.perf_counter() - self.start_time) * 1000
        return max(0, self.total_budget_ms - elapsed)
    
    def is_over_budget(self, component: str = None) -> bool:
        """Check if we've exceeded budget."""
        if component:
            actual = self.actual_timings.get(component, 0)
            budget = self.component_budgets.get(component, float('inf'))
            return actual > budget
        
        elapsed = (time.perf_counter() - self.start_time) * 1000
        return elapsed > self.total_budget_ms
    
    def should_skip_expensive_operation(self) -> bool:
        """Determine if we should skip optional expensive operations."""
        remaining = self.remaining_budget_ms()
        
        # If less than 3ms remaining, skip fuzzy matching
        return remaining < 3.0
 
 
class AdaptiveTimeoutManager:
    """
    Dynamically adjust component timeouts based on budget.
    """
    
    def __init__(self, total_budget_ms: float = 15.0):
        self.total_budget_ms = total_budget_ms
        self.start_time = None
    
    def start_request(self):
        """Mark request start."""
        self.start_time = time.perf_counter()
    
    def get_timeout_for(self, component: str, 
                        default_budget_ms: float) -> float:
        """
        Get remaining timeout for a component.
        Returns smaller of default budget or remaining time.
        """
        if self.start_time is None:
            return default_budget_ms
        
        elapsed = (time.perf_counter() - self.start_time) * 1000
        remaining = self.total_budget_ms - elapsed
        
        # Leave 2ms buffer for response serialization
        available = remaining - 2.0
        
        return max(0.1, min(default_budget_ms, available))

P99, Not P50

Optimize for 99th percentile latency, not median. Users remember the slow requests. If P50 is 10ms but P99 is 500ms, 1% of requests feel broken. Monitor and alert on tail latency, not just average.

Caching Strategies for Autocomplete

Cache Levels

Multi-Level Cache Architecture

•L1: Client-side cache — Browser localStorage/memory. Instant, no network. Good for user's recent queries.
•L2: CDN/Edge cache — Cloudflare, CloudFront. 10-20ms latency. Good for popular global prefixes.
•L3: Application cache — Redis/Memcached in data center. 1-5ms. Warm prefixes.
•L4: Precomputed results — Pre-materialized for top prefixes. Stored in-process. <1ms.

caching_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
from typing import Optional, List, Any
import hashlib
import json
import time
 
 
class AutocompleteCache:
    """
    Multi-level cache for autocomplete results.
    """
    
    def __init__(self, redis_client, local_capacity: int = 10000):
        self.redis = redis_client
        
        # L4: In-process LRU cache for hottest prefixes
        self.local_cache: Dict[str, Tuple[List[str], float]] = {}
        self.local_capacity = local_capacity
        
        # Precomputed results for top prefixes (loaded at startup)
        self.precomputed: Dict[str, List[str]] = {}
        
        # TTLs
        self.local_ttl = 60  # 1 minute in-process
        self.redis_ttl = 300  # 5 minutes in Redis
        
    def cache_key(self, prefix: str, user_id: Optional[str] = None) -> str:
        """
        Generate cache key for prefix.
        Include user for personalized results.
        """
        if user_id:
            # Personalized: include user segment (not full ID for privacy)
            user_segment = hashlib.md5(user_id.encode()).hexdigest()[:4]
            return f"ac:{prefix}:{user_segment}"
        return f"ac:{prefix}"
    
    def get(self, prefix: str, user_id: Optional[str] = None) -> Optional[List[str]]:
        """
        Get cached results, checking all levels.
        Returns (results, cache_level) or (None, None).
        """
        key = self.cache_key(prefix, user_id)
        
        # L4: Check precomputed (non-personalized only)
        if not user_id and prefix in self.precomputed:
            return self.precomputed[prefix]
        
        # L3: Check local cache
        if key in self.local_cache:
            results, timestamp = self.local_cache[key]
            if time.time() - timestamp < self.local_ttl:
                return results
            else:
                del self.local_cache[key]
        
        # L2: Check Redis
        try:
            cached = self.redis.get(key)
            if cached:
                results = json.loads(cached)
                # Promote to local cache
                self._local_set(key, results)
                return results
        except Exception:
            pass  # Redis failure - fall through
        
        return None
    
    def set(self, prefix: str, results: List[str], 
            user_id: Optional[str] = None) -> None:
        """Cache results at appropriate levels."""
        key = self.cache_key(prefix, user_id)
        
        # Always set in local cache
        self._local_set(key, results)
        
        # Set in Redis for longer persistence
        try:
            self.redis.setex(key, self.redis_ttl, json.dumps(results))
        except Exception:
            pass  # Redis failure - local cache still works
    
    def _local_set(self, key: str, results: List[str]) -> None:
        """Set in local cache with LRU eviction."""
        # Simple eviction: remove oldest if over capacity
        if len(self.local_cache) >= self.local_capacity:
            oldest_key = min(self.local_cache.keys(),
                           key=lambda k: self.local_cache[k][1])
            del self.local_cache[oldest_key]
        
        self.local_cache[key] = (results, time.time())
    
    def load_precomputed(self, popular_prefixes: Dict[str, List[str]]) -> None:
        """
        Load precomputed results for popular prefixes.
        Called at application startup.
        """
        self.precomputed = popular_prefixes
        print(f"Loaded {len(popular_prefixes)} precomputed prefixes")
 
 
class PersonalizedCacheStrategy:
    """
    Handle caching for personalized results.
    
    Challenge: personalized results reduce cache hit rate.
    Solution: segment users into cohorts for shared caching.
    """
    
    def __init__(self, num_segments: int = 100):
        self.num_segments = num_segments
        self.segment_profiles: Dict[int, List[str]] = {}
    
    def get_user_segment(self, user: UserProfile) -> int:
        """
        Assign user to cache segment based on interests.
        Users with similar interests share a segment.
        """
        # Simple: hash top interests
        top_interests = sorted(
            user.interests.items(), 
            key=lambda x: -x[1]
        )[:5]
        
        interest_key = ','.join(k for k, v in top_interests)
        segment = hash(interest_key) % self.num_segments
        
        return segment
    
    def should_personalize_cache(self, query: str, user: UserProfile) -> bool:
        """
        Determine if query should use personalized cache.
        
        Short prefixes: use personalized (high ambiguity)
        Long prefixes: use global (user intent is clearer)
        """
        if len(query) <= 2:
            return True  # High ambiguity, personalization helps
        
        if len(query) >= 6:
            return False  # Intent is clear, use global cache
        
        # Medium length: personalize if user has strong preferences
        return len(user.interests) > 10

Cache Hit Rate Benchmarks
Prefix Length	Typical Hit Rate	Strategy
1 character	95%+	Precompute all 26 letters
2 characters	80-90%	Precompute common bigrams
3 characters	60-80%	Cache in Redis
4+ characters	30-60%	Cache most popular, compute rest
Personalized	20-40%	Segment-based caching

Cache Warming

Precomputation Patterns

Precomputation moves work from query time to indexing time. For autocomplete, we can precompute top suggestions for the most frequent prefixes and store them directly.

What to Precompute

At query time, we compute: prefix matching → fuzzy expansion → scoring → ranking → top-K selection. Much of this can shift to preprocessing.

precomputation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
class PrecomputedAutocomplete:
    """
    Precompute top suggestions for popular prefixes.
    """
    
    def __init__(self, trie: Trie, scorer: BasicRanker):
        self.trie = trie
        self.scorer = scorer
        
        # Precomputed: prefix -> top 20 suggestions
        self.precomputed: Dict[str, List[str]] = {}
        
        # Track which prefixes we've precomputed
        self.precomputed_prefixes: Set[str] = set()
    
    def precompute_for_prefix(self, prefix: str, top_k: int = 20) -> None:
        """Precompute top suggestions for a prefix."""
        # Get all matching terms
        all_matches = self.trie.get_all_words_with_prefix(prefix)
        
        if not all_matches:
            return
        
        # Score and rank all matches
        context = RankingContext(
            query=prefix,
            user_id=None,  # Non-personalized
            location=None,
            timestamp=time.time(),
            previous_queries=[]
        )
        
        scored = []
        for term in all_matches:
            suggestion = self._make_suggestion(term, prefix)
            score = self.scorer.score(prefix, suggestion, context)
            scored.append((term, score))
        
        # Sort and take top K
        scored.sort(key=lambda x: -x[1])
        top_terms = [term for term, _ in scored[:top_k]]
        
        self.precomputed[prefix] = top_terms
        self.precomputed_prefixes.add(prefix)
    
    def precompute_all_popular(self, query_logs: List[str], 
                                min_frequency: int = 100,
                                max_prefixes: int = 100000) -> None:
        """
        Precompute for prefixes appearing frequently in query logs.
        
        Run this as a batch job (e.g., daily).
        """
        from collections import Counter
        
        # Count prefix frequencies
        prefix_counts: Counter = Counter()
        
        for query in query_logs:
            query = query.lower().strip()
            # Count all prefixes of this query
            for length in range(1, len(query) + 1):
                prefix_counts[query[:length]] += 1
        
        # Filter to frequent prefixes
        popular = [
            prefix for prefix, count in prefix_counts.items()
            if count >= min_frequency
        ]
        
        # Sort by frequency, take top N
        popular.sort(key=lambda p: -prefix_counts[p])
        popular = popular[:max_prefixes]
        
        print(f"Precomputing {len(popular)} popular prefixes...")
        
        for i, prefix in enumerate(popular):
            self.precompute_for_prefix(prefix)
            
            if (i + 1) % 10000 == 0:
                print(f"  Precomputed {i + 1}/{len(popular)}")
        
        print(f"Precomputation complete. Memory: ~{self._estimate_memory_mb():.1f} MB")
    
    def query(self, prefix: str, limit: int = 10) -> List[str]:
        """
        Query with precomputed optimization.
        """
        # Check precomputed first
        if prefix in self.precomputed:
            return self.precomputed[prefix][:limit]
        
        # Fall back to real-time computation
        return self._compute_live(prefix, limit)
    
    def _compute_live(self, prefix: str, limit: int) -> List[str]:
        """Fall back to live computation."""
        matches = self.trie.get_all_words_with_prefix(prefix)
        # Simplified ranking for real-time path
        matches.sort(key=lambda x: len(x))  # Shorter = more relevant
        return matches[:limit]
    
    def _estimate_memory_mb(self) -> float:
        """Estimate memory usage of precomputed data."""
        total_chars = sum(
            sum(len(term) for term in terms)
            for terms in self.precomputed.values()
        )
        # Rough estimate: chars + overhead
        return total_chars / 1_000_000 * 2
 
 
class IncrementalPrecomputation:
    """
    Update precomputed results incrementally as data changes.
    """
    
    def __init__(self):
        self.precomputed: Dict[str, List[str]] = {}
        self.last_computed: Dict[str, float] = {}
        self.staleness_threshold_hours = 24
    
    def is_stale(self, prefix: str) -> bool:
        """Check if precomputed results are outdated."""
        if prefix not in self.last_computed:
            return True
        
        age_hours = (time.time() - self.last_computed[prefix]) / 3600
        return age_hours > self.staleness_threshold_hours
    
    def mark_affected_prefixes(self, added_term: str, 
                                removed_term: str = None) -> Set[str]:
        """
        Identify which precomputed prefixes are affected by term changes.
        """
        affected = set()
        
        terms_to_check = [added_term] if added_term else []
        if removed_term:
            terms_to_check.append(removed_term)
        
        for term in terms_to_check:
            # All prefixes of this term are affected
            for length in range(1, len(term) + 1):
                prefix = term[:length]
                if prefix in self.precomputed:
                    affected.add(prefix)
        
        return affected

Precomputation Trade-offs

Index Sharding and Replication

Sharding Strategies for Autocomplete

Autocomplete Sharding Strategies
Strategy	How It Works	Pros	Cons
Prefix-based	Shard by first N characters	Single shard per query	Uneven distribution (common letters)
Hash-based	Hash term → shard	Even distribution	Query hits all shards (scatter-gather)
Popularity-based	Hot terms on dedicated shards	Optimizes common queries	Complex routing logic
Range-based	Lexicographic ranges	Continuous prefixes together	Rebalancing on updates

sharding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import asyncio
from typing import List, Tuple
 
 
class PrefixShardedIndex:
    """
    Shard autocomplete index by prefix for single-shard queries.
    """
    
    def __init__(self, num_shards: int = 26):
        self.num_shards = num_shards
        # In production: actual shard client connections
        self.shard_clients: Dict[int, Any] = {}
        
        # Prefix ranges for each shard
        self.shard_ranges: List[Tuple[str, str]] = self._compute_ranges()
    
    def _compute_ranges(self) -> List[Tuple[str, str]]:
        """
        Compute prefix ranges for even distribution.
        
        Simple version: one shard per letter.
        Production: analyze traffic for balanced splits.
        """
        ranges = []
        for i in range(self.num_shards):
            start_char = chr(ord('a') + i)
            if i == self.num_shards - 1:
                end_char = '{'  # Character after 'z'
            else:
                end_char = chr(ord('a') + i + 1)
            ranges.append((start_char, end_char))
        return ranges
    
    def get_shard_for_prefix(self, prefix: str) -> int:
        """Route prefix to appropriate shard."""
        if not prefix:
            return 0  # Default shard for empty prefix
        
        first_char = prefix[0].lower()
        
        for shard_id, (start, end) in enumerate(self.shard_ranges):
            if start <= first_char < end:
                return shard_id
        
        return self.num_shards - 1  # Default to last shard
    
    async def query_shard(self, shard_id: int, prefix: str, 
                          limit: int) -> List[str]:
        """Query a single shard."""
        client = self.shard_clients.get(shard_id)
        if not client:
            return []
        
        # In production: actual RPC call
        results = await client.query(prefix, limit)
        return results
    
    async def query(self, prefix: str, limit: int = 10) -> List[str]:
        """
        Query with routing to appropriate shard.
        Single shard for prefix queries (no scatter-gather).
        """
        shard_id = self.get_shard_for_prefix(prefix)
        return await self.query_shard(shard_id, prefix, limit)
 
 
class ReplicatedShardCluster:
    """
    Manage replicated shards for high availability and load distribution.
    """
    
    def __init__(self, num_shards: int, replicas_per_shard: int = 3):
        self.num_shards = num_shards
        self.replicas_per_shard = replicas_per_shard
        
        # For each shard: list of replica addresses
        self.shard_replicas: Dict[int, List[str]] = {}
        
        # Health tracking
        self.replica_health: Dict[str, bool] = {}
        
        # Load balancing: round-robin counters
        self.rr_counters: Dict[int, int] = {}
    
    def register_replica(self, shard_id: int, address: str) -> None:
        """Register a replica for a shard."""
        if shard_id not in self.shard_replicas:
            self.shard_replicas[shard_id] = []
            self.rr_counters[shard_id] = 0
        
        self.shard_replicas[shard_id].append(address)
        self.replica_health[address] = True
    
    def get_replica_for_query(self, shard_id: int) -> Optional[str]:
        """
        Select a healthy replica for querying.
        Uses round-robin for load balancing.
        """
        replicas = self.shard_replicas.get(shard_id, [])
        healthy = [r for r in replicas if self.replica_health.get(r, False)]
        
        if not healthy:
            return None
        
        # Round-robin selection
        counter = self.rr_counters.get(shard_id, 0)
        selected = healthy[counter % len(healthy)]
        self.rr_counters[shard_id] = counter + 1
        
        return selected
    
    def mark_unhealthy(self, address: str) -> None:
        """Mark a replica as unhealthy after failure."""
        self.replica_health[address] = False
        # In production: trigger health check retry
    
    async def query_with_failover(self, shard_id: int, prefix: str,
                                   limit: int, timeout_ms: float = 10) -> List[str]:
        """
        Query shard with automatic failover to other replicas.
        """
        replicas = self.shard_replicas.get(shard_id, [])
        healthy = [r for r in replicas if self.replica_health.get(r, False)]
        
        for replica in healthy:
            try:
                result = await asyncio.wait_for(
                    self._query_replica(replica, prefix, limit),
                    timeout=timeout_ms / 1000
                )
                return result
            except asyncio.TimeoutError:
                self.mark_unhealthy(replica)
                continue
            except Exception:
                self.mark_unhealthy(replica)
                continue
        
        # All replicas failed
        return []
    
    async def _query_replica(self, address: str, prefix: str,
                              limit: int) -> List[str]:
        """Query a specific replica."""
        # In production: actual network call
        pass

Prefer Prefix Sharding

Network and Protocol Optimization

Network latency often dominates the autocomplete budget, especially for geographically distributed users. Optimizing network paths and protocols can shave tens of milliseconds.

Key Network Optimizations

Network Optimization Strategies

•Edge servers: Deploy autocomplete services at CDN edge locations. Reduces RTT from 100ms+ to 10-20ms for most users.
•Connection reuse: Keep HTTP/2 connections warm. First request on a new connection incurs TLS handshake (~30ms).
•Protocol choice: Use HTTP/2 for multiplexing. Consider gRPC for internal services (efficient binary protocol).
•Request coalescing: If user types fast, cancel in-flight requests and only send the latest prefix.
•Compression: gzip/brotli response bodies. Typical autocomplete responses compress 3-5x.
•Early flush: Start sending response headers before full body is ready. Parallelizes network with processing.

network_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import gzip
import json
from typing import Optional
 
 
class OptimizedResponseFormatter:
    """
    Optimize response size and format for autocomplete.
    """
    
    def __init__(self, enable_compression: bool = True):
        self.enable_compression = enable_compression
        self.min_compression_size = 200  # bytes
    
    def format_response(self, suggestions: List[str], 
                        query: str,
                        metadata: Optional[Dict] = None) -> bytes:
        """
        Create compact response format.
        """
        # Minimal response structure
        response = {
            'q': query,  # Short key names
            's': suggestions,  # Just the essentials
        }
        
        if metadata:
            response['m'] = metadata
        
        json_bytes = json.dumps(response, separators=(',', ':')).encode('utf-8')
        
        # Compress if worthwhile
        if self.enable_compression and len(json_bytes) > self.min_compression_size:
            compressed = gzip.compress(json_bytes, compresslevel=6)
            if len(compressed) < len(json_bytes) * 0.8:
                return compressed
        
        return json_bytes
    
    def streaming_response(self, suggestions: List[str], query: str):
        """
        Stream response incrementally.
        
        Benefit: First suggestion can render while rest are still being sent.
        """
        yield b'{"q":"'
        yield query.encode('utf-8')
        yield b'","s":['
        
        for i, suggestion in enumerate(suggestions):
            if i > 0:
                yield b','
            yield json.dumps(suggestion).encode('utf-8')
            # Flush after each suggestion
        
        yield b']}'
 
 
class RequestOptimizer:
    """
    Client-side request optimization strategies.
    """
    
    def __init__(self, debounce_ms: int = 50):
        self.debounce_ms = debounce_ms
        self.pending_request: Optional[asyncio.Task] = None
    
    async def query_with_debounce(self, prefix: str, 
                                   query_fn) -> Optional[List[str]]:
        """
        Debounce rapid keystrokes into single request.
        
        If user types "prog" in 100ms:
        - Without debounce: 4 requests (p, pr, pro, prog)
        - With 50ms debounce: 1 request (prog)
        """
        # Cancel any pending request
        if self.pending_request and not self.pending_request.done():
            self.pending_request.cancel()
        
        # Wait for debounce period
        await asyncio.sleep(self.debounce_ms / 1000)
        
        # Execute the query
        return await query_fn(prefix)
    
    async def query_with_cancellation(self, prefix: str,
                                       query_fn) -> Optional[List[str]]:
        """
        Cancel in-flight requests when new query arrives.
        """
        # Cancel previous request
        if self.pending_request and not self.pending_request.done():
            self.pending_request.cancel()
            try:
                await self.pending_request
            except asyncio.CancelledError:
                pass
        
        # Start new request
        self.pending_request = asyncio.create_task(query_fn(prefix))
        
        try:
            return await self.pending_request
        except asyncio.CancelledError:
            return None

Network Optimization Impact
Optimization	Latency Reduction	Implementation Effort
Edge deployment	30-100ms	High (infrastructure)
HTTP/2	10-30ms (cold requests)	Low (configuration)
Request debouncing	N/A (reduces load)	Low (client code)
Response compression	2-10ms (large responses)	Low (configuration)
Connection keep-alive	15-30ms (repeated requests)	Low (configuration)

Mobile Networks Are Different

Data Structure Optimizations

Beyond caching and architecture, the underlying data structures significantly impact performance. Memory layout, cache efficiency, and algorithmic constants matter at scale.

Memory Layout Optimizations

Data Structure Optimization Techniques

•Cache line alignment: Arrange frequently co-accessed data within same 64-byte cache line.
•Contiguous allocation: Use array-based structures over pointer-chasing linked structures.
•String interning: Deduplicate string storage to reduce memory and cache pressure.
•Integer IDs: Reference terms by integer ID, not string. Comparison is O(1) instead of O(m).
•Compact representations: Use uint32 instead of uint64 when values fit. Halves memory, improves cache.
•SIMD operations: Process multiple elements in parallel using vector instructions.

optimized_data_structures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import array
import struct
from typing import List, Tuple
 
 
class CompactSuggestionStore:
    """
    Memory-optimized storage for autocomplete suggestions.
    
    Instead of storing strings directly, store:
    1. Term IDs (4 bytes each)
    2. Scores (4 bytes each, float16 or fixed-point)
    3. Actual strings in separate pooled storage
    
    Reduces memory 60%+ compared to Python objects.
    """
    
    def __init__(self):
        # String pool: all unique terms concatenated
        self.string_pool: bytearray = bytearray()
        
        # Term table: term_id -> (offset, length) in string pool
        self.term_offsets: array.array = array.array('I')  # uint32
        self.term_lengths: array.array = array.array('H')  # uint16
        
        # Reverse lookup: term -> term_id
        self.term_to_id: Dict[str, int] = {}
        
        # Suggestion lists: prefix -> list of (term_id, score)
        self.suggestions: Dict[str, bytes] = {}
    
    def add_term(self, term: str) -> int:
        """Add term to string pool, return term_id."""
        if term in self.term_to_id:
            return self.term_to_id[term]
        
        term_id = len(self.term_offsets)
        term_bytes = term.encode('utf-8')
        
        # Record offset and length
        self.term_offsets.append(len(self.string_pool))
        self.term_lengths.append(len(term_bytes))
        
        # Append to pool
        self.string_pool.extend(term_bytes)
        
        self.term_to_id[term] = term_id
        return term_id
    
    def get_term(self, term_id: int) -> str:
        """Retrieve term from string pool."""
        offset = self.term_offsets[term_id]
        length = self.term_lengths[term_id]
        return self.string_pool[offset:offset + length].decode('utf-8')
    
    def set_suggestions(self, prefix: str, 
                        suggestions: List[Tuple[str, float]]) -> None:
        """Store suggestions as compact binary format."""
        # Format: [term_id (4 bytes), score (4 bytes)] × count
        packed = bytearray()
        
        for term, score in suggestions:
            term_id = self.add_term(term)
            packed.extend(struct.pack('If', term_id, score))
        
        self.suggestions[prefix] = bytes(packed)
    
    def get_suggestions(self, prefix: str) -> List[Tuple[str, float]]:
        """Retrieve and unpack suggestions."""
        if prefix not in self.suggestions:
            return []
        
        packed = self.suggestions[prefix]
        count = len(packed) // 8  # 4 + 4 bytes per entry
        
        result = []
        for i in range(count):
            offset = i * 8
            term_id, score = struct.unpack('If', packed[offset:offset + 8])
            result.append((self.get_term(term_id), score))
        
        return result
    
    def memory_usage_bytes(self) -> Dict[str, int]:
        """Report memory usage breakdown."""
        return {
            'string_pool': len(self.string_pool),
            'term_offsets': len(self.term_offsets) * 4,
            'term_lengths': len(self.term_lengths) * 2,
            'suggestions': sum(len(v) for v in self.suggestions.values()),
            'prefix_keys': sum(len(k) for k in self.suggestions.keys()),
        }
 
 
class SIMDAwareSearch:
    """
    Demonstrate SIMD-aware data layout for fast comparison.
    
    Note: Python doesn't directly support SIMD.
    In production, use NumPy (which uses SIMD internally)
    or write critical paths in Rust/C++.
    """
    
    def __init__(self):
        # Store first 16 bytes of each term for fast filtering
        # Aligns with SSE 128-bit registers
        self.prefixes_16: List[bytes] = []
        self.full_terms: List[str] = []
    
    def add_term(self, term: str) -> None:
        """Add term with SIMD-friendly prefix extraction."""
        # Pad to exactly 16 bytes
        prefix = term[:16].encode('utf-8').ljust(16, b'\x00')
        self.prefixes_16.append(prefix)
        self.full_terms.append(term)
    
    def filter_by_prefix(self, query_prefix: str) -> List[int]:
        """
        Fast filter using fixed-size prefix comparison.
        
        In native code, this would use _mm_cmpeq_epi8 and _mm_movemask_epi8
        to compare 16 bytes in parallel.
        """
        query_bytes = query_prefix.encode('utf-8')
        query_len = len(query_bytes)
        
        # Create comparison mask
        query_padded = query_bytes.ljust(16, b'\x00')
        
        matches = []
        for i, prefix in enumerate(self.prefixes_16):
            # Compare first query_len bytes
            if prefix[:query_len] == query_bytes:
                matches.append(i)
        
        return matches

Profile Before Optimizing

Monitoring and Debugging Performance

Key Metrics to Track

Autocomplete Performance Metrics
Metric	Target	Alert Threshold
P50 latency	<15ms	25ms
P99 latency	<50ms	100ms
Cache hit rate	70%	<50%
Error rate	<0.01%	0.1%
Timeout rate	<0.1%	1%
QPS per server	Capacity - 20%	Capacity - 10%

monitoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import time
from collections import defaultdict
from typing import Dict, List
import statistics
 
 
class PerformanceMonitor:
    """
    Track and analyze autocomplete performance.
    """
    
    def __init__(self, window_seconds: int = 60):
        self.window_seconds = window_seconds
        
        # Latency samples: component -> list of (timestamp, latency_ms)
        self.latency_samples: Dict[str, List[Tuple[float, float]]] = defaultdict(list)
        
        # Counters
        self.request_count = 0
        self.cache_hits = 0
        self.cache_misses = 0
        self.errors = 0
        self.timeouts = 0
    
    def record_latency(self, component: str, latency_ms: float) -> None:
        """Record latency for a component."""
        now = time.time()
        self.latency_samples[component].append((now, latency_ms))
        self._prune_old_samples(component, now)
    
    def record_request(self, cache_hit: bool, error: bool = False,
                       timeout: bool = False) -> None:
        """Record request outcome."""
        self.request_count += 1
        
        if cache_hit:
            self.cache_hits += 1
        else:
            self.cache_misses += 1
        
        if error:
            self.errors += 1
        if timeout:
            self.timeouts += 1
    
    def _prune_old_samples(self, component: str, now: float) -> None:
        """Remove samples older than window."""
        cutoff = now - self.window_seconds
        self.latency_samples[component] = [
            (ts, lat) for ts, lat in self.latency_samples[component]
            if ts >= cutoff
        ]
    
    def get_percentile(self, component: str, percentile: float) -> float:
        """Get latency percentile for component."""
        samples = [lat for _, lat in self.latency_samples[component]]
        if not samples:
            return 0.0
        
        samples.sort()
        idx = int(len(samples) * percentile / 100)
        return samples[min(idx, len(samples) - 1)]
    
    def get_stats(self, component: str) -> Dict[str, float]:
        """Get comprehensive stats for component."""
        samples = [lat for _, lat in self.latency_samples[component]]
        
        if not samples:
            return {'count': 0}
        
        return {
            'count': len(samples),
            'mean': statistics.mean(samples),
            'median': statistics.median(samples),
            'p95': self.get_percentile(component, 95),
            'p99': self.get_percentile(component, 99),
            'min': min(samples),
            'max': max(samples),
        }
    
    def get_cache_hit_rate(self) -> float:
        """Calculate cache hit rate."""
        total = self.cache_hits + self.cache_misses
        if total == 0:
            return 0.0
        return self.cache_hits / total
    
    def get_error_rate(self) -> float:
        """Calculate error rate."""
        if self.request_count == 0:
            return 0.0
        return self.errors / self.request_count
    
    def should_alert(self, component: str) -> bool:
        """Check if component is performing poorly."""
        stats = self.get_stats(component)
        
        if stats.get('count', 0) < 100:
            return False  # Not enough data
        
        # Alert thresholds
        if stats.get('p99', 0) > 100:
            return True  # P99 > 100ms
        
        if stats.get('median', 0) > 25:
            return True  # P50 > 25ms
        
        return False
 
 
class LatencyBreakdown:
    """
    Trace latency breakdown for a single request.
    Useful for debugging slow requests.
    """
    
    def __init__(self, request_id: str):
        self.request_id = request_id
        self.start_time = time.perf_counter()
        self.checkpoints: List[Tuple[str, float]] = []
    
    def checkpoint(self, name: str) -> None:
        """Record a checkpoint."""
        elapsed = (time.perf_counter() - self.start_time) * 1000
        self.checkpoints.append((name, elapsed))
    
    def finish(self) -> Dict[str, float]:
        """Complete trace and return breakdown."""
        total = (time.perf_counter() - self.start_time) * 1000
        
        result = {'total_ms': total}
        prev = 0.0
        
        for name, elapsed in self.checkpoints:
            result[f'{name}_ms'] = elapsed - prev
            prev = elapsed
        
        return result

Don't Average Latencies

Average latency hides tail latency problems. A system with P50=10ms, P99=500ms has serious issues that don't show in the average (~15ms). Always track percentiles, especially P95 and P99.

Summary: Performance is a Feature

Key Takeaways

•Start with a latency budget — Break down end-to-end time into components and allocate budgets. Measure against budget continuously.
•Caching is the biggest lever — Multi-level caching (client, edge, application, precomputed) can serve 80%+ of requests without hitting the index.
•Precompute popular prefixes — Move ranking work to offline batch jobs. Serve common queries in <1ms from memory.
•Shard by prefix for single-shard queries — Avoid scatter-gather overhead. Each query hits exactly one shard.
•Optimize the network path — Edge deployment, connection reuse, compression, and debouncing reduce network latency.
•Data structure details matter — Cache-friendly layouts, compact representations, and SIMD can provide 2-5x speedups.
•Monitor P99, not average — Tail latency reveals problems that averages hide. Alert on percentiles.

Module Complete: Typeahead and Autocomplete Mastery

You've completed the comprehensive journey through autocomplete system design:

Prefix Matching — Understanding the problem and naive approach failures
Trie Data Structures — The engine behind efficient prefix search
Fuzzy Matching — Handling user errors gracefully
Suggestions Ranking — Surfacing the most relevant results
Performance Optimization — Delivering speed at scale

You now have the deep knowledge to design, implement, and optimize autocomplete systems that serve billions of users with sub-50ms latency.

Module Complete: Typeahead and Autocomplete

5 / 6