Why Caching Matters - Learning Module

Loading content...

0/246

Cache Hit vs Cache Miss

The Binary Question

Every time your system needs data, it asks a simple question: Is this data already in the cache? The answer—yes or no—determines whether you serve the request in microseconds or milliseconds. This binary outcome, multiplied across millions of requests, defines whether your system feels instantaneous or sluggish, whether your infrastructure costs are reasonable or astronomical.

Understanding the difference between cache hits and misses, and more importantly, understanding why hits occur and how to maximize them, is fundamental to building high-performance systems. In this page, we'll dissect the mechanics of cache access, explore the critical concept of hit rate, and examine how hit rate impacts every aspect of system behavior.

What You Will Learn

By the end of this page, you will understand the precise mechanics of cache hits and misses, how to calculate and interpret hit rates, how hit rate affects latency distribution and backend load, and the subtle factors that influence whether your cache is performing well or poorly.

Anatomy of a Cache Hit

A cache hit occurs when requested data is found in the cache and can be served directly without accessing the slower backend data source. This is the happy path—the scenario you want to optimize for.

The cache hit sequence:

Request arrives — A component requests data by key
Key lookup — Cache performs a hash lookup or index scan to find the key
Entry found — The key exists in the cache
Validity check — The entry is checked for expiration/staleness
Entry valid — The data is within its TTL and usable
Data returned — Cached data is deserialized (if necessary) and returned
Metadata update — Access time, hit count, etc. are updated

The entire sequence typically completes in microseconds to low milliseconds, depending on whether the cache is in-process (nanoseconds) or network-accessible (sub-millisecond to a few milliseconds).

Converting Mermaid diagram...

Why cache hits are fast:

Cache hits are fast for several reinforcing reasons:

Hardware speed: In-memory access (RAM) is 100,000x+ faster than disk access
Data locality: Cached data is close to the requester (same machine, same datacenter)
Optimized data structures: Caches use hash tables for O(1) key lookup
No computation: The data is pre-computed and ready to serve
Connection reuse: Cache connections are typically pooled and persistent

The hit's contribution to perceived performance:

Users experience the average response time of your system, which is heavily weighted by cache hits. If 95% of requests hit the cache (0.5ms each) and 5% miss (100ms each), your average response time is:

0.95 × 0.5ms + 0.05 × 100ms = 0.475ms + 5ms = 5.475ms

The 5% of misses dominate the average! This is why even small improvements in hit rate can dramatically improve perceived performance.

The 'Warm' Cache

A 'warm' cache is one that has been populated with data through normal operation. A 'cold' cache (just started, just cleared) has no data and will miss on everything. System performance immediately after a cache restart can be significantly worse until the cache warms up—a phenomenon called the 'cold start problem.'

Anatomy of a Cache Miss

A cache miss occurs when requested data is not found in the cache (or is found but stale/invalid). The application must then fetch data from the slower backend source. This is the expensive path that caching aims to minimize.

The cache miss sequence:

Request arrives — A component requests data by key
Key lookup — Cache performs a hash lookup
Entry not found — The key doesn't exist (or has expired)
Backend fetch — Application queries the origin data source (database, API, etc.)
Origin processing — Database executes query, computes result
Response received — Data arrives from the backend
Cache population — Data is stored in cache with appropriate TTL
Data returned — Data is returned to the requester

This sequence takes significantly longer—typically 10-1000x longer than a cache hit, depending on the backend.

Converting Mermaid diagram...

Types of cache misses:

Not all misses are equal. Understanding miss types helps diagnose cache behavior:

1. Compulsory Miss (Cold Miss)

The very first request for a piece of data. No cache can avoid this—the data has never been accessed before. Compulsory misses are inevitable; they represent the cost of populating an empty cache.

2. Capacity Miss

The cache is full and had to evict this entry to make room for other data. The data was previously cached but has been removed. Indicates the cache is undersized for the working set.

3. Conflict Miss

In caches with limited associativity (common in CPU caches), data may be evicted even when the cache isn't full because of hash collisions. Less common in application-level caches.

4. Coherence/Invalidation Miss

The entry was explicitly invalidated due to a data update. The data existed but was removed to maintain correctness. This is an intentional miss.

5. Expiration Miss

The entry's TTL expired. The data was cached but is now considered stale. Indicates TTL may be too short, or this is working as designed.

Miss Classification Matters

When debugging poor cache performance, classify your misses. High compulsory misses on startup are normal. Persistent capacity misses indicate you need a larger cache. Frequent invalidation misses suggest overly aggressive invalidation. The fix depends on the miss type.

Understanding Hit Rate

The cache hit rate is the most important metric for evaluating cache effectiveness. It measures the percentage of requests that are served from the cache without accessing the backend.

Hit Rate Formula:

Hit Rate = (Cache Hits) / (Cache Hits + Cache Misses) × 100%

Alternatively stated:

Hit Rate = (Cache Hits) / (Total Requests) × 100%

A hit rate of 95% means 95 out of every 100 requests are served from cache. Only 5 requests reach the backend.

Miss Rate:

The complement of hit rate is miss rate:

Miss Rate = 100% - Hit Rate
Miss Rate = (Cache Misses) / (Total Requests) × 100%

At 95% hit rate, miss rate is 5%.

Hit Rate Benchmarks and Implications
Hit Rate	Miss Rate	Assessment	Typical Scenario
< 50%	50%	Poor - Cache likely misconfigured	Wrong data being cached, TTL too short
50-80%	20-50%	Moderate - Room for improvement	Diverse workload, undersized cache
80-90%	10-20%	Good - Effective caching	Well-configured general workload
90-95%	5-10%	Very Good - Optimized	Hot data well-cached
95-99%	1-5%	Excellent - High-performance	Highly cacheable workload
99%	< 1%	Exceptional - Rare to achieve	Static/infrequently changing data

Interpreting hit rates:

Hit rate interpretation depends heavily on context:

CDN serving static assets: Expect 95-99%+. Static content should almost always hit.
Application cache for database queries: 70-90% is often realistic. Queries vary in cacheability.
API response cache: Depends on personalization. 50-80% for personalized, 90%+ for shared content.
Session cache: Near 100% within session lifetime—every session exists once created.

Hit rate over time:

Hit rates aren't static. They change based on:

Cache temperature: Cold caches have low hit rates; they improve as caches warm
Traffic patterns: Off-peak traffic may access less common data, lowering hit rate
Data changes: Bulk updates invalidate entries, temporarily lowering hit rate
New feature launches: Novel data has no cache history, causing compulsory misses

Low Hit Rate Red Flags

Hit rates below 50% usually indicate a fundamental problem: caching the wrong data, TTLs too short, cache too small, or key design issues causing unnecessary uniqueness. A low hit rate means you're paying for cache infrastructure without getting corresponding benefit.

The Non-Linear Impact of Hit Rate

One of the most important—and counterintuitive—properties of hit rate is that its impact on system performance is non-linear. Small improvements in hit rate at high cache hit rates can have dramatic effects, while the same improvements at lower rates are less impactful.

The backend load calculation:

If your system receives R requests per second and your cache hit rate is H, the requests reaching your backend are:

Backend Load = R × (1 - H)

Consider 100,000 requests per second:

Backend Load at Various Hit Rates
Hit Rate	Miss Rate	Backend QPS	Improvement from +5% Hit Rate
80%	20%	20,000	—
85%	15%	15,000	5,000 fewer QPS (25% reduction)
90%	10%	10,000	5,000 fewer QPS (33% reduction)
95%	5%	5,000	5,000 fewer QPS (50% reduction)
99%	1%	1,000	4,000 fewer QPS (80% reduction)

Notice the pattern: moving from 90% to 95% hit rate halves your backend load. Moving from 95% to 99% cuts backend load by 80%. At high hit rates, each percentage point improvement has outsized impact.

The latency distribution effect:

Hit rate also determines your latency distribution. Let's model a system where:

Cache hit latency: 1ms
Cache miss latency: 100ms

Average Latency at Various Hit Rates
Hit Rate	Average Latency	p50 Latency	p99 Latency
50%	50.5ms	~50ms (mix)	100ms
80%	20.8ms	1ms	100ms
90%	10.9ms	1ms	100ms
95%	5.95ms	1ms	100ms
99%	1.99ms	1ms	100ms

Important observation: Your p50 (median) latency improves dramatically with hit rate, but your p99 latency remains at the miss latency until you achieve extremely high hit rates. This is why tail latency is often dominated by cache misses even in well-optimized systems.

Capacity planning implications:

Because hit rate determines backend load, capacity planning must account for cache behavior:

Provisioning: Size your backend for miss traffic, not total traffic
Cache failures: If cache fails, you must handle 100% of traffic—massive spike
Cold starts: New cache nodes temporarily increase miss rate
Hit rate degradation: Even 5% drop in hit rate can significantly increase backend load

The 'Five Nines' Effect

At 99% hit rate, your backend sees 1% of traffic. At 99.9% hit rate, it sees only 0.1%—a 10x reduction. This is why highly cacheable workloads (CDN, static content) can serve millions of requests per second with minimal origin infrastructure. Every nine you add multiplies the benefit.

Factors Affecting Hit Rate

Achieving high hit rates requires understanding and optimizing the factors that influence whether requests hit or miss. These factors are often interconnected and must be balanced against each other.

Primary Hit Rate Factors

•Cache Size — Larger caches can hold more of the working set, reducing capacity misses. But size has diminishing returns beyond the working set size, and larger caches cost more.
•TTL Configuration — Longer TTLs mean entries survive longer, increasing hits. But longer TTLs increase staleness risk. Optimal TTL balances freshness against hit rate.
•Access Pattern Distribution — Highly skewed access (few items dominate traffic) leads to high hit rates. Uniform access patterns require larger caches for the same hit rate.
•Cache Key Design — Keys that are too specific (including unnecessary parameters) reduce cache sharing. Keys that are too general risk serving wrong data. Proper granularity is essential.
•Working Set Size — The amount of data actively accessed. If working set exceeds cache size, capacity misses are inevitable. Understanding working set is crucial for sizing.
•Data Churn Rate — How frequently data changes. High churn means frequent invalidations, reducing effective lifetime of cached entries.
•Request Deduplication — How the system handles concurrent requests for the same uncached data. Without deduplication, parallel misses waste resources.

The working set concept:

The working set is the subset of data that is actively accessed during a given time window. Understanding your working set is crucial for cache sizing:

If cache size > working set: Most requests hit (bounded by TTL)
If cache size < working set: Capacity misses are common, hit rate suffers
If cache size ≈ working set: Efficient, but vulnerable to access pattern changes

Measuring working set:

Estimate working set by analyzing:

Unique keys accessed over time periods (hour, day, week)
Distribution of access frequency
Size of cached entries

Example analysis:

- 1 hour: 50,000 unique keys accessed
- 1 day: 200,000 unique keys accessed  
- 1 week: 500,000 unique keys accessed
- Average entry size: 2 KB

1-hour working set: ~100 MB
1-day working set: ~400 MB
1-week working set: ~1 GB

If your cache is 512 MB with a 1-hour TTL, you're likely undersized for the workload.

The Pareto Advantage

Real workloads typically follow Pareto-like distributions: a small percentage of data receives most of the traffic. This is good news for caching—a cache that can hold just the 'hot' data can achieve high hit rates. Analyze your access patterns; they're probably more cacheable than random distributions would suggest.

Cache Miss Storms and Thundering Herds

One of the most dangerous cache-related failure modes is the cache miss storm, also known as the thundering herd problem. This occurs when a large number of requests simultaneously miss the cache and all hit the backend, potentially overwhelming it.

How thundering herds form:

Imagine a popular cached entry—say, the homepage data—that 1,000 requests per second access. At exactly the moment this entry expires:

First request finds entry expired → triggers backend fetch
Before fetch completes, 50 more requests arrive → all find entry missing
All 50 requests independently trigger backend fetches
Backend suddenly receives 50 concurrent requests instead of 1
Backend may slow down or fail under the load
Slow backend means fetches take longer, more requests pile up
Cascading failure begins

Converting Mermaid diagram...

Scenarios that trigger thundering herds:

Hot key expiration: Popular cached item TTL expires
Cache node failure: All keys on that node suddenly miss
Cache restart: Cold cache, everything misses
Bulk invalidation: Many entries invalidated simultaneously
Synchronized expiration: Multiple entries with same TTL expire together

Mitigation strategies:

Thundering Herd Prevention

•Request coalescing (single-flight) — When multiple requests arrive for the same uncached key, only the first triggers a backend fetch. Others wait for that fetch to complete. This is the most effective mitigation.
•Probabilistic early expiration — Randomly refresh entries before they expire. Some requests trigger background refresh while entry is still valid, ensuring fresh data is available before hard expiration.
•Stale-while-revalidate — Return stale data immediately while refreshing in background. Users get fast (slightly stale) response; cache is updated asynchronously.
•TTL jitter — Add random variance to TTLs to prevent synchronized expiration. Instead of all entries expiring at exactly 5 minutes, they expire between 4-6 minutes.
•Circuit breakers — Limit concurrent backend requests. Excess requests wait or fail fast rather than overwhelming the backend.
•Cache warming — Proactively populate cache for known hot keys before they're requested, especially after cache restarts.

Production Lesson

Thundering herds are often discovered in production during high-traffic periods. A site that works fine at moderate load suddenly crashes when a popular cache entry expires during peak traffic. Implement request coalescing proactively—don't wait for the outage.

Measuring and Monitoring Hit Rates

You can't optimize what you don't measure. Effective cache management requires continuous monitoring of hit rates and related metrics. Let's explore how to instrument and interpret cache metrics.

Essential Cache Metrics

•Hit Count — Total number of cache hits. Absolute number, grows continuously. Use rate (hits/second) for meaningful comparisons.
•Miss Count — Total number of cache misses. Combined with hits, gives total request volume.
•Hit Rate (%) — Derived metric: hits / (hits + misses). The primary efficiency indicator.
•Eviction Count — How many entries were removed due to capacity limits. High evictions indicate undersized cache.
•Entry Count / Keys — Current number of cached entries. Helps understand cache utilization.
•Memory Usage — Actual memory consumed by cache. Compare against configured limits.
•Latency (p50, p95, p99) — Response time for cache operations. Helps detect performance issues.
•Miss Latency — Time to fetch from backend on miss. Indicates origin health.

cache-metrics-example

Prometheus Metrics

# Cache metrics exposed by application
# Example Prometheus metric exposition
 
# Counter: Total cache hits
cache_hits_total{cache="product_cache"} 15234567
 
# Counter: Total cache misses  
cache_misses_total{cache="product_cache"} 892341
 
# Gauge: Current number of cached entries
cache_entries{cache="product_cache"} 48234
 
# Gauge: Memory usage in bytes
cache_memory_bytes{cache="product_cache"} 268435456
 
# Counter: Evictions by reason
cache_evictions_total{cache="product_cache",reason="capacity"} 12456
cache_evictions_total{cache="product_cache",reason="expired"} 34521
 
# Histogram: Cache operation latency
cache_operation_duration_seconds_bucket{cache="product_cache",op="get",le="0.001"} 14000000
cache_operation_duration_seconds_bucket{cache="product_cache",op="get",le="0.01"} 15500000
 
# Derived metric (PromQL query)
# Hit rate = hits / (hits + misses)
# rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))

Alerting on cache metrics:

Set up alerts for cache health:

Hit rate drops below threshold: Alert if hit rate falls below 80% (adjust based on baseline)
Eviction rate spikes: May indicate sudden working set growth or cache pressure
Memory near limit: Proactive warning before cache starts evicting
Latency increases: Cache performance degrading, possibly due to contention
Miss rate spikes: Sudden increase may indicate cache failure or invalidation event

Dashboards and visualization:

Effective cache dashboards show:

Hit rate over time (line chart) — Primary health indicator
Hit/miss rates (stacked area) — Shows absolute volume and ratio
Evictions by reason (stacked bar) — Capacity vs. expiration
Memory utilization (gauge) — Current vs. limit
Latency percentiles (line chart) — Performance trends

Segment Your Metrics

Track hit rates per cache 'namespace' or data type. Aggregate hit rate can hide problems: 99% hit rate on session data might mask 40% hit rate on product data. Segmented metrics reveal which caches need attention.

Summary: Cache Hit vs Cache Miss

Understanding the mechanics of cache hits and misses is fundamental to building and operating cached systems effectively. Let's consolidate the key insights:

Key Takeaways

•Cache hits are fast, misses are slow — The speed difference is often 100-1000x, making hit rate crucial to system performance.
•Miss types matter — Compulsory, capacity, and invalidation misses have different causes and solutions. Classify misses to fix them.
•Hit rate is your primary metric — It determines backend load, latency distribution, and infrastructure costs.
•Hit rate impact is non-linear — Going from 90% to 95% halves backend load. High hit rates have outsized benefits.
•Multiple factors affect hit rate — Cache size, TTL, access patterns, key design, and working set all play a role.
•Thundering herds are dangerous — Use request coalescing and other techniques to prevent miss storms.
•Measure and monitor continuously — You can't optimize what you don't measure. Track hits, misses, evictions, and latencies.

What's next:

Now that we understand how cache hits and misses work and their impact on systems, the next page explores the performance improvement potential of caching. We'll quantify just how much faster and more scalable systems become with effective caching, and explore the mathematical models that help us predict cache behavior.

Page Complete

You now understand the fundamental mechanics of cache hits and misses, how to measure hit rates, and the factors that influence cache effectiveness. This knowledge is essential for designing, operating, and troubleshooting cached systems.

Cache Hit vs Cache Miss

The Binary Question

What You Will Learn

Anatomy of a Cache Hit

The cache hit sequence:

Request arrives — A component requests data by key
Key lookup — Cache performs a hash lookup or index scan to find the key
Entry found — The key exists in the cache
Validity check — The entry is checked for expiration/staleness
Entry valid — The data is within its TTL and usable
Data returned — Cached data is deserialized (if necessary) and returned
Metadata update — Access time, hit count, etc. are updated

Converting Mermaid diagram...

Why cache hits are fast:

Cache hits are fast for several reinforcing reasons:

Hardware speed: In-memory access (RAM) is 100,000x+ faster than disk access
Data locality: Cached data is close to the requester (same machine, same datacenter)
Optimized data structures: Caches use hash tables for O(1) key lookup
No computation: The data is pre-computed and ready to serve
Connection reuse: Cache connections are typically pooled and persistent

The hit's contribution to perceived performance:

0.95 × 0.5ms + 0.05 × 100ms = 0.475ms + 5ms = 5.475ms

The 5% of misses dominate the average! This is why even small improvements in hit rate can dramatically improve perceived performance.

The 'Warm' Cache

Anatomy of a Cache Miss

The cache miss sequence:

Request arrives — A component requests data by key
Key lookup — Cache performs a hash lookup
Entry not found — The key doesn't exist (or has expired)
Backend fetch — Application queries the origin data source (database, API, etc.)
Origin processing — Database executes query, computes result
Response received — Data arrives from the backend
Cache population — Data is stored in cache with appropriate TTL
Data returned — Data is returned to the requester

This sequence takes significantly longer—typically 10-1000x longer than a cache hit, depending on the backend.

Converting Mermaid diagram...

Types of cache misses:

Not all misses are equal. Understanding miss types helps diagnose cache behavior:

1. Compulsory Miss (Cold Miss)

The very first request for a piece of data. No cache can avoid this—the data has never been accessed before. Compulsory misses are inevitable; they represent the cost of populating an empty cache.

2. Capacity Miss

The cache is full and had to evict this entry to make room for other data. The data was previously cached but has been removed. Indicates the cache is undersized for the working set.

3. Conflict Miss

In caches with limited associativity (common in CPU caches), data may be evicted even when the cache isn't full because of hash collisions. Less common in application-level caches.

4. Coherence/Invalidation Miss

The entry was explicitly invalidated due to a data update. The data existed but was removed to maintain correctness. This is an intentional miss.

5. Expiration Miss

The entry's TTL expired. The data was cached but is now considered stale. Indicates TTL may be too short, or this is working as designed.

Miss Classification Matters

Understanding Hit Rate

The cache hit rate is the most important metric for evaluating cache effectiveness. It measures the percentage of requests that are served from the cache without accessing the backend.

Hit Rate Formula:

Hit Rate = (Cache Hits) / (Cache Hits + Cache Misses) × 100%

Alternatively stated:

Hit Rate = (Cache Hits) / (Total Requests) × 100%

A hit rate of 95% means 95 out of every 100 requests are served from cache. Only 5 requests reach the backend.

Miss Rate:

The complement of hit rate is miss rate:

Miss Rate = 100% - Hit Rate
Miss Rate = (Cache Misses) / (Total Requests) × 100%

At 95% hit rate, miss rate is 5%.

Hit Rate Benchmarks and Implications
Hit Rate	Miss Rate	Assessment	Typical Scenario
< 50%	50%	Poor - Cache likely misconfigured	Wrong data being cached, TTL too short
50-80%	20-50%	Moderate - Room for improvement	Diverse workload, undersized cache
80-90%	10-20%	Good - Effective caching	Well-configured general workload
90-95%	5-10%	Very Good - Optimized	Hot data well-cached
95-99%	1-5%	Excellent - High-performance	Highly cacheable workload
99%	< 1%	Exceptional - Rare to achieve	Static/infrequently changing data

Interpreting hit rates:

Hit rate interpretation depends heavily on context:

CDN serving static assets: Expect 95-99%+. Static content should almost always hit.
Application cache for database queries: 70-90% is often realistic. Queries vary in cacheability.
API response cache: Depends on personalization. 50-80% for personalized, 90%+ for shared content.
Session cache: Near 100% within session lifetime—every session exists once created.

Hit rate over time:

Hit rates aren't static. They change based on:

Cache temperature: Cold caches have low hit rates; they improve as caches warm
Traffic patterns: Off-peak traffic may access less common data, lowering hit rate
Data changes: Bulk updates invalidate entries, temporarily lowering hit rate
New feature launches: Novel data has no cache history, causing compulsory misses

Low Hit Rate Red Flags

The Non-Linear Impact of Hit Rate

The backend load calculation:

If your system receives R requests per second and your cache hit rate is H, the requests reaching your backend are:

Backend Load = R × (1 - H)

Consider 100,000 requests per second:

Backend Load at Various Hit Rates
Hit Rate	Miss Rate	Backend QPS	Improvement from +5% Hit Rate
80%	20%	20,000	—
85%	15%	15,000	5,000 fewer QPS (25% reduction)
90%	10%	10,000	5,000 fewer QPS (33% reduction)
95%	5%	5,000	5,000 fewer QPS (50% reduction)
99%	1%	1,000	4,000 fewer QPS (80% reduction)

The latency distribution effect:

Hit rate also determines your latency distribution. Let's model a system where:

Cache hit latency: 1ms
Cache miss latency: 100ms

Average Latency at Various Hit Rates
Hit Rate	Average Latency	p50 Latency	p99 Latency
50%	50.5ms	~50ms (mix)	100ms
80%	20.8ms	1ms	100ms
90%	10.9ms	1ms	100ms
95%	5.95ms	1ms	100ms
99%	1.99ms	1ms	100ms

Capacity planning implications:

Because hit rate determines backend load, capacity planning must account for cache behavior:

Provisioning: Size your backend for miss traffic, not total traffic
Cache failures: If cache fails, you must handle 100% of traffic—massive spike
Cold starts: New cache nodes temporarily increase miss rate
Hit rate degradation: Even 5% drop in hit rate can significantly increase backend load

The 'Five Nines' Effect

Factors Affecting Hit Rate

Achieving high hit rates requires understanding and optimizing the factors that influence whether requests hit or miss. These factors are often interconnected and must be balanced against each other.

Primary Hit Rate Factors

•Cache Size — Larger caches can hold more of the working set, reducing capacity misses. But size has diminishing returns beyond the working set size, and larger caches cost more.
•TTL Configuration — Longer TTLs mean entries survive longer, increasing hits. But longer TTLs increase staleness risk. Optimal TTL balances freshness against hit rate.
•Access Pattern Distribution — Highly skewed access (few items dominate traffic) leads to high hit rates. Uniform access patterns require larger caches for the same hit rate.
•Cache Key Design — Keys that are too specific (including unnecessary parameters) reduce cache sharing. Keys that are too general risk serving wrong data. Proper granularity is essential.
•Working Set Size — The amount of data actively accessed. If working set exceeds cache size, capacity misses are inevitable. Understanding working set is crucial for sizing.
•Data Churn Rate — How frequently data changes. High churn means frequent invalidations, reducing effective lifetime of cached entries.
•Request Deduplication — How the system handles concurrent requests for the same uncached data. Without deduplication, parallel misses waste resources.

The working set concept:

The working set is the subset of data that is actively accessed during a given time window. Understanding your working set is crucial for cache sizing:

If cache size > working set: Most requests hit (bounded by TTL)
If cache size < working set: Capacity misses are common, hit rate suffers
If cache size ≈ working set: Efficient, but vulnerable to access pattern changes

Measuring working set:

Estimate working set by analyzing:

Unique keys accessed over time periods (hour, day, week)
Distribution of access frequency
Size of cached entries

Example analysis:

- 1 hour: 50,000 unique keys accessed
- 1 day: 200,000 unique keys accessed  
- 1 week: 500,000 unique keys accessed
- Average entry size: 2 KB

1-hour working set: ~100 MB
1-day working set: ~400 MB
1-week working set: ~1 GB

If your cache is 512 MB with a 1-hour TTL, you're likely undersized for the workload.

The Pareto Advantage

Cache Miss Storms and Thundering Herds

How thundering herds form:

Imagine a popular cached entry—say, the homepage data—that 1,000 requests per second access. At exactly the moment this entry expires:

First request finds entry expired → triggers backend fetch
Before fetch completes, 50 more requests arrive → all find entry missing
All 50 requests independently trigger backend fetches
Backend suddenly receives 50 concurrent requests instead of 1
Backend may slow down or fail under the load
Slow backend means fetches take longer, more requests pile up
Cascading failure begins

Converting Mermaid diagram...

Scenarios that trigger thundering herds:

Hot key expiration: Popular cached item TTL expires
Cache node failure: All keys on that node suddenly miss
Cache restart: Cold cache, everything misses
Bulk invalidation: Many entries invalidated simultaneously
Synchronized expiration: Multiple entries with same TTL expire together

Mitigation strategies:

Thundering Herd Prevention

•Request coalescing (single-flight) — When multiple requests arrive for the same uncached key, only the first triggers a backend fetch. Others wait for that fetch to complete. This is the most effective mitigation.
•Probabilistic early expiration — Randomly refresh entries before they expire. Some requests trigger background refresh while entry is still valid, ensuring fresh data is available before hard expiration.
•Stale-while-revalidate — Return stale data immediately while refreshing in background. Users get fast (slightly stale) response; cache is updated asynchronously.
•TTL jitter — Add random variance to TTLs to prevent synchronized expiration. Instead of all entries expiring at exactly 5 minutes, they expire between 4-6 minutes.
•Circuit breakers — Limit concurrent backend requests. Excess requests wait or fail fast rather than overwhelming the backend.
•Cache warming — Proactively populate cache for known hot keys before they're requested, especially after cache restarts.

Production Lesson

Measuring and Monitoring Hit Rates

You can't optimize what you don't measure. Effective cache management requires continuous monitoring of hit rates and related metrics. Let's explore how to instrument and interpret cache metrics.

Essential Cache Metrics

•Hit Count — Total number of cache hits. Absolute number, grows continuously. Use rate (hits/second) for meaningful comparisons.
•Miss Count — Total number of cache misses. Combined with hits, gives total request volume.
•Hit Rate (%) — Derived metric: hits / (hits + misses). The primary efficiency indicator.
•Eviction Count — How many entries were removed due to capacity limits. High evictions indicate undersized cache.
•Entry Count / Keys — Current number of cached entries. Helps understand cache utilization.
•Memory Usage — Actual memory consumed by cache. Compare against configured limits.
•Latency (p50, p95, p99) — Response time for cache operations. Helps detect performance issues.
•Miss Latency — Time to fetch from backend on miss. Indicates origin health.

cache-metrics-example

Prometheus Metrics

# Cache metrics exposed by application
# Example Prometheus metric exposition
 
# Counter: Total cache hits
cache_hits_total{cache="product_cache"} 15234567
 
# Counter: Total cache misses  
cache_misses_total{cache="product_cache"} 892341
 
# Gauge: Current number of cached entries
cache_entries{cache="product_cache"} 48234
 
# Gauge: Memory usage in bytes
cache_memory_bytes{cache="product_cache"} 268435456
 
# Counter: Evictions by reason
cache_evictions_total{cache="product_cache",reason="capacity"} 12456
cache_evictions_total{cache="product_cache",reason="expired"} 34521
 
# Histogram: Cache operation latency
cache_operation_duration_seconds_bucket{cache="product_cache",op="get",le="0.001"} 14000000
cache_operation_duration_seconds_bucket{cache="product_cache",op="get",le="0.01"} 15500000
 
# Derived metric (PromQL query)
# Hit rate = hits / (hits + misses)
# rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))

Alerting on cache metrics:

Set up alerts for cache health:

Hit rate drops below threshold: Alert if hit rate falls below 80% (adjust based on baseline)
Eviction rate spikes: May indicate sudden working set growth or cache pressure
Memory near limit: Proactive warning before cache starts evicting
Latency increases: Cache performance degrading, possibly due to contention
Miss rate spikes: Sudden increase may indicate cache failure or invalidation event

Dashboards and visualization:

Effective cache dashboards show:

Hit rate over time (line chart) — Primary health indicator
Hit/miss rates (stacked area) — Shows absolute volume and ratio
Evictions by reason (stacked bar) — Capacity vs. expiration
Memory utilization (gauge) — Current vs. limit
Latency percentiles (line chart) — Performance trends

Segment Your Metrics

Summary: Cache Hit vs Cache Miss

Understanding the mechanics of cache hits and misses is fundamental to building and operating cached systems effectively. Let's consolidate the key insights:

Key Takeaways

•Cache hits are fast, misses are slow — The speed difference is often 100-1000x, making hit rate crucial to system performance.
•Miss types matter — Compulsory, capacity, and invalidation misses have different causes and solutions. Classify misses to fix them.
•Hit rate is your primary metric — It determines backend load, latency distribution, and infrastructure costs.
•Hit rate impact is non-linear — Going from 90% to 95% halves backend load. High hit rates have outsized benefits.
•Multiple factors affect hit rate — Cache size, TTL, access patterns, key design, and working set all play a role.
•Thundering herds are dangerous — Use request coalescing and other techniques to prevent miss storms.
•Measure and monitor continuously — You can't optimize what you don't measure. Track hits, misses, evictions, and latencies.

What's next:

Page Complete