Distributed Cache - Learning Module

Loading content...

0/273

Requirements: Low-Latency Caching

The Speed Imperative

In the hierarchy of distributed systems components, caching occupies a unique position: it is simultaneously one of the simplest concepts to explain and one of the most challenging to implement correctly at scale. The premise is deceptively simple—store frequently accessed data closer to the application to avoid repeated expensive operations. Yet the engineering reality of building a cache that serves millions of requests per second with sub-millisecond latency, maintains consistency across distributed nodes, handles failures gracefully, and scales horizontally while remaining cost-effective is anything but simple.

Every major technology company—Google, Facebook, Amazon, Netflix, Twitter—has invested heavily in building or customizing distributed caching infrastructure. These systems sit at the critical path of nearly every user interaction, and their performance characteristics directly impact user experience, system throughput, and infrastructure costs. A well-designed distributed cache can reduce database load by 99%, cut response times from hundreds of milliseconds to single-digit milliseconds, and enable applications to scale horizontally without proportional increases in backend infrastructure.

Before we can design such a system, we must establish clear requirements. In system design interviews and real-world engineering alike, the requirements phase isn't merely a formality—it's the foundation upon which all architectural decisions rest. A cache designed for session storage has fundamentally different requirements than one optimized for API response caching or content delivery. This page establishes the complete requirements framework for building a production-grade distributed caching system.

What You Will Learn

By the end of this page, you will understand the complete functional and non-functional requirements for a distributed caching system, be able to perform capacity estimation for cache infrastructure, identify the key architectural constraints that shape cache design decisions, and recognize the trade-offs inherent in different caching approaches.

Understanding the Caching Problem Space

To properly scope our distributed cache design, we must first understand why caching is necessary and what problems it solves. Caching exists because of a fundamental asymmetry in computing: some operations are expensive, and their results are needed repeatedly.

The Latency Hierarchy

Modern systems exhibit dramatic differences in access latency across different storage tiers:

L1 CPU Cache: ~0.5 nanoseconds
L2 CPU Cache: ~7 nanoseconds
Main Memory (RAM): ~100 nanoseconds
SSD Read: ~150 microseconds (150,000 ns)
Network Round-Trip (Same Data Center): ~500 microseconds (500,000 ns)
HDD Seek: ~10 milliseconds (10,000,000 ns)
Network Round-Trip (Cross-Region): ~150 milliseconds (150,000,000 ns)

Notice the orders of magnitude between tiers. A single cross-region network round-trip is 300 million times slower than an L1 cache hit. This latency hierarchy explains why caching at every layer of the stack is essential for performance-critical systems.

What Makes an Operation Cache-Worthy?

Not all data benefits equally from caching. The ideal cache candidates exhibit these characteristics:

Cache-Worthy Data Characteristics

•High Read-to-Write Ratio — Data that is read frequently but written infrequently. User profiles, product catalogs, configuration settings, and static content are classic examples. A 100:1 read-to-write ratio means each write operation amortizes across 100 cache hits.
•Expensive to Compute or Retrieve — Operations involving complex database joins, external API calls, machine learning inference, or compute-intensive transformations. The cost of caching must be offset by the cost of regeneration.
•Temporal Locality — Data accessed at one point in time is likely to be accessed again soon. User sessions, trending content, and recently viewed items exhibit strong temporal locality.
•Tolerance for Staleness — Data where brief inconsistency doesn't cause correctness issues. A user seeing a slightly outdated follower count is acceptable; a user seeing an incorrect account balance is not.
•Size-Appropriate — Data small enough that memory costs remain reasonable. Caching multi-megabyte objects consumes memory quickly and may not provide proportional benefits.

The Caching Paradox

Caching introduces a fundamental tension: the data closest to the user is also the most likely to be stale. This paradox shapes every distributed cache design. We must balance proximity (for latency) against freshness (for correctness), and different applications strike this balance differently.

A news feed can tolerate seconds of staleness without user impact. A stock trading platform cannot tolerate even milliseconds. A collaborative editing tool like Google Docs requires near-real-time consistency. Understanding where your application falls on this spectrum is the first requirement decision.

Cache Hit Rate: The North Star Metric

The most important metric for any cache is the hit rate—the percentage of requests served from cache rather than the origin. A cache with a 99% hit rate reduces origin load by 100x. A cache with a 50% hit rate only reduces load by 2x. Every 1% improvement in hit rate at 99% is equivalent to 50% reduction in origin traffic. This is why cache optimization focuses intensely on marginal hit rate improvements.

Functional Requirements

Functional requirements define what the system must do. For a distributed cache, these requirements encompass the core operations, data model, and features that clients expect.

Core Cache Operations

Every distributed cache must support three fundamental operations:

Core Cache Operations
Operation	Description	Semantics	Expected Latency
GET(key)	Retrieve value associated with key	Returns value if exists, null/miss otherwise	< 1ms p99
SET(key, value, ttl)	Store key-value pair with optional TTL	Overwrites existing; atomic operation	< 1ms p99
DELETE(key)	Remove key from cache	Idempotent; succeeds even if key absent	< 1ms p99

Extended Operations

Production caches typically require additional operations beyond the basic three:

Extended Cache Operations

•MGET(keys) / MSET(key-value pairs) — Batch operations to reduce network round-trips. Getting 100 keys in one MGET is far more efficient than 100 individual GETs.
•INCREMENT(key, delta) / DECREMENT(key, delta) — Atomic counter operations essential for rate limiting, view counts, and statistics aggregation.
•EXPIRE(key, ttl) / TTL(key) — Manage key expiration independently of SET operations. Query remaining time-to-live.
•EXISTS(key) — Check key existence without retrieving value. Useful when only membership matters.
•SCAN — Iterate over keys matching patterns. Essential for debugging, migration, and administrative operations.
•CAS (Compare-And-Swap) — Conditional update only if current value matches expected. Enables optimistic locking patterns.

Data Model Requirements

The data model defines what types of values the cache can store:

Key Constraints: Keys are typically strings with maximum length limits (e.g., 250 bytes for Memcached, 512 MB for Redis though practical limits are lower). Keys should be lightweight since they're stored in memory.
Value Types: At minimum, support opaque byte arrays. Advanced caches support structured types (strings, integers, lists, sets, sorted sets, hashes, hyperloglog, streams) with type-specific operations.
Maximum Value Size: Define limits for individual values (e.g., 1 MB default for Memcached). Large values impact network transfer time and memory fragmentation.
Total Key Space: The system should support billions of keys across the cluster without degradation.

TTL and Expiration

Time-based expiration is fundamental to cache operation:

Expiration Requirements

•Per-Key TTL — Each key can have an independent expiration time, set at write time or modified later.
•TTL Precision — Second-level precision is typical; millisecond precision may be required for rate limiting use cases.
•Expiration Behavior — Keys should become inaccessible immediately at expiration. Physical deletion may be lazy (on access) or periodic (background cleanup).
•No-TTL Option — Some keys should persist indefinitely until explicitly deleted or evicted due to memory pressure.

Write-Through vs Write-Behind

A key functional decision is whether the cache operates independently (cache-aside pattern where clients manage both cache and database) or integrates with the storage layer (write-through where cache writes propagate to database, or write-behind where writes are buffered and batched). Most distributed caches implement cache-aside, leaving synchronization to the application layer.

Non-Functional Requirements

Non-functional requirements define how well the system must perform. For distributed caches, these requirements are often more constraining than functional requirements because caching is fundamentally a performance optimization—a slow cache defeats its own purpose.

Latency Requirements

Latency is the defining characteristic of a cache. If the cache is slow, applications will bypass it and hit the origin directly.

Latency SLOs for Distributed Cache
Metric	Target	Rationale
p50 (median) GET latency	< 200 μs	Typical case should be imperceptible
p99 GET latency	< 1 ms	Tail latency affects user experience
p99.9 GET latency	< 5 ms	Extreme tail should remain reasonable
SET latency	< 1 ms p99	Writes should not block application flow
Network overhead	Minimal	Cache should be co-located with compute

Throughput Requirements

A distributed cache must handle massive request volumes. Consider the scale of major deployments:

Facebook's Memcached: Trillions of requests per day, billions of objects cached
Twitter's Cache Layer: Millions of requests per second for timeline serving
Reddit: 100,000+ requests per second during peak

Our target system should support:

Throughput Targets

•Per-Node Throughput: 100,000+ operations per second per cache node
•Cluster Throughput: Linear scaling—10 nodes should handle 10x the traffic of 1 node
•Burst Handling: 10x sustained throughput for short bursts (flash sale events, viral content)
•Batched Operations: MGET/MSET should approach throughput of single operations for moderate batch sizes

Availability Requirements

Caches sit on the critical path of nearly every request. When the cache is unavailable, applications experience degraded performance (if they can fall back to origin) or complete outages (if the origin cannot handle the load).

Availability Targets
Requirement	Target	Implication
Availability SLA	99.99% (four 9s)	< 53 minutes downtime per year
Mean Time To Recovery	< 30 seconds	Automatic failover for node failures
Data Durability	Not guaranteed	Cache is ephemeral; loss is acceptable
Graceful Degradation	Required	Partial cluster should serve partial traffic

Scalability Requirements

The cache must scale with both data volume and request rate:

Scalability Dimensions

•Horizontal Scaling: Add nodes to increase capacity without downtime
•Storage Scaling: Support petabytes of cached data across the cluster
•Geographic Distribution: Deploy across multiple regions for global applications
•Elastic Scaling: Scale up for peak periods, scale down to save costs

The Thundering Herd Problem

Cache unavailability creates a paradox: when the cache fails, all requests hit the origin simultaneously. If 1,000 concurrent users were being served by cache, suddenly 1,000 requests hit the database—likely crashing it. This 'thundering herd' or 'cache stampede' makes cache availability even more critical than database availability in some architectures.

Scale Estimation and Capacity Planning

Before designing the architecture, we must estimate the scale of our distributed cache. These calculations inform decisions about cluster size, memory allocation, network capacity, and cost.

Reference Architecture Assumptions

Let's design for a large-scale social media or e-commerce platform:

Daily Active Users (DAU): 100 million
Average Requests per User per Day: 100
Cache Hit Rate Target: 95%
Average Object Size: 1 KB
Key Size: 100 bytes average
Unique Objects in Working Set: 1 billion

Request Volume Estimation

capacity_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Daily request volume
daily_requests = 100_000_000 * 100  # 10 billion requests/day
 
# Requests per second (assuming uniform distribution, which understates peak)
rps_average = daily_requests / 86400  # ~115,740 RPS
 
# Peak RPS (typically 3-5x average for social media)
rps_peak = rps_average * 4  # ~463,000 RPS
 
# With 95% cache hit rate, origin receives:
origin_rps = rps_peak * 0.05  # ~23,000 RPS to origin
 
# Without cache (100% to origin):
# origin_rps = 463,000 RPS - likely database cannot handle this
 
print(f"Average RPS: {rps_average:,.0f}")
print(f"Peak RPS: {rps_peak:,.0f}")
print(f"Origin RPS (with cache): {origin_rps:,.0f}")
print(f"Cache reduces origin load by: {(1 - 0.05) * 100:.0f}%")

Memory Requirements

memory_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Object storage
unique_objects = 1_000_000_000  # 1 billion unique objects
avg_object_size = 1024  # 1 KB
avg_key_size = 100  # bytes
 
# Raw data size
raw_data_size = unique_objects * (avg_object_size + avg_key_size)
raw_data_size_tb = raw_data_size / (1024 ** 4)
print(f"Raw data size: {raw_data_size_tb:.2f} TB")
 
# Memory overhead (hash table, pointers, TTL metadata)
# Typically 50-100% overhead for in-memory data structures
overhead_multiplier = 1.75
total_memory = raw_data_size * overhead_multiplier
total_memory_tb = total_memory / (1024 ** 4)
print(f"Total memory with overhead: {total_memory_tb:.2f} TB")
 
# Node sizing
memory_per_node_gb = 256  # Common for cache-optimized instances
memory_per_node_bytes = memory_per_node_gb * (1024 ** 3)
nodes_required = total_memory / memory_per_node_bytes
print(f"Nodes required (256 GB each): {nodes_required:.0f}")
 
# Add replication factor (e.g., 2 for single replica)
replication_factor = 2
total_nodes = nodes_required * replication_factor
print(f"Total nodes with replication: {total_nodes:.0f}")

Network Bandwidth Estimation

bandwidth_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# At peak: 463,000 RPS
# Average response size: 1 KB
# Request overhead: ~200 bytes (headers, key, protocol)
 
rps_peak = 463_000
response_size = 1024 + 200  # bytes per request/response pair
 
# Bandwidth per second
bandwidth_bytes_per_second = rps_peak * response_size
bandwidth_gbps = (bandwidth_bytes_per_second * 8) / (1024 ** 3)
print(f"Peak bandwidth: {bandwidth_gbps:.2f} Gbps")
 
# Per-node bandwidth (assuming 50 nodes)
nodes = 50
per_node_gbps = bandwidth_gbps / nodes
print(f"Per-node bandwidth: {per_node_gbps:.2f} Gbps")
 
# Each node needs at least 10 Gbps NICs with headroom
# Consider: client connections, replication traffic, monitoring

Capacity Summary
Resource	Estimate	Notes
Peak Requests Per Second	~460,000 RPS	4x average for social media patterns
Total Memory Required	~2 TB	1 billion objects at 1KB + overhead
Cache Nodes (256 GB each)	~16 nodes	Before replication
Total Nodes with Replication	~32 nodes	2x for single replica
Network Bandwidth	~45 Gbps	Aggregate across cluster
Per-Node Throughput Target	~15,000 RPS	With headroom for burst

Interview Tip: Back-of-Envelope Calculations

In system design interviews, you're expected to perform these calculations on a whiteboard. Practice estimation with powers of 2 and 10. Key numbers to memorize: 86,400 seconds/day ≈ 100,000; 1 million seconds ≈ 11 days; 1 KB * 1 million objects = 1 GB. Round aggressively—the goal is order of magnitude, not precision.

Key Design Constraints

Design constraints are the non-negotiable boundaries within which our cache must operate. These constraints eliminate certain design options and prioritize others.

Memory as the Primary Resource

Unlike databases that can spill to disk, caches are fundamentally memory-bound. This constraint shapes nearly every decision:

Memory Constraints

•Fixed Capacity: Unlike elastic storage, memory has hard limits. The cache must decide what to evict when full.
•Cost: Memory is expensive (~$5-10 per GB/month for cloud instances). A 2 TB cache at $7.50/GB/month costs $15,000/month just for memory.
•Memory Fragmentation: Long-running caches experience fragmentation, reducing effective capacity by 10-30%.
•Restart Penalty: On restart, memory is cold—cache must warm up from origin, creating temporary load spikes.

Network as the Bottleneck

For in-datacenter caching, the network often becomes the constraint before CPU or memory:

Network Constraints

•Serialization Overhead: Every request requires serialization/deserialization. Efficient protocols matter.
•Connection Limits: Each TCP connection consumes kernel resources. Thousands of clients require connection pooling.
•Bandwidth Limits: 10 Gbps NICs are common; 25-100 Gbps for high-throughput nodes.
•Latency Sensitivity: Network jitter directly impacts p99 latency. Co-location with compute is essential.

Consistency vs. Availability Trade-off

Distributed caches must navigate the CAP theorem:

Most caches choose AP (Availability + Partition tolerance): When network partitions occur, cache nodes continue serving requests independently, accepting that different clients may see different values.
Strong consistency is expensive: Requiring all replicas to agree before responding adds latency and reduces availability.
Eventual consistency is typical: Applications must tolerate stale reads and handle cache invalidation carefully.

Multi-Tenancy Considerations

Production caches serve multiple applications and teams:

Multi-Tenancy Requirements

•Namespace Isolation: Different applications' keys should not collide
•Resource Quotas: Prevent one tenant from consuming all cache capacity
•Traffic Isolation: One tenant's hot keys should not impact another's latency
•Access Control: Authentication and authorization for cache access

Durability is NOT a Requirement

Unlike databases, caches are explicitly ephemeral. Data loss is expected and acceptable. This assumption enables aggressive performance optimizations—no write-ahead logging, no fsync, no replication quorums. When cache nodes fail, data is regenerated from the origin. If you need durability, you need a database, not a cache.

Common Use Cases and Access Patterns

Understanding how applications use caches informs our design priorities. Different use cases exhibit different access patterns, size distributions, and consistency requirements.

Cache Use Cases and Characteristics
Use Case	Access Pattern	Object Size	TTL	Consistency
Session Storage	Read-heavy, write on login/activity	1-10 KB	30 min - 24 hr	Low tolerance
API Response Caching	Read-heavy, write on data change	1-100 KB	Seconds - minutes	Moderate
Database Query Cache	Read-heavy, invalidate on write	100 B - 1 MB	Minutes - hours	Critical
Rate Limiting	Write-heavy (counters)	8 bytes (counter)	Seconds - minutes	High (approximation OK)
Feature Flags	Read-heavy, rare writes	100-500 bytes	Minutes	Moderate
User Preferences	Read-heavy, occasional writes	1-5 KB	Hours - days	Low
Content/Asset Metadata	Read-heavy, rarely changes	500 B - 5 KB	Hours - days	Low
Leaderboards	Read and write heavy	Variable	Real-time	Moderate

Hot Key Patterns

Caches must handle 'hot keys'—individual keys that receive disproportionate traffic:

Celebrity Effect: A viral post or celebrity account can generate millions of requests to a single key
Flash Sales: A product on sale experiences massive concurrent access
Breaking News: Trending topics concentrate traffic on few keys
Configuration Keys: Global settings read by every request

Hot keys challenge cache design because:

A single cache node may become overwhelmed
Network bandwidth to that node saturates
Memory for that key's connections exhausts resources
Replication doesn't help if reads all go to primary

The Hot Key Problem

Hot keys can bring down entire cache clusters. Mitigation strategies include: local caching on application servers, key replication across multiple nodes, request coalescing (only one request to origin during refresh), and artificial key distribution (appending random suffixes and aggregating on read).

Access Pattern: Zipf Distribution

Real-world cache access follows a Zipf (power-law) distribution: a small number of keys receive most of the requests. The 80/20 rule often applies—20% of keys serve 80% of requests, and often it's more extreme (1% of keys serving 50%+ of requests).

This distribution is actually favorable for caching:

A relatively small 'hot set' can be kept in the fastest tier
Cache effectiveness depends on properly identifying and prioritizing hot data
Eviction policies must account for access frequency, not just recency

Requirements Summary

Let's consolidate our requirements into a comprehensive specification that will guide the design of our distributed cache system.

Functional Requirements

•Core Operations: GET, SET (with optional TTL), DELETE
•Batch Operations: MGET, MSET for efficiency
•Atomic Operations: INCREMENT, DECREMENT, CAS
•TTL Management: Per-key expiration with second precision
•Key Patterns: Support billions of unique keys
•Value Size: Up to 1 MB per value (configurable)

Non-Functional Requirements

•Latency: < 1ms p99 for all operations
•Throughput: 100K+ ops/sec per node
•Availability: 99.99% uptime SLA
•Scalability: Linear horizontal scaling
•Hit Rate: Target 95%+ for typical workloads
•Failover: < 30 second recovery time

Design Priorities (in order)

Latency — Every microsecond matters. Performance is the cache's raison d'être.
Availability — The cache must handle failures gracefully; downtime cascades to all dependent systems.
Throughput — Support massive request volumes without degradation.
Scalability — Grow capacity by adding nodes, not by vertical scaling.
Simplicity — Complex systems fail in complex ways. Operational simplicity matters.

What We Explicitly Do NOT Require

Durability: Data loss on failure is acceptable
Strong Consistency: Eventual consistency is sufficient
Complex Queries: No joins, aggregations, or secondary indexes
Transactions: No multi-key atomic operations beyond CAS

Requirements Phase Complete

We have established clear functional and non-functional requirements, performed capacity estimation, identified key constraints, and documented common use cases. This foundation enables us to make principled design decisions. In the next page, we'll explore Cache Partitioning—how to distribute data across multiple nodes to achieve the scale and performance we require.

Requirements: Low-Latency Caching

The Speed Imperative

What You Will Learn

Understanding the Caching Problem Space

The Latency Hierarchy

Modern systems exhibit dramatic differences in access latency across different storage tiers:

L1 CPU Cache: ~0.5 nanoseconds
L2 CPU Cache: ~7 nanoseconds
Main Memory (RAM): ~100 nanoseconds
SSD Read: ~150 microseconds (150,000 ns)
Network Round-Trip (Same Data Center): ~500 microseconds (500,000 ns)
HDD Seek: ~10 milliseconds (10,000,000 ns)
Network Round-Trip (Cross-Region): ~150 milliseconds (150,000,000 ns)

What Makes an Operation Cache-Worthy?

Not all data benefits equally from caching. The ideal cache candidates exhibit these characteristics:

Cache-Worthy Data Characteristics

•High Read-to-Write Ratio — Data that is read frequently but written infrequently. User profiles, product catalogs, configuration settings, and static content are classic examples. A 100:1 read-to-write ratio means each write operation amortizes across 100 cache hits.
•Expensive to Compute or Retrieve — Operations involving complex database joins, external API calls, machine learning inference, or compute-intensive transformations. The cost of caching must be offset by the cost of regeneration.
•Temporal Locality — Data accessed at one point in time is likely to be accessed again soon. User sessions, trending content, and recently viewed items exhibit strong temporal locality.
•Tolerance for Staleness — Data where brief inconsistency doesn't cause correctness issues. A user seeing a slightly outdated follower count is acceptable; a user seeing an incorrect account balance is not.
•Size-Appropriate — Data small enough that memory costs remain reasonable. Caching multi-megabyte objects consumes memory quickly and may not provide proportional benefits.

The Caching Paradox

Cache Hit Rate: The North Star Metric

Functional Requirements

Functional requirements define what the system must do. For a distributed cache, these requirements encompass the core operations, data model, and features that clients expect.

Core Cache Operations

Every distributed cache must support three fundamental operations:

Core Cache Operations
Operation	Description	Semantics	Expected Latency
GET(key)	Retrieve value associated with key	Returns value if exists, null/miss otherwise	< 1ms p99
SET(key, value, ttl)	Store key-value pair with optional TTL	Overwrites existing; atomic operation	< 1ms p99
DELETE(key)	Remove key from cache	Idempotent; succeeds even if key absent	< 1ms p99

Extended Operations

Production caches typically require additional operations beyond the basic three:

Extended Cache Operations

•MGET(keys) / MSET(key-value pairs) — Batch operations to reduce network round-trips. Getting 100 keys in one MGET is far more efficient than 100 individual GETs.
•INCREMENT(key, delta) / DECREMENT(key, delta) — Atomic counter operations essential for rate limiting, view counts, and statistics aggregation.
•EXPIRE(key, ttl) / TTL(key) — Manage key expiration independently of SET operations. Query remaining time-to-live.
•EXISTS(key) — Check key existence without retrieving value. Useful when only membership matters.
•SCAN — Iterate over keys matching patterns. Essential for debugging, migration, and administrative operations.
•CAS (Compare-And-Swap) — Conditional update only if current value matches expected. Enables optimistic locking patterns.

Data Model Requirements

The data model defines what types of values the cache can store:

Key Constraints: Keys are typically strings with maximum length limits (e.g., 250 bytes for Memcached, 512 MB for Redis though practical limits are lower). Keys should be lightweight since they're stored in memory.
Value Types: At minimum, support opaque byte arrays. Advanced caches support structured types (strings, integers, lists, sets, sorted sets, hashes, hyperloglog, streams) with type-specific operations.
Maximum Value Size: Define limits for individual values (e.g., 1 MB default for Memcached). Large values impact network transfer time and memory fragmentation.
Total Key Space: The system should support billions of keys across the cluster without degradation.

TTL and Expiration

Time-based expiration is fundamental to cache operation:

Expiration Requirements

•Per-Key TTL — Each key can have an independent expiration time, set at write time or modified later.
•TTL Precision — Second-level precision is typical; millisecond precision may be required for rate limiting use cases.
•Expiration Behavior — Keys should become inaccessible immediately at expiration. Physical deletion may be lazy (on access) or periodic (background cleanup).
•No-TTL Option — Some keys should persist indefinitely until explicitly deleted or evicted due to memory pressure.

Write-Through vs Write-Behind

Non-Functional Requirements

Latency Requirements

Latency is the defining characteristic of a cache. If the cache is slow, applications will bypass it and hit the origin directly.

Latency SLOs for Distributed Cache
Metric	Target	Rationale
p50 (median) GET latency	< 200 μs	Typical case should be imperceptible
p99 GET latency	< 1 ms	Tail latency affects user experience
p99.9 GET latency	< 5 ms	Extreme tail should remain reasonable
SET latency	< 1 ms p99	Writes should not block application flow
Network overhead	Minimal	Cache should be co-located with compute

Throughput Requirements

A distributed cache must handle massive request volumes. Consider the scale of major deployments:

Facebook's Memcached: Trillions of requests per day, billions of objects cached
Twitter's Cache Layer: Millions of requests per second for timeline serving
Reddit: 100,000+ requests per second during peak

Our target system should support:

Throughput Targets

•Per-Node Throughput: 100,000+ operations per second per cache node
•Cluster Throughput: Linear scaling—10 nodes should handle 10x the traffic of 1 node
•Burst Handling: 10x sustained throughput for short bursts (flash sale events, viral content)
•Batched Operations: MGET/MSET should approach throughput of single operations for moderate batch sizes

Availability Requirements

Availability Targets
Requirement	Target	Implication
Availability SLA	99.99% (four 9s)	< 53 minutes downtime per year
Mean Time To Recovery	< 30 seconds	Automatic failover for node failures
Data Durability	Not guaranteed	Cache is ephemeral; loss is acceptable
Graceful Degradation	Required	Partial cluster should serve partial traffic

Scalability Requirements

The cache must scale with both data volume and request rate:

Scalability Dimensions

•Horizontal Scaling: Add nodes to increase capacity without downtime
•Storage Scaling: Support petabytes of cached data across the cluster
•Geographic Distribution: Deploy across multiple regions for global applications
•Elastic Scaling: Scale up for peak periods, scale down to save costs

The Thundering Herd Problem

Scale Estimation and Capacity Planning

Before designing the architecture, we must estimate the scale of our distributed cache. These calculations inform decisions about cluster size, memory allocation, network capacity, and cost.

Reference Architecture Assumptions

Let's design for a large-scale social media or e-commerce platform:

Daily Active Users (DAU): 100 million
Average Requests per User per Day: 100
Cache Hit Rate Target: 95%
Average Object Size: 1 KB
Key Size: 100 bytes average
Unique Objects in Working Set: 1 billion

Request Volume Estimation

capacity_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Daily request volume
daily_requests = 100_000_000 * 100  # 10 billion requests/day
 
# Requests per second (assuming uniform distribution, which understates peak)
rps_average = daily_requests / 86400  # ~115,740 RPS
 
# Peak RPS (typically 3-5x average for social media)
rps_peak = rps_average * 4  # ~463,000 RPS
 
# With 95% cache hit rate, origin receives:
origin_rps = rps_peak * 0.05  # ~23,000 RPS to origin
 
# Without cache (100% to origin):
# origin_rps = 463,000 RPS - likely database cannot handle this
 
print(f"Average RPS: {rps_average:,.0f}")
print(f"Peak RPS: {rps_peak:,.0f}")
print(f"Origin RPS (with cache): {origin_rps:,.0f}")
print(f"Cache reduces origin load by: {(1 - 0.05) * 100:.0f}%")

Memory Requirements

memory_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Object storage
unique_objects = 1_000_000_000  # 1 billion unique objects
avg_object_size = 1024  # 1 KB
avg_key_size = 100  # bytes
 
# Raw data size
raw_data_size = unique_objects * (avg_object_size + avg_key_size)
raw_data_size_tb = raw_data_size / (1024 ** 4)
print(f"Raw data size: {raw_data_size_tb:.2f} TB")
 
# Memory overhead (hash table, pointers, TTL metadata)
# Typically 50-100% overhead for in-memory data structures
overhead_multiplier = 1.75
total_memory = raw_data_size * overhead_multiplier
total_memory_tb = total_memory / (1024 ** 4)
print(f"Total memory with overhead: {total_memory_tb:.2f} TB")
 
# Node sizing
memory_per_node_gb = 256  # Common for cache-optimized instances
memory_per_node_bytes = memory_per_node_gb * (1024 ** 3)
nodes_required = total_memory / memory_per_node_bytes
print(f"Nodes required (256 GB each): {nodes_required:.0f}")
 
# Add replication factor (e.g., 2 for single replica)
replication_factor = 2
total_nodes = nodes_required * replication_factor
print(f"Total nodes with replication: {total_nodes:.0f}")

Network Bandwidth Estimation

bandwidth_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# At peak: 463,000 RPS
# Average response size: 1 KB
# Request overhead: ~200 bytes (headers, key, protocol)
 
rps_peak = 463_000
response_size = 1024 + 200  # bytes per request/response pair
 
# Bandwidth per second
bandwidth_bytes_per_second = rps_peak * response_size
bandwidth_gbps = (bandwidth_bytes_per_second * 8) / (1024 ** 3)
print(f"Peak bandwidth: {bandwidth_gbps:.2f} Gbps")
 
# Per-node bandwidth (assuming 50 nodes)
nodes = 50
per_node_gbps = bandwidth_gbps / nodes
print(f"Per-node bandwidth: {per_node_gbps:.2f} Gbps")
 
# Each node needs at least 10 Gbps NICs with headroom
# Consider: client connections, replication traffic, monitoring

Capacity Summary
Resource	Estimate	Notes
Peak Requests Per Second	~460,000 RPS	4x average for social media patterns
Total Memory Required	~2 TB	1 billion objects at 1KB + overhead
Cache Nodes (256 GB each)	~16 nodes	Before replication
Total Nodes with Replication	~32 nodes	2x for single replica
Network Bandwidth	~45 Gbps	Aggregate across cluster
Per-Node Throughput Target	~15,000 RPS	With headroom for burst

Interview Tip: Back-of-Envelope Calculations

Key Design Constraints

Design constraints are the non-negotiable boundaries within which our cache must operate. These constraints eliminate certain design options and prioritize others.

Memory as the Primary Resource

Unlike databases that can spill to disk, caches are fundamentally memory-bound. This constraint shapes nearly every decision:

Memory Constraints

•Fixed Capacity: Unlike elastic storage, memory has hard limits. The cache must decide what to evict when full.
•Cost: Memory is expensive (~$5-10 per GB/month for cloud instances). A 2 TB cache at $7.50/GB/month costs $15,000/month just for memory.
•Memory Fragmentation: Long-running caches experience fragmentation, reducing effective capacity by 10-30%.
•Restart Penalty: On restart, memory is cold—cache must warm up from origin, creating temporary load spikes.

Network as the Bottleneck

For in-datacenter caching, the network often becomes the constraint before CPU or memory:

Network Constraints

•Serialization Overhead: Every request requires serialization/deserialization. Efficient protocols matter.
•Connection Limits: Each TCP connection consumes kernel resources. Thousands of clients require connection pooling.
•Bandwidth Limits: 10 Gbps NICs are common; 25-100 Gbps for high-throughput nodes.
•Latency Sensitivity: Network jitter directly impacts p99 latency. Co-location with compute is essential.

Consistency vs. Availability Trade-off

Distributed caches must navigate the CAP theorem:

Most caches choose AP (Availability + Partition tolerance): When network partitions occur, cache nodes continue serving requests independently, accepting that different clients may see different values.
Strong consistency is expensive: Requiring all replicas to agree before responding adds latency and reduces availability.
Eventual consistency is typical: Applications must tolerate stale reads and handle cache invalidation carefully.

Multi-Tenancy Considerations

Production caches serve multiple applications and teams:

Multi-Tenancy Requirements

•Namespace Isolation: Different applications' keys should not collide
•Resource Quotas: Prevent one tenant from consuming all cache capacity
•Traffic Isolation: One tenant's hot keys should not impact another's latency
•Access Control: Authentication and authorization for cache access

Durability is NOT a Requirement

Common Use Cases and Access Patterns

Understanding how applications use caches informs our design priorities. Different use cases exhibit different access patterns, size distributions, and consistency requirements.

Cache Use Cases and Characteristics
Use Case	Access Pattern	Object Size	TTL	Consistency
Session Storage	Read-heavy, write on login/activity	1-10 KB	30 min - 24 hr	Low tolerance
API Response Caching	Read-heavy, write on data change	1-100 KB	Seconds - minutes	Moderate
Database Query Cache	Read-heavy, invalidate on write	100 B - 1 MB	Minutes - hours	Critical
Rate Limiting	Write-heavy (counters)	8 bytes (counter)	Seconds - minutes	High (approximation OK)
Feature Flags	Read-heavy, rare writes	100-500 bytes	Minutes	Moderate
User Preferences	Read-heavy, occasional writes	1-5 KB	Hours - days	Low
Content/Asset Metadata	Read-heavy, rarely changes	500 B - 5 KB	Hours - days	Low
Leaderboards	Read and write heavy	Variable	Real-time	Moderate

Hot Key Patterns

Caches must handle 'hot keys'—individual keys that receive disproportionate traffic:

Celebrity Effect: A viral post or celebrity account can generate millions of requests to a single key
Flash Sales: A product on sale experiences massive concurrent access
Breaking News: Trending topics concentrate traffic on few keys
Configuration Keys: Global settings read by every request

Hot keys challenge cache design because:

A single cache node may become overwhelmed
Network bandwidth to that node saturates
Memory for that key's connections exhausts resources
Replication doesn't help if reads all go to primary

The Hot Key Problem

Access Pattern: Zipf Distribution

This distribution is actually favorable for caching:

A relatively small 'hot set' can be kept in the fastest tier
Cache effectiveness depends on properly identifying and prioritizing hot data
Eviction policies must account for access frequency, not just recency

Requirements Summary

Let's consolidate our requirements into a comprehensive specification that will guide the design of our distributed cache system.

Functional Requirements

•Core Operations: GET, SET (with optional TTL), DELETE
•Batch Operations: MGET, MSET for efficiency
•Atomic Operations: INCREMENT, DECREMENT, CAS
•TTL Management: Per-key expiration with second precision
•Key Patterns: Support billions of unique keys
•Value Size: Up to 1 MB per value (configurable)

Non-Functional Requirements

•Latency: < 1ms p99 for all operations
•Throughput: 100K+ ops/sec per node
•Availability: 99.99% uptime SLA
•Scalability: Linear horizontal scaling
•Hit Rate: Target 95%+ for typical workloads
•Failover: < 30 second recovery time

Design Priorities (in order)

Latency — Every microsecond matters. Performance is the cache's raison d'être.
Availability — The cache must handle failures gracefully; downtime cascades to all dependent systems.
Throughput — Support massive request volumes without degradation.
Scalability — Grow capacity by adding nodes, not by vertical scaling.
Simplicity — Complex systems fail in complex ways. Operational simplicity matters.

What We Explicitly Do NOT Require

Durability: Data loss on failure is acceptable
Strong Consistency: Eventual consistency is sufficient
Complex Queries: No joins, aggregations, or secondary indexes
Transactions: No multi-key atomic operations beyond CAS

Requirements Phase Complete