Loading content...
In the hierarchy of distributed systems components, caching occupies a unique position: it is simultaneously one of the simplest concepts to explain and one of the most challenging to implement correctly at scale. The premise is deceptively simple—store frequently accessed data closer to the application to avoid repeated expensive operations. Yet the engineering reality of building a cache that serves millions of requests per second with sub-millisecond latency, maintains consistency across distributed nodes, handles failures gracefully, and scales horizontally while remaining cost-effective is anything but simple.
Every major technology company—Google, Facebook, Amazon, Netflix, Twitter—has invested heavily in building or customizing distributed caching infrastructure. These systems sit at the critical path of nearly every user interaction, and their performance characteristics directly impact user experience, system throughput, and infrastructure costs. A well-designed distributed cache can reduce database load by 99%, cut response times from hundreds of milliseconds to single-digit milliseconds, and enable applications to scale horizontally without proportional increases in backend infrastructure.
Before we can design such a system, we must establish clear requirements. In system design interviews and real-world engineering alike, the requirements phase isn't merely a formality—it's the foundation upon which all architectural decisions rest. A cache designed for session storage has fundamentally different requirements than one optimized for API response caching or content delivery. This page establishes the complete requirements framework for building a production-grade distributed caching system.
By the end of this page, you will understand the complete functional and non-functional requirements for a distributed caching system, be able to perform capacity estimation for cache infrastructure, identify the key architectural constraints that shape cache design decisions, and recognize the trade-offs inherent in different caching approaches.
To properly scope our distributed cache design, we must first understand why caching is necessary and what problems it solves. Caching exists because of a fundamental asymmetry in computing: some operations are expensive, and their results are needed repeatedly.
The Latency Hierarchy
Modern systems exhibit dramatic differences in access latency across different storage tiers:
Notice the orders of magnitude between tiers. A single cross-region network round-trip is 300 million times slower than an L1 cache hit. This latency hierarchy explains why caching at every layer of the stack is essential for performance-critical systems.
What Makes an Operation Cache-Worthy?
Not all data benefits equally from caching. The ideal cache candidates exhibit these characteristics:
The Caching Paradox
Caching introduces a fundamental tension: the data closest to the user is also the most likely to be stale. This paradox shapes every distributed cache design. We must balance proximity (for latency) against freshness (for correctness), and different applications strike this balance differently.
A news feed can tolerate seconds of staleness without user impact. A stock trading platform cannot tolerate even milliseconds. A collaborative editing tool like Google Docs requires near-real-time consistency. Understanding where your application falls on this spectrum is the first requirement decision.
The most important metric for any cache is the hit rate—the percentage of requests served from cache rather than the origin. A cache with a 99% hit rate reduces origin load by 100x. A cache with a 50% hit rate only reduces load by 2x. Every 1% improvement in hit rate at 99% is equivalent to 50% reduction in origin traffic. This is why cache optimization focuses intensely on marginal hit rate improvements.
Functional requirements define what the system must do. For a distributed cache, these requirements encompass the core operations, data model, and features that clients expect.
Core Cache Operations
Every distributed cache must support three fundamental operations:
| Operation | Description | Semantics | Expected Latency |
|---|---|---|---|
| GET(key) | Retrieve value associated with key | Returns value if exists, null/miss otherwise | < 1ms p99 |
| SET(key, value, ttl) | Store key-value pair with optional TTL | Overwrites existing; atomic operation | < 1ms p99 |
| DELETE(key) | Remove key from cache | Idempotent; succeeds even if key absent | < 1ms p99 |
Extended Operations
Production caches typically require additional operations beyond the basic three:
Data Model Requirements
The data model defines what types of values the cache can store:
Key Constraints: Keys are typically strings with maximum length limits (e.g., 250 bytes for Memcached, 512 MB for Redis though practical limits are lower). Keys should be lightweight since they're stored in memory.
Value Types: At minimum, support opaque byte arrays. Advanced caches support structured types (strings, integers, lists, sets, sorted sets, hashes, hyperloglog, streams) with type-specific operations.
Maximum Value Size: Define limits for individual values (e.g., 1 MB default for Memcached). Large values impact network transfer time and memory fragmentation.
Total Key Space: The system should support billions of keys across the cluster without degradation.
TTL and Expiration
Time-based expiration is fundamental to cache operation:
A key functional decision is whether the cache operates independently (cache-aside pattern where clients manage both cache and database) or integrates with the storage layer (write-through where cache writes propagate to database, or write-behind where writes are buffered and batched). Most distributed caches implement cache-aside, leaving synchronization to the application layer.
Non-functional requirements define how well the system must perform. For distributed caches, these requirements are often more constraining than functional requirements because caching is fundamentally a performance optimization—a slow cache defeats its own purpose.
Latency Requirements
Latency is the defining characteristic of a cache. If the cache is slow, applications will bypass it and hit the origin directly.
| Metric | Target | Rationale |
|---|---|---|
| p50 (median) GET latency | < 200 μs | Typical case should be imperceptible |
| p99 GET latency | < 1 ms | Tail latency affects user experience |
| p99.9 GET latency | < 5 ms | Extreme tail should remain reasonable |
| SET latency | < 1 ms p99 | Writes should not block application flow |
| Network overhead | Minimal | Cache should be co-located with compute |
Throughput Requirements
A distributed cache must handle massive request volumes. Consider the scale of major deployments:
Our target system should support:
Availability Requirements
Caches sit on the critical path of nearly every request. When the cache is unavailable, applications experience degraded performance (if they can fall back to origin) or complete outages (if the origin cannot handle the load).
| Requirement | Target | Implication |
|---|---|---|
| Availability SLA | 99.99% (four 9s) | < 53 minutes downtime per year |
| Mean Time To Recovery | < 30 seconds | Automatic failover for node failures |
| Data Durability | Not guaranteed | Cache is ephemeral; loss is acceptable |
| Graceful Degradation | Required | Partial cluster should serve partial traffic |
Scalability Requirements
The cache must scale with both data volume and request rate:
Cache unavailability creates a paradox: when the cache fails, all requests hit the origin simultaneously. If 1,000 concurrent users were being served by cache, suddenly 1,000 requests hit the database—likely crashing it. This 'thundering herd' or 'cache stampede' makes cache availability even more critical than database availability in some architectures.
Before designing the architecture, we must estimate the scale of our distributed cache. These calculations inform decisions about cluster size, memory allocation, network capacity, and cost.
Reference Architecture Assumptions
Let's design for a large-scale social media or e-commerce platform:
Request Volume Estimation
12345678910111213141516171819
# Daily request volumedaily_requests = 100_000_000 * 100 # 10 billion requests/day # Requests per second (assuming uniform distribution, which understates peak)rps_average = daily_requests / 86400 # ~115,740 RPS # Peak RPS (typically 3-5x average for social media)rps_peak = rps_average * 4 # ~463,000 RPS # With 95% cache hit rate, origin receives:origin_rps = rps_peak * 0.05 # ~23,000 RPS to origin # Without cache (100% to origin):# origin_rps = 463,000 RPS - likely database cannot handle this print(f"Average RPS: {rps_average:,.0f}")print(f"Peak RPS: {rps_peak:,.0f}")print(f"Origin RPS (with cache): {origin_rps:,.0f}")print(f"Cache reduces origin load by: {(1 - 0.05) * 100:.0f}%")Memory Requirements
123456789101112131415161718192021222324252627
# Object storageunique_objects = 1_000_000_000 # 1 billion unique objectsavg_object_size = 1024 # 1 KBavg_key_size = 100 # bytes # Raw data sizeraw_data_size = unique_objects * (avg_object_size + avg_key_size)raw_data_size_tb = raw_data_size / (1024 ** 4)print(f"Raw data size: {raw_data_size_tb:.2f} TB") # Memory overhead (hash table, pointers, TTL metadata)# Typically 50-100% overhead for in-memory data structuresoverhead_multiplier = 1.75total_memory = raw_data_size * overhead_multipliertotal_memory_tb = total_memory / (1024 ** 4)print(f"Total memory with overhead: {total_memory_tb:.2f} TB") # Node sizingmemory_per_node_gb = 256 # Common for cache-optimized instancesmemory_per_node_bytes = memory_per_node_gb * (1024 ** 3)nodes_required = total_memory / memory_per_node_bytesprint(f"Nodes required (256 GB each): {nodes_required:.0f}") # Add replication factor (e.g., 2 for single replica)replication_factor = 2total_nodes = nodes_required * replication_factorprint(f"Total nodes with replication: {total_nodes:.0f}")Network Bandwidth Estimation
12345678910111213141516171819
# At peak: 463,000 RPS# Average response size: 1 KB# Request overhead: ~200 bytes (headers, key, protocol) rps_peak = 463_000response_size = 1024 + 200 # bytes per request/response pair # Bandwidth per secondbandwidth_bytes_per_second = rps_peak * response_sizebandwidth_gbps = (bandwidth_bytes_per_second * 8) / (1024 ** 3)print(f"Peak bandwidth: {bandwidth_gbps:.2f} Gbps") # Per-node bandwidth (assuming 50 nodes)nodes = 50per_node_gbps = bandwidth_gbps / nodesprint(f"Per-node bandwidth: {per_node_gbps:.2f} Gbps") # Each node needs at least 10 Gbps NICs with headroom# Consider: client connections, replication traffic, monitoring| Resource | Estimate | Notes |
|---|---|---|
| Peak Requests Per Second | ~460,000 RPS | 4x average for social media patterns |
| Total Memory Required | ~2 TB | 1 billion objects at 1KB + overhead |
| Cache Nodes (256 GB each) | ~16 nodes | Before replication |
| Total Nodes with Replication | ~32 nodes | 2x for single replica |
| Network Bandwidth | ~45 Gbps | Aggregate across cluster |
| Per-Node Throughput Target | ~15,000 RPS | With headroom for burst |
In system design interviews, you're expected to perform these calculations on a whiteboard. Practice estimation with powers of 2 and 10. Key numbers to memorize: 86,400 seconds/day ≈ 100,000; 1 million seconds ≈ 11 days; 1 KB * 1 million objects = 1 GB. Round aggressively—the goal is order of magnitude, not precision.
Design constraints are the non-negotiable boundaries within which our cache must operate. These constraints eliminate certain design options and prioritize others.
Memory as the Primary Resource
Unlike databases that can spill to disk, caches are fundamentally memory-bound. This constraint shapes nearly every decision:
Network as the Bottleneck
For in-datacenter caching, the network often becomes the constraint before CPU or memory:
Consistency vs. Availability Trade-off
Distributed caches must navigate the CAP theorem:
Most caches choose AP (Availability + Partition tolerance): When network partitions occur, cache nodes continue serving requests independently, accepting that different clients may see different values.
Strong consistency is expensive: Requiring all replicas to agree before responding adds latency and reduces availability.
Eventual consistency is typical: Applications must tolerate stale reads and handle cache invalidation carefully.
Multi-Tenancy Considerations
Production caches serve multiple applications and teams:
Unlike databases, caches are explicitly ephemeral. Data loss is expected and acceptable. This assumption enables aggressive performance optimizations—no write-ahead logging, no fsync, no replication quorums. When cache nodes fail, data is regenerated from the origin. If you need durability, you need a database, not a cache.
Understanding how applications use caches informs our design priorities. Different use cases exhibit different access patterns, size distributions, and consistency requirements.
| Use Case | Access Pattern | Object Size | TTL | Consistency |
|---|---|---|---|---|
| Session Storage | Read-heavy, write on login/activity | 1-10 KB | 30 min - 24 hr | Low tolerance |
| API Response Caching | Read-heavy, write on data change | 1-100 KB | Seconds - minutes | Moderate |
| Database Query Cache | Read-heavy, invalidate on write | 100 B - 1 MB | Minutes - hours | Critical |
| Rate Limiting | Write-heavy (counters) | 8 bytes (counter) | Seconds - minutes | High (approximation OK) |
| Feature Flags | Read-heavy, rare writes | 100-500 bytes | Minutes | Moderate |
| User Preferences | Read-heavy, occasional writes | 1-5 KB | Hours - days | Low |
| Content/Asset Metadata | Read-heavy, rarely changes | 500 B - 5 KB | Hours - days | Low |
| Leaderboards | Read and write heavy | Variable | Real-time | Moderate |
Hot Key Patterns
Caches must handle 'hot keys'—individual keys that receive disproportionate traffic:
Hot keys challenge cache design because:
Hot keys can bring down entire cache clusters. Mitigation strategies include: local caching on application servers, key replication across multiple nodes, request coalescing (only one request to origin during refresh), and artificial key distribution (appending random suffixes and aggregating on read).
Access Pattern: Zipf Distribution
Real-world cache access follows a Zipf (power-law) distribution: a small number of keys receive most of the requests. The 80/20 rule often applies—20% of keys serve 80% of requests, and often it's more extreme (1% of keys serving 50%+ of requests).
This distribution is actually favorable for caching:
Let's consolidate our requirements into a comprehensive specification that will guide the design of our distributed cache system.
Design Priorities (in order)
What We Explicitly Do NOT Require
We have established clear functional and non-functional requirements, performed capacity estimation, identified key constraints, and documented common use cases. This foundation enables us to make principled design decisions. In the next page, we'll explore Cache Partitioning—how to distribute data across multiple nodes to achieve the scale and performance we require.