Loading system design...
Design a distributed key-value store similar to Amazon DynamoDB, Apache Cassandra, or Riak. The system should support put, get, and delete operations with tunable consistency, automatic data partitioning via consistent hashing, and fault-tolerant replication.
| Metric | Value |
|---|---|
| Data stored | 100 TB across the cluster |
| Key size | ≤ 256 bytes |
| Value size | ≤ 1 MB (typical: 1–10 KB) |
| Read QPS | 500,000 per second |
| Write QPS | 100,000 per second |
| Number of nodes | 100–1,000 |
| Replication factor (N) | 3 |
| Read/write latency (p99) | < 10ms (within a single DC) |
| Virtual nodes per physical | 150 (for balanced hash ring) |
Put(key, value): insert or update a key-value pair
Get(key): retrieve the value associated with a key
Delete(key): remove a key-value pair
Data is automatically partitioned (sharded) across multiple nodes for horizontal scalability
Data is replicated across multiple nodes for fault tolerance; the system remains available when nodes fail
Provide tunable consistency: clients can choose between strong consistency and eventual consistency per request
Support TTL (time-to-live) for automatic key expiry
Detect and resolve conflicting writes using vector clocks or last-writer-wins (LWW)
Support range queries on keys (if keys are ordered) or secondary indexes
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?