Loading system design...
Design a distributed NoSQL database like Apache Cassandra or Amazon DynamoDB — a masterless, peer-to-peer, wide-column / key-value store that partitions data across a cluster using consistent hashing with virtual nodes, replicates data to N nodes with rack/DC-aware placement, offers tunable consistency (ONE/QUORUM/ALL per operation), uses an LSM-tree storage engine (commit log → memtable → SSTables with compaction), and maintains consistency across replicas via hinted handoff, read repair, and Merkle tree anti-entropy repair — all while supporting multi-data-centre active-active replication.
| Metric | Value |
|---|---|
| Cluster size | 100–10,000 nodes |
| Total storage | 1+ PB |
| Writes per second (per node) | 50,000–100,000 |
| Reads per second (per node) | 10,000–50,000 |
| Write latency (p99) | < 5ms |
| Read latency (p99) | < 10ms |
| Replication factor | 3 (typical) |
| Virtual nodes per physical node | 256 |
| Data centres | 2–5 (active-active) |
| Partitions (token ranges) | Millions |
| Compaction throughput | 50–100 MB/s per node |
Key-value and wide-column data model: store data as rows identified by a partition key; each row contains a set of columns (wide-column); support compound primary keys (partition key + clustering key for sorted data within a partition); schema-flexible — columns can vary per row
Data partitioning: automatically distribute data across a cluster of nodes using consistent hashing; each node responsible for a range of the hash ring; data evenly distributed to avoid hotspots; support adding/removing nodes with minimal data movement
Replication: each piece of data replicated to N nodes (configurable replication factor, typically 3); replicas placed across different racks/data centres for fault tolerance; data remains available even when some replicas are down
Tunable consistency: support configurable consistency levels per operation — ONE (fastest, single replica), QUORUM (majority of replicas), ALL (all replicas); client chooses consistency-availability trade-off per read/write; strong consistency achievable with QUORUM read + QUORUM write (R + W > N)
High availability writes: writes always succeed as long as at least one replica is reachable (with consistency level ONE); no single point of failure — all nodes are equal (masterless/peer-to-peer); no single coordinator or leader required
Read and write path: writes → commit log (WAL) + memtable (in-memory sorted structure) → periodically flushed to SSTables (Sorted String Tables) on disk; reads → check memtable → check Bloom filters on SSTables → merge results from SSTables; LSM-tree storage engine
Conflict resolution: concurrent writes to the same key on different replicas → resolve using last-write-wins (LWW) with timestamps; alternative: vector clocks for causal ordering; client-side conflict resolution for application-specific merge logic
Anti-entropy and repair: detect and repair inconsistencies between replicas — read repair (compare replica responses on read, fix stale replicas), Merkle tree-based anti-entropy repair (compare hash trees to find divergent data ranges), hinted handoff (temporarily store writes for down replicas)
Multi-data-centre replication: replicate data across geographically distributed data centres; each data centre has its own set of replicas; cross-DC replication is asynchronous (to avoid cross-DC latency on writes); local reads/writes within each DC for low latency
Secondary indexes and queries: support secondary indexes on non-primary-key columns for flexible querying; materialised views for denormalised query patterns; lightweight transactions (compare-and-set) using Paxos for conditional updates
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?