Loading system design...
Design a consensus protocol (like Raft or Paxos) that enables a cluster of servers to agree on a sequence of commands despite server crashes and network partitions. The protocol implements a replicated state machine: a single leader handles all client requests, replicates log entries to followers via AppendEntries RPCs, commits entries when a majority acknowledges, and guarantees that committed entries are never lost. Leader election uses randomised timeouts and majority voting, ensuring exactly one leader per term. The system tolerates F failures in a 2F+1 node cluster while maintaining linearizable consistency.
| Metric | Value |
|---|---|
| Cluster size | 3, 5, or 7 nodes (odd) |
| Fault tolerance | F failures with 2F+1 nodes |
| Heartbeat interval | 100–150ms |
| Election timeout | 150–300ms (randomised) |
| Leader election time | < 5 seconds (typically < 1 second) |
| Write latency (within DC) | 1–5ms (majority round-trip) |
| Write latency (cross-DC) | 50–200ms |
| Throughput | 10,000–100,000 ops/sec |
| Log entry size | ~100 bytes (typical KV command) |
| Snapshot interval | Every 10,000–100,000 entries |
Replicated state machine: maintain an identical log of commands across a cluster of N servers (typically 3, 5, or 7); each server applies log entries to its state machine in the same order; all servers converge to the same state; clients interact with any server (or the leader) and see a consistent view
Leader election: the cluster must elect exactly one leader at any time; the leader handles all client requests and replicates log entries to followers; if the leader crashes, followers detect the failure and elect a new leader within a bounded time (typically < 5 seconds)
Log replication: the leader appends client commands to its log → replicates each log entry to all followers → once a majority (quorum) of servers have persisted the entry, it is committed → the leader notifies the client of success; committed entries are never lost (even after leader crashes)
Safety guarantee: a committed log entry will never be overwritten or lost; all servers that apply a log entry at a given index apply the same command; if any server has applied a log entry, no other server applies a different entry at that index; linearizable (behaves as if a single server)
Fault tolerance: the system continues to operate correctly as long as a majority of servers are alive — a cluster of 2F+1 nodes tolerates F simultaneous failures; 3-node cluster tolerates 1 failure; 5-node cluster tolerates 2 failures; progress requires a quorum (majority)
Cluster membership changes: add or remove servers from the cluster without downtime; the change itself is replicated as a log entry; at no point can two disjoint majorities coexist (split-brain prevention); joint consensus or single-server changes ensure safety during transitions
Log compaction: the log grows unboundedly; periodically take a snapshot of the state machine's current state → truncate the log up to the snapshot point; enable slow/recovering followers to catch up via snapshot transfer instead of replaying the entire log
Client interaction: clients send commands to the leader; if a client contacts a follower, the follower redirects to the leader; the leader responds only after the command is committed; exactly-once semantics (prevent duplicate application) via client-assigned sequence numbers
Read consistency: reads must return the most recently committed value (linearizable reads); naive approach: leader handles all reads (bottleneck); optimised: leader confirms it is still the leader at read time (heartbeat lease) or read index protocol; follower reads with lease for read scaling
Persistence and recovery: each server persists its current term, votedFor, and log entries to durable storage (disk); on crash and restart, the server recovers from persistent state; the server rejoins the cluster and catches up on missed log entries from the leader
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?