Loading system design...
Design a distributed file system like HDFS (Hadoop Distributed File System) or GFS (Google File System) that stores petabytes of data across thousands of commodity servers. The system splits files into large blocks (128 MB), replicates each block across multiple DataNodes with rack-aware placement, manages all metadata in a central NameNode (in-memory for speed), supports pipelined writes and parallel reads with data locality, handles DataNode failures via heartbeat-based detection and automatic re-replication, and ensures NameNode high availability through Active-Standby failover with a quorum-based shared edit log.
| Metric | Value |
|---|---|
| Total storage capacity | 100+ PB (petabytes) |
| Number of DataNodes | 10,000+ |
| Number of files | 500 million |
| Number of blocks | 5 billion |
| Block size | 128 MB |
| Replication factor | 3 (configurable) |
| NameNode metadata (in memory) | ~75 GB (500M files) |
| DataNode heartbeat interval | 3 seconds |
| Block report interval | 6 hours |
| Read throughput (per DataNode) | 100–200 MB/s |
| Write throughput (pipeline) | 60–100 MB/s |
| NameNode failover time | < 30 seconds |
File storage: store files of arbitrary size (bytes to terabytes); files written once, read many times (write-once-read-many — WORM); support append operations; files split into fixed-size chunks (64 MB or 128 MB blocks) distributed across cluster nodes
Namespace management: hierarchical file system namespace (directories and files with paths like /user/data/file.csv); metadata managed by a central NameNode (master); metadata includes: file→block mapping, block→DataNode mapping, permissions, replication factor
Replication: each block replicated across multiple DataNodes (default replication factor = 3); replicas placed on different racks for fault tolerance (rack-aware placement: 1 replica on local rack, 2 on a different rack); ensures data survives DataNode and rack failures
Read path: client contacts NameNode for file metadata (list of blocks + DataNode locations per block) → client reads blocks directly from DataNodes in parallel; NameNode not involved in data transfer (avoids bottleneck); client selects nearest DataNode replica for each block
Write path: client contacts NameNode to create file → NameNode allocates blocks and selects DataNodes → client writes block data to first DataNode → DataNode pipelines the write to second and third replicas in a chain; write acknowledged after all replicas confirm
Fault tolerance: detect DataNode failures via heartbeats (every 3 seconds); on DataNode failure → NameNode marks blocks on failed node as under-replicated → triggers re-replication to restore replication factor; data remains available from surviving replicas during re-replication
Consistency: single-writer model — only one writer per file at a time; once a file is closed, it is immutable (can only be appended or deleted); readers see consistent data (read-your-writes for the writer; eventual consistency for concurrent readers via lease mechanism)
Data integrity: every block has a checksum (CRC32); DataNodes verify checksums on read and periodically scan stored blocks; corrupt blocks detected → re-replicated from healthy replicas; NameNode tracks which blocks need checksum repair
NameNode high availability: NameNode is a single point of failure; HA architecture: Active NameNode + Standby NameNode with shared edit log (stored in JournalNodes using Quorum Journal Manager); automatic failover via ZooKeeper; failover in < 30 seconds
Snapshots and trash: support directory-level snapshots (point-in-time copy) for backup and recovery; deleted files moved to trash directory (retained for configurable period) before permanent deletion; enables accidental delete recovery
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?