Design a Distributed File System (HDFS / GFS)

Design a distributed file system like HDFS (Hadoop Distributed File System) or GFS (Google File System) that stores petabytes of data across thousands of commodity servers. The system splits files into large blocks (128 MB), replicates each block across multiple DataNodes with rack-aware placement, manages all metadata in a central NameNode (in-memory for speed), supports pipelined writes and parallel reads with data locality, handles DataNode failures via heartbeat-based detection and automatic re-replication, and ensures NameNode high availability through Active-Standby failover with a quorum-based shared edit log.

Scale Estimates

Metric	Value
Total storage capacity	100+ PB (petabytes)
Number of DataNodes	10,000+
Number of files	500 million
Number of blocks	5 billion
Block size	128 MB
Replication factor	3 (configurable)
NameNode metadata (in memory)	~75 GB (500M files)
DataNode heartbeat interval	3 seconds
Block report interval	6 hours
Read throughput (per DataNode)	100–200 MB/s
Write throughput (pipeline)	60–100 MB/s
NameNode failover time	< 30 seconds

Non-Functional Requirements

Durability: Replication factor 3 with rack-aware placement; survives any single DataNode failure (instant) and single rack failure; CRC32C checksums detect corruption; re-replication restores replication factor automatically
Throughput: Optimised for large sequential I/O; 128 MB blocks minimise seek overhead; parallel block reads across DataNodes; pipeline writes utilise aggregate cluster bandwidth; data locality for MapReduce (computation moves to data)
Scalability: DataNodes scale horizontally (add disks/nodes for more storage); NameNode scales vertically (more RAM for more metadata); Federation for horizontal metadata scaling (multiple NameNodes with disjoint namespaces)
Availability: NameNode HA — Active + Standby with shared edit log (JournalNodes, quorum); ZooKeeper-based automatic failover with fencing; DataNode failures transparent to users (read retries to other replicas)
Consistency: Single-writer model; write acknowledged after all replicas confirm; read-after-write consistency for the writer; checksums guarantee data integrity at rest and during transfer
Design assumptions: Write-once-read-many workload; large files (GB–TB); batch analytics (MapReduce, Spark); NOT suited for small files, random writes, or low-latency OLTP

Scale Estimates

Metric

Value

Total storage capacity

100+ PB (petabytes)

Number of DataNodes

10,000+

Number of files

500 million

Number of blocks

5 billion

Block size

128 MB

Replication factor

3 (configurable)

NameNode metadata (in memory)

~75 GB (500M files)

DataNode heartbeat interval

3 seconds

Block report interval

6 hours

Read throughput (per DataNode)

100–200 MB/s

Write throughput (pipeline)

60–100 MB/s

NameNode failover time

< 30 seconds

Non-Functional Requirements

Durability: Replication factor 3 with rack-aware placement; survives any single DataNode failure (instant) and single rack failure; CRC32C checksums detect corruption; re-replication restores replication factor automatically

Throughput: Optimised for large sequential I/O; 128 MB blocks minimise seek overhead; parallel block reads across DataNodes; pipeline writes utilise aggregate cluster bandwidth; data locality for MapReduce (computation moves to data)

Scalability: DataNodes scale horizontally (add disks/nodes for more storage); NameNode scales vertically (more RAM for more metadata); Federation for horizontal metadata scaling (multiple NameNodes with disjoint namespaces)

Availability: NameNode HA — Active + Standby with shared edit log (JournalNodes, quorum); ZooKeeper-based automatic failover with fencing; DataNode failures transparent to users (read retries to other replicas)

Consistency: Single-writer model; write acknowledged after all replicas confirm; read-after-write consistency for the writer; checksums guarantee data integrity at rest and during transfer

Design assumptions: Write-once-read-many workload; large files (GB–TB); batch analytics (MapReduce, Spark); NOT suited for small files, random writes, or low-latency OLTP

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Distributed File System (HDFS / GFS)

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Distributed File System (HDFS / GFS)

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How does the NameNode manage metadata and why is it a central component?

2How does the write pipeline work in detail?

3How does the read path work and achieve data locality?

4How does the system handle failures and ensure durability?

5How does NameNode High Availability work?

6How would you scale the NameNode for very large clusters (Federation)?

7How does GFS (Google File System) differ from HDFS, and what makes these designs suitable for big data?

Key Topics

Asked At

Design a Distributed File System (HDFS / GFS)

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How does the NameNode manage metadata and why is it a central component?

2How does the write pipeline work in detail?

3How does the read path work and achieve data locality?

4How does the system handle failures and ensure durability?

5How does NameNode High Availability work?

6How would you scale the NameNode for very large clusters (Federation)?

7How does GFS (Google File System) differ from HDFS, and what makes these designs suitable for big data?

Key Topics

Asked At