Loading content...
In 2004, Google published a paper describing the Google File System (GFS), a distributed file system designed to store and process petabytes of data across thousands of commodity servers. This paper inspired Doug Cutting and Mike Cafarella to create an open-source implementation called the Hadoop Distributed File System (HDFS), which would go on to become the foundation of the big data revolution.
HDFS fundamentally changed how organizations think about data storage. Before HDFS, storing petabytes of data required expensive, specialized hardware—SAN arrays, NAS appliances, and custom high-availability systems costing millions of dollars. HDFS proved that you could achieve similar reliability and far greater scalability using thousands of cheap commodity servers, accepting that failures are inevitable and designing the system to handle them gracefully.
Today, HDFS remains the backbone of countless data platforms, powering analytics at companies like Facebook (storing over 300 petabytes), Yahoo (40+ petabytes), and thousands of enterprises worldwide. Understanding HDFS architecture isn't just academically interesting—it's essential knowledge for anyone designing systems that must handle massive scale.
By the end of this page, you will understand HDFS's master-slave architecture, the role of NameNodes and DataNodes, how data is distributed and replicated for fault tolerance, the write and read paths in detail, federation and high availability mechanisms, and the trade-offs that make HDFS optimal for batch processing workloads.
Before diving into architecture, we must understand the assumptions and design goals that shaped HDFS. These aren't arbitrary choices—they reflect deep insights about hardware economics, failure patterns, and workload characteristics.
HDFS was designed with specific assumptions:
These assumptions have profound implications. HDFS is not a general-purpose file system. It excels at batch processing workloads but is poorly suited for interactive queries, random writes, or applications requiring millions of small files. Choosing HDFS for the wrong workload leads to frustration; understanding its design philosophy prevents this mistake.
The core insight behind HDFS:
Traditional storage systems tried to prevent failures through expensive, specialized hardware—redundant disk controllers, battery-backed caches, and enterprise-grade components. This approach doesn't scale economically to petabyte levels.
HDFS takes a fundamentally different approach: accept that failures are inevitable and build recovery into the software layer. Using commodity hardware and aggressive replication, HDFS achieves reliability through redundancy rather than prevention. A three-way replica on a $500 commodity server is more cost-effective than a single copy on a $50,000 enterprise array—and often more reliable because failures are truly independent.
HDFS employs a master-slave architecture with two primary component types: a single NameNode (the master) and multiple DataNodes (the slaves). This separation of concerns is fundamental to understanding how HDFS operates.
Key architectural insight: The NameNode never handles actual data—it only manages metadata. All data transfers happen directly between clients and DataNodes. This design prevents the NameNode from becoming a throughput bottleneck, allowing the cluster to scale data bandwidth by simply adding more DataNodes.
Let's examine each component in detail:
| Component | Responsibility | State Characteristics |
|---|---|---|
| NameNode | Manages namespace, file-to-block mapping, block locations | All state in memory; persists metadata to edit log and fsimage |
| Secondary NameNode | Periodically merges edit log with fsimage (checkpointing) | Not a failover node; reduces NameNode recovery time |
| DataNode | Stores and retrieves data blocks on local disk | Stateless from NameNode perspective; reports blocks on startup |
| Client | Initiates reads/writes, handles data pipeline logic | Uses HDFS client library (Java, C, or WebHDFS) |
The NameNode is the central coordinator of an HDFS cluster. It maintains the entire file system namespace—the directory tree, file attributes, and the mapping from files to blocks. Understanding the NameNode's internal structure is crucial for capacity planning and troubleshooting.
NameNode's In-Memory Data Structures:
The NameNode keeps the entire block map in memory. For a cluster with 100 million blocks, this requires significant RAM (roughly 150-200 bytes per block). This design enables O(1) lookup for any block location but limits cluster size by NameNode memory. A typical NameNode with 128GB RAM can manage ~400 million blocks, which at 128MB block size equals ~50PB of data.
Metadata Persistence Mechanism:
The NameNode must persist metadata durably to survive restarts. It uses a combination of periodic checkpointing and continuous journaling:
On each metadata operation: The change is first written to the edit log (WAL pattern) before being applied to in-memory structures.
Periodically: The Secondary NameNode (or Checkpoint Node) fetches the FsImage and recent edit logs, applies the edits to create a new FsImage, and returns it to the NameNode.
On NameNode restart: It loads the latest FsImage, replays all edit logs since that checkpoint, and waits for DataNodes to report their blocks. Only after receiving a sufficient number of block reports does the NameNode leave "safe mode" and begin normal operations.
# NameNode Startup Sequence 1. Load FsImage into memory └─ Contains: namespace tree, file metadata, block IDs └─ Does NOT contain: block locations (DataNode mapping) 2. Replay Edit Log └─ Apply all transactions since checkpoint └─ Rebuild current in-memory state 3. Enter Safe Mode └─ Read-only operation allowed └─ Block reports received from DataNodes └─ Block map reconstructed progressively 4. Check Block Health └─ Count blocks with minimum replicas └─ Threshold: 99.9% of blocks must have ≥1 replica 5. Leave Safe Mode └─ Full read/write operations enabled └─ Background replication begins for under-replicated blocksDataNodes are the workhorses of HDFS—they store the actual data blocks and handle all data transfer operations. Understanding how DataNodes manage storage is essential for capacity planning and performance tuning.
Block Storage Model:
When a file is written to HDFS, it's split into fixed-size blocks (default 128MB, configurable per file). Each block is stored as a separate file on the DataNode's local file system, along with a corresponding metadata file containing checksums.
| Version | Default Block Size | Rationale |
|---|---|---|
| HDFS 1.x | 64 MB | Original design, balanced memory overhead vs. locality |
| HDFS 2.x+ | 128 MB | Increased to reduce NameNode memory pressure at scale |
| Large Clusters | 256 MB or 512 MB | Custom configuration for massive files to minimize metadata |
Why large blocks?
HDFS uses large blocks compared to traditional file systems (4KB-64KB) for several important reasons:
Reduced NameNode memory: Each block requires ~150 bytes in NameNode memory. A 1TB file at 128MB blocks = 8,192 blocks; at 4KB blocks = 268 million blocks.
Optimized for sequential reads: Large blocks minimize seek overhead. MapReduce jobs typically scan entire blocks sequentially.
Lower network overhead: Fewer block transfers mean fewer TCP connections and metadata lookups.
Efficient replication: Replicating larger chunks is more efficient than many small transfers.
DataNode Operations:
DataNodes continuously verify stored blocks through a 'block scanner' that reads all blocks and validates checksums over a configurable period (default: 3 weeks). This proactive scanning catches disk corruption before it's discovered during reads, giving the NameNode time to create additional replicas.
Replication is HDFS's primary mechanism for ensuring data durability and availability. By default, every block is replicated to three DataNodes. The placement of these replicas follows a carefully designed policy that balances reliability, write bandwidth, and read performance.
Rack-Aware Replica Placement:
HDFS assumes that nodes are organized into racks, and that network bandwidth within a rack is higher than across racks. The default placement policy for three replicas is:
Why this specific placement?
The placement policy optimizes multiple objectives:
| Objective | How Policy Addresses It |
|---|---|
| Rack failure tolerance | Two racks contain replicas; one full rack can fail |
| Write performance | Only one cross-rack hop; second and third replicas share rack |
| Read performance | High probability of finding replica on same rack as reader |
| Data locality | First replica placed on writer's node when possible |
Replication Pipeline:
When writing a block, HDFS uses a pipeline replication strategy rather than parallel writes:
This pipeline approach efficiently uses network bandwidth—the client only needs bandwidth to reach one DataNode, and the replication work is distributed.
When the NameNode detects an under-replicated block (via missing heartbeats or block reports), it schedules re-replication to another DataNode. The priority is based on how under-replicated the block is: blocks with only one replica are replicated before those with two. This automatic recovery is how HDFS tolerates DataNode failures without data loss.
Understanding the complete data flow for reads and writes reveals why HDFS achieves high throughput and how it maintains consistency. Let's trace both paths step by step.
HDFS Write Path (Creating a New File):
/user/data/file.txt. NameNode checks permissions, validates path doesn't exist, creates entry in namespace.HDFS provides write-once semantics. Once a file is closed, it cannot be modified—only appended to (added in Hadoop 2.x). This simplifies consistency and enables aggressive caching.
The original HDFS design had a critical weakness: a single NameNode was a single point of failure. If the NameNode crashed, the entire cluster was unavailable until manual recovery. HDFS 2.0 introduced High Availability (HA) to address this limitation.
NameNode HA Architecture:
HDFS HA provides two NameNodes in an active-standby configuration:
Failover Process:
Typical failover time: 30-60 seconds (configurable), compared to hours for manual recovery in non-HA configurations.
Split-brain—where both NameNodes believe they're Active—would cause catastrophic data corruption. HDFS uses 'fencing' mechanisms to guarantee the old Active is truly stopped before promoting the Standby. Common fencing methods include SSH-based process killing, STONITH (Shoot The Other Node In The Head) via power control, or preventing the old Active from writing to JournalNodes.
Even with HA, the NameNode remains a scalability bottleneck—all namespace operations funnel through a single NameNode's memory. For clusters with hundreds of millions of files, this hits hard limits. HDFS Federation addresses this through horizontal namespace scaling.
Federation Architecture:
Federation allows multiple independent NameNodes to share datanodes, each managing a portion of the namespace:
| Concept | Description |
|---|---|
| Namespace Volume | Each NameNode manages an independent namespace (directory tree). No coordination between NameNodes. |
| Block Pool | Blocks belonging to a single namespace. Each NameNode has its own block pool. |
| DataNode | Stores blocks from ALL block pools, serving all NameNodes. Single DataNode, multiple pools. |
| Client ViewFS | Client-side mount table mapping paths to NameNodes. Creates unified namespace view. |
123456789101112131415
<!-- Client-side mount table configuration --><configuration> <property> <name>fs.viewfs.mounttable.default.link./user</name> <value>hdfs://namenode1:8020/user</value> </property> <property> <name>fs.viewfs.mounttable.default.link./data</name> <value>hdfs://namenode2:8020/data</value> </property> <property> <name>fs.viewfs.mounttable.default.link./logs</name> <value>hdfs://namenode3:8020/logs</value> </property></configuration>Benefits of Federation:
Limitations:
HDFS's design choices optimize for specific workloads at the expense of others. Understanding these trade-offs prevents misapplication and guides technology selection.
Each file, directory, and block consumes ~150 bytes in NameNode memory. With millions of small files (e.g., storing 1KB files), you hit NameNode memory limits long before exhausting DataNode storage. Solutions include HAR files (Hadoop Archives), SequenceFiles (container format), or using HBase for small object storage.
When NOT to use HDFS:
| Workload | Why HDFS Is Unsuitable | Better Alternative |
|---|---|---|
| Interactive queries requiring <100ms latency | NameNode lookup overhead, no caching | Apache Ignite, ClickHouse |
| Millions of small files (<MB) | NameNode memory exhaustion | Object stores, HBase |
| Random read/write workloads | Block-level access, no byte-level updates | PostgreSQL, MongoDB |
| Real-time streaming writes | Batch-oriented, high latency | Kafka, Pulsar |
| Mutable data with ACID transactions | Append-only, no transactions | Delta Lake, Apache Iceberg |
HDFS excels when your workload matches its assumptions: large files, write-once/read-many, sequential access, and batch processing. For other patterns, consider purpose-built systems.
We've explored the architecture that enables HDFS to store petabytes of data reliably on commodity hardware. Let's consolidate the key concepts:
What's Next:
HDFS pioneered distributed storage for big data, but it's not the only approach. In the next page, we'll explore Ceph, a unified storage platform that provides object, block, and file interfaces with a fundamentally different architecture—eliminating the single-point-of-failure metadata server entirely through distributed hashing.
You now understand HDFS's architecture at a depth sufficient for system design discussions. You can explain master-slave separation, block replication, fault tolerance mechanisms, and the trade-offs that make HDFS optimal for batch processing at scale.