Distributed File Systems - Learning Module

Loading content...

0/273

HDFS Architecture

The File System That Changed Big Data

In 2004, Google published a paper describing the Google File System (GFS), a distributed file system designed to store and process petabytes of data across thousands of commodity servers. This paper inspired Doug Cutting and Mike Cafarella to create an open-source implementation called the Hadoop Distributed File System (HDFS), which would go on to become the foundation of the big data revolution.

HDFS fundamentally changed how organizations think about data storage. Before HDFS, storing petabytes of data required expensive, specialized hardware—SAN arrays, NAS appliances, and custom high-availability systems costing millions of dollars. HDFS proved that you could achieve similar reliability and far greater scalability using thousands of cheap commodity servers, accepting that failures are inevitable and designing the system to handle them gracefully.

Today, HDFS remains the backbone of countless data platforms, powering analytics at companies like Facebook (storing over 300 petabytes), Yahoo (40+ petabytes), and thousands of enterprises worldwide. Understanding HDFS architecture isn't just academically interesting—it's essential knowledge for anyone designing systems that must handle massive scale.

What You Will Learn

By the end of this page, you will understand HDFS's master-slave architecture, the role of NameNodes and DataNodes, how data is distributed and replicated for fault tolerance, the write and read paths in detail, federation and high availability mechanisms, and the trade-offs that make HDFS optimal for batch processing workloads.

Core Design Philosophy

Before diving into architecture, we must understand the assumptions and design goals that shaped HDFS. These aren't arbitrary choices—they reflect deep insights about hardware economics, failure patterns, and workload characteristics.

HDFS was designed with specific assumptions:

HDFS Design Assumptions

•Hardware failures are inevitable — In a cluster of thousands of nodes, multiple failures happen daily. The system must detect and recover from failures automatically, without human intervention.
•Files are written once and read many times — Typical big data workloads involve batch processing where data is ingested once and analyzed repeatedly. Optimizing for this pattern simplifies design significantly.
•Large files are the norm — HDFS targets multi-gigabyte to terabyte-scale files (log files, sensor data, scientific datasets). Optimizing for small files would introduce unnecessary overhead.
•High throughput matters more than low latency — Batch analytics needs to scan terabytes efficiently. Interactive, low-latency access is not the primary use case.
•Moving computation is cheaper than moving data — With petabytes of data, network bandwidth becomes the bottleneck. HDFS provides data locality information so computation can be scheduled where data resides.

Understanding Trade-offs

These assumptions have profound implications. HDFS is not a general-purpose file system. It excels at batch processing workloads but is poorly suited for interactive queries, random writes, or applications requiring millions of small files. Choosing HDFS for the wrong workload leads to frustration; understanding its design philosophy prevents this mistake.

The core insight behind HDFS:

Traditional storage systems tried to prevent failures through expensive, specialized hardware—redundant disk controllers, battery-backed caches, and enterprise-grade components. This approach doesn't scale economically to petabyte levels.

HDFS takes a fundamentally different approach: accept that failures are inevitable and build recovery into the software layer. Using commodity hardware and aggressive replication, HDFS achieves reliability through redundancy rather than prevention. A three-way replica on a $500 commodity server is more cost-effective than a single copy on a $50,000 enterprise array—and often more reliable because failures are truly independent.

Master-Slave Architecture Overview

HDFS employs a master-slave architecture with two primary component types: a single NameNode (the master) and multiple DataNodes (the slaves). This separation of concerns is fundamental to understanding how HDFS operates.

Converting Mermaid diagram...

Key architectural insight: The NameNode never handles actual data—it only manages metadata. All data transfers happen directly between clients and DataNodes. This design prevents the NameNode from becoming a throughput bottleneck, allowing the cluster to scale data bandwidth by simply adding more DataNodes.

Let's examine each component in detail:

HDFS Component Responsibilities
Component	Responsibility	State Characteristics
NameNode	Manages namespace, file-to-block mapping, block locations	All state in memory; persists metadata to edit log and fsimage
Secondary NameNode	Periodically merges edit log with fsimage (checkpointing)	Not a failover node; reduces NameNode recovery time
DataNode	Stores and retrieves data blocks on local disk	Stateless from NameNode perspective; reports blocks on startup
Client	Initiates reads/writes, handles data pipeline logic	Uses HDFS client library (Java, C, or WebHDFS)

The NameNode: Brain of the File System

The NameNode is the central coordinator of an HDFS cluster. It maintains the entire file system namespace—the directory tree, file attributes, and the mapping from files to blocks. Understanding the NameNode's internal structure is crucial for capacity planning and troubleshooting.

NameNode's In-Memory Data Structures:

NameNode Memory Model

•FsImage — A persistent checkpoint of the file system metadata, including namespace tree, file-to-block mappings, replication factors, permissions, and quotas.
•Edit Log — An append-only transaction log recording every metadata operation since the last checkpoint. Enables durability without constantly rewriting FsImage.
•Block Map — In-memory structure mapping blocks to their physical locations on DataNodes. This is never persisted—it's reconstructed from DataNode block reports on every startup.
•DataNode Registry — Tracks live DataNodes, their capacity, and health status based on heartbeats.

Critical Design Decision: Block Locations in Memory

The NameNode keeps the entire block map in memory. For a cluster with 100 million blocks, this requires significant RAM (roughly 150-200 bytes per block). This design enables O(1) lookup for any block location but limits cluster size by NameNode memory. A typical NameNode with 128GB RAM can manage ~400 million blocks, which at 128MB block size equals ~50PB of data.

Metadata Persistence Mechanism:

The NameNode must persist metadata durably to survive restarts. It uses a combination of periodic checkpointing and continuous journaling:

On each metadata operation: The change is first written to the edit log (WAL pattern) before being applied to in-memory structures.
Periodically: The Secondary NameNode (or Checkpoint Node) fetches the FsImage and recent edit logs, applies the edits to create a new FsImage, and returns it to the NameNode.
On NameNode restart: It loads the latest FsImage, replays all edit logs since that checkpoint, and waits for DataNodes to report their blocks. Only after receiving a sufficient number of block reports does the NameNode leave "safe mode" and begin normal operations.

namenode-recovery-sequence.md
# NameNode Startup Sequence
                                
1. Load FsImage into memory
   └─ Contains: namespace tree, file metadata, block IDs
   └─ Does NOT contain: block locations (DataNode mapping)
 
2. Replay Edit Log
   └─ Apply all transactions since checkpoint
   └─ Rebuild current in-memory state
 
3. Enter Safe Mode
   └─ Read-only operation allowed
   └─ Block reports received from DataNodes
   └─ Block map reconstructed progressively
 
4. Check Block Health
   └─ Count blocks with minimum replicas
   └─ Threshold: 99.9% of blocks must have ≥1 replica
 
5. Leave Safe Mode
   └─ Full read/write operations enabled
   └─ Background replication begins for under-replicated blocks

DataNodes and Block Storage

DataNodes are the workhorses of HDFS—they store the actual data blocks and handle all data transfer operations. Understanding how DataNodes manage storage is essential for capacity planning and performance tuning.

Block Storage Model:

When a file is written to HDFS, it's split into fixed-size blocks (default 128MB, configurable per file). Each block is stored as a separate file on the DataNode's local file system, along with a corresponding metadata file containing checksums.

HDFS Block Size Evolution
Version	Default Block Size	Rationale
HDFS 1.x	64 MB	Original design, balanced memory overhead vs. locality
HDFS 2.x+	128 MB	Increased to reduce NameNode memory pressure at scale
Large Clusters	256 MB or 512 MB	Custom configuration for massive files to minimize metadata

Why large blocks?

HDFS uses large blocks compared to traditional file systems (4KB-64KB) for several important reasons:

Reduced NameNode memory: Each block requires ~150 bytes in NameNode memory. A 1TB file at 128MB blocks = 8,192 blocks; at 4KB blocks = 268 million blocks.
Optimized for sequential reads: Large blocks minimize seek overhead. MapReduce jobs typically scan entire blocks sequentially.
Lower network overhead: Fewer block transfers mean fewer TCP connections and metadata lookups.
Efficient replication: Replicating larger chunks is more efficient than many small transfers.

DataNode Operations:

DataNode Responsibilities

•Block Storage — Store blocks as files in local file system, maintaining data and checksum files.
•Heartbeat Signals — Send periodic heartbeats to NameNode (default: every 3 seconds) indicating liveness.
•Block Reports — Periodically send complete list of all blocks stored (default: every 6 hours).
•Data Integrity — Verify checksums on read and during background scanning; report corrupted blocks to NameNode.
•Pipeline Operations — Participate in write pipelines, receiving data and forwarding to next DataNode in chain.
•Block Operations — Execute NameNode commands: replicate blocks, delete blocks, recover blocks.

Block Scanner: Silent Guardian

DataNodes continuously verify stored blocks through a 'block scanner' that reads all blocks and validates checksums over a configurable period (default: 3 weeks). This proactive scanning catches disk corruption before it's discovered during reads, giving the NameNode time to create additional replicas.

Data Replication Strategy

Replication is HDFS's primary mechanism for ensuring data durability and availability. By default, every block is replicated to three DataNodes. The placement of these replicas follows a carefully designed policy that balances reliability, write bandwidth, and read performance.

Rack-Aware Replica Placement:

HDFS assumes that nodes are organized into racks, and that network bandwidth within a rack is higher than across racks. The default placement policy for three replicas is:

•First replica: Placed on the node where the client is running (or a random node if client is external).
•Second replica: Placed on a different rack than the first replica. This ensures data survives a complete rack failure.
•Third replica: Placed on the same rack as the second replica but on a different node. This optimizes write bandwidth (only one cross-rack transfer) while maintaining rack-level redundancy.

Converting Mermaid diagram...

Why this specific placement?

The placement policy optimizes multiple objectives:

Objective	How Policy Addresses It
Rack failure tolerance	Two racks contain replicas; one full rack can fail
Write performance	Only one cross-rack hop; second and third replicas share rack
Read performance	High probability of finding replica on same rack as reader
Data locality	First replica placed on writer's node when possible

Replication Pipeline:

When writing a block, HDFS uses a pipeline replication strategy rather than parallel writes:

Client writes to first DataNode
First DataNode simultaneously writes to disk AND forwards to second DataNode
Second DataNode writes to disk AND forwards to third DataNode
Acknowledgments flow back up the pipeline
Client receives ACK only after all replicas confirm

This pipeline approach efficiently uses network bandwidth—the client only needs bandwidth to reach one DataNode, and the replication work is distributed.

Under-Replication Detection and Recovery

When the NameNode detects an under-replicated block (via missing heartbeats or block reports), it schedules re-replication to another DataNode. The priority is based on how under-replicated the block is: blocks with only one replica are replicated before those with two. This automatic recovery is how HDFS tolerates DataNode failures without data loss.

Read and Write Paths in Detail

Understanding the complete data flow for reads and writes reveals why HDFS achieves high throughput and how it maintains consistency. Let's trace both paths step by step.

HDFS Write Path (Creating a New File):

•Client contacts NameNode — Requests to create file /user/data/file.txt. NameNode checks permissions, validates path doesn't exist, creates entry in namespace.
•NameNode logs operation — File creation logged to edit log BEFORE response. Client receives success but file has no blocks yet.
•Client requests block allocation — When ready to write data, client asks NameNode for DataNodes to store first block.
•NameNode selects DataNodes — Using rack-aware policy, selects three DataNodes (e.g., DN1, DN5, DN6) and returns ordered list to client.
•Client initiates pipeline — Opens TCP connection to DN1, passing the list of remaining nodes (DN5, DN6).
•Pipeline assembly — DN1 connects to DN5, DN5 connects to DN6. Success confirmed back to client.
•Data streaming begins — Client writes packets (64KB default) to DN1. DN1 simultaneously writes to disk AND forwards to DN5.
•Packet acknowledgment — ACKs flow back through pipeline. Client tracks outstanding packets in 'ack queue'.
•Block completion — When block is full, client notifies NameNode of successful block storage, then requests allocation for next block.
•File close — After all blocks written, client issues close. NameNode verifies minimum replicas, finalizes file.

Write Semantics

HDFS provides write-once semantics. Once a file is closed, it cannot be modified—only appended to (added in Hadoop 2.x). This simplifies consistency and enables aggressive caching.

High Availability Architecture

The original HDFS design had a critical weakness: a single NameNode was a single point of failure. If the NameNode crashed, the entire cluster was unavailable until manual recovery. HDFS 2.0 introduced High Availability (HA) to address this limitation.

NameNode HA Architecture:

HDFS HA provides two NameNodes in an active-standby configuration:

Converting Mermaid diagram...

HA Components

•Active NameNode — Handles all client operations and writes edits to JournalNodes.
•Standby NameNode — Continuously reads edits from JournalNodes, maintaining synchronized state. Ready to take over instantly.
•JournalNodes (QJM) — Quorum-based shared edit log. Active writes to majority; Standby reads any. Typically 3 or 5 nodes for odd quorum.
•ZooKeeper — Provides failure detection and leader election. Ensures only one Active at a time (prevents split-brain).
•ZKFC (ZooKeeper Failover Controller) — Daemon running on each NameNode machine, monitoring health and coordinating failover.

Failover Process:

ZKFC detects Active NameNode failure (health check failures or ZooKeeper session timeout)
ZKFC initiates fencing to prevent split-brain (ensures old Active cannot respond to requests)
ZKFC triggers Standby to transition to Active
New Active NameNode begins accepting client requests
DataNodes now report to new Active

Typical failover time: 30-60 seconds (configurable), compared to hours for manual recovery in non-HA configurations.

Fencing: Preventing Split-Brain

Split-brain—where both NameNodes believe they're Active—would cause catastrophic data corruption. HDFS uses 'fencing' mechanisms to guarantee the old Active is truly stopped before promoting the Standby. Common fencing methods include SSH-based process killing, STONITH (Shoot The Other Node In The Head) via power control, or preventing the old Active from writing to JournalNodes.

Federation and Horizontal Scalability

Even with HA, the NameNode remains a scalability bottleneck—all namespace operations funnel through a single NameNode's memory. For clusters with hundreds of millions of files, this hits hard limits. HDFS Federation addresses this through horizontal namespace scaling.

Federation Architecture:

Federation allows multiple independent NameNodes to share datanodes, each managing a portion of the namespace:

Federation Concepts
Concept	Description
Namespace Volume	Each NameNode manages an independent namespace (directory tree). No coordination between NameNodes.
Block Pool	Blocks belonging to a single namespace. Each NameNode has its own block pool.
DataNode	Stores blocks from ALL block pools, serving all NameNodes. Single DataNode, multiple pools.
Client ViewFS	Client-side mount table mapping paths to NameNodes. Creates unified namespace view.

viewfs-mount-table.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<!-- Client-side mount table configuration -->
<configuration>
  <property>
    <name>fs.viewfs.mounttable.default.link./user</name>
    <value>hdfs://namenode1:8020/user</value>
  </property>
  <property>
    <name>fs.viewfs.mounttable.default.link./data</name>
    <value>hdfs://namenode2:8020/data</value>
  </property>
  <property>
    <name>fs.viewfs.mounttable.default.link./logs</name>
    <value>hdfs://namenode3:8020/logs</value>
  </property>
</configuration>

Benefits of Federation:

Horizontal namespace scaling: Add more NameNodes to manage more files. No single NameNode memory limit.
Isolation: Different organizations/applications use different NameNodes. Failure and performance isolation.
Improved throughput: Multiple NameNodes handle metadata operations in parallel.

Limitations:

No cross-NameNode operations: Can't move files across namespaces atomically; appears as copy+delete.
Operational complexity: Multiple NameNodes to monitor, backup, and upgrade.
ViewFS configuration: Clients need consistent mount table configuration.

Trade-offs and Limitations

HDFS's design choices optimize for specific workloads at the expense of others. Understanding these trade-offs prevents misapplication and guides technology selection.

HDFS Strengths

•Petabyte-scale storage on commodity hardware
•Extremely high sequential read/write throughput
•Automatic failure detection and recovery
•Data locality awareness for compute frameworks
•Cost-effective compared to enterprise storage
•Proven at scale (Facebook: 300+ PB)

HDFS Limitations

•High latency for small files (NameNode overhead)
•No random write support (append only)
•Single-writer semantics (no concurrent writes)
•Small files problem (millions → NameNode OOM)
•No native encryption at rest (added in 3.x)
•NameNode memory limits cluster file count

The Small Files Problem

Each file, directory, and block consumes ~150 bytes in NameNode memory. With millions of small files (e.g., storing 1KB files), you hit NameNode memory limits long before exhausting DataNode storage. Solutions include HAR files (Hadoop Archives), SequenceFiles (container format), or using HBase for small object storage.

When NOT to use HDFS:

Workload	Why HDFS Is Unsuitable	Better Alternative
Interactive queries requiring <100ms latency	NameNode lookup overhead, no caching	Apache Ignite, ClickHouse
Millions of small files (<MB)	NameNode memory exhaustion	Object stores, HBase
Random read/write workloads	Block-level access, no byte-level updates	PostgreSQL, MongoDB
Real-time streaming writes	Batch-oriented, high latency	Kafka, Pulsar
Mutable data with ACID transactions	Append-only, no transactions	Delta Lake, Apache Iceberg

HDFS excels when your workload matches its assumptions: large files, write-once/read-many, sequential access, and batch processing. For other patterns, consider purpose-built systems.

Summary: HDFS Architecture Essentials

We've explored the architecture that enables HDFS to store petabytes of data reliably on commodity hardware. Let's consolidate the key concepts:

Key Takeaways

•HDFS uses a master-slave architecture — NameNode manages metadata; DataNodes store and serve data blocks. Separation of concerns enables scaling.
•Large blocks minimize metadata overhead — 128MB default blocks keep NameNode memory manageable while optimizing for sequential access patterns.
•Replication provides durability — Three-way replication with rack-aware placement tolerates disk, node, and rack failures automatically.
•Pipeline replication optimizes writes — Client writes to one DataNode; replication is distributed across the pipeline, not client-driven.
•NameNode HA prevents cluster unavailability — Active-standby with shared JournalNodes enables automatic failover in under a minute.
•Federation enables horizontal namespace scaling — Multiple NameNodes share DataNodes to overcome single-NameNode limits.
•HDFS is optimized for batch workloads — Large files, sequential access, write-once/read-many. Other patterns require different systems.

What's Next:

HDFS pioneered distributed storage for big data, but it's not the only approach. In the next page, we'll explore Ceph, a unified storage platform that provides object, block, and file interfaces with a fundamentally different architecture—eliminating the single-point-of-failure metadata server entirely through distributed hashing.

Page Complete

You now understand HDFS's architecture at a depth sufficient for system design discussions. You can explain master-slave separation, block replication, fault tolerance mechanisms, and the trade-offs that make HDFS optimal for batch processing at scale.

HDFS Architecture

The File System That Changed Big Data

What You Will Learn

Core Design Philosophy

HDFS was designed with specific assumptions:

HDFS Design Assumptions

•Hardware failures are inevitable — In a cluster of thousands of nodes, multiple failures happen daily. The system must detect and recover from failures automatically, without human intervention.
•Files are written once and read many times — Typical big data workloads involve batch processing where data is ingested once and analyzed repeatedly. Optimizing for this pattern simplifies design significantly.
•Large files are the norm — HDFS targets multi-gigabyte to terabyte-scale files (log files, sensor data, scientific datasets). Optimizing for small files would introduce unnecessary overhead.
•High throughput matters more than low latency — Batch analytics needs to scan terabytes efficiently. Interactive, low-latency access is not the primary use case.
•Moving computation is cheaper than moving data — With petabytes of data, network bandwidth becomes the bottleneck. HDFS provides data locality information so computation can be scheduled where data resides.

Understanding Trade-offs

The core insight behind HDFS:

Master-Slave Architecture Overview

Converting Mermaid diagram...

Let's examine each component in detail:

HDFS Component Responsibilities
Component	Responsibility	State Characteristics
NameNode	Manages namespace, file-to-block mapping, block locations	All state in memory; persists metadata to edit log and fsimage
Secondary NameNode	Periodically merges edit log with fsimage (checkpointing)	Not a failover node; reduces NameNode recovery time
DataNode	Stores and retrieves data blocks on local disk	Stateless from NameNode perspective; reports blocks on startup
Client	Initiates reads/writes, handles data pipeline logic	Uses HDFS client library (Java, C, or WebHDFS)

The NameNode: Brain of the File System

NameNode's In-Memory Data Structures:

NameNode Memory Model

•FsImage — A persistent checkpoint of the file system metadata, including namespace tree, file-to-block mappings, replication factors, permissions, and quotas.
•Edit Log — An append-only transaction log recording every metadata operation since the last checkpoint. Enables durability without constantly rewriting FsImage.
•Block Map — In-memory structure mapping blocks to their physical locations on DataNodes. This is never persisted—it's reconstructed from DataNode block reports on every startup.
•DataNode Registry — Tracks live DataNodes, their capacity, and health status based on heartbeats.

Critical Design Decision: Block Locations in Memory

Metadata Persistence Mechanism:

The NameNode must persist metadata durably to survive restarts. It uses a combination of periodic checkpointing and continuous journaling:

On each metadata operation: The change is first written to the edit log (WAL pattern) before being applied to in-memory structures.
Periodically: The Secondary NameNode (or Checkpoint Node) fetches the FsImage and recent edit logs, applies the edits to create a new FsImage, and returns it to the NameNode.
On NameNode restart: It loads the latest FsImage, replays all edit logs since that checkpoint, and waits for DataNodes to report their blocks. Only after receiving a sufficient number of block reports does the NameNode leave "safe mode" and begin normal operations.

namenode-recovery-sequence.md
# NameNode Startup Sequence
                                
1. Load FsImage into memory
   └─ Contains: namespace tree, file metadata, block IDs
   └─ Does NOT contain: block locations (DataNode mapping)
 
2. Replay Edit Log
   └─ Apply all transactions since checkpoint
   └─ Rebuild current in-memory state
 
3. Enter Safe Mode
   └─ Read-only operation allowed
   └─ Block reports received from DataNodes
   └─ Block map reconstructed progressively
 
4. Check Block Health
   └─ Count blocks with minimum replicas
   └─ Threshold: 99.9% of blocks must have ≥1 replica
 
5. Leave Safe Mode
   └─ Full read/write operations enabled
   └─ Background replication begins for under-replicated blocks

DataNodes and Block Storage

Block Storage Model:

HDFS Block Size Evolution
Version	Default Block Size	Rationale
HDFS 1.x	64 MB	Original design, balanced memory overhead vs. locality
HDFS 2.x+	128 MB	Increased to reduce NameNode memory pressure at scale
Large Clusters	256 MB or 512 MB	Custom configuration for massive files to minimize metadata

Why large blocks?

HDFS uses large blocks compared to traditional file systems (4KB-64KB) for several important reasons:

Reduced NameNode memory: Each block requires ~150 bytes in NameNode memory. A 1TB file at 128MB blocks = 8,192 blocks; at 4KB blocks = 268 million blocks.
Optimized for sequential reads: Large blocks minimize seek overhead. MapReduce jobs typically scan entire blocks sequentially.
Lower network overhead: Fewer block transfers mean fewer TCP connections and metadata lookups.
Efficient replication: Replicating larger chunks is more efficient than many small transfers.

DataNode Operations:

DataNode Responsibilities

•Block Storage — Store blocks as files in local file system, maintaining data and checksum files.
•Heartbeat Signals — Send periodic heartbeats to NameNode (default: every 3 seconds) indicating liveness.
•Block Reports — Periodically send complete list of all blocks stored (default: every 6 hours).
•Data Integrity — Verify checksums on read and during background scanning; report corrupted blocks to NameNode.
•Pipeline Operations — Participate in write pipelines, receiving data and forwarding to next DataNode in chain.
•Block Operations — Execute NameNode commands: replicate blocks, delete blocks, recover blocks.

Block Scanner: Silent Guardian

Data Replication Strategy

Rack-Aware Replica Placement:

HDFS assumes that nodes are organized into racks, and that network bandwidth within a rack is higher than across racks. The default placement policy for three replicas is:

•First replica: Placed on the node where the client is running (or a random node if client is external).
•Second replica: Placed on a different rack than the first replica. This ensures data survives a complete rack failure.
•Third replica: Placed on the same rack as the second replica but on a different node. This optimizes write bandwidth (only one cross-rack transfer) while maintaining rack-level redundancy.

Converting Mermaid diagram...

Why this specific placement?

The placement policy optimizes multiple objectives:

Objective	How Policy Addresses It
Rack failure tolerance	Two racks contain replicas; one full rack can fail
Write performance	Only one cross-rack hop; second and third replicas share rack
Read performance	High probability of finding replica on same rack as reader
Data locality	First replica placed on writer's node when possible

Replication Pipeline:

When writing a block, HDFS uses a pipeline replication strategy rather than parallel writes:

Client writes to first DataNode
First DataNode simultaneously writes to disk AND forwards to second DataNode
Second DataNode writes to disk AND forwards to third DataNode
Acknowledgments flow back up the pipeline
Client receives ACK only after all replicas confirm

This pipeline approach efficiently uses network bandwidth—the client only needs bandwidth to reach one DataNode, and the replication work is distributed.

Under-Replication Detection and Recovery

Read and Write Paths in Detail

Understanding the complete data flow for reads and writes reveals why HDFS achieves high throughput and how it maintains consistency. Let's trace both paths step by step.

HDFS Write Path (Creating a New File):

•Client contacts NameNode — Requests to create file /user/data/file.txt. NameNode checks permissions, validates path doesn't exist, creates entry in namespace.
•NameNode logs operation — File creation logged to edit log BEFORE response. Client receives success but file has no blocks yet.
•Client requests block allocation — When ready to write data, client asks NameNode for DataNodes to store first block.
•NameNode selects DataNodes — Using rack-aware policy, selects three DataNodes (e.g., DN1, DN5, DN6) and returns ordered list to client.
•Client initiates pipeline — Opens TCP connection to DN1, passing the list of remaining nodes (DN5, DN6).
•Pipeline assembly — DN1 connects to DN5, DN5 connects to DN6. Success confirmed back to client.
•Data streaming begins — Client writes packets (64KB default) to DN1. DN1 simultaneously writes to disk AND forwards to DN5.
•Packet acknowledgment — ACKs flow back through pipeline. Client tracks outstanding packets in 'ack queue'.
•Block completion — When block is full, client notifies NameNode of successful block storage, then requests allocation for next block.
•File close — After all blocks written, client issues close. NameNode verifies minimum replicas, finalizes file.

Write Semantics

HDFS provides write-once semantics. Once a file is closed, it cannot be modified—only appended to (added in Hadoop 2.x). This simplifies consistency and enables aggressive caching.

High Availability Architecture

NameNode HA Architecture:

HDFS HA provides two NameNodes in an active-standby configuration:

Converting Mermaid diagram...

HA Components

•Active NameNode — Handles all client operations and writes edits to JournalNodes.
•Standby NameNode — Continuously reads edits from JournalNodes, maintaining synchronized state. Ready to take over instantly.
•JournalNodes (QJM) — Quorum-based shared edit log. Active writes to majority; Standby reads any. Typically 3 or 5 nodes for odd quorum.
•ZooKeeper — Provides failure detection and leader election. Ensures only one Active at a time (prevents split-brain).
•ZKFC (ZooKeeper Failover Controller) — Daemon running on each NameNode machine, monitoring health and coordinating failover.

Failover Process:

ZKFC detects Active NameNode failure (health check failures or ZooKeeper session timeout)
ZKFC initiates fencing to prevent split-brain (ensures old Active cannot respond to requests)
ZKFC triggers Standby to transition to Active
New Active NameNode begins accepting client requests
DataNodes now report to new Active

Typical failover time: 30-60 seconds (configurable), compared to hours for manual recovery in non-HA configurations.

Fencing: Preventing Split-Brain

Federation and Horizontal Scalability

Federation Architecture:

Federation allows multiple independent NameNodes to share datanodes, each managing a portion of the namespace:

Federation Concepts
Concept	Description
Namespace Volume	Each NameNode manages an independent namespace (directory tree). No coordination between NameNodes.
Block Pool	Blocks belonging to a single namespace. Each NameNode has its own block pool.
DataNode	Stores blocks from ALL block pools, serving all NameNodes. Single DataNode, multiple pools.
Client ViewFS	Client-side mount table mapping paths to NameNodes. Creates unified namespace view.

viewfs-mount-table.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<!-- Client-side mount table configuration -->
<configuration>
  <property>
    <name>fs.viewfs.mounttable.default.link./user</name>
    <value>hdfs://namenode1:8020/user</value>
  </property>
  <property>
    <name>fs.viewfs.mounttable.default.link./data</name>
    <value>hdfs://namenode2:8020/data</value>
  </property>
  <property>
    <name>fs.viewfs.mounttable.default.link./logs</name>
    <value>hdfs://namenode3:8020/logs</value>
  </property>
</configuration>

Benefits of Federation:

Horizontal namespace scaling: Add more NameNodes to manage more files. No single NameNode memory limit.
Isolation: Different organizations/applications use different NameNodes. Failure and performance isolation.
Improved throughput: Multiple NameNodes handle metadata operations in parallel.

Limitations:

No cross-NameNode operations: Can't move files across namespaces atomically; appears as copy+delete.
Operational complexity: Multiple NameNodes to monitor, backup, and upgrade.
ViewFS configuration: Clients need consistent mount table configuration.

Trade-offs and Limitations

HDFS's design choices optimize for specific workloads at the expense of others. Understanding these trade-offs prevents misapplication and guides technology selection.

HDFS Strengths

•Petabyte-scale storage on commodity hardware
•Extremely high sequential read/write throughput
•Automatic failure detection and recovery
•Data locality awareness for compute frameworks
•Cost-effective compared to enterprise storage
•Proven at scale (Facebook: 300+ PB)

HDFS Limitations

•High latency for small files (NameNode overhead)
•No random write support (append only)
•Single-writer semantics (no concurrent writes)
•Small files problem (millions → NameNode OOM)
•No native encryption at rest (added in 3.x)
•NameNode memory limits cluster file count

The Small Files Problem

When NOT to use HDFS:

Workload	Why HDFS Is Unsuitable	Better Alternative
Interactive queries requiring <100ms latency	NameNode lookup overhead, no caching	Apache Ignite, ClickHouse
Millions of small files (<MB)	NameNode memory exhaustion	Object stores, HBase
Random read/write workloads	Block-level access, no byte-level updates	PostgreSQL, MongoDB
Real-time streaming writes	Batch-oriented, high latency	Kafka, Pulsar
Mutable data with ACID transactions	Append-only, no transactions	Delta Lake, Apache Iceberg

HDFS excels when your workload matches its assumptions: large files, write-once/read-many, sequential access, and batch processing. For other patterns, consider purpose-built systems.

Summary: HDFS Architecture Essentials

We've explored the architecture that enables HDFS to store petabytes of data reliably on commodity hardware. Let's consolidate the key concepts:

Key Takeaways

•HDFS uses a master-slave architecture — NameNode manages metadata; DataNodes store and serve data blocks. Separation of concerns enables scaling.
•Large blocks minimize metadata overhead — 128MB default blocks keep NameNode memory manageable while optimizing for sequential access patterns.
•Replication provides durability — Three-way replication with rack-aware placement tolerates disk, node, and rack failures automatically.
•Pipeline replication optimizes writes — Client writes to one DataNode; replication is distributed across the pipeline, not client-driven.
•NameNode HA prevents cluster unavailability — Active-standby with shared JournalNodes enables automatic failover in under a minute.
•Federation enables horizontal namespace scaling — Multiple NameNodes share DataNodes to overcome single-NameNode limits.
•HDFS is optimized for batch workloads — Large files, sequential access, write-once/read-many. Other patterns require different systems.

What's Next:

Page Complete