Loading content...
When you type cat /data/reports/quarterly-sales.csv on a local system, the kernel follows a well-defined path: it traverses directory entries from root, resolves the inode, finds the disk block addresses, and reads the data. The entire process is deterministic, fast, and entirely local.
But in a distributed file system, /data/reports/quarterly-sales.csv could exist on any of thousands of machines. The file might be split across multiple nodes. There might be copies on different continents. The machine that held the file yesterday might be dead today.
How does a DFS translate a human-readable file path into the physical locations where data actually resides? This is the problem of naming and location in distributed file systems—and solving it elegantly is essential to creating the illusion of a unified filesystem.
By the end of this page, you will understand how distributed file systems implement naming services, achieve location transparency, and handle the mapping between logical file paths and physical data locations. You'll see the mechanisms that make distributed storage appear as seamless as local storage.
Naming in distributed file systems refers to the process of identifying and locating resources—files, directories, and their constituent data blocks—across a network of machines. This problem is fundamentally more complex than local naming because:
The naming hierarchy:
Distributed file systems typically implement a multi-level naming hierarchy:
| Level | Name Type | Example | Purpose |
|---|---|---|---|
| User-Level | Path name | /data/users/alice/doc.txt | Human-readable identifier |
| System-Level | File identifier | file_id: 0x7A3B2C1D | Unique within namespace |
| Storage-Level | Block/chunk ID | chunk: 0x7A3B2C1D-0001 | Identifies data units |
| Physical-Level | Location address | node12:/disk3/block_1a2b | Actual storage location |
The naming service is the component responsible for translating between these levels. When a client requests /data/users/alice/doc.txt, the naming service must:
This translation must happen for every file access—making the naming service a critical performance bottleneck if not designed carefully.
It's crucial to distinguish between naming (what something is called) and location (where it physically resides). Good DFS design separates these concerns: names remain stable even as physical locations change. This separation enables transparent migration, replication, and failure recovery without changing how clients reference files.
Location transparency is a fundamental property of distributed file systems: clients access files without knowing or caring about their physical location. The file path /data/report.txt works identically whether the file is stored locally, on a server across the room, or on a node in a different data center.
Degrees of transparency:
Distributed systems provide varying degrees of transparency, each hiding more complexity from the user:
Implementing location transparency:
Location transparency is achieved through indirection—instead of encoding physical locations in file names, the system maintains a separate mapping from names to locations. This mapping can be updated independently of the names themselves.
Without Location Transparency:
Path: //server12.datacenter-west.company.com/disk3/partition2/files/report.txt
Problem: If server12 fails or file moves, path is invalid
With Location Transparency:
Path: /data/report.txt
System lookup: /data/report.txt → [node7, node12, node23] (replicas)
Client connects to one of the returned nodes
Benefit: Path remains valid even if underlying nodes change
The indirection layer—typically the metadata service—absorbs all location changes, presenting a stable interface to clients.
How a distributed file system organizes its namespace—the hierarchical structure of directories and files—has profound implications for scalability, performance, and administration. Several organizational strategies exist:
Strategy 1: Unified Global Namespace
All clients see a single, consistent directory tree regardless of which server they contact. The entire namespace is logically centralized (though physically distributed).
Strategy 2: Federated Namespace
The namespace is divided into independent sub-namespaces, each managed by a separate metadata server. A federation layer unifies them.
Strategy 3: Mount-Based Integration
Distributed storage is 'mounted' at specific points in a local filesystem tree. Paths below the mount point are handled by the DFS.
Local filesystem:
/
├── bin/
├── home/
└── mnt/
└── dfs/ ← Mount point
├── data/ ← These paths go to DFS
└── users/
This is the traditional approach used by NFS and many POSIX-compliant DFS implementations.
Strategy 4: Object/Flat Namespace
No hierarchical directory structure—files are addressed by unique keys in a flat namespace. Directories are simulated or don't exist.
Object storage:
Bucket: my-data
├── reports/q1/sales.csv (just a key, not a path)
├── reports/q2/sales.csv
└── users/alice/profile.json
No actual directory traversal—
keys are matched as strings.
This is the approach of S3 and most object storage systems.
HDFS Federation addresses the single-NameNode bottleneck by partitioning the namespace horizontally. Each NameNode manages an independent portion of the namespace (a 'namespace volume'). For example, NameNode1 handles /user/*, NameNode2 handles /data/*. Clients determine which NameNode to contact based on the path prefix. This scales metadata capacity linearly with NameNodes but requires careful planning of namespace partitioning.
Name resolution is the process of translating a path name into the information needed to access the file's data. In distributed systems, this is more complex than local systems because each component of the path might reside on a different server.
Resolution approaches:
/a/b/c requires three round trips: resolve /a, then /a/b, then /a/b/c. High latency but simple implementation.HDFS resolution example:
Let's trace how HDFS resolves the path /user/alice/data/report.csv:
user (inode 2)alice (inode 47)data (inode 1023)report.csv (inode 5612)blk_1001: [DataNode4:50010, DataNode7:50010, DataNode12:50010]
blk_1002: [DataNode2:50010, DataNode9:50010, DataNode15:50010]
The entire namespace traversal happens in memory on the NameNode—that's why NameNode memory is a critical resource.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
# Conceptual name resolution in a DFS class NameResolutionService: """ Demonstrates different name resolution strategies. """ def __init__(self, namespace_tree, block_locations): """ namespace_tree: hierarchical structure of directories/files block_locations: mapping of block_id -> list of node addresses """ self.namespace = namespace_tree self.locations = block_locations self.cache = {} # Path -> resolved location cache def iterative_resolve(self, path: str) -> dict: """ Iterative resolution: resolve each component separately. Simulates multiple round-trips to naming server. """ components = path.strip("/").split("/") current = self.namespace["/"] for component in components: # Each step would be a network round-trip print(f" Resolving /{component} in current directory...") if component not in current["children"]: raise FileNotFoundError(f"Path not found: {component}") current = current["children"][component] return self._get_locations(current) def recursive_resolve(self, path: str) -> dict: """ Recursive resolution: server resolves entire path at once. Single round-trip, full path sent to server. """ # Single call to naming service print(f" Resolving full path: {path}") current = self.namespace["/"] for component in path.strip("/").split("/"): current = current["children"].get(component) if current is None: raise FileNotFoundError(path) return self._get_locations(current) def cached_resolve(self, path: str) -> dict: """ Cached resolution: check cache before querying. """ if path in self.cache: print(f" Cache hit for: {path}") return self.cache[path] print(f" Cache miss, performing lookup: {path}") result = self.recursive_resolve(path) self.cache[path] = result return result def computed_resolve(self, object_key: str, num_nodes: int = 100) -> list: """ Computed resolution: hash-based location determination. No server lookup needed - location computed from key. """ # Consistent hashing to determine storage nodes hash_value = hash(object_key) % num_nodes # Return primary and replica nodes primary = f"node_{hash_value}" replica1 = f"node_{(hash_value + 1) % num_nodes}" replica2 = f"node_{(hash_value + 2) % num_nodes}" print(f" Computed locations for '{object_key}': {[primary, replica1, replica2]}") return [primary, replica1, replica2] def _get_locations(self, file_node: dict) -> dict: """Get block locations for a file node.""" if file_node["type"] != "file": raise IsADirectoryError("Path is a directory") block_locs = {} for block_id in file_node["blocks"]: block_locs[block_id] = self.locations.get(block_id, []) return { "file_id": file_node["id"], "size": file_node["size"], "blocks": block_locs } # Example namespace structure (simplified)namespace = { "/": { "type": "dir", "children": { "data": { "type": "dir", "children": { "report.csv": { "type": "file", "id": "file_001", "size": 268435456, # 256 MB "blocks": ["blk_1001", "blk_1002"] } } } } }}When a client opens a file, the DFS returns a file handle—a reference that the client uses for subsequent operations. The design of file handles has significant implications for system behavior.
What's in a file handle?
File handles typically contain information that allows reopening the file without full path resolution:
| Component | NFS v3 | HDFS | Ceph |
|---|---|---|---|
| File Identifier | File system ID + inode number | Block IDs + locations | Object ID + stripe layout |
| Version/Generation | Generation number | Block token | Epoch number |
| Security Context | None (auth at mount) | Delegation token | Capability bits |
| Location Info | Server address | DataNode addresses | OSD (Object Storage Daemon) map |
| Validity | Indefinite (may become stale) | Time-limited tokens | Leases with expiry |
Location binding strategies:
Early binding: Location is determined when the file is opened and embedded in the handle. Fast subsequent accesses but problematic if nodes fail or data migrates.
Late binding: Handle contains only the file identifier. Location is resolved on each access. More resilient to changes but higher overhead.
Hybrid binding: Handle contains a cached location that's verified on use. If verification fails, fresh resolution occurs.
NFS file handle example:
NFS uses early binding with file handles that contain:
NFS File Handle (NFSv3):
├── fsid (32 bits) — Identifies the file system
├── fileid (64 bits) — Inode number
├── generation (32 bits) — Reuse counter (to detect deleted inodes)
└── Server determines remaining opaque data
Total: up to 64 bytes (configurable)
This handle uniquely identifies a file. If the client sends a handle for a deleted file, the generation number won't match, and the server returns ESTALE (stale handle).
A subtle issue with file handles: what happens if a file is renamed or moved while a client has it open? In NFS, the handle remains valid because it references the inode, not the path. But this creates a semantic surprise: subsequent resolve of the original path finds nothing, yet the handle still works. Some DFS implementations invalidate handles on rename, forcing clients to re-resolve—a different tradeoff.
Traditional distributed file systems like NFS use mounting to integrate remote filesystems into the local namespace. This mechanism determines which remote resources are visible and where they appear in the local directory tree.
The mount operation:
Mounting establishes a binding between a local path (the mount point) and a remote file system (the export). After mounting:
# On NFS client:
mount -t nfs server:/exports/data /mnt/shared
# Now /mnt/shared/* accesses server:/exports/data/*
ls /mnt/shared/reports/ → Lists server:/exports/data/reports/
Server-side export configuration:
Servers define which directories are accessible remotely (exports) and to whom:
# /etc/exports (NFS server configuration)
/exports/data 192.168.1.0/24(rw,sync,no_root_squash)
/exports/public *(ro,async)
/exports/secure client1.example.com(rw,sec=krb5p)
Export options control:
Automounting: Dynamic namespace construction
Automounting delays the actual mount until the path is accessed:
/etc/auto.master:
/projects /etc/auto.projects
/etc/auto.projects:
alpha -rw,soft fileserver:/projects/alpha
beta -rw,soft fileserver:/projects/beta
gamma -rw,soft fileserver:/projects/gamma
Client behavior:
$ ls /projects/
(empty or cached entries)
$ cd /projects/alpha
(automounter intercepts, mounts fileserver:/projects/alpha)
(access proceeds as if always mounted)
$ # After timeout (e.g., 5 minutes of inactivity)
(automounter unmounts, freeing resources)
This pattern scales to thousands of potential mount points without resource exhaustion.
Behind location transparency is a location service—the component that maintains and answers queries about where data resides. Different DFS designs implement location services differently.
Centralized Location Service (HDFS NameNode model):
The NameNode maintains an in-memory map of every block's locations:
// Conceptual NameNode data structures
class NameNode {
// File/directory namespace (persisted to EditLog)
Map<Path, INode> namespace;
// Block to locations mapping (reconstructed from DataNode reports)
Map<BlockId, List<DataNodeInfo>> blockLocations;
// Replicas and their states
Map<BlockId, Map<DataNodeId, ReplicaState>> replicaStates;
public LocatedBlocks getBlockLocations(Path file) {
INode inode = namespace.get(file);
List<BlockInfo> blocks = inode.getBlocks();
LocatedBlocks result = new LocatedBlocks();
for (BlockInfo block : blocks) {
List<DataNodeInfo> locations = blockLocations.get(block.getId());
result.add(new LocatedBlock(block, locations));
}
return result; // Return blocks with their current locations
}
}
Key insight: Block locations are not persisted. They're reconstructed at startup from DataNode block reports. This simplifies consistency (the DataNodes are authoritative about what they store) but extends startup time.
Distributed Location Service (Ceph CRUSH model):
Ceph takes a radically different approach: there's no location lookup because locations are computed.
CRUSH (Controlled Replication Under Scalable Hashing) is a pseudo-random placement algorithm:
CRUSH Algorithm (simplified):
function CRUSH(object_name, cluster_map, replication_factor):
placement_group = hash(object_name) mod num_placement_groups
selected_osds = []
for r in range(replication_factor):
# Straw algorithm selects OSDs considering weights and failures
osd = straw_select(placement_group, r, cluster_map)
selected_osds.append(osd)
return selected_osds
Advantages:
Disadvantages:
In a distributed system, data locations change constantly: nodes fail, new nodes are added, data is rebalanced, hot spots are migrated. The naming and location system must handle these changes gracefully.
Triggers for location changes:
Notification mechanisms:
How do clients learn about location changes?
1. Server-initiated invalidation The metadata server tracks which clients have cached what and sends explicit invalidation messages.
Example: NFS delegations
- Client A gets read delegation for file F
- Client B wants to write file F
- Server recalls A's delegation before allowing B's write
- A must flush caches and acknowledge before B proceeds
2. Client-side timeout/refresh Clients consider cached locations valid only for a limited time, periodically refreshing.
Example: HDFS block locations
- Client caches block locations from NameNode
- Cache entries have implicit TTL (typically minutes)
- On access, if cached location fails, client re-queries NameNode
- NameNode returns current locations including any changes
3. Version-based validation Locations include version numbers. Clients validate version before using cached data.
Example: Optimistic concurrency
- Cached location: {block_id: X, version: 42, nodes: [A, B, C]}
- On access, client sends version to node
- If version matches, proceed; if not, refresh from metadata server
Well-designed DFS location systems degrade gracefully. If a cached location fails, the client tries alternate replicas before consulting the metadata server. If the metadata server is slow, stale caches still allow reads of unchanged files. This layered approach maximizes availability while eventually achieving consistency.
We've explored how distributed file systems translate human-readable paths into physical data locations. Let's consolidate the key concepts:
What's next:
Now that we understand how files are named and located, we'll explore caching strategies in distributed file systems. Caching is crucial for performance—but in a distributed system, caching introduces complex consistency challenges that we must carefully navigate.
You now understand how distributed file systems implement naming services and achieve location transparency. You can analyze different approaches to namespace organization, name resolution, and location tracking. Next, we'll see how caching accelerates distributed file access while managing consistency.