Distributed File Systems - Learning Module

Loading content...

0/227

Naming and Location

The Art of Finding Files

When you type cat /data/reports/quarterly-sales.csv on a local system, the kernel follows a well-defined path: it traverses directory entries from root, resolves the inode, finds the disk block addresses, and reads the data. The entire process is deterministic, fast, and entirely local.

But in a distributed file system, /data/reports/quarterly-sales.csv could exist on any of thousands of machines. The file might be split across multiple nodes. There might be copies on different continents. The machine that held the file yesterday might be dead today.

How does a DFS translate a human-readable file path into the physical locations where data actually resides? This is the problem of naming and location in distributed file systems—and solving it elegantly is essential to creating the illusion of a unified filesystem.

What You Will Learn

By the end of this page, you will understand how distributed file systems implement naming services, achieve location transparency, and handle the mapping between logical file paths and physical data locations. You'll see the mechanisms that make distributed storage appear as seamless as local storage.

The Naming Problem in Distributed Systems

Naming in distributed file systems refers to the process of identifying and locating resources—files, directories, and their constituent data blocks—across a network of machines. This problem is fundamentally more complex than local naming because:

Resources are mobile: Data can move between nodes for load balancing or recovery
Multiple representations exist: The same file may have replicas on different machines
Failures are partial: Some nodes may be unreachable while others work
Scale is immense: Billions of files across thousands of nodes

The naming hierarchy:

Distributed file systems typically implement a multi-level naming hierarchy:

Levels of DFS Naming
Level	Name Type	Example	Purpose
User-Level	Path name	`/data/users/alice/doc.txt`	Human-readable identifier
System-Level	File identifier	file_id: `0x7A3B2C1D`	Unique within namespace
Storage-Level	Block/chunk ID	chunk: `0x7A3B2C1D-0001`	Identifies data units
Physical-Level	Location address	`node12:/disk3/block_1a2b`	Actual storage location

The naming service is the component responsible for translating between these levels. When a client requests /data/users/alice/doc.txt, the naming service must:

Parse and validate the path
Traverse the directory hierarchy
Resolve permissions at each level
Locate the file's metadata
Identify which chunks comprise the file
Determine which storage nodes hold each chunk
Return location information to the client

This translation must happen for every file access—making the naming service a critical performance bottleneck if not designed carefully.

Naming vs. Location

It's crucial to distinguish between naming (what something is called) and location (where it physically resides). Good DFS design separates these concerns: names remain stable even as physical locations change. This separation enables transparent migration, replication, and failure recovery without changing how clients reference files.

Location Transparency

Location transparency is a fundamental property of distributed file systems: clients access files without knowing or caring about their physical location. The file path /data/report.txt works identically whether the file is stored locally, on a server across the room, or on a node in a different data center.

Degrees of transparency:

Distributed systems provide varying degrees of transparency, each hiding more complexity from the user:

Types of Transparency in DFS

•Location Transparency — The physical location of a file is hidden from clients. The same name accesses the file regardless of where it's stored.
•Migration Transparency — Files can move between nodes without clients noticing. Ongoing operations continue seamlessly after migration.
•Replication Transparency — Multiple copies of a file exist, but clients see a single logical file. The system handles replica selection and consistency.
•Concurrency Transparency — Multiple clients access the same file simultaneously without explicit coordination. The system manages concurrent access.
•Failure Transparency — Node failures are masked from clients. Reads/writes succeed despite some replicas being unavailable.
•Scaling Transparency — The system can grow without changing how clients access files. Adding nodes doesn't require client reconfiguration.

Implementing location transparency:

Location transparency is achieved through indirection—instead of encoding physical locations in file names, the system maintains a separate mapping from names to locations. This mapping can be updated independently of the names themselves.

Without Location Transparency:
  Path: //server12.datacenter-west.company.com/disk3/partition2/files/report.txt
  Problem: If server12 fails or file moves, path is invalid

With Location Transparency:
  Path: /data/report.txt
  System lookup: /data/report.txt → [node7, node12, node23] (replicas)
  Client connects to one of the returned nodes
  
Benefit: Path remains valid even if underlying nodes change

The indirection layer—typically the metadata service—absorbs all location changes, presenting a stable interface to clients.

Converting Mermaid diagram...

Namespace Organization Strategies

How a distributed file system organizes its namespace—the hierarchical structure of directories and files—has profound implications for scalability, performance, and administration. Several organizational strategies exist:

Strategy 1: Unified Global Namespace

All clients see a single, consistent directory tree regardless of which server they contact. The entire namespace is logically centralized (though physically distributed).

Advantages: Simple mental model, consistent behavior, straightforward path resolution
Challenges: Requires global coordination, potential bottleneck at namespace root
Examples: NFS with automount, AFS, CephFS

Strategy 2: Federated Namespace

The namespace is divided into independent sub-namespaces, each managed by a separate metadata server. A federation layer unifies them.

Advantages: Independent scaling of sub-namespaces, failure isolation
Challenges: Cross-namespace operations are complex, administrative overhead
Examples: HDFS Federation, multi-site deployments

Strategy 3: Mount-Based Integration

Distributed storage is 'mounted' at specific points in a local filesystem tree. Paths below the mount point are handled by the DFS.

Local filesystem:
/
├── bin/
├── home/
└── mnt/
    └── dfs/        ← Mount point
        ├── data/   ← These paths go to DFS
        └── users/

This is the traditional approach used by NFS and many POSIX-compliant DFS implementations.

Strategy 4: Object/Flat Namespace

No hierarchical directory structure—files are addressed by unique keys in a flat namespace. Directories are simulated or don't exist.

Object storage:
Bucket: my-data
├── reports/q1/sales.csv  (just a key, not a path)
├── reports/q2/sales.csv
└── users/alice/profile.json

No actual directory traversal—
keys are matched as strings.

This is the approach of S3 and most object storage systems.

HDFS Federation: Dividing the Namespace

HDFS Federation addresses the single-NameNode bottleneck by partitioning the namespace horizontally. Each NameNode manages an independent portion of the namespace (a 'namespace volume'). For example, NameNode1 handles /user/*, NameNode2 handles /data/*. Clients determine which NameNode to contact based on the path prefix. This scales metadata capacity linearly with NameNodes but requires careful planning of namespace partitioning.

Name Resolution Mechanisms

Name resolution is the process of translating a path name into the information needed to access the file's data. In distributed systems, this is more complex than local systems because each component of the path might reside on a different server.

Resolution approaches:

Name Resolution Strategies

•Iterative Resolution — Client queries naming service for each path component sequentially. /a/b/c requires three round trips: resolve /a, then /a/b, then /a/b/c. High latency but simple implementation.
•Recursive Resolution — Client sends full path; naming service internally resolves all components and returns final result. Single round trip but more server work. Used by most modern DFS.
•Cached Resolution — Client caches directory-to-location mappings. Subsequent accesses to paths in the same directories skip resolution. Requires cache invalidation on namespace changes.
•Computed Resolution — Location is computed from the name using a deterministic function (like hashing). No lookup needed, but limits flexibility. Used by some object stores.

HDFS resolution example:

Let's trace how HDFS resolves the path /user/alice/data/report.csv:

Client contacts NameNode with the full path
NameNode traverses namespace in memory:
- Root inode → directory user (inode 2)
- Inode 2 → directory alice (inode 47)
- Inode 47 → directory data (inode 1023)
- Inode 1023 → file report.csv (inode 5612)
NameNode retrieves file metadata:
- File size: 256 MB
- Block IDs: [blk_1001, blk_1002]
- Replication: 3

NameNode returns block locations:

blk_1001: [DataNode4:50010, DataNode7:50010, DataNode12:50010]
blk_1002: [DataNode2:50010, DataNode9:50010, DataNode15:50010]

Client reads directly from DataNodes

The entire namespace traversal happens in memory on the NameNode—that's why NameNode memory is a critical resource.

name_resolution_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# Conceptual name resolution in a DFS
 
class NameResolutionService:
    """
    Demonstrates different name resolution strategies.
    """
    
    def __init__(self, namespace_tree, block_locations):
        """
        namespace_tree: hierarchical structure of directories/files
        block_locations: mapping of block_id -> list of node addresses
        """
        self.namespace = namespace_tree
        self.locations = block_locations
        self.cache = {}  # Path -> resolved location cache
    
    def iterative_resolve(self, path: str) -> dict:
        """
        Iterative resolution: resolve each component separately.
        Simulates multiple round-trips to naming server.
        """
        components = path.strip("/").split("/")
        current = self.namespace["/"]
        
        for component in components:
            # Each step would be a network round-trip
            print(f"  Resolving /{component} in current directory...")
            if component not in current["children"]:
                raise FileNotFoundError(f"Path not found: {component}")
            current = current["children"][component]
        
        return self._get_locations(current)
    
    def recursive_resolve(self, path: str) -> dict:
        """
        Recursive resolution: server resolves entire path at once.
        Single round-trip, full path sent to server.
        """
        # Single call to naming service
        print(f"  Resolving full path: {path}")
        current = self.namespace["/"]
        
        for component in path.strip("/").split("/"):
            current = current["children"].get(component)
            if current is None:
                raise FileNotFoundError(path)
        
        return self._get_locations(current)
    
    def cached_resolve(self, path: str) -> dict:
        """
        Cached resolution: check cache before querying.
        """
        if path in self.cache:
            print(f"  Cache hit for: {path}")
            return self.cache[path]
        
        print(f"  Cache miss, performing lookup: {path}")
        result = self.recursive_resolve(path)
        self.cache[path] = result
        return result
    
    def computed_resolve(self, object_key: str, num_nodes: int = 100) -> list:
        """
        Computed resolution: hash-based location determination.
        No server lookup needed - location computed from key.
        """
        # Consistent hashing to determine storage nodes
        hash_value = hash(object_key) % num_nodes
        
        # Return primary and replica nodes
        primary = f"node_{hash_value}"
        replica1 = f"node_{(hash_value + 1) % num_nodes}"
        replica2 = f"node_{(hash_value + 2) % num_nodes}"
        
        print(f"  Computed locations for '{object_key}': {[primary, replica1, replica2]}")
        return [primary, replica1, replica2]
    
    def _get_locations(self, file_node: dict) -> dict:
        """Get block locations for a file node."""
        if file_node["type"] != "file":
            raise IsADirectoryError("Path is a directory")
        
        block_locs = {}
        for block_id in file_node["blocks"]:
            block_locs[block_id] = self.locations.get(block_id, [])
        
        return {
            "file_id": file_node["id"],
            "size": file_node["size"],
            "blocks": block_locs
        }
 
# Example namespace structure (simplified)
namespace = {
    "/": {
        "type": "dir",
        "children": {
            "data": {
                "type": "dir",
                "children": {
                    "report.csv": {
                        "type": "file",
                        "id": "file_001",
                        "size": 268435456,  # 256 MB
                        "blocks": ["blk_1001", "blk_1002"]
                    }
                }
            }
        }
    }
}

File Handles and Location Binding

When a client opens a file, the DFS returns a file handle—a reference that the client uses for subsequent operations. The design of file handles has significant implications for system behavior.

What's in a file handle?

File handles typically contain information that allows reopening the file without full path resolution:

File Handle Contents Across Different Systems
Component	NFS v3	HDFS	Ceph
File Identifier	File system ID + inode number	Block IDs + locations	Object ID + stripe layout
Version/Generation	Generation number	Block token	Epoch number
Security Context	None (auth at mount)	Delegation token	Capability bits
Location Info	Server address	DataNode addresses	OSD (Object Storage Daemon) map
Validity	Indefinite (may become stale)	Time-limited tokens	Leases with expiry

Location binding strategies:

Early binding: Location is determined when the file is opened and embedded in the handle. Fast subsequent accesses but problematic if nodes fail or data migrates.

Late binding: Handle contains only the file identifier. Location is resolved on each access. More resilient to changes but higher overhead.

Hybrid binding: Handle contains a cached location that's verified on use. If verification fails, fresh resolution occurs.

NFS file handle example:

NFS uses early binding with file handles that contain:

NFS File Handle (NFSv3):
├── fsid (32 bits) — Identifies the file system
├── fileid (64 bits) — Inode number  
├── generation (32 bits) — Reuse counter (to detect deleted inodes)
└── Server determines remaining opaque data

Total: up to 64 bytes (configurable)

This handle uniquely identifies a file. If the client sends a handle for a deleted file, the generation number won't match, and the server returns ESTALE (stale handle).

Stale Handles: The Rename Problem

A subtle issue with file handles: what happens if a file is renamed or moved while a client has it open? In NFS, the handle remains valid because it references the inode, not the path. But this creates a semantic surprise: subsequent resolve of the original path finds nothing, yet the handle still works. Some DFS implementations invalidate handles on rename, forcing clients to re-resolve—a different tradeoff.

Mounting, Exporting, and Access Points

Traditional distributed file systems like NFS use mounting to integrate remote filesystems into the local namespace. This mechanism determines which remote resources are visible and where they appear in the local directory tree.

The mount operation:

Mounting establishes a binding between a local path (the mount point) and a remote file system (the export). After mounting:

# On NFS client:
mount -t nfs server:/exports/data /mnt/shared

# Now /mnt/shared/* accesses server:/exports/data/*
ls /mnt/shared/reports/  →  Lists server:/exports/data/reports/

Server-side export configuration:

Servers define which directories are accessible remotely (exports) and to whom:

# /etc/exports (NFS server configuration)
/exports/data    192.168.1.0/24(rw,sync,no_root_squash)
/exports/public  *(ro,async)
/exports/secure  client1.example.com(rw,sec=krb5p)

Export options control:

Access permissions: read-write (rw) vs read-only (ro)
Allowed clients: IP ranges, hostnames, wildcards
Security mechanisms: Kerberos, AUTH_SYS, etc.
Performance tuning: sync vs async writes

Mount Semantics Considerations

•Soft vs. Hard Mounts — Soft mounts time out on server failure (returning errors to applications). Hard mounts block indefinitely, waiting for the server. Hard mounts prevent data corruption but can hang applications.
•Automounting — Dynamic mounting on first access rather than at boot. Reduces startup time and handles many mount points efficiently. Controlled by autofs or similar.
•Nested Mounts — Mounting remote filesystems on top of each other. A path traversal might cross multiple servers transparently.
•Bind Mounts — Re-exposing part of a filesystem at a different path. Can integrate local and remote filesystems in complex arrangements.

Automounting: Dynamic namespace construction

Automounting delays the actual mount until the path is accessed:

/etc/auto.master:
/projects    /etc/auto.projects

/etc/auto.projects:
alpha    -rw,soft    fileserver:/projects/alpha
beta     -rw,soft    fileserver:/projects/beta
gamma    -rw,soft    fileserver:/projects/gamma

Client behavior:
$ ls /projects/
(empty or cached entries)

$ cd /projects/alpha
(automounter intercepts, mounts fileserver:/projects/alpha)
(access proceeds as if always mounted)

$ # After timeout (e.g., 5 minutes of inactivity)
(automounter unmounts, freeing resources)

This pattern scales to thousands of potential mount points without resource exhaustion.

Location Services Implementation

Behind location transparency is a location service—the component that maintains and answers queries about where data resides. Different DFS designs implement location services differently.

Centralized Location Service (HDFS NameNode model):

The NameNode maintains an in-memory map of every block's locations:

// Conceptual NameNode data structures
class NameNode {
    // File/directory namespace (persisted to EditLog)
    Map<Path, INode> namespace;
    
    // Block to locations mapping (reconstructed from DataNode reports)
    Map<BlockId, List<DataNodeInfo>> blockLocations;
    
    // Replicas and their states
    Map<BlockId, Map<DataNodeId, ReplicaState>> replicaStates;
    
    public LocatedBlocks getBlockLocations(Path file) {
        INode inode = namespace.get(file);
        List<BlockInfo> blocks = inode.getBlocks();
        
        LocatedBlocks result = new LocatedBlocks();
        for (BlockInfo block : blocks) {
            List<DataNodeInfo> locations = blockLocations.get(block.getId());
            result.add(new LocatedBlock(block, locations));
        }
        return result;  // Return blocks with their current locations
    }
}

Key insight: Block locations are not persisted. They're reconstructed at startup from DataNode block reports. This simplifies consistency (the DataNodes are authoritative about what they store) but extends startup time.

Distributed Location Service (Ceph CRUSH model):

Ceph takes a radically different approach: there's no location lookup because locations are computed.

CRUSH (Controlled Replication Under Scalable Hashing) is a pseudo-random placement algorithm:

Input: Object name, cluster topology (CRUSH map), placement rules
Output: Deterministic list of OSDs (storage daemons) that should store the object
Property: Any client with the same CRUSH map computes the same result

CRUSH Algorithm (simplified):
function CRUSH(object_name, cluster_map, replication_factor):
    placement_group = hash(object_name) mod num_placement_groups
    
    selected_osds = []
    for r in range(replication_factor):
        # Straw algorithm selects OSDs considering weights and failures
        osd = straw_select(placement_group, r, cluster_map)
        selected_osds.append(osd)
    
    return selected_osds

Advantages:

No metadata lookup for data location
Scales linearly with clients
Cluster expansion only requires distributing updated CRUSH maps

Disadvantages:

All clients must have current cluster topology
Cluster changes (adding/removing nodes) trigger data migration
Complex algorithm to understand and tune

Converting Mermaid diagram...

Handling Location Changes

In a distributed system, data locations change constantly: nodes fail, new nodes are added, data is rebalanced, hot spots are migrated. The naming and location system must handle these changes gracefully.

Triggers for location changes:

Why Data Moves

•Node failure — Replicas on failed nodes must be recreated elsewhere to maintain redundancy.
•Capacity balancing — As nodes fill up, data migrates to nodes with more space.
•Load balancing — Hot data may be replicated more aggressively or moved closer to clients.
•Node addition — New nodes receive data to distribute the load.
•Rack/zone awareness — Data is placed to survive rack or zone failures.
•Client affinity — For latency, data may move closer to frequent accessors.

Notification mechanisms:

How do clients learn about location changes?

1. Server-initiated invalidation The metadata server tracks which clients have cached what and sends explicit invalidation messages.

Example: NFS delegations
- Client A gets read delegation for file F
- Client B wants to write file F
- Server recalls A's delegation before allowing B's write
- A must flush caches and acknowledge before B proceeds

2. Client-side timeout/refresh Clients consider cached locations valid only for a limited time, periodically refreshing.

Example: HDFS block locations
- Client caches block locations from NameNode
- Cache entries have implicit TTL (typically minutes)
- On access, if cached location fails, client re-queries NameNode
- NameNode returns current locations including any changes

3. Version-based validation Locations include version numbers. Clients validate version before using cached data.

Example: Optimistic concurrency
- Cached location: {block_id: X, version: 42, nodes: [A, B, C]}
- On access, client sends version to node
- If version matches, proceed; if not, refresh from metadata server

Graceful Degradation

Well-designed DFS location systems degrade gracefully. If a cached location fails, the client tries alternate replicas before consulting the metadata server. If the metadata server is slow, stale caches still allow reads of unchanged files. This layered approach maximizes availability while eventually achieving consistency.

Summary: Naming and Location in DFS

We've explored how distributed file systems translate human-readable paths into physical data locations. Let's consolidate the key concepts:

Key Naming and Location Takeaways

•Naming is multi-level — User paths translate through system identifiers, block IDs, and finally physical addresses. Each level serves a different purpose.
•Location transparency is fundamental — The ability to access files without knowing their physical location is what makes distributed storage usable.
•Namespace organization affects scalability — Unified, federated, or flat namespaces each have distinct performance and administration characteristics.
•Name resolution is a critical path — Every file access requires resolution; caching and efficient algorithms are essential for performance.
•File handles bind names to locations — The design of handles (early vs. late binding) affects resilience to location changes.
•Location services vary widely — From centralized lookup (HDFS) to computed placement (CRUSH), each approach has distinct tradeoffs.
•Location changes are constant — The system must gracefully handle failures, additions, and rebalancing without disrupting clients.

What's next:

Now that we understand how files are named and located, we'll explore caching strategies in distributed file systems. Caching is crucial for performance—but in a distributed system, caching introduces complex consistency challenges that we must carefully navigate.

Page Complete

You now understand how distributed file systems implement naming services and achieve location transparency. You can analyze different approaches to namespace organization, name resolution, and location tracking. Next, we'll see how caching accelerates distributed file access while managing consistency.

Naming and Location

The Art of Finding Files

What You Will Learn

The Naming Problem in Distributed Systems

Resources are mobile: Data can move between nodes for load balancing or recovery
Multiple representations exist: The same file may have replicas on different machines
Failures are partial: Some nodes may be unreachable while others work
Scale is immense: Billions of files across thousands of nodes

The naming hierarchy:

Distributed file systems typically implement a multi-level naming hierarchy:

Levels of DFS Naming
Level	Name Type	Example	Purpose
User-Level	Path name	`/data/users/alice/doc.txt`	Human-readable identifier
System-Level	File identifier	file_id: `0x7A3B2C1D`	Unique within namespace
Storage-Level	Block/chunk ID	chunk: `0x7A3B2C1D-0001`	Identifies data units
Physical-Level	Location address	`node12:/disk3/block_1a2b`	Actual storage location

The naming service is the component responsible for translating between these levels. When a client requests /data/users/alice/doc.txt, the naming service must:

Parse and validate the path
Traverse the directory hierarchy
Resolve permissions at each level
Locate the file's metadata
Identify which chunks comprise the file
Determine which storage nodes hold each chunk
Return location information to the client

This translation must happen for every file access—making the naming service a critical performance bottleneck if not designed carefully.

Naming vs. Location

Location Transparency

Degrees of transparency:

Distributed systems provide varying degrees of transparency, each hiding more complexity from the user:

Types of Transparency in DFS

•Location Transparency — The physical location of a file is hidden from clients. The same name accesses the file regardless of where it's stored.
•Migration Transparency — Files can move between nodes without clients noticing. Ongoing operations continue seamlessly after migration.
•Replication Transparency — Multiple copies of a file exist, but clients see a single logical file. The system handles replica selection and consistency.
•Concurrency Transparency — Multiple clients access the same file simultaneously without explicit coordination. The system manages concurrent access.
•Failure Transparency — Node failures are masked from clients. Reads/writes succeed despite some replicas being unavailable.
•Scaling Transparency — The system can grow without changing how clients access files. Adding nodes doesn't require client reconfiguration.

Implementing location transparency:

Without Location Transparency:
  Path: //server12.datacenter-west.company.com/disk3/partition2/files/report.txt
  Problem: If server12 fails or file moves, path is invalid

With Location Transparency:
  Path: /data/report.txt
  System lookup: /data/report.txt → [node7, node12, node23] (replicas)
  Client connects to one of the returned nodes
  
Benefit: Path remains valid even if underlying nodes change

The indirection layer—typically the metadata service—absorbs all location changes, presenting a stable interface to clients.

Converting Mermaid diagram...

Namespace Organization Strategies

Strategy 1: Unified Global Namespace

All clients see a single, consistent directory tree regardless of which server they contact. The entire namespace is logically centralized (though physically distributed).

Advantages: Simple mental model, consistent behavior, straightforward path resolution
Challenges: Requires global coordination, potential bottleneck at namespace root
Examples: NFS with automount, AFS, CephFS

Strategy 2: Federated Namespace

The namespace is divided into independent sub-namespaces, each managed by a separate metadata server. A federation layer unifies them.

Advantages: Independent scaling of sub-namespaces, failure isolation
Challenges: Cross-namespace operations are complex, administrative overhead
Examples: HDFS Federation, multi-site deployments

Strategy 3: Mount-Based Integration

Distributed storage is 'mounted' at specific points in a local filesystem tree. Paths below the mount point are handled by the DFS.

Local filesystem:
/
├── bin/
├── home/
└── mnt/
    └── dfs/        ← Mount point
        ├── data/   ← These paths go to DFS
        └── users/

This is the traditional approach used by NFS and many POSIX-compliant DFS implementations.

Strategy 4: Object/Flat Namespace

No hierarchical directory structure—files are addressed by unique keys in a flat namespace. Directories are simulated or don't exist.

Object storage:
Bucket: my-data
├── reports/q1/sales.csv  (just a key, not a path)
├── reports/q2/sales.csv
└── users/alice/profile.json

No actual directory traversal—
keys are matched as strings.

This is the approach of S3 and most object storage systems.

HDFS Federation: Dividing the Namespace

Name Resolution Mechanisms

Resolution approaches:

Name Resolution Strategies

•Iterative Resolution — Client queries naming service for each path component sequentially. /a/b/c requires three round trips: resolve /a, then /a/b, then /a/b/c. High latency but simple implementation.
•Recursive Resolution — Client sends full path; naming service internally resolves all components and returns final result. Single round trip but more server work. Used by most modern DFS.
•Cached Resolution — Client caches directory-to-location mappings. Subsequent accesses to paths in the same directories skip resolution. Requires cache invalidation on namespace changes.
•Computed Resolution — Location is computed from the name using a deterministic function (like hashing). No lookup needed, but limits flexibility. Used by some object stores.

HDFS resolution example:

Let's trace how HDFS resolves the path /user/alice/data/report.csv:

Client contacts NameNode with the full path
NameNode traverses namespace in memory:
- Root inode → directory user (inode 2)
- Inode 2 → directory alice (inode 47)
- Inode 47 → directory data (inode 1023)
- Inode 1023 → file report.csv (inode 5612)
NameNode retrieves file metadata:
- File size: 256 MB
- Block IDs: [blk_1001, blk_1002]
- Replication: 3

NameNode returns block locations:

blk_1001: [DataNode4:50010, DataNode7:50010, DataNode12:50010]
blk_1002: [DataNode2:50010, DataNode9:50010, DataNode15:50010]

Client reads directly from DataNodes

The entire namespace traversal happens in memory on the NameNode—that's why NameNode memory is a critical resource.

name_resolution_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# Conceptual name resolution in a DFS
 
class NameResolutionService:
    """
    Demonstrates different name resolution strategies.
    """
    
    def __init__(self, namespace_tree, block_locations):
        """
        namespace_tree: hierarchical structure of directories/files
        block_locations: mapping of block_id -> list of node addresses
        """
        self.namespace = namespace_tree
        self.locations = block_locations
        self.cache = {}  # Path -> resolved location cache
    
    def iterative_resolve(self, path: str) -> dict:
        """
        Iterative resolution: resolve each component separately.
        Simulates multiple round-trips to naming server.
        """
        components = path.strip("/").split("/")
        current = self.namespace["/"]
        
        for component in components:
            # Each step would be a network round-trip
            print(f"  Resolving /{component} in current directory...")
            if component not in current["children"]:
                raise FileNotFoundError(f"Path not found: {component}")
            current = current["children"][component]
        
        return self._get_locations(current)
    
    def recursive_resolve(self, path: str) -> dict:
        """
        Recursive resolution: server resolves entire path at once.
        Single round-trip, full path sent to server.
        """
        # Single call to naming service
        print(f"  Resolving full path: {path}")
        current = self.namespace["/"]
        
        for component in path.strip("/").split("/"):
            current = current["children"].get(component)
            if current is None:
                raise FileNotFoundError(path)
        
        return self._get_locations(current)
    
    def cached_resolve(self, path: str) -> dict:
        """
        Cached resolution: check cache before querying.
        """
        if path in self.cache:
            print(f"  Cache hit for: {path}")
            return self.cache[path]
        
        print(f"  Cache miss, performing lookup: {path}")
        result = self.recursive_resolve(path)
        self.cache[path] = result
        return result
    
    def computed_resolve(self, object_key: str, num_nodes: int = 100) -> list:
        """
        Computed resolution: hash-based location determination.
        No server lookup needed - location computed from key.
        """
        # Consistent hashing to determine storage nodes
        hash_value = hash(object_key) % num_nodes
        
        # Return primary and replica nodes
        primary = f"node_{hash_value}"
        replica1 = f"node_{(hash_value + 1) % num_nodes}"
        replica2 = f"node_{(hash_value + 2) % num_nodes}"
        
        print(f"  Computed locations for '{object_key}': {[primary, replica1, replica2]}")
        return [primary, replica1, replica2]
    
    def _get_locations(self, file_node: dict) -> dict:
        """Get block locations for a file node."""
        if file_node["type"] != "file":
            raise IsADirectoryError("Path is a directory")
        
        block_locs = {}
        for block_id in file_node["blocks"]:
            block_locs[block_id] = self.locations.get(block_id, [])
        
        return {
            "file_id": file_node["id"],
            "size": file_node["size"],
            "blocks": block_locs
        }
 
# Example namespace structure (simplified)
namespace = {
    "/": {
        "type": "dir",
        "children": {
            "data": {
                "type": "dir",
                "children": {
                    "report.csv": {
                        "type": "file",
                        "id": "file_001",
                        "size": 268435456,  # 256 MB
                        "blocks": ["blk_1001", "blk_1002"]
                    }
                }
            }
        }
    }
}

File Handles and Location Binding

What's in a file handle?

File handles typically contain information that allows reopening the file without full path resolution:

File Handle Contents Across Different Systems
Component	NFS v3	HDFS	Ceph
File Identifier	File system ID + inode number	Block IDs + locations	Object ID + stripe layout
Version/Generation	Generation number	Block token	Epoch number
Security Context	None (auth at mount)	Delegation token	Capability bits
Location Info	Server address	DataNode addresses	OSD (Object Storage Daemon) map
Validity	Indefinite (may become stale)	Time-limited tokens	Leases with expiry

Location binding strategies:

Early binding: Location is determined when the file is opened and embedded in the handle. Fast subsequent accesses but problematic if nodes fail or data migrates.

Late binding: Handle contains only the file identifier. Location is resolved on each access. More resilient to changes but higher overhead.

Hybrid binding: Handle contains a cached location that's verified on use. If verification fails, fresh resolution occurs.

NFS file handle example:

NFS uses early binding with file handles that contain:

NFS File Handle (NFSv3):
├── fsid (32 bits) — Identifies the file system
├── fileid (64 bits) — Inode number  
├── generation (32 bits) — Reuse counter (to detect deleted inodes)
└── Server determines remaining opaque data

Total: up to 64 bytes (configurable)

This handle uniquely identifies a file. If the client sends a handle for a deleted file, the generation number won't match, and the server returns ESTALE (stale handle).

Stale Handles: The Rename Problem

Mounting, Exporting, and Access Points

The mount operation:

Mounting establishes a binding between a local path (the mount point) and a remote file system (the export). After mounting:

# On NFS client:
mount -t nfs server:/exports/data /mnt/shared

# Now /mnt/shared/* accesses server:/exports/data/*
ls /mnt/shared/reports/  →  Lists server:/exports/data/reports/

Server-side export configuration:

Servers define which directories are accessible remotely (exports) and to whom:

# /etc/exports (NFS server configuration)
/exports/data    192.168.1.0/24(rw,sync,no_root_squash)
/exports/public  *(ro,async)
/exports/secure  client1.example.com(rw,sec=krb5p)

Export options control:

Access permissions: read-write (rw) vs read-only (ro)
Allowed clients: IP ranges, hostnames, wildcards
Security mechanisms: Kerberos, AUTH_SYS, etc.
Performance tuning: sync vs async writes

Mount Semantics Considerations

•Soft vs. Hard Mounts — Soft mounts time out on server failure (returning errors to applications). Hard mounts block indefinitely, waiting for the server. Hard mounts prevent data corruption but can hang applications.
•Automounting — Dynamic mounting on first access rather than at boot. Reduces startup time and handles many mount points efficiently. Controlled by autofs or similar.
•Nested Mounts — Mounting remote filesystems on top of each other. A path traversal might cross multiple servers transparently.
•Bind Mounts — Re-exposing part of a filesystem at a different path. Can integrate local and remote filesystems in complex arrangements.

Automounting: Dynamic namespace construction

Automounting delays the actual mount until the path is accessed:

/etc/auto.master:
/projects    /etc/auto.projects

/etc/auto.projects:
alpha    -rw,soft    fileserver:/projects/alpha
beta     -rw,soft    fileserver:/projects/beta
gamma    -rw,soft    fileserver:/projects/gamma

Client behavior:
$ ls /projects/
(empty or cached entries)

$ cd /projects/alpha
(automounter intercepts, mounts fileserver:/projects/alpha)
(access proceeds as if always mounted)

$ # After timeout (e.g., 5 minutes of inactivity)
(automounter unmounts, freeing resources)

This pattern scales to thousands of potential mount points without resource exhaustion.

Location Services Implementation

Behind location transparency is a location service—the component that maintains and answers queries about where data resides. Different DFS designs implement location services differently.

Centralized Location Service (HDFS NameNode model):

The NameNode maintains an in-memory map of every block's locations:

// Conceptual NameNode data structures
class NameNode {
    // File/directory namespace (persisted to EditLog)
    Map<Path, INode> namespace;
    
    // Block to locations mapping (reconstructed from DataNode reports)
    Map<BlockId, List<DataNodeInfo>> blockLocations;
    
    // Replicas and their states
    Map<BlockId, Map<DataNodeId, ReplicaState>> replicaStates;
    
    public LocatedBlocks getBlockLocations(Path file) {
        INode inode = namespace.get(file);
        List<BlockInfo> blocks = inode.getBlocks();
        
        LocatedBlocks result = new LocatedBlocks();
        for (BlockInfo block : blocks) {
            List<DataNodeInfo> locations = blockLocations.get(block.getId());
            result.add(new LocatedBlock(block, locations));
        }
        return result;  // Return blocks with their current locations
    }
}

Distributed Location Service (Ceph CRUSH model):

Ceph takes a radically different approach: there's no location lookup because locations are computed.

CRUSH (Controlled Replication Under Scalable Hashing) is a pseudo-random placement algorithm:

Input: Object name, cluster topology (CRUSH map), placement rules
Output: Deterministic list of OSDs (storage daemons) that should store the object
Property: Any client with the same CRUSH map computes the same result

CRUSH Algorithm (simplified):
function CRUSH(object_name, cluster_map, replication_factor):
    placement_group = hash(object_name) mod num_placement_groups
    
    selected_osds = []
    for r in range(replication_factor):
        # Straw algorithm selects OSDs considering weights and failures
        osd = straw_select(placement_group, r, cluster_map)
        selected_osds.append(osd)
    
    return selected_osds

Advantages:

No metadata lookup for data location
Scales linearly with clients
Cluster expansion only requires distributing updated CRUSH maps

Disadvantages:

All clients must have current cluster topology
Cluster changes (adding/removing nodes) trigger data migration
Complex algorithm to understand and tune

Converting Mermaid diagram...

Handling Location Changes

Triggers for location changes:

Why Data Moves

•Node failure — Replicas on failed nodes must be recreated elsewhere to maintain redundancy.
•Capacity balancing — As nodes fill up, data migrates to nodes with more space.
•Load balancing — Hot data may be replicated more aggressively or moved closer to clients.
•Node addition — New nodes receive data to distribute the load.
•Rack/zone awareness — Data is placed to survive rack or zone failures.
•Client affinity — For latency, data may move closer to frequent accessors.

Notification mechanisms:

How do clients learn about location changes?

1. Server-initiated invalidation The metadata server tracks which clients have cached what and sends explicit invalidation messages.

Example: NFS delegations
- Client A gets read delegation for file F
- Client B wants to write file F
- Server recalls A's delegation before allowing B's write
- A must flush caches and acknowledge before B proceeds

2. Client-side timeout/refresh Clients consider cached locations valid only for a limited time, periodically refreshing.

Example: HDFS block locations
- Client caches block locations from NameNode
- Cache entries have implicit TTL (typically minutes)
- On access, if cached location fails, client re-queries NameNode
- NameNode returns current locations including any changes

3. Version-based validation Locations include version numbers. Clients validate version before using cached data.

Example: Optimistic concurrency
- Cached location: {block_id: X, version: 42, nodes: [A, B, C]}
- On access, client sends version to node
- If version matches, proceed; if not, refresh from metadata server

Graceful Degradation

Summary: Naming and Location in DFS

We've explored how distributed file systems translate human-readable paths into physical data locations. Let's consolidate the key concepts:

Key Naming and Location Takeaways

•Naming is multi-level — User paths translate through system identifiers, block IDs, and finally physical addresses. Each level serves a different purpose.
•Location transparency is fundamental — The ability to access files without knowing their physical location is what makes distributed storage usable.
•Namespace organization affects scalability — Unified, federated, or flat namespaces each have distinct performance and administration characteristics.
•Name resolution is a critical path — Every file access requires resolution; caching and efficient algorithms are essential for performance.
•File handles bind names to locations — The design of handles (early vs. late binding) affects resilience to location changes.
•Location services vary widely — From centralized lookup (HDFS) to computed placement (CRUSH), each approach has distinct tradeoffs.
•Location changes are constant — The system must gracefully handle failures, additions, and rebalancing without disrupting clients.

What's next:

Page Complete