Operating SystemsAdvanced File Systems

Network File Systems

LevelAdvanced

Duration75 mins

TopicAdvanced File Systems

3 / 5

Stateless Protocol

The Stateless Paradigm

When Sun Microsystems designed NFS in the early 1980s, they made a radical choice: the server would maintain no state about client activities. Every request would be self-contained, carrying all information needed for processing. The server would have no memory of previous interactions.

This decision—seemingly simple yet profoundly consequential—shaped everything about NFS: its remarkable resilience to failures, its performance characteristics, its consistency semantics, and even its limitations. Understanding stateless design is essential for anyone working with NFS, because virtually every NFS behavior can be traced back to this foundational choice.

In this page, we'll explore why statelessness was chosen, how it enables NFS's legendary crash recovery, the engineering challenges it creates, and how NFS accommodates operations that inherently require state.

What You Will Learn

By the end of this page, you will understand the philosophy behind stateless design, how idempotent operations enable transparent crash recovery, the specific challenges statelessness creates for file locking and deletion, and the auxiliary protocols (NLM, NSM) that provide necessary state. You'll be able to predict NFS behavior in failure scenarios.

Why Statelessness?

To appreciate why NFS chose statelessness, we must understand the alternative—and its problems.

The Stateful Approach

A stateful file server maintains information about each client's session:

Client Session State (stateful server would track):
- Which files this client has open
- Current file positions (offsets)
- Lock ownership
- Pending operations
- Authentication session

This seems natural—after all, local file systems track open files. But in a distributed system, maintaining state creates severe complications:

Problems with Stateful Servers

•Server Crash Recovery — When the server reboots, all session state is lost. Clients have open files that the server no longer knows about. How do they recover? Clients must detect the crash, reconnect, and re-establish their state—complex and error-prone.
•Client Crash Handling — When a client crashes, the server holds resources (locks, open file slots) for a dead client. The server must detect the crash (difficult) and clean up state (what if the client comes back?).
•Resource Limits — The server must allocate memory for each client session. With thousands of clients, state tracking becomes a bottleneck.
•Failover Complexity — If the server fails over to a backup, all session state must be replicated in real-time. Any state divergence causes errors.
•Timeout Ambiguity — How long should the server wait before assuming a client is dead? Too short causes false positives; too long wastes resources.

The Stateless Solution

NFS sidesteps these problems elegantly: the server simply doesn't maintain per-client state.

No session tracking — Clients don't 'log in' or establish connections
No open file counts — The server doesn't know which files are 'open'
No file positions — Each read/write includes the offset explicitly
No cleanup required — Nothing to clean up when clients disappear

The server treats each request as independent. It looks up the file (via file handle), performs the operation, and forgets about the interaction immediately.

Stateful Server

•Server crash: complex recovery
•Client crash: resource leaks
•Scales poorly with clients
•Failover requires state sync
•Complex implementation

Stateless Server

•Server crash: automatic retry
•Client crash: no server impact
•Scales to any client count
•Failover is trivial
•Simple, reliable implementation

Network Reality in the 1980s

When NFS was designed, networks were much less reliable than today. Packet loss, cable failures, and server crashes were common. The stateless design made NFS robust in this hostile environment. A motto from the era: 'Simple systems work; complex systems fail in complex ways.'

Idempotent Operations

The stateless design works because NFS operations are designed to be idempotent: executing the same operation multiple times produces the same result as executing it once.

Formal Definition:

An operation f is idempotent if: f(f(x)) = f(x)

In NFS terms: sending the same request twice (due to a retry after timeout) should not corrupt the file system or produce errors.

Why Idempotency Matters

Consider what happens with a lost network packet:

Client sends READ request
Server receives request, sends response
Response is lost in the network
Client times out, doesn't know if request succeeded
Client retries by sending the same request
Server processes it again (doesn't know it's a retry)
Client receives response

With an idempotent READ, this is completely safe—reading the same data twice is harmless. But what about operations that might not be naturally idempotent?

Idempotency Analysis of NFS Operations
Operation	Naturally Idempotent?	How NFS Handles It
READ	Yes	Reading same data multiple times is harmless
GETATTR	Yes	Querying attributes multiple times is harmless
WRITE (at offset)	Yes	Writing same data to same offset overwrites identically
LOOKUP	Yes	Looking up a name multiple times is harmless
SETATTR	Yes	Setting same attributes again is harmless
MKDIR (exclusive)	No	Server checks 'already exists', returns NFS3ERR_EXIST
CREATE (exclusive)	No	Use CREATE_VERF to detect duplicate requests
REMOVE	No	Server checks 'doesn't exist', returns NFS3ERR_NOENT
RENAME	Mostly	If already renamed, second attempt may succeed or fail safely
APPEND	No	Offset-based writes avoid this; append not supported in NFS

Handling Non-Idempotent Operations

Some operations cannot be made naturally idempotent. NFS uses several strategies:

1. Exclusive Creation Verifier (NFSv3)

For CREATE with exclusive mode, the client sends a 64-bit verifier:

CREATE3args {
    diropargs3  where;      /* Directory + filename */
    createhow3  how;        /* UNCHECKED | GUARDED | EXCLUSIVE */
    createverf3 verf;       /* 64-bit verifier for EXCLUSIVE */
};

First request: server creates file, stores verifier in file's metadata (mtime/atime)
Duplicate request: server checks verifier matches, returns success (no error)

2. Server Reply Cache

Servers may maintain a cache of recent replies. If a duplicate request arrives:

The server recognizes the XID (transaction ID) and client
Returns the cached reply without re-executing the operation

Duplicate Request Detection:

Request arrives with XID=12345 from client 10.0.0.1
  ↓
Check reply cache: (client=10.0.0.1, xid=12345)
  ↓
Cache hit? → Return cached reply
Cache miss? → Process request, cache reply

Server Reply Cache Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* Simplified NFS server reply cache (duplicate request detection) */
 
struct reply_cache_entry {
    struct sockaddr_storage  client_addr;  /* Client address */
    uint32_t                 xid;          /* RPC transaction ID */
    uint32_t                 program;      /* RPC program number */
    uint32_t                 proc;         /* Procedure number */
    time_t                   timestamp;    /* When reply was cached */
    
    /* Cached reply */
    int                      reply_len;
    char                     reply_data[NFS_MAXREPLY];
    
    struct reply_cache_entry *next;        /* Hash chain */
};
 
/* Hash table of recent replies */
#define REPLY_CACHE_SIZE 1024
struct reply_cache_entry *reply_cache[REPLY_CACHE_SIZE];
 
/* Check for duplicate request, return cached reply if found */
struct reply_cache_entry* check_reply_cache(
    struct sockaddr_storage *client,
    uint32_t xid, uint32_t program, uint32_t proc) 
{
    uint32_t hash = hash_reply(client, xid) % REPLY_CACHE_SIZE;
    struct reply_cache_entry *entry;
    
    for (entry = reply_cache[hash]; entry; entry = entry->next) {
        if (entry->xid == xid &&
            entry->program == program &&
            entry->proc == proc &&
            addr_equal(&entry->client_addr, client)) {
            /* Duplicate detected! */
            log_debug("Duplicate request detected: xid=%u, proc=%u", 
                      xid, proc);
            return entry;  /* Return cached reply */
        }
    }
    return NULL;  /* Not a duplicate */
}
 
/* Cache a reply for future duplicate detection */
void cache_reply(struct sockaddr_storage *client,
                 uint32_t xid, uint32_t program, uint32_t proc,
                 void *reply, int reply_len)
{
    /* Only cache non-idempotent operations */
    if (is_idempotent(proc))
        return;
    
    struct reply_cache_entry *entry = alloc_cache_entry();
    memcpy(&entry->client_addr, client, sizeof(*client));
    entry->xid = xid;
    entry->program = program;
    entry->proc = proc;
    entry->timestamp = time(NULL);
    entry->reply_len = reply_len;
    memcpy(entry->reply_data, reply, reply_len);
    
    /* Insert into hash table */
    uint32_t hash = hash_reply(client, xid) % REPLY_CACHE_SIZE;
    entry->next = reply_cache[hash];
    reply_cache[hash] = entry;
    
    /* Expire old entries to prevent unbounded growth */
    expire_old_entries();
}

Reply Cache Limitations

The reply cache is finite and entries expire. If a client retries a non-idempotent request after the cache entry expires, the server will re-execute the operation, potentially causing errors (e.g., 'file already exists'). Applications must handle these cases gracefully.

Crash Recovery Mechanics

The most celebrated benefit of NFS's stateless design is its exceptional crash recovery. Let's trace through exactly how different failure scenarios are handled.

Scenario 1: Server Crashes and Reboots

Consider an application reading a large file when the server suddenly reboots:

Timeline:

  Client                        Server
    |                             |  
    |--- READ(fh, off=0, 8K) --->|  Server receives, processing
    |<-- [data bytes 0-8K] ------|  Response sent
    |                             |  
    |--- READ(fh, off=8K, 8K) -->|  Request in flight
    |                             X  SERVER CRASHES
    |      [timeout...]           |  
    |                             |  (Server rebooting...)
    |--- READ(fh, off=8K, 8K) -->|  Client retries
    |                             |  SERVER BACK UP
    |                             |  
    |                             |  Server receives retry
    |                             |  Validates file handle
    |                             |  Reads from disk
    |<-- [data bytes 8K-16K] ----|  Response sent
    |                             |
    |  (Application continues    |
    |   reading, unaware of      |
    |   server crash!)           |

Why this works:

The READ request contains all necessary information (file handle, offset, count)
The server needs no memory of previous reads
The file handle remains valid across reboot (encoded in file metadata)
The client simply retries until success
The application never sees an error

Scenario 2: Client Crashes and Reboots

Timeline:

  Client                         Server
    |                              |
    |--- WRITE(fh, off=0, data)->  |  Server writes to cache
    |<-- OK ---------------------- |  Write acknowledged
    X  CLIENT CRASHES              |
                                   |  (Server doesn't notice)
    |  (Client rebooting...)       |
    |                              |
    |  (Application restarts,      |
    |   reopens file)              |
    |--- LOOKUP(dir_fh, "file") ->|  Normal operation
    |<-- file_fh + attrs --------- |  
    |                              |

Why this works:

The server has no per-client state to clean up
The written data is safely on the server
When the client restarts, it simply performs normal operations
No recovery protocol is needed

NFS Mount Options for Crash Behavior

•hard (default) — Retry requests indefinitely until server responds. Operation eventually succeeds or user interrupts. Safest for data integrity.
•soft — Retry for a limited time, then return error to application. Can cause data corruption if writes are interrupted. Generally discouraged.
•intr (deprecated) — Allow interrupt of hard mount operations with Ctrl+C. Replaced by more granular options.
•timeo=n — Initial timeout in tenths of a second (default: 600 for TCP, 7 for UDP). After timeout, retry with exponential backoff.
•retrans=n — Number of retries before giving up (soft mount) or printing warning (hard mount).

Client Retry Logic
C (Pseudocode)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
/* Simplified NFS client RPC with retry logic */
 
int nfs_rpc_call(struct nfs_client *client, 
                 int procedure,
                 void *args, 
                 void *result)
{
    int timeout_ms = client->initial_timeout;
    int retries = 0;
    int major_timeout = 0;
    
    while (1) {
        /* Send request */
        int err = rpc_send_request(client, procedure, args);
        if (err)
            goto retry;
        
        /* Wait for response with timeout */
        err = rpc_wait_response(client, result, timeout_ms);
        
        if (err == 0) {
            return 0;  /* Success! */
        }
        
        if (err != -ETIMEDOUT) {
            return err;  /* Real error, not timeout */
        }
        
retry:
        retries++;
        
        /* Soft mount: give up after configured retries */
        if (client->mount_flags & NFS_MOUNT_SOFT) {
            if (retries >= client->max_retries) {
                printk(KERN_WARNING "NFS: server not responding, "
                       "giving up
");
                return -EIO;  /* Return error to application */
            }
        }
        
        /* Hard mount: warn but keep trying */
        if ((retries % client->max_retries) == 0) {
            major_timeout++;
            printk(KERN_WARNING "NFS: server %s not responding, "
                   "still trying
", client->hostname);
        }
        
        /* Exponential backoff, capped at max timeout */
        timeout_ms = min(timeout_ms * 2, client->max_timeout);
        
        /* For write operations, may need congestion control */
        if (is_write_procedure(procedure)) {
            nfs_congestion_wait(client);
        }
    }
}

The Hard vs Soft Mount Debate

Soft mounts are tempting (avoid hangs!) but dangerous. If a soft mount times out during a write, the application thinks the write failed and might take recovery action—but the server might have actually completed the write. This can cause data corruption. Use hard mounts for data you care about.

The Locking Problem

While statelessness works brilliantly for file I/O, it creates a fundamental problem for file locking. Locks are inherently stateful—they represent an ongoing relationship between a client and a file.

Why Locks Need State

Consider what a lock represents:

Client A: "I own the exclusive lock on file X"
           ↓
         This is STATE - the server must remember it
         to deny locks to Client B
           ↓
Client B: "Can I lock file X?"  
           ↓
         Server must consult state to answer

The Dilemma:

Without server-maintained state, Client B has no way to know Client A holds the lock
But maintaining state reintroduces all the crash recovery problems we avoided
UNIX applications expect flock() and fcntl() locking to work

The Solution: Auxiliary Stateful Protocols

NFS solved this by separating concerns:

Core NFS Protocol — Remains stateless (file I/O, metadata)
Network Lock Manager (NLM) — Separate stateful protocol for locking
Network Status Monitor (NSM) — Monitors machine health for lock recovery

This separation means that a crash affects only the locking subsystem, not basic file operations.

Converting Mermaid diagram...

The Network Lock Manager (NLM) Protocol

NLM is a separate RPC service (program 100021) that handles lock operations:

NLM_LOCK — Request a lock (blocking or non-blocking)
NLM_UNLOCK — Release a lock
NLM_TEST — Test if a lock would succeed
NLM_CANCEL — Cancel a pending lock request
NLM_GRANTED — Callback: previously blocked lock now available

Locking Sequence:

  Client A                Lock Manager              Client B
     |                         |                       |
     |--- NLM_LOCK(file) ----->|                       |
     |<-- NLM_LOCK_RES(ok) ----|  [A owns lock]        |
     |                         |                       |
     |                         |<--- NLM_LOCK(file) ---|
     |                         |  [blocked - A has it] |
     |                         |                       |
     |--- NLM_UNLOCK(file) --->|                       |
     |<-- NLM_UNLOCK_RES(ok) --|                       |
     |                         |                       |
     |                         |--- NLM_GRANTED ------>|
     |                         |<-- NLM_GRANTED_RES ---|  
     |                         |  [B now owns lock]    |

NLM Lock Structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/* NLM (Network Lock Manager) Structures */
 
/* Lock owner identification */
struct nlm_lock {
    char         *caller_name;   /* Client hostname */
    netobj        fh;            /* NFS file handle */
    netobj        oh;            /* Lock owner handle (opaque) */
    uint32_t      svid;          /* System V id (PID on client) */
    uint64_t      l_offset;      /* Lock start offset */
    uint64_t      l_len;         /* Lock length (0 = to EOF) */
};
 
/* Lock request message */
struct nlm_lockargs {
    netobj        cookie;        /* Request identifier */
    bool_t        block;         /* Wait if lock unavailable? */
    bool_t        exclusive;     /* Exclusive or shared? */
    struct nlm_lock alock;       /* Lock details */
    bool_t        reclaim;       /* Reclaiming after restart? */
    int           state;         /* NSM state number */
};
 
/* Lock states for callback management */
enum nlm_stats {
    nlm_granted = 0,    /* Lock succeeded */
    nlm_denied = 1,     /* Lock conflicts with existing lock */
    nlm_denied_nolocks = 2,  /* Server out of lock resources */
    nlm_blocked = 3,    /* Request queued, will callback */
    nlm_denied_grace_period = 4,  /* Server in recovery, retry later */
    nlm_deadlck = 5,    /* Would cause deadlock */
};
 
/* Server-side lock state tracking */
struct server_lock {
    struct list_head  list;      /* All locks on this file */
    struct nlm_lock   lock;      /* Lock details */
    int               fl_type;   /* F_RDLCK or F_WRLCK */
    struct nlm_host  *host;      /* Client that owns this lock */
    time_t            timestamp; /* For lease expiration */
};

Advisory vs. Mandatory Locking

NLM implements advisory locking—applications can ignore locks if they choose. Mandatory locking (where the OS enforces locks on all access) is complex in a distributed system and rarely used with NFS. Well-behaved applications honor advisory locks; malicious or buggy programs can bypass them.

Network Status Monitor (NSM)

Locks create a new crash recovery problem: what happens to locks when machines fail? This is where the Network Status Monitor (NSM) protocol comes in.

The Lock Recovery Problem

Scenario: Server Crashes While Holding Lock State

1. Client A holds lock on file X
2. Client B is blocked waiting for the lock
3. Server crashes, losing all lock state
4. Server reboots
5. ??? What happens to the locks?

Without a recovery mechanism:

Client A thinks it still holds the lock
Client B's blocked request was lost
New lock requests might conflict with Client A's 'ghost' lock
Or the server might grant locks that conflict with Client A's assumption

NSM Protocol Operation

NSM (program 100024) tracks the health of networked machines and provides crash notifications:

On Startup (SM_NOTIFY): When a machine boots, its statd daemon notifies all previously-registered peers:

"I'm machine X. My state counter is now 5."

State counter increments on each reboot, so peers know this is a fresh boot.

Monitor Registration (SM_MON): When a client locks a file, its lockd tells statd:

"Monitor server.example.com. If it reboots, tell me."

Crash Notification: When the server reboots and sends SM_NOTIFY:

Client's statd receives notification
Client's statd tells local lockd: "Server rebooted!"
Client's lockd initiates lock recovery

Converting Mermaid diagram...

The Grace Period

When the server's lock manager restarts, it enters a grace period (typically 90 seconds):

During grace period:
- Only lock reclaim requests accepted (clients recovering existing locks)
- New lock requests rejected with 'denied_grace_period'
- This ensures previous lock holders can re-establish their locks
After grace period:
- Normal lock operations resume
- Any locks not reclaimed are considered abandoned

This mechanism prevents a race where new lock requests sneak in before legitimate owners can reclaim their locks.

NSM State Files
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# NSM maintains state in /var/lib/nfs/
 
# Current state counter (increments on boot)
$ cat /var/lib/nfs/state
5
 
# Monitored hosts (persist across reboots)
$ ls /var/lib/nfs/sm/
client1.example.com  client2.example.com  server.example.com
 
# Backup for recovery
$ ls /var/lib/nfs/sm.bak/
 
# When recovering after crash:
# 1. statd reads sm.bak/ to find previous monitors
# 2. Sends SM_NOTIFY to each host
# 3. Moves sm.bak/* to sm/*
# 4. Increments state counter
 
# Example state file content (implementation varies)
$ cat /var/lib/nfs/sm/client1.example.com
# Contains: IP address, state number when registered, callback program

Persistent Lock State

NSM's persistent state files (/var/lib/nfs/sm/) must survive reboots. If these files are lost, the server won't know which clients to notify, and lock recovery fails. Place this directory on persistent storage, not tmpfs.

The Silly Rename Problem

Statelessness creates an interesting problem with file deletion. In UNIX, a deleted file remains accessible to processes that have it open—they hold references to the inode. But NFS has no concept of 'open files'.

The UNIX Deletion Model

Local File System:

1. Process A opens file X (gets file descriptor)
2. Process B deletes file X
3. File disappears from directory (unlinked)
4. Process A can still read/write via its file descriptor
5. When Process A closes, the inode is freed

This works because the kernel tracks open files and delays
actual deletion until the reference count reaches zero.

The NFS Problem

With NFS, the server doesn't know the client has the file 'open':

1. Client A opens file X (gets file handle)
2. Client B deletes file X (REMOVE RPC)
3. Server removes directory entry AND inode (no references known)
4. Client A tries to read via its file handle
5. Server: "File not found!" (NFS3ERR_STALE)
6. Application on Client A gets unexpected error

This violates UNIX semantics that applications depend on.

The 'Silly Rename' Solution

The NFS client implements a workaround called silly rename:

When a file is deleted but has open references on the client:

Client detects: "file is open locally"
Instead of REMOVE, client sends RENAME
File is renamed to .nfsXXXXXXXXXXXXXXXX (unique temporary name)
Client continues using file via existing handle
When all local references close, client sends REMOVE for the silly name

This preserves UNIX semantics while working within NFS's stateless model.

Silly Rename in Action
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Demonstrating silly rename behavior
 
# Terminal 1: Open a file and keep it open
$ cd /mnt/nfs/shared
$ exec 3< testfile.txt   # Open file on fd 3
$ cat <&3                 # Can read it
Hello from testfile
 
# Terminal 2: Delete the file while it's open in Terminal 1
$ rm /mnt/nfs/shared/testfile.txt
$ ls -la /mnt/nfs/shared/
total 4
drwxr-xr-x 2 root root   40 Jan 15 10:00 .
drwxr-xr-x 3 root root 4096 Jan 15 09:00 ..
-rw-r--r-- 1 root root   20 Jan 15 10:00 .nfs00000001a5e0003e  # Silly name!
 
# The file was renamed, not deleted
 
# Terminal 1: Still can read!
$ cat <&3
Hello from testfile
 
# Close the file descriptor
$ exec 3<&-
 
# Now check directory again
$ ls -la /mnt/nfs/shared/
total 0
drwxr-xr-x 2 root root 40 Jan 15 10:00 .
drwxr-xr-x 3 root root 4096 Jan 15 09:00 ..
 
# Silly-named file is now gone (deleted on close)

Silly Rename Limitations

Silly rename has edge cases: if the client crashes before closing the file, the .nfsXXXX file remains on the server permanently. These orphaned files must be cleaned up manually. Also, if the rename itself fails (e.g., read-only export), the delete succeeds but the file becomes inaccessible.

Other Statelessness-Related Behaviors

•No append mode — Without tracking file positions, true append is impossible. NFS WRITE always specifies an offset. Applications using O_APPEND must get the file size first, creating race conditions.
•No file descriptor inheritance — fork() semantics are complex. Child processes need their own NFS state tracking.
•Attribute caching anomalies — Without server-push notifications, clients poll. Cached attributes can be seconds stale.
•No change notifications — Clients can't 'watch' for changes. inotify doesn't work across NFS (it's local-only).

NFSv4: Embracing Limited Statefulness

NFSv4, released in 2003, represents a significant evolution: it deliberately incorporates limited statefulness to address the most painful limitations of the stateless model while preserving its benefits.

Key NFSv4 Stateful Features:

Stateful Features in NFSv4
Feature	Purpose	State Maintained
Delegations	Client caching without server round-trips	Server tracks which clients have delegations
Leases	Bounded lifetime for state, enabling cleanup	Every state has an expiration time
Lock Integration	Locking built into NFS protocol (not separate NLM)	Server tracks locks per client session
Open State	Server knows which files are open	Open files per client session
Sessions (v4.1)	Exactly-once semantics via slot tables	Operation sequence numbers per session

Delegations: Caching with Confidence

A delegation is the server saying to a client: "I delegate control of this file to you. Until I recall the delegation, you can cache and modify without asking me."

Delegation Types:

- Read Delegation: "No one else will modify this file"
  → Client can cache reads without revalidating
  
- Write Delegation: "No one else will read or modify this file"
  → Client can cache writes locally without sending to server

The server tracks delegations. If another client wants access that conflicts with a delegation, the server issues a recall:

1. Client A has write delegation on file X
2. Client B tries to open file X
3. Server: "Wait, I need to recall A's delegation"
4. Server → Client A: CB_RECALL (give back your delegation)
5. Client A flushes cached writes, returns delegation
6. Server → Client B: "OK, proceed with open"

This requires the server to maintain state about delegations, but provides significant performance benefits.

Leases: State with Expiration

NFSv4 uses leases to bound the lifetime of state. Every piece of state (delegation, lock, open file) has a lease period:

Default: 90 seconds
Client must renew before expiration
If lease expires, state is revoked

This solves the client-crash cleanup problem:

Client crash scenario:

1. Client holds lock on file (with 90-second lease)
2. Client crashes, stops sending renewals
3. 90 seconds elapse
4. Server: "Lease expired, revoking state"
5. State cleaned up automatically
6. Other clients can now acquire lock

Renewal is efficient—a single SEQUENCE or RENEW operation extends all of a client's leases.

NFSv4 Lease Management
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* NFSv4 State Management Structures */
 
struct nfs4_client {
    clientid4            cl_clientid;     /* Unique client identifier */
    verifier4            cl_verifier;     /* Client boot verifier */
    time_t               cl_time;         /* Last renewal time */
    time_t               cl_lease_time;   /* Lease duration (seconds) */
    
    struct list_head     cl_openowners;   /* Open owner states */
    struct list_head     cl_delegations;  /* Granted delegations */
    struct list_head     cl_callbacks;    /* Callback pending */
    
    struct sockaddr_storage cl_cb_addr;   /* Callback address */
    bool                 cl_cb_connected; /* Callback channel up? */
};
 
/* Lease expiration check (called periodically) */
void nfs4_expire_clients(void) {
    struct nfs4_client *client, *tmp;
    time_t cutoff = time(NULL) - server_lease_time;
    
    list_for_each_entry_safe(client, tmp, &client_list, cl_list) {
        if (client->cl_time < cutoff) {
            /* Lease expired - client didn't renew */
            log_info("Client %llx lease expired, revoking state",
                     client->cl_clientid);
            
            /* Revoke all delegations */
            revoke_client_delegations(client);
            
            /* Release all locks */
            release_client_locks(client);
            
            /* Close all open files */
            close_client_opens(client);
            
            /* Remove client record */
            destroy_client(client);
        }
    }
}
 
/* Client renews by any operation or explicit RENEW */
void nfs4_renew_client(struct nfs4_client *client) {
    client->cl_time = time(NULL);
}

NFSv4: Best of Both Worlds?

NFSv4's approach is pragmatic: maintain state where it provides clear benefits (caching, locking), but with bounded lifetime (leases) to enable automatic cleanup. Most operations remain stateless (reads, writes, lookups). This hybrid approach has proven successful for 20+ years.

Summary: Statelessness in Practice

The stateless protocol design is one of NFS's most important and influential architectural decisions. Let's consolidate what we've learned:

Key Takeaways

•Statelessness enables transparent crash recovery — Servers and clients can crash and reboot without complex recovery protocols. Clients simply retry failed requests.
•Idempotent operations make retry safe — Most NFS operations are idempotent by design. Non-idempotent operations use verifiers and reply caches for safety.
•Hard mounts are safer than soft mounts — Hard mounts retry indefinitely, ensuring data integrity. Soft mounts can cause data corruption if writes are interrupted.
•Locking requires separate stateful protocols — NLM provides locking; NSM enables lock recovery. The grace period ensures orderly recovery after crashes.
•Silly rename preserves UNIX semantics — Files deleted while open are renamed, not removed, maintaining expected behavior for applications.
•NFSv4 adds bounded statefulness — Delegations, leases, and integrated locking address statelessness limitations while preserving crash recovery via lease expiration.

What's Next

With statelessness understood, we're ready to explore NFS Versions—the evolution from NFSv2 through NFSv4.2. We'll see how each version addressed limitations of its predecessors while maintaining compatibility, and understand the specific features and trade-offs of each version. This practical knowledge helps you choose the right NFS version for your deployment.

Page Complete

You now deeply understand NFS's stateless design philosophy—its motivations, mechanics, and implications. You can predict NFS behavior in failure scenarios, understand the role of auxiliary protocols like NLM and NSM, and appreciate NFSv4's evolution toward bounded statefulness. This foundation is essential for effective NFS deployment and troubleshooting.

3 / 5

Loading learning content...

Operating SystemsAdvanced File Systems

Network File Systems

LevelAdvanced

Duration75 mins

TopicAdvanced File Systems

3 / 5

Stateless Protocol

The Stateless Paradigm

What You Will Learn

Why Statelessness?

To appreciate why NFS chose statelessness, we must understand the alternative—and its problems.

The Stateful Approach

A stateful file server maintains information about each client's session:

Client Session State (stateful server would track):
- Which files this client has open
- Current file positions (offsets)
- Lock ownership
- Pending operations
- Authentication session

This seems natural—after all, local file systems track open files. But in a distributed system, maintaining state creates severe complications:

Problems with Stateful Servers

•Server Crash Recovery — When the server reboots, all session state is lost. Clients have open files that the server no longer knows about. How do they recover? Clients must detect the crash, reconnect, and re-establish their state—complex and error-prone.
•Client Crash Handling — When a client crashes, the server holds resources (locks, open file slots) for a dead client. The server must detect the crash (difficult) and clean up state (what if the client comes back?).
•Resource Limits — The server must allocate memory for each client session. With thousands of clients, state tracking becomes a bottleneck.
•Failover Complexity — If the server fails over to a backup, all session state must be replicated in real-time. Any state divergence causes errors.
•Timeout Ambiguity — How long should the server wait before assuming a client is dead? Too short causes false positives; too long wastes resources.

The Stateless Solution

NFS sidesteps these problems elegantly: the server simply doesn't maintain per-client state.

No session tracking — Clients don't 'log in' or establish connections
No open file counts — The server doesn't know which files are 'open'
No file positions — Each read/write includes the offset explicitly
No cleanup required — Nothing to clean up when clients disappear

The server treats each request as independent. It looks up the file (via file handle), performs the operation, and forgets about the interaction immediately.

Stateful Server

•Server crash: complex recovery
•Client crash: resource leaks
•Scales poorly with clients
•Failover requires state sync
•Complex implementation

Stateless Server

•Server crash: automatic retry
•Client crash: no server impact
•Scales to any client count
•Failover is trivial
•Simple, reliable implementation

Network Reality in the 1980s

Idempotent Operations

The stateless design works because NFS operations are designed to be idempotent: executing the same operation multiple times produces the same result as executing it once.

Formal Definition:

An operation f is idempotent if: f(f(x)) = f(x)

In NFS terms: sending the same request twice (due to a retry after timeout) should not corrupt the file system or produce errors.

Why Idempotency Matters

Consider what happens with a lost network packet:

Client sends READ request
Server receives request, sends response
Response is lost in the network
Client times out, doesn't know if request succeeded
Client retries by sending the same request
Server processes it again (doesn't know it's a retry)
Client receives response

With an idempotent READ, this is completely safe—reading the same data twice is harmless. But what about operations that might not be naturally idempotent?

Idempotency Analysis of NFS Operations
Operation	Naturally Idempotent?	How NFS Handles It
READ	Yes	Reading same data multiple times is harmless
GETATTR	Yes	Querying attributes multiple times is harmless
WRITE (at offset)	Yes	Writing same data to same offset overwrites identically
LOOKUP	Yes	Looking up a name multiple times is harmless
SETATTR	Yes	Setting same attributes again is harmless
MKDIR (exclusive)	No	Server checks 'already exists', returns NFS3ERR_EXIST
CREATE (exclusive)	No	Use CREATE_VERF to detect duplicate requests
REMOVE	No	Server checks 'doesn't exist', returns NFS3ERR_NOENT
RENAME	Mostly	If already renamed, second attempt may succeed or fail safely
APPEND	No	Offset-based writes avoid this; append not supported in NFS

Handling Non-Idempotent Operations

Some operations cannot be made naturally idempotent. NFS uses several strategies:

1. Exclusive Creation Verifier (NFSv3)

For CREATE with exclusive mode, the client sends a 64-bit verifier:

CREATE3args {
    diropargs3  where;      /* Directory + filename */
    createhow3  how;        /* UNCHECKED | GUARDED | EXCLUSIVE */
    createverf3 verf;       /* 64-bit verifier for EXCLUSIVE */
};

First request: server creates file, stores verifier in file's metadata (mtime/atime)
Duplicate request: server checks verifier matches, returns success (no error)

2. Server Reply Cache

Servers may maintain a cache of recent replies. If a duplicate request arrives:

The server recognizes the XID (transaction ID) and client
Returns the cached reply without re-executing the operation

Duplicate Request Detection:

Request arrives with XID=12345 from client 10.0.0.1
  ↓
Check reply cache: (client=10.0.0.1, xid=12345)
  ↓
Cache hit? → Return cached reply
Cache miss? → Process request, cache reply

Server Reply Cache Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* Simplified NFS server reply cache (duplicate request detection) */
 
struct reply_cache_entry {
    struct sockaddr_storage  client_addr;  /* Client address */
    uint32_t                 xid;          /* RPC transaction ID */
    uint32_t                 program;      /* RPC program number */
    uint32_t                 proc;         /* Procedure number */
    time_t                   timestamp;    /* When reply was cached */
    
    /* Cached reply */
    int                      reply_len;
    char                     reply_data[NFS_MAXREPLY];
    
    struct reply_cache_entry *next;        /* Hash chain */
};
 
/* Hash table of recent replies */
#define REPLY_CACHE_SIZE 1024
struct reply_cache_entry *reply_cache[REPLY_CACHE_SIZE];
 
/* Check for duplicate request, return cached reply if found */
struct reply_cache_entry* check_reply_cache(
    struct sockaddr_storage *client,
    uint32_t xid, uint32_t program, uint32_t proc) 
{
    uint32_t hash = hash_reply(client, xid) % REPLY_CACHE_SIZE;
    struct reply_cache_entry *entry;
    
    for (entry = reply_cache[hash]; entry; entry = entry->next) {
        if (entry->xid == xid &&
            entry->program == program &&
            entry->proc == proc &&
            addr_equal(&entry->client_addr, client)) {
            /* Duplicate detected! */
            log_debug("Duplicate request detected: xid=%u, proc=%u", 
                      xid, proc);
            return entry;  /* Return cached reply */
        }
    }
    return NULL;  /* Not a duplicate */
}
 
/* Cache a reply for future duplicate detection */
void cache_reply(struct sockaddr_storage *client,
                 uint32_t xid, uint32_t program, uint32_t proc,
                 void *reply, int reply_len)
{
    /* Only cache non-idempotent operations */
    if (is_idempotent(proc))
        return;
    
    struct reply_cache_entry *entry = alloc_cache_entry();
    memcpy(&entry->client_addr, client, sizeof(*client));
    entry->xid = xid;
    entry->program = program;
    entry->proc = proc;
    entry->timestamp = time(NULL);
    entry->reply_len = reply_len;
    memcpy(entry->reply_data, reply, reply_len);
    
    /* Insert into hash table */
    uint32_t hash = hash_reply(client, xid) % REPLY_CACHE_SIZE;
    entry->next = reply_cache[hash];
    reply_cache[hash] = entry;
    
    /* Expire old entries to prevent unbounded growth */
    expire_old_entries();
}

Reply Cache Limitations

Crash Recovery Mechanics

The most celebrated benefit of NFS's stateless design is its exceptional crash recovery. Let's trace through exactly how different failure scenarios are handled.

Scenario 1: Server Crashes and Reboots

Consider an application reading a large file when the server suddenly reboots:

Timeline:

  Client                        Server
    |                             |  
    |--- READ(fh, off=0, 8K) --->|  Server receives, processing
    |<-- [data bytes 0-8K] ------|  Response sent
    |                             |  
    |--- READ(fh, off=8K, 8K) -->|  Request in flight
    |                             X  SERVER CRASHES
    |      [timeout...]           |  
    |                             |  (Server rebooting...)
    |--- READ(fh, off=8K, 8K) -->|  Client retries
    |                             |  SERVER BACK UP
    |                             |  
    |                             |  Server receives retry
    |                             |  Validates file handle
    |                             |  Reads from disk
    |<-- [data bytes 8K-16K] ----|  Response sent
    |                             |
    |  (Application continues    |
    |   reading, unaware of      |
    |   server crash!)           |

Why this works:

The READ request contains all necessary information (file handle, offset, count)
The server needs no memory of previous reads
The file handle remains valid across reboot (encoded in file metadata)
The client simply retries until success
The application never sees an error

Scenario 2: Client Crashes and Reboots

Timeline:

  Client                         Server
    |                              |
    |--- WRITE(fh, off=0, data)->  |  Server writes to cache
    |<-- OK ---------------------- |  Write acknowledged
    X  CLIENT CRASHES              |
                                   |  (Server doesn't notice)
    |  (Client rebooting...)       |
    |                              |
    |  (Application restarts,      |
    |   reopens file)              |
    |--- LOOKUP(dir_fh, "file") ->|  Normal operation
    |<-- file_fh + attrs --------- |  
    |                              |

Why this works:

The server has no per-client state to clean up
The written data is safely on the server
When the client restarts, it simply performs normal operations
No recovery protocol is needed

NFS Mount Options for Crash Behavior

•hard (default) — Retry requests indefinitely until server responds. Operation eventually succeeds or user interrupts. Safest for data integrity.
•soft — Retry for a limited time, then return error to application. Can cause data corruption if writes are interrupted. Generally discouraged.
•intr (deprecated) — Allow interrupt of hard mount operations with Ctrl+C. Replaced by more granular options.
•timeo=n — Initial timeout in tenths of a second (default: 600 for TCP, 7 for UDP). After timeout, retry with exponential backoff.
•retrans=n — Number of retries before giving up (soft mount) or printing warning (hard mount).

Client Retry Logic
C (Pseudocode)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
/* Simplified NFS client RPC with retry logic */
 
int nfs_rpc_call(struct nfs_client *client, 
                 int procedure,
                 void *args, 
                 void *result)
{
    int timeout_ms = client->initial_timeout;
    int retries = 0;
    int major_timeout = 0;
    
    while (1) {
        /* Send request */
        int err = rpc_send_request(client, procedure, args);
        if (err)
            goto retry;
        
        /* Wait for response with timeout */
        err = rpc_wait_response(client, result, timeout_ms);
        
        if (err == 0) {
            return 0;  /* Success! */
        }
        
        if (err != -ETIMEDOUT) {
            return err;  /* Real error, not timeout */
        }
        
retry:
        retries++;
        
        /* Soft mount: give up after configured retries */
        if (client->mount_flags & NFS_MOUNT_SOFT) {
            if (retries >= client->max_retries) {
                printk(KERN_WARNING "NFS: server not responding, "
                       "giving up
");
                return -EIO;  /* Return error to application */
            }
        }
        
        /* Hard mount: warn but keep trying */
        if ((retries % client->max_retries) == 0) {
            major_timeout++;
            printk(KERN_WARNING "NFS: server %s not responding, "
                   "still trying
", client->hostname);
        }
        
        /* Exponential backoff, capped at max timeout */
        timeout_ms = min(timeout_ms * 2, client->max_timeout);
        
        /* For write operations, may need congestion control */
        if (is_write_procedure(procedure)) {
            nfs_congestion_wait(client);
        }
    }
}

The Hard vs Soft Mount Debate

The Locking Problem

Why Locks Need State

Consider what a lock represents:

Client A: "I own the exclusive lock on file X"
           ↓
         This is STATE - the server must remember it
         to deny locks to Client B
           ↓
Client B: "Can I lock file X?"  
           ↓
         Server must consult state to answer

The Dilemma:

Without server-maintained state, Client B has no way to know Client A holds the lock
But maintaining state reintroduces all the crash recovery problems we avoided
UNIX applications expect flock() and fcntl() locking to work

The Solution: Auxiliary Stateful Protocols

NFS solved this by separating concerns:

Core NFS Protocol — Remains stateless (file I/O, metadata)
Network Lock Manager (NLM) — Separate stateful protocol for locking
Network Status Monitor (NSM) — Monitors machine health for lock recovery

This separation means that a crash affects only the locking subsystem, not basic file operations.

Converting Mermaid diagram...

The Network Lock Manager (NLM) Protocol

NLM is a separate RPC service (program 100021) that handles lock operations:

NLM_LOCK — Request a lock (blocking or non-blocking)
NLM_UNLOCK — Release a lock
NLM_TEST — Test if a lock would succeed
NLM_CANCEL — Cancel a pending lock request
NLM_GRANTED — Callback: previously blocked lock now available

Locking Sequence:

  Client A                Lock Manager              Client B
     |                         |                       |
     |--- NLM_LOCK(file) ----->|                       |
     |<-- NLM_LOCK_RES(ok) ----|  [A owns lock]        |
     |                         |                       |
     |                         |<--- NLM_LOCK(file) ---|
     |                         |  [blocked - A has it] |
     |                         |                       |
     |--- NLM_UNLOCK(file) --->|                       |
     |<-- NLM_UNLOCK_RES(ok) --|                       |
     |                         |                       |
     |                         |--- NLM_GRANTED ------>|
     |                         |<-- NLM_GRANTED_RES ---|  
     |                         |  [B now owns lock]    |

NLM Lock Structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/* NLM (Network Lock Manager) Structures */
 
/* Lock owner identification */
struct nlm_lock {
    char         *caller_name;   /* Client hostname */
    netobj        fh;            /* NFS file handle */
    netobj        oh;            /* Lock owner handle (opaque) */
    uint32_t      svid;          /* System V id (PID on client) */
    uint64_t      l_offset;      /* Lock start offset */
    uint64_t      l_len;         /* Lock length (0 = to EOF) */
};
 
/* Lock request message */
struct nlm_lockargs {
    netobj        cookie;        /* Request identifier */
    bool_t        block;         /* Wait if lock unavailable? */
    bool_t        exclusive;     /* Exclusive or shared? */
    struct nlm_lock alock;       /* Lock details */
    bool_t        reclaim;       /* Reclaiming after restart? */
    int           state;         /* NSM state number */
};
 
/* Lock states for callback management */
enum nlm_stats {
    nlm_granted = 0,    /* Lock succeeded */
    nlm_denied = 1,     /* Lock conflicts with existing lock */
    nlm_denied_nolocks = 2,  /* Server out of lock resources */
    nlm_blocked = 3,    /* Request queued, will callback */
    nlm_denied_grace_period = 4,  /* Server in recovery, retry later */
    nlm_deadlck = 5,    /* Would cause deadlock */
};
 
/* Server-side lock state tracking */
struct server_lock {
    struct list_head  list;      /* All locks on this file */
    struct nlm_lock   lock;      /* Lock details */
    int               fl_type;   /* F_RDLCK or F_WRLCK */
    struct nlm_host  *host;      /* Client that owns this lock */
    time_t            timestamp; /* For lease expiration */
};

Advisory vs. Mandatory Locking

Network Status Monitor (NSM)

Locks create a new crash recovery problem: what happens to locks when machines fail? This is where the Network Status Monitor (NSM) protocol comes in.

The Lock Recovery Problem

Scenario: Server Crashes While Holding Lock State

1. Client A holds lock on file X
2. Client B is blocked waiting for the lock
3. Server crashes, losing all lock state
4. Server reboots
5. ??? What happens to the locks?

Without a recovery mechanism:

Client A thinks it still holds the lock
Client B's blocked request was lost
New lock requests might conflict with Client A's 'ghost' lock
Or the server might grant locks that conflict with Client A's assumption

NSM Protocol Operation

NSM (program 100024) tracks the health of networked machines and provides crash notifications:

On Startup (SM_NOTIFY): When a machine boots, its statd daemon notifies all previously-registered peers:

"I'm machine X. My state counter is now 5."

State counter increments on each reboot, so peers know this is a fresh boot.

Monitor Registration (SM_MON): When a client locks a file, its lockd tells statd:

"Monitor server.example.com. If it reboots, tell me."

Crash Notification: When the server reboots and sends SM_NOTIFY:

Client's statd receives notification
Client's statd tells local lockd: "Server rebooted!"
Client's lockd initiates lock recovery

Converting Mermaid diagram...

The Grace Period

When the server's lock manager restarts, it enters a grace period (typically 90 seconds):

During grace period:
- Only lock reclaim requests accepted (clients recovering existing locks)
- New lock requests rejected with 'denied_grace_period'
- This ensures previous lock holders can re-establish their locks
After grace period:
- Normal lock operations resume
- Any locks not reclaimed are considered abandoned

This mechanism prevents a race where new lock requests sneak in before legitimate owners can reclaim their locks.

NSM State Files
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# NSM maintains state in /var/lib/nfs/
 
# Current state counter (increments on boot)
$ cat /var/lib/nfs/state
5
 
# Monitored hosts (persist across reboots)
$ ls /var/lib/nfs/sm/
client1.example.com  client2.example.com  server.example.com
 
# Backup for recovery
$ ls /var/lib/nfs/sm.bak/
 
# When recovering after crash:
# 1. statd reads sm.bak/ to find previous monitors
# 2. Sends SM_NOTIFY to each host
# 3. Moves sm.bak/* to sm/*
# 4. Increments state counter
 
# Example state file content (implementation varies)
$ cat /var/lib/nfs/sm/client1.example.com
# Contains: IP address, state number when registered, callback program

Persistent Lock State

The Silly Rename Problem

The UNIX Deletion Model

Local File System:

1. Process A opens file X (gets file descriptor)
2. Process B deletes file X
3. File disappears from directory (unlinked)
4. Process A can still read/write via its file descriptor
5. When Process A closes, the inode is freed

This works because the kernel tracks open files and delays
actual deletion until the reference count reaches zero.

The NFS Problem

With NFS, the server doesn't know the client has the file 'open':

1. Client A opens file X (gets file handle)
2. Client B deletes file X (REMOVE RPC)
3. Server removes directory entry AND inode (no references known)
4. Client A tries to read via its file handle
5. Server: "File not found!" (NFS3ERR_STALE)
6. Application on Client A gets unexpected error

This violates UNIX semantics that applications depend on.

The 'Silly Rename' Solution

The NFS client implements a workaround called silly rename:

When a file is deleted but has open references on the client:

Client detects: "file is open locally"
Instead of REMOVE, client sends RENAME
File is renamed to .nfsXXXXXXXXXXXXXXXX (unique temporary name)
Client continues using file via existing handle
When all local references close, client sends REMOVE for the silly name

This preserves UNIX semantics while working within NFS's stateless model.

Silly Rename in Action
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Demonstrating silly rename behavior
 
# Terminal 1: Open a file and keep it open
$ cd /mnt/nfs/shared
$ exec 3< testfile.txt   # Open file on fd 3
$ cat <&3                 # Can read it
Hello from testfile
 
# Terminal 2: Delete the file while it's open in Terminal 1
$ rm /mnt/nfs/shared/testfile.txt
$ ls -la /mnt/nfs/shared/
total 4
drwxr-xr-x 2 root root   40 Jan 15 10:00 .
drwxr-xr-x 3 root root 4096 Jan 15 09:00 ..
-rw-r--r-- 1 root root   20 Jan 15 10:00 .nfs00000001a5e0003e  # Silly name!
 
# The file was renamed, not deleted
 
# Terminal 1: Still can read!
$ cat <&3
Hello from testfile
 
# Close the file descriptor
$ exec 3<&-
 
# Now check directory again
$ ls -la /mnt/nfs/shared/
total 0
drwxr-xr-x 2 root root 40 Jan 15 10:00 .
drwxr-xr-x 3 root root 4096 Jan 15 09:00 ..
 
# Silly-named file is now gone (deleted on close)

Silly Rename Limitations

Other Statelessness-Related Behaviors

•No append mode — Without tracking file positions, true append is impossible. NFS WRITE always specifies an offset. Applications using O_APPEND must get the file size first, creating race conditions.
•No file descriptor inheritance — fork() semantics are complex. Child processes need their own NFS state tracking.
•Attribute caching anomalies — Without server-push notifications, clients poll. Cached attributes can be seconds stale.
•No change notifications — Clients can't 'watch' for changes. inotify doesn't work across NFS (it's local-only).

NFSv4: Embracing Limited Statefulness

Key NFSv4 Stateful Features:

Stateful Features in NFSv4
Feature	Purpose	State Maintained
Delegations	Client caching without server round-trips	Server tracks which clients have delegations
Leases	Bounded lifetime for state, enabling cleanup	Every state has an expiration time
Lock Integration	Locking built into NFS protocol (not separate NLM)	Server tracks locks per client session
Open State	Server knows which files are open	Open files per client session
Sessions (v4.1)	Exactly-once semantics via slot tables	Operation sequence numbers per session

Delegations: Caching with Confidence

A delegation is the server saying to a client: "I delegate control of this file to you. Until I recall the delegation, you can cache and modify without asking me."

Delegation Types:

- Read Delegation: "No one else will modify this file"
  → Client can cache reads without revalidating
  
- Write Delegation: "No one else will read or modify this file"
  → Client can cache writes locally without sending to server

The server tracks delegations. If another client wants access that conflicts with a delegation, the server issues a recall:

1. Client A has write delegation on file X
2. Client B tries to open file X
3. Server: "Wait, I need to recall A's delegation"
4. Server → Client A: CB_RECALL (give back your delegation)
5. Client A flushes cached writes, returns delegation
6. Server → Client B: "OK, proceed with open"

This requires the server to maintain state about delegations, but provides significant performance benefits.

Leases: State with Expiration

NFSv4 uses leases to bound the lifetime of state. Every piece of state (delegation, lock, open file) has a lease period:

Default: 90 seconds
Client must renew before expiration
If lease expires, state is revoked

This solves the client-crash cleanup problem:

Client crash scenario:

1. Client holds lock on file (with 90-second lease)
2. Client crashes, stops sending renewals
3. 90 seconds elapse
4. Server: "Lease expired, revoking state"
5. State cleaned up automatically
6. Other clients can now acquire lock

Renewal is efficient—a single SEQUENCE or RENEW operation extends all of a client's leases.

NFSv4 Lease Management
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* NFSv4 State Management Structures */
 
struct nfs4_client {
    clientid4            cl_clientid;     /* Unique client identifier */
    verifier4            cl_verifier;     /* Client boot verifier */
    time_t               cl_time;         /* Last renewal time */
    time_t               cl_lease_time;   /* Lease duration (seconds) */
    
    struct list_head     cl_openowners;   /* Open owner states */
    struct list_head     cl_delegations;  /* Granted delegations */
    struct list_head     cl_callbacks;    /* Callback pending */
    
    struct sockaddr_storage cl_cb_addr;   /* Callback address */
    bool                 cl_cb_connected; /* Callback channel up? */
};
 
/* Lease expiration check (called periodically) */
void nfs4_expire_clients(void) {
    struct nfs4_client *client, *tmp;
    time_t cutoff = time(NULL) - server_lease_time;
    
    list_for_each_entry_safe(client, tmp, &client_list, cl_list) {
        if (client->cl_time < cutoff) {
            /* Lease expired - client didn't renew */
            log_info("Client %llx lease expired, revoking state",
                     client->cl_clientid);
            
            /* Revoke all delegations */
            revoke_client_delegations(client);
            
            /* Release all locks */
            release_client_locks(client);
            
            /* Close all open files */
            close_client_opens(client);
            
            /* Remove client record */
            destroy_client(client);
        }
    }
}
 
/* Client renews by any operation or explicit RENEW */
void nfs4_renew_client(struct nfs4_client *client) {
    client->cl_time = time(NULL);
}

NFSv4: Best of Both Worlds?

Summary: Statelessness in Practice

The stateless protocol design is one of NFS's most important and influential architectural decisions. Let's consolidate what we've learned:

Key Takeaways

•Statelessness enables transparent crash recovery — Servers and clients can crash and reboot without complex recovery protocols. Clients simply retry failed requests.
•Idempotent operations make retry safe — Most NFS operations are idempotent by design. Non-idempotent operations use verifiers and reply caches for safety.
•Hard mounts are safer than soft mounts — Hard mounts retry indefinitely, ensuring data integrity. Soft mounts can cause data corruption if writes are interrupted.
•Locking requires separate stateful protocols — NLM provides locking; NSM enables lock recovery. The grace period ensures orderly recovery after crashes.
•Silly rename preserves UNIX semantics — Files deleted while open are renamed, not removed, maintaining expected behavior for applications.
•NFSv4 adds bounded statefulness — Delegations, leases, and integrated locking address statelessness limitations while preserving crash recovery via lease expiration.

What's Next

Page Complete

3 / 5