Loading learning content...
When Sun Microsystems designed NFS in the early 1980s, they made a radical choice: the server would maintain no state about client activities. Every request would be self-contained, carrying all information needed for processing. The server would have no memory of previous interactions.
This decision—seemingly simple yet profoundly consequential—shaped everything about NFS: its remarkable resilience to failures, its performance characteristics, its consistency semantics, and even its limitations. Understanding stateless design is essential for anyone working with NFS, because virtually every NFS behavior can be traced back to this foundational choice.
In this page, we'll explore why statelessness was chosen, how it enables NFS's legendary crash recovery, the engineering challenges it creates, and how NFS accommodates operations that inherently require state.
By the end of this page, you will understand the philosophy behind stateless design, how idempotent operations enable transparent crash recovery, the specific challenges statelessness creates for file locking and deletion, and the auxiliary protocols (NLM, NSM) that provide necessary state. You'll be able to predict NFS behavior in failure scenarios.
To appreciate why NFS chose statelessness, we must understand the alternative—and its problems.
The Stateful Approach
A stateful file server maintains information about each client's session:
Client Session State (stateful server would track):
- Which files this client has open
- Current file positions (offsets)
- Lock ownership
- Pending operations
- Authentication session
This seems natural—after all, local file systems track open files. But in a distributed system, maintaining state creates severe complications:
The Stateless Solution
NFS sidesteps these problems elegantly: the server simply doesn't maintain per-client state.
The server treats each request as independent. It looks up the file (via file handle), performs the operation, and forgets about the interaction immediately.
When NFS was designed, networks were much less reliable than today. Packet loss, cable failures, and server crashes were common. The stateless design made NFS robust in this hostile environment. A motto from the era: 'Simple systems work; complex systems fail in complex ways.'
The stateless design works because NFS operations are designed to be idempotent: executing the same operation multiple times produces the same result as executing it once.
Formal Definition:
An operation f is idempotent if: f(f(x)) = f(x)
In NFS terms: sending the same request twice (due to a retry after timeout) should not corrupt the file system or produce errors.
Why Idempotency Matters
Consider what happens with a lost network packet:
With an idempotent READ, this is completely safe—reading the same data twice is harmless. But what about operations that might not be naturally idempotent?
| Operation | Naturally Idempotent? | How NFS Handles It |
|---|---|---|
| READ | Yes | Reading same data multiple times is harmless |
| GETATTR | Yes | Querying attributes multiple times is harmless |
| WRITE (at offset) | Yes | Writing same data to same offset overwrites identically |
| LOOKUP | Yes | Looking up a name multiple times is harmless |
| SETATTR | Yes | Setting same attributes again is harmless |
| MKDIR (exclusive) | No | Server checks 'already exists', returns NFS3ERR_EXIST |
| CREATE (exclusive) | No | Use CREATE_VERF to detect duplicate requests |
| REMOVE | No | Server checks 'doesn't exist', returns NFS3ERR_NOENT |
| RENAME | Mostly | If already renamed, second attempt may succeed or fail safely |
| APPEND | No | Offset-based writes avoid this; append not supported in NFS |
Handling Non-Idempotent Operations
Some operations cannot be made naturally idempotent. NFS uses several strategies:
1. Exclusive Creation Verifier (NFSv3)
For CREATE with exclusive mode, the client sends a 64-bit verifier:
CREATE3args {
diropargs3 where; /* Directory + filename */
createhow3 how; /* UNCHECKED | GUARDED | EXCLUSIVE */
createverf3 verf; /* 64-bit verifier for EXCLUSIVE */
};
2. Server Reply Cache
Servers may maintain a cache of recent replies. If a duplicate request arrives:
Duplicate Request Detection:
Request arrives with XID=12345 from client 10.0.0.1
↓
Check reply cache: (client=10.0.0.1, xid=12345)
↓
Cache hit? → Return cached reply
Cache miss? → Process request, cache reply
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
/* Simplified NFS server reply cache (duplicate request detection) */ struct reply_cache_entry { struct sockaddr_storage client_addr; /* Client address */ uint32_t xid; /* RPC transaction ID */ uint32_t program; /* RPC program number */ uint32_t proc; /* Procedure number */ time_t timestamp; /* When reply was cached */ /* Cached reply */ int reply_len; char reply_data[NFS_MAXREPLY]; struct reply_cache_entry *next; /* Hash chain */}; /* Hash table of recent replies */#define REPLY_CACHE_SIZE 1024struct reply_cache_entry *reply_cache[REPLY_CACHE_SIZE]; /* Check for duplicate request, return cached reply if found */struct reply_cache_entry* check_reply_cache( struct sockaddr_storage *client, uint32_t xid, uint32_t program, uint32_t proc) { uint32_t hash = hash_reply(client, xid) % REPLY_CACHE_SIZE; struct reply_cache_entry *entry; for (entry = reply_cache[hash]; entry; entry = entry->next) { if (entry->xid == xid && entry->program == program && entry->proc == proc && addr_equal(&entry->client_addr, client)) { /* Duplicate detected! */ log_debug("Duplicate request detected: xid=%u, proc=%u", xid, proc); return entry; /* Return cached reply */ } } return NULL; /* Not a duplicate */} /* Cache a reply for future duplicate detection */void cache_reply(struct sockaddr_storage *client, uint32_t xid, uint32_t program, uint32_t proc, void *reply, int reply_len){ /* Only cache non-idempotent operations */ if (is_idempotent(proc)) return; struct reply_cache_entry *entry = alloc_cache_entry(); memcpy(&entry->client_addr, client, sizeof(*client)); entry->xid = xid; entry->program = program; entry->proc = proc; entry->timestamp = time(NULL); entry->reply_len = reply_len; memcpy(entry->reply_data, reply, reply_len); /* Insert into hash table */ uint32_t hash = hash_reply(client, xid) % REPLY_CACHE_SIZE; entry->next = reply_cache[hash]; reply_cache[hash] = entry; /* Expire old entries to prevent unbounded growth */ expire_old_entries();}The reply cache is finite and entries expire. If a client retries a non-idempotent request after the cache entry expires, the server will re-execute the operation, potentially causing errors (e.g., 'file already exists'). Applications must handle these cases gracefully.
The most celebrated benefit of NFS's stateless design is its exceptional crash recovery. Let's trace through exactly how different failure scenarios are handled.
Scenario 1: Server Crashes and Reboots
Consider an application reading a large file when the server suddenly reboots:
Timeline:
Client Server
| |
|--- READ(fh, off=0, 8K) --->| Server receives, processing
|<-- [data bytes 0-8K] ------| Response sent
| |
|--- READ(fh, off=8K, 8K) -->| Request in flight
| X SERVER CRASHES
| [timeout...] |
| | (Server rebooting...)
|--- READ(fh, off=8K, 8K) -->| Client retries
| | SERVER BACK UP
| |
| | Server receives retry
| | Validates file handle
| | Reads from disk
|<-- [data bytes 8K-16K] ----| Response sent
| |
| (Application continues |
| reading, unaware of |
| server crash!) |
Why this works:
Scenario 2: Client Crashes and Reboots
Timeline:
Client Server
| |
|--- WRITE(fh, off=0, data)-> | Server writes to cache
|<-- OK ---------------------- | Write acknowledged
X CLIENT CRASHES |
| (Server doesn't notice)
| (Client rebooting...) |
| |
| (Application restarts, |
| reopens file) |
|--- LOOKUP(dir_fh, "file") ->| Normal operation
|<-- file_fh + attrs --------- |
| |
Why this works:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
/* Simplified NFS client RPC with retry logic */ int nfs_rpc_call(struct nfs_client *client, int procedure, void *args, void *result){ int timeout_ms = client->initial_timeout; int retries = 0; int major_timeout = 0; while (1) { /* Send request */ int err = rpc_send_request(client, procedure, args); if (err) goto retry; /* Wait for response with timeout */ err = rpc_wait_response(client, result, timeout_ms); if (err == 0) { return 0; /* Success! */ } if (err != -ETIMEDOUT) { return err; /* Real error, not timeout */ } retry: retries++; /* Soft mount: give up after configured retries */ if (client->mount_flags & NFS_MOUNT_SOFT) { if (retries >= client->max_retries) { printk(KERN_WARNING "NFS: server not responding, " "giving up"); return -EIO; /* Return error to application */ } } /* Hard mount: warn but keep trying */ if ((retries % client->max_retries) == 0) { major_timeout++; printk(KERN_WARNING "NFS: server %s not responding, " "still trying", client->hostname); } /* Exponential backoff, capped at max timeout */ timeout_ms = min(timeout_ms * 2, client->max_timeout); /* For write operations, may need congestion control */ if (is_write_procedure(procedure)) { nfs_congestion_wait(client); } }}Soft mounts are tempting (avoid hangs!) but dangerous. If a soft mount times out during a write, the application thinks the write failed and might take recovery action—but the server might have actually completed the write. This can cause data corruption. Use hard mounts for data you care about.
While statelessness works brilliantly for file I/O, it creates a fundamental problem for file locking. Locks are inherently stateful—they represent an ongoing relationship between a client and a file.
Why Locks Need State
Consider what a lock represents:
Client A: "I own the exclusive lock on file X"
↓
This is STATE - the server must remember it
to deny locks to Client B
↓
Client B: "Can I lock file X?"
↓
Server must consult state to answer
The Dilemma:
flock() and fcntl() locking to workThe Solution: Auxiliary Stateful Protocols
NFS solved this by separating concerns:
This separation means that a crash affects only the locking subsystem, not basic file operations.
The Network Lock Manager (NLM) Protocol
NLM is a separate RPC service (program 100021) that handles lock operations:
Locking Sequence:
Client A Lock Manager Client B
| | |
|--- NLM_LOCK(file) ----->| |
|<-- NLM_LOCK_RES(ok) ----| [A owns lock] |
| | |
| |<--- NLM_LOCK(file) ---|
| | [blocked - A has it] |
| | |
|--- NLM_UNLOCK(file) --->| |
|<-- NLM_UNLOCK_RES(ok) --| |
| | |
| |--- NLM_GRANTED ------>|
| |<-- NLM_GRANTED_RES ---|
| | [B now owns lock] |
12345678910111213141516171819202122232425262728293031323334353637383940
/* NLM (Network Lock Manager) Structures */ /* Lock owner identification */struct nlm_lock { char *caller_name; /* Client hostname */ netobj fh; /* NFS file handle */ netobj oh; /* Lock owner handle (opaque) */ uint32_t svid; /* System V id (PID on client) */ uint64_t l_offset; /* Lock start offset */ uint64_t l_len; /* Lock length (0 = to EOF) */}; /* Lock request message */struct nlm_lockargs { netobj cookie; /* Request identifier */ bool_t block; /* Wait if lock unavailable? */ bool_t exclusive; /* Exclusive or shared? */ struct nlm_lock alock; /* Lock details */ bool_t reclaim; /* Reclaiming after restart? */ int state; /* NSM state number */}; /* Lock states for callback management */enum nlm_stats { nlm_granted = 0, /* Lock succeeded */ nlm_denied = 1, /* Lock conflicts with existing lock */ nlm_denied_nolocks = 2, /* Server out of lock resources */ nlm_blocked = 3, /* Request queued, will callback */ nlm_denied_grace_period = 4, /* Server in recovery, retry later */ nlm_deadlck = 5, /* Would cause deadlock */}; /* Server-side lock state tracking */struct server_lock { struct list_head list; /* All locks on this file */ struct nlm_lock lock; /* Lock details */ int fl_type; /* F_RDLCK or F_WRLCK */ struct nlm_host *host; /* Client that owns this lock */ time_t timestamp; /* For lease expiration */};NLM implements advisory locking—applications can ignore locks if they choose. Mandatory locking (where the OS enforces locks on all access) is complex in a distributed system and rarely used with NFS. Well-behaved applications honor advisory locks; malicious or buggy programs can bypass them.
Locks create a new crash recovery problem: what happens to locks when machines fail? This is where the Network Status Monitor (NSM) protocol comes in.
The Lock Recovery Problem
Scenario: Server Crashes While Holding Lock State
1. Client A holds lock on file X
2. Client B is blocked waiting for the lock
3. Server crashes, losing all lock state
4. Server reboots
5. ??? What happens to the locks?
Without a recovery mechanism:
NSM Protocol Operation
NSM (program 100024) tracks the health of networked machines and provides crash notifications:
On Startup (SM_NOTIFY): When a machine boots, its statd daemon notifies all previously-registered peers:
"I'm machine X. My state counter is now 5."
State counter increments on each reboot, so peers know this is a fresh boot.
Monitor Registration (SM_MON): When a client locks a file, its lockd tells statd:
"Monitor server.example.com. If it reboots, tell me."
Crash Notification: When the server reboots and sends SM_NOTIFY:
The Grace Period
When the server's lock manager restarts, it enters a grace period (typically 90 seconds):
During grace period:
After grace period:
This mechanism prevents a race where new lock requests sneak in before legitimate owners can reclaim their locks.
12345678910111213141516171819202122
# NSM maintains state in /var/lib/nfs/ # Current state counter (increments on boot)$ cat /var/lib/nfs/state5 # Monitored hosts (persist across reboots)$ ls /var/lib/nfs/sm/client1.example.com client2.example.com server.example.com # Backup for recovery$ ls /var/lib/nfs/sm.bak/ # When recovering after crash:# 1. statd reads sm.bak/ to find previous monitors# 2. Sends SM_NOTIFY to each host# 3. Moves sm.bak/* to sm/*# 4. Increments state counter # Example state file content (implementation varies)$ cat /var/lib/nfs/sm/client1.example.com# Contains: IP address, state number when registered, callback programNSM's persistent state files (/var/lib/nfs/sm/) must survive reboots. If these files are lost, the server won't know which clients to notify, and lock recovery fails. Place this directory on persistent storage, not tmpfs.
Statelessness creates an interesting problem with file deletion. In UNIX, a deleted file remains accessible to processes that have it open—they hold references to the inode. But NFS has no concept of 'open files'.
The UNIX Deletion Model
Local File System:
1. Process A opens file X (gets file descriptor)
2. Process B deletes file X
3. File disappears from directory (unlinked)
4. Process A can still read/write via its file descriptor
5. When Process A closes, the inode is freed
This works because the kernel tracks open files and delays
actual deletion until the reference count reaches zero.
The NFS Problem
With NFS, the server doesn't know the client has the file 'open':
1. Client A opens file X (gets file handle)
2. Client B deletes file X (REMOVE RPC)
3. Server removes directory entry AND inode (no references known)
4. Client A tries to read via its file handle
5. Server: "File not found!" (NFS3ERR_STALE)
6. Application on Client A gets unexpected error
This violates UNIX semantics that applications depend on.
The 'Silly Rename' Solution
The NFS client implements a workaround called silly rename:
When a file is deleted but has open references on the client:
.nfsXXXXXXXXXXXXXXXX (unique temporary name)This preserves UNIX semantics while working within NFS's stateless model.
1234567891011121314151617181920212223242526272829303132
# Demonstrating silly rename behavior # Terminal 1: Open a file and keep it open$ cd /mnt/nfs/shared$ exec 3< testfile.txt # Open file on fd 3$ cat <&3 # Can read itHello from testfile # Terminal 2: Delete the file while it's open in Terminal 1$ rm /mnt/nfs/shared/testfile.txt$ ls -la /mnt/nfs/shared/total 4drwxr-xr-x 2 root root 40 Jan 15 10:00 .drwxr-xr-x 3 root root 4096 Jan 15 09:00 ..-rw-r--r-- 1 root root 20 Jan 15 10:00 .nfs00000001a5e0003e # Silly name! # The file was renamed, not deleted # Terminal 1: Still can read!$ cat <&3Hello from testfile # Close the file descriptor$ exec 3<&- # Now check directory again$ ls -la /mnt/nfs/shared/total 0drwxr-xr-x 2 root root 40 Jan 15 10:00 .drwxr-xr-x 3 root root 4096 Jan 15 09:00 .. # Silly-named file is now gone (deleted on close)Silly rename has edge cases: if the client crashes before closing the file, the .nfsXXXX file remains on the server permanently. These orphaned files must be cleaned up manually. Also, if the rename itself fails (e.g., read-only export), the delete succeeds but the file becomes inaccessible.
NFSv4, released in 2003, represents a significant evolution: it deliberately incorporates limited statefulness to address the most painful limitations of the stateless model while preserving its benefits.
Key NFSv4 Stateful Features:
| Feature | Purpose | State Maintained |
|---|---|---|
| Delegations | Client caching without server round-trips | Server tracks which clients have delegations |
| Leases | Bounded lifetime for state, enabling cleanup | Every state has an expiration time |
| Lock Integration | Locking built into NFS protocol (not separate NLM) | Server tracks locks per client session |
| Open State | Server knows which files are open | Open files per client session |
| Sessions (v4.1) | Exactly-once semantics via slot tables | Operation sequence numbers per session |
Delegations: Caching with Confidence
A delegation is the server saying to a client: "I delegate control of this file to you. Until I recall the delegation, you can cache and modify without asking me."
Delegation Types:
- Read Delegation: "No one else will modify this file"
→ Client can cache reads without revalidating
- Write Delegation: "No one else will read or modify this file"
→ Client can cache writes locally without sending to server
The server tracks delegations. If another client wants access that conflicts with a delegation, the server issues a recall:
1. Client A has write delegation on file X
2. Client B tries to open file X
3. Server: "Wait, I need to recall A's delegation"
4. Server → Client A: CB_RECALL (give back your delegation)
5. Client A flushes cached writes, returns delegation
6. Server → Client B: "OK, proceed with open"
This requires the server to maintain state about delegations, but provides significant performance benefits.
Leases: State with Expiration
NFSv4 uses leases to bound the lifetime of state. Every piece of state (delegation, lock, open file) has a lease period:
This solves the client-crash cleanup problem:
Client crash scenario:
1. Client holds lock on file (with 90-second lease)
2. Client crashes, stops sending renewals
3. 90 seconds elapse
4. Server: "Lease expired, revoking state"
5. State cleaned up automatically
6. Other clients can now acquire lock
Renewal is efficient—a single SEQUENCE or RENEW operation extends all of a client's leases.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
/* NFSv4 State Management Structures */ struct nfs4_client { clientid4 cl_clientid; /* Unique client identifier */ verifier4 cl_verifier; /* Client boot verifier */ time_t cl_time; /* Last renewal time */ time_t cl_lease_time; /* Lease duration (seconds) */ struct list_head cl_openowners; /* Open owner states */ struct list_head cl_delegations; /* Granted delegations */ struct list_head cl_callbacks; /* Callback pending */ struct sockaddr_storage cl_cb_addr; /* Callback address */ bool cl_cb_connected; /* Callback channel up? */}; /* Lease expiration check (called periodically) */void nfs4_expire_clients(void) { struct nfs4_client *client, *tmp; time_t cutoff = time(NULL) - server_lease_time; list_for_each_entry_safe(client, tmp, &client_list, cl_list) { if (client->cl_time < cutoff) { /* Lease expired - client didn't renew */ log_info("Client %llx lease expired, revoking state", client->cl_clientid); /* Revoke all delegations */ revoke_client_delegations(client); /* Release all locks */ release_client_locks(client); /* Close all open files */ close_client_opens(client); /* Remove client record */ destroy_client(client); } }} /* Client renews by any operation or explicit RENEW */void nfs4_renew_client(struct nfs4_client *client) { client->cl_time = time(NULL);}NFSv4's approach is pragmatic: maintain state where it provides clear benefits (caching, locking), but with bounded lifetime (leases) to enable automatic cleanup. Most operations remain stateless (reads, writes, lookups). This hybrid approach has proven successful for 20+ years.
The stateless protocol design is one of NFS's most important and influential architectural decisions. Let's consolidate what we've learned:
What's Next
With statelessness understood, we're ready to explore NFS Versions—the evolution from NFSv2 through NFSv4.2. We'll see how each version addressed limitations of its predecessors while maintaining compatibility, and understand the specific features and trade-offs of each version. This practical knowledge helps you choose the right NFS version for your deployment.
You now deeply understand NFS's stateless design philosophy—its motivations, mechanics, and implications. You can predict NFS behavior in failure scenarios, understand the role of auxiliary protocols like NLM and NSM, and appreciate NFSv4's evolution toward bounded statefulness. This foundation is essential for effective NFS deployment and troubleshooting.