Loading learning content...
A spinlock is just the beginning. Production synchronization requires more: the ability to block when spinning becomes wasteful, wake efficiently when the lock becomes available, handle priority correctly, and provide fairness guarantees. These capabilities transform a simple spinlock into a full-featured mutex (mutual exclusion lock).
In this page, we'll build upon our Test-and-Set foundation to construct complete lock abstractions. We'll examine how operating systems combine hardware primitives with kernel services to create the locks that programmers actually use. Understanding this architecture reveals why different lock types exist and when to choose each.
The goal is architectural understanding—seeing how simple atomic operations compose into sophisticated synchronization primitives that balance competing concerns: latency, throughput, fairness, and efficiency.
By the end of this page, you will understand: (1) The architecture of mutex implementations, (2) How to combine spinning with blocking, (3) Wait queue management and wake-up strategies, (4) Implementing fairness and preventing starvation, and (5) Real-world mutex implementations in Linux and POSIX.
A mutex (mutual exclusion lock) extends the basic spinlock with capabilities that make it suitable for general-purpose synchronization. Unlike spinlocks that waste CPU cycles while waiting, mutexes can block—the waiting thread yields the CPU and sleeps until the lock becomes available.
Components of a Mutex:
1234567891011121314151617181920212223242526272829303132333435363738
// Complete mutex structuretypedef struct mutex { // Primary state: 0=unlocked, 1=locked, 2+=locked with waiters atomic_int state; // Wait queue for blocked threads struct list_head wait_queue; // Spinlock protecting the wait queue spinlock_t wait_lock; // Owner thread (for recursive locks and debugging) struct thread *owner; // Recursion count (for recursive mutexes) int recursion_count; #ifdef DEBUG_MUTEX const char *name; struct mutex_debug_info debug; #endif} mutex_t; // Wait queue entry for each blocked threadtypedef struct mutex_waiter { struct list_node node; // Linkage in wait queue struct thread *thread; // The waiting thread int flags; // Priority, timeout info} mutex_waiter_t; // Initialize a mutexvoid mutex_init(mutex_t *mutex) { atomic_store(&mutex->state, 0); list_init(&mutex->wait_queue); spinlock_init(&mutex->wait_lock); mutex->owner = NULL; mutex->recursion_count = 0;}The Three-State Model:
Production mutexes often use a three-state model for efficiency:
This distinction matters for the fast path: if state is 1 when unlocking, we know no one needs to be woken and can skip the expensive wake-up path.
Without the 'locked with waiters' state, every unlock would need to check the wait queue (expensive). With it, unlock can check state atomically: if state==1, just set to 0; if state==2, take the slow path and wake a waiter. This optimization is critical for uncontended performance.
Lock/Unlock Protocol Overview:
12345678910111213141516171819202122232425262728293031
Mutex Lock Protocol:=====================1. Try fast path: atomic CAS (0 → 1) If success: Lock acquired! Set owner, return 2. Slow path: spin briefly (if enabled) For N iterations: test-and-test-and-set If acquired during spin: Set owner, return 3. Block path: prepare to sleep a. Take wait_lock spinlock b. Add self to wait queue c. Update state to indicate waiters (1 → 2) d. Release wait_lock e. Call scheduler to sleep f. (wake up) → remove from queue, retry from step 1 Mutex Unlock Protocol:======================1. Clear owner, release memory barrier 2. Fast path check: was state == 1? If yes: atomic CAS (1 → 0) If success: Done! No waiters to wake 3. Slow path: wake a waiter a. Take wait_lock spinlock b. Remove first waiter from queue c. If queue now empty: state = 1, else state = 2 d. Release wait_lock e. Wake the removed threadThe key insight in mutex design is that most lock operations are uncontended. In well-designed systems, locks are held briefly and contention is rare. Therefore, the most important optimization is making the uncontended path as fast as possible.
The Ideal Fast Path:
For an uncontended lock, the entire operation should be:
No function calls, no branches taken, no memory allocation, no kernel involvement.
Similarly, uncontended unlock:
Again, pure user-space, predictable, fast.
123456789101112131415161718192021222324252627282930313233343536373839404142
// Fast-path optimized mutex// Designed for uncontended case to be maximum speed void mutex_lock(mutex_t *mutex) { // FAST PATH: Try to acquire immediately // Single atomic operation, no branches in common case if (likely(atomic_compare_exchange_strong_explicit( &mutex->state, &(int){0}, // Expected: unlocked 1, // Desired: locked memory_order_acquire, memory_order_relaxed))) { // Success! Lock acquired on first try mutex->owner = current_thread(); return; } // SLOW PATH: Lock is contended mutex_lock_slowpath(mutex);} void mutex_unlock(mutex_t *mutex) { mutex->owner = NULL; // FAST PATH: Try to release if no waiters if (likely(atomic_compare_exchange_strong_explicit( &mutex->state, &(int){1}, // Expected: locked, no waiters 0, // Desired: unlocked memory_order_release, memory_order_relaxed))) { // Success! No waiters, we're done return; } // SLOW PATH: There are waiters to wake mutex_unlock_slowpath(mutex);} // likely() macro for branch prediction hint#define likely(x) __builtin_expect(!!(x), 1)#define unlikely(x) __builtin_expect(!!(x), 0)Measuring Fast Path Performance:
| Operation | Cycles (x86-64) | Time @ 3GHz |
|---|---|---|
| Lock (CAS success) | 15-25 | 5-8 ns |
| Unlock (CAS success) | 10-20 | 3-7 ns |
| Lock+Unlock round trip | 25-45 | 8-15 ns |
| For comparison: function call | 5-10 | 2-3 ns |
| For comparison: L3 cache miss | 40-60 | 13-20 ns |
The fast path is often inlined at call sites, eliminating function call overhead. The slow path (which handles blocking) remains a separate function to keep the code size of the fast path minimal, improving instruction cache efficiency.
Avoiding Unnecessary Work:
Fast-path design principles:
When the fast path fails, we enter the slow path. This is where the interesting design decisions happen: how long to spin before blocking, how to manage the wait queue, and how to wake waiters efficiently.
Adaptive Spinning:
Before blocking, it often makes sense to spin briefly. If the lock holder is actively running and about to release, spinning can be faster than the context switch overhead of blocking and waking.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
// Slow path with adaptive spinningnoinline void mutex_lock_slowpath(mutex_t *mutex) { int spins = 0; const int MAX_SPINS = 100; // Tunable parameter // Phase 1: Adaptive spinning while (spins < MAX_SPINS) { // Check if lock is available if (atomic_load_explicit(&mutex->state, memory_order_relaxed) == 0) { // Try to acquire if (atomic_compare_exchange_weak_explicit( &mutex->state, &(int){0}, 1, memory_order_acquire, memory_order_relaxed)) { mutex->owner = current_thread(); return; // Acquired during spin! } } // Check if holder is still running (adaptive part) if (mutex->owner && !is_running(mutex->owner)) { // Holder is not on CPU - blocking is better than spinning break; } cpu_relax(); spins++; } // Phase 2: Prepare to block mutex_waiter_t waiter; waiter.thread = current_thread(); waiter.flags = 0; // Add ourselves to wait queue spin_lock(&mutex->wait_lock); // Double-check: lock might have been released if (atomic_load_explicit(&mutex->state, memory_order_relaxed) == 0) { if (atomic_compare_exchange_strong_explicit( &mutex->state, &(int){0}, 1, memory_order_acquire, memory_order_relaxed)) { spin_unlock(&mutex->wait_lock); mutex->owner = current_thread(); return; // Acquired after taking wait_lock } } // Add to queue and mark that we're waiting list_add_tail(&waiter.node, &mutex->wait_queue); // Set state to indicate waiters int expected = 1; atomic_compare_exchange_strong(&mutex->state, &expected, 2); spin_unlock(&mutex->wait_lock); // Phase 3: Block and wait while (1) { // Sleep until woken scheduler_sleep(&waiter); // Woken up - try to acquire spin_lock(&mutex->wait_lock); // We might have been woken spuriously or lost race if (atomic_load(&mutex->state) == 0) { if (atomic_compare_exchange_strong( &mutex->state, &(int){0}, list_empty(&mutex->wait_queue) ? 1 : 2, memory_order_acquire, memory_order_relaxed)) { list_del(&waiter.node); spin_unlock(&mutex->wait_lock); mutex->owner = current_thread(); return; // Finally acquired! } } spin_unlock(&mutex->wait_lock); // Race lost, go back to sleep }}Checking if the lock holder is running (is_running()) requires accessing scheduler data, which may itself require synchronization. This check must be very cheap or the benefit of avoiding unnecessary blocking is lost. Some implementations approximate this with heuristics rather than exact checking.
Unlock and Wake-up:
The unlock slow path must efficiently wake exactly one waiter (or all waiters, depending on design):
123456789101112131415161718192021222324252627282930
// Unlock slow path: wake a waiternoinline void mutex_unlock_slowpath(mutex_t *mutex) { mutex_waiter_t *waiter = NULL; spin_lock(&mutex->wait_lock); // Get first waiter from queue if (!list_empty(&mutex->wait_queue)) { waiter = list_first_entry(&mutex->wait_queue, mutex_waiter_t, node); list_del(&waiter->node); } // Update state based on remaining waiters if (list_empty(&mutex->wait_queue)) { // No more waiters - set to unlocked or locked depending on handoff atomic_store_explicit(&mutex->state, 0, memory_order_release); } else { // Still have waiters - keep state=2 // Someone else will need to unlock again } spin_unlock(&mutex->wait_lock); // Wake the waiter outside the lock // This avoids "thundering herd" within the wait_lock critical section if (waiter) { scheduler_wake(waiter->thread); }}Wake Strategies:
There are several approaches to waking waiters:
| Strategy | Fairness | Throughput | Latency | Use Case |
|---|---|---|---|---|
| Wake-one | Low | High | Variable | General purpose |
| Handoff | High | Medium | Predictable | Fairness critical |
| Wake-all | High-ish | Low | Low (first) | Broadcast signals |
| Batched | Medium | High | Medium | High contention |
Basic TAS spinlocks provide no fairness guarantees—a thread can wait indefinitely while others repeatedly acquire the lock. For many applications, this is unacceptable. Building fair locks using TAS requires additional mechanisms.
The Fairness Problem:
Consider a workload where threads T1-T10 all need the lock. With a basic TAS lock:
This unfairness arises because TAS doesn't remember who arrived first—it's a free-for-all.
Starvation isn't theoretical. In a 2016 incident, a major website went down because a single thread held a lock while other threads starved. The starving threads included the health check handler, so the load balancer kept sending traffic to the 'dead' server. Fairness matters.
Ticket Locks: Fair TAS-based Locking:
Ticket locks use TAS to implement a queueing discipline. The idea is borrowed from bakery ticket systems: take a number, wait for your number to be called.
1234567891011121314151617181920212223242526272829303132333435
// Ticket Lock: Fair FIFO ordering using TSLtypedef struct { atomic_uint next_ticket; // Next ticket to be issued atomic_uint now_serving; // Ticket currently being served} ticket_lock_t; #define TICKET_LOCK_INIT { .next_ticket = 0, .now_serving = 0 } void ticket_lock(ticket_lock_t *lock) { // Atomically get our ticket number unsigned int my_ticket = atomic_fetch_add_explicit( &lock->next_ticket, 1, memory_order_relaxed); // Wait until our number is called while (atomic_load_explicit( &lock->now_serving, memory_order_acquire) != my_ticket) { cpu_relax(); } // Lock acquired!} void ticket_unlock(ticket_lock_t *lock) { // Advance to next ticket atomic_fetch_add_explicit( &lock->now_serving, 1, memory_order_release);} // Analysis:// - FIFO fairness: threads served in arrival order// - No starvation: bounded wait (at most N-1 threads ahead)// - Simple implementation//// Drawbacks:// - All waiters spin on same cache line (now_serving) - not scalable// - More expensive than basic TAS in uncontended caseMCS Locks: Scalable Fair Locking:
The MCS lock (Mellor-Crummey and Scott, 1991) provides fairness while eliminating the cache line bouncing problem of ticket locks. Each waiter spins on its own private variable.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
// MCS Lock: Scalable, fair spinlock// Each waiter spins on a local variable, eliminating cache line bouncing typedef struct mcs_node { struct mcs_node *next; atomic_int locked; // 0 = has lock, 1 = waiting} mcs_node_t; typedef struct { atomic_ptr tail; // Tail of queue (or NULL if unlocked)} mcs_lock_t; void mcs_lock(mcs_lock_t *lock, mcs_node_t *node) { node->next = NULL; atomic_store(&node->locked, 1); // We're waiting // Atomically add ourselves to the queue mcs_node_t *prev = atomic_exchange(&lock->tail, node); if (prev == NULL) { // Queue was empty - we have the lock! return; } // Link ourselves to predecessor prev->next = node; // Spin on OUR OWN locked flag (not shared!) while (atomic_load_explicit(&node->locked, memory_order_acquire)) { cpu_relax(); } // Predecessor woke us - we have the lock} void mcs_unlock(mcs_lock_t *lock, mcs_node_t *node) { // Check if we're the only one if (node->next == NULL) { // Try to atomically clear the tail mcs_node_t *expected = node; if (atomic_compare_exchange_strong(&lock->tail, &expected, NULL)) { // Success - lock is now free return; } // Someone is in the process of queueing behind us // Wait for them to link while (node->next == NULL) { cpu_relax(); } } // Wake our successor by clearing their locked flag atomic_store_explicit(&node->next->locked, 0, memory_order_release);} // Usage requires caller to provide per-thread mcs_node:// thread_local mcs_node_t my_node;// mcs_lock(&lock, &my_node);// ... critical section ...// mcs_unlock(&lock, &my_node);| Lock Type | Fairness | Cache Behavior | Per-Thread State | Complexity |
|---|---|---|---|---|
| TAS Spinlock | None | All spin on one line | None | Simple |
| Ticket Lock | FIFO | All spin on one line | O(1) | Simple |
| MCS Lock | FIFO | Spin on private line | O(N) total | Moderate |
| CLH Lock | FIFO | Spin on pred's line | O(N) total | Moderate |
Linux combines ticket lock simplicity with MCS scalability in its qspinlock. Under low contention, it behaves like a ticket lock. Under high contention, waiters form an MCS-style queue. This adaptive approach achieves both good fast-path performance and scalable contention handling.
Let's examine how major operating systems and libraries implement mutexes, seeing how they balance the competing concerns we've discussed.
Linux Kernel Mutex:
The Linux kernel mutex (not to be confused with futex) is a sleeping lock optimized for the uncontended case.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
// Simplified Linux kernel mutex implementation// Actual implementation is in kernel/locking/mutex.c struct mutex { atomic_long_t owner; // Owner task + flags in low bits spinlock_t wait_lock; // Protects wait_list struct list_head wait_list; #ifdef CONFIG_DEBUG_MUTEXES void *magic; struct lockdep_map dep_map; #endif}; // Flags stored in low bits of 'owner' pointer#define MUTEX_FLAG_WAITERS 0x01#define MUTEX_FLAG_HANDOFF 0x02#define MUTEX_FLAG_PICKUP 0x04 void mutex_lock(struct mutex *lock) { // Fast path: single atomic if uncontended if (!__mutex_trylock_fast(lock)) { // Slow path: optimistic spin or block __mutex_lock_slowpath(lock); }} static __always_inline bool __mutex_trylock_fast(struct mutex *lock) { // Attempt CAS: NULL -> current_task unsigned long curr = (unsigned long)current; unsigned long zero = 0UL; return atomic_long_try_cmpxchg_acquire(&lock->owner, &zero, curr);} void __mutex_lock_slowpath(struct mutex *lock) { // 1. Optimistic spinning (MCS-style queue for spinning) // Only if owner is running on a CPU // 2. If spinning fails or not appropriate, block // Add to wait_list, set WAITERS flag, sleep // 3. Handle handoff for fairness // After sleeping too long, request handoff}POSIX pthread_mutex:
POSIX threads provide pthread_mutex_t with configurable behavior:
123456789101112131415161718192021222324252627282930313233343536
// POSIX mutex with various configurations#include <pthread.h> // Normal mutex (default)pthread_mutex_t normal_mutex = PTHREAD_MUTEX_INITIALIZER; // Error-checking mutex (detects deadlock attempts)pthread_mutex_t errorcheck_mutex;pthread_mutexattr_t attr;pthread_mutexattr_init(&attr);pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_ERRORCHECK);pthread_mutex_init(&errorcheck_mutex, &attr); // Recursive mutex (same thread can lock multiple times)pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);pthread_mutex_init(&recursive_mutex, &attr); // Robust mutex (handles owner death)pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST); // Priority inheritance (prevents priority inversion)pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT); // Lock operationsint rc = pthread_mutex_lock(&normal_mutex); // Block until acquiredrc = pthread_mutex_trylock(&normal_mutex); // Non-blocking attemptstruct timespec ts;rc = pthread_mutex_timedlock(&normal_mutex, &ts); // Timeout // Unlockpthread_mutex_unlock(&normal_mutex); // Error checking examples:// PTHREAD_MUTEX_ERRORCHECK:// - Returns EDEADLK if thread tries to lock mutex it owns// - Returns EPERM if thread tries to unlock mutex it doesn't ownWindows Critical Sections:
Windows provides CRITICAL_SECTION, a hybrid spinlock/mutex:
123456789101112131415161718192021222324252627282930313233343536373839
// Windows Critical Section#include <windows.h> CRITICAL_SECTION cs; // Initialize with spin count// SpinCount: number of spins before blocking (0-4096)// Higher on multi-processor systemsInitializeCriticalSectionAndSpinCount(&cs, 4000); // AcquireEnterCriticalSection(&cs);// or non-blocking:if (TryEnterCriticalSection(&cs)) { // acquired} // Critical section work... // ReleaseLeaveCriticalSection(&cs); // CleanupDeleteCriticalSection(&cs); // Slim Reader/Writer Lock (SRW Lock) - lighter weight, no spin countSRWLOCK srw = SRWLOCK_INIT;AcquireSRWLockExclusive(&srw);ReleaseSRWLockExclusive(&srw); // Internal structure (simplified):typedef struct _RTL_CRITICAL_SECTION { PRTL_CRITICAL_SECTION_DEBUG DebugInfo; LONG LockCount; // -1 = free, 0+ = locked + waiters LONG RecursionCount; // Recursive acquisition count HANDLE OwningThread; // Owner thread ID HANDLE LockSemaphore; // Blocking mechanism (auto-allocated) ULONG_PTR SpinCount; // Spin count before blocking} RTL_CRITICAL_SECTION;For new code, Microsoft recommends SRW locks over Critical Sections. SRW locks are smaller (just one pointer-sized word), faster for uncontended cases, and provide reader/writer semantics. They don't support recursive locking, which is often a design smell anyway.
When using locks in systems with priority scheduling, a subtle but dangerous problem arises: priority inversion. A high-priority task can be blocked indefinitely by a lower-priority task, violating the fundamental premise of priority scheduling.
The Mars Pathfinder Bug:
In 1997, NASA's Mars Pathfinder mission experienced repeated system resets due to priority inversion. A low-priority thread holding a mutex was preempted by a medium-priority thread, while a high-priority thread waited for the mutex. The high-priority thread starved, triggering a watchdog reset.
Priority Inheritance Protocol:
When a high-priority thread blocks on a mutex, temporarily boost the lock holder's priority to match. This prevents medium-priority threads from preempting the holder.
1234567891011121314151617181920212223242526272829303132
// Priority Inheritance implementation sketch void mutex_lock_with_pi(mutex_t *mutex) { if (!try_acquire(mutex)) { // We're blocked - boost owner's priority struct thread *owner = mutex->owner; int my_priority = current_thread()->priority; if (owner->priority < my_priority) { // Boost owner to our priority boost_priority(owner, my_priority); } // Now block block_on_mutex(mutex); } // Acquired! (priority boosting cleaned up by unlock)} void mutex_unlock_with_pi(mutex_t *mutex) { // Restore original priority (if we were boosted) if (current_thread()->base_priority != current_thread()->priority) { restore_priority(current_thread()); } // Wake highest-priority waiter struct thread *waiter = get_highest_priority_waiter(mutex); release_lock(mutex); if (waiter) { wake(waiter); }}Priority inversion is particularly dangerous in real-time systems where deadlines are hard. POSIX provides PTHREAD_PRIO_INHERIT and PTHREAD_PRIO_PROTECT attributes for mutexes. If your system has priority levels and shared locks, you MUST consider inversion—it will bite you eventually.
We have explored the complete architecture of lock implementations built upon Test-and-Set primitives. From simple spinlocks to sophisticated mutexes with fairness, blocking, and priority handling, we've seen how atomic operations compose into production-grade synchronization. Let's consolidate the key insights:
What's Next:
Our final topic in this module addresses a subtle but critical issue: the ABA problem. When using Compare-and-Swap operations (a close relative of Test-and-Set), the possibility of values cycling back to old states creates correctness bugs that are devilishly hard to detect. Understanding and preventing ABA is essential for lock-free programming and certain advanced lock implementations.
You now understand how complete mutex implementations are built from atomic primitives. From fast paths through blocking mechanisms to fairness and priority protocols, you can reason about lock implementations at any level of detail. This knowledge is essential for debugging synchronization issues and designing efficient concurrent systems.