Loading content...
We've established what a mutex is and explored the blocking vs. spinning tradeoff. Now we must answer a deeper question: How is a mutex actually built?\n\nImplementing a mutex correctly is surprisingly difficult. The code must work correctly even when multiple threads execute it simultaneously—precisely the condition the mutex is designed to prevent. This is the bootstrapping problem of synchronization: we need synchronization to build synchronization.\n\nThe answer lies in hardware atomic operations—special CPU instructions that complete indivisibly, even in the presence of concurrent access from other processors.
By the end of this page, you will understand how mutexes are implemented from the ground up—starting with atomic hardware primitives, building through basic spinlock algorithms, and culminating in production-quality blocking mutex implementations. You will understand the data structures, the atomic operations, and the careful ordering that makes it all work.
Mutex implementation begins with the CPU itself. Modern processors provide special instructions that execute atomically—they cannot be interrupted or interleaved with other operations on the same memory location.\n\nWhy Atomicity Matters:\n\nConsider a naive lock implementation:\n\nc\nif (lock == 0) { // Step 1: Check if free\n lock = 1; // Step 2: Mark as taken\n // Enter critical section\n}\n\n\nTwo threads could both see lock == 0 (Step 1), then both set lock = 1 (Step 2), and both believe they acquired the lock. This is a classic race condition—the 'check-then-act' pattern is inherently unsafe without atomicity.\n\nThe solution: Combine the check and the modification into a single atomic operation that the hardware guarantees cannot be split.
| Operation | Semantics | CPU Instructions |
|---|---|---|
| Test-and-Set (TAS) | Read old value, write 1, return old value (atomically) | x86: XCHG, LOCK BTS |
| Compare-and-Swap (CAS) | If value == expected, write new; return success/actual | x86: LOCK CMPXCHG |
| Fetch-and-Add | Read current, add delta, return old value (atomically) | x86: LOCK XADD |
| Load-Linked/Store-Conditional | LL loads, SC stores only if unmodified since LL | ARM: LDREX/STREX, RISC-V: LR/SC |
| Atomic Load | Read with memory ordering guarantees | x86: MOV (with barriers) |
| Atomic Store | Write with memory ordering guarantees | x86: MOV (with barriers) |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// ATOMIC OPERATIONS - The building blocks of all locks #include <stdatomic.h> // ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━// TEST-AND-SET (TAS)// ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━// Atomically: read the value, replace with 1, return the old value// If old was 0, we just acquired the lock// If old was 1, lock was already held int test_and_set(atomic_int* lock) { return atomic_exchange(lock, 1);} // Usage:// while (test_and_set(&lock) == 1) { /* spin */ }// // Critical section// atomic_store(&lock, 0); // Release // ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━// COMPARE-AND-SWAP (CAS)// ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━// Atomically: if *addr == expected, write new_val and return true// else update expected with actual value and return false bool compare_and_swap(atomic_int* addr, int* expected, int new_val) { return atomic_compare_exchange_strong(addr, expected, new_val);} // Usage for lock acquisition:// int expected = 0;// while (!compare_and_swap(&lock, &expected, 1)) {// expected = 0; // Reset for retry// /* spin */// } // ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━// FETCH-AND-ADD// ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━// Atomically: add delta to value, return the OLD value int fetch_and_add(atomic_int* value, int delta) { return atomic_fetch_add(value, delta);} // Used in ticket locks:// int my_ticket = fetch_and_add(&next_ticket, 1);// while (now_serving != my_ticket) { /* spin */ }Atomic operations come with different memory ordering options: relaxed, acquire, release, acquire-release, and sequentially consistent. For mutex implementation, you typically need acquire ordering on lock acquisition and release ordering on unlock. Incorrect ordering leads to subtle bugs where changes made in the critical section become visible before or after they should.
The simplest mutex implementation is a spinlock: a lock where waiters spin in a loop until they acquire the lock. While not suitable for all scenarios, spinlocks are foundational—more sophisticated locks build upon these ideas.\n\nTest-and-Set Spinlock (TAS Lock):
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// TEST-AND-SET SPINLOCK// The simplest possible lock implementation #include <stdatomic.h> typedef struct { atomic_int locked; // 0 = free, 1 = held} tas_spinlock_t; void tas_spinlock_init(tas_spinlock_t* lock) { atomic_store(&lock->locked, 0);} void tas_spinlock_acquire(tas_spinlock_t* lock) { // Keep trying until we successfully swap 0 → 1 while (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 1) { // Lock was already held (we read 1) // Spin until it looks free, then try again while (atomic_load_explicit(&lock->locked, memory_order_relaxed) == 1) { // Optional: reduce power consumption __builtin_ia32_pause(); // x86 PAUSE instruction } } // We successfully swapped 0 → 1, we own the lock} void tas_spinlock_release(tas_spinlock_t* lock) { atomic_store_explicit(&lock->locked, 0, memory_order_release);} // WHY THE INNER LOOP?// ────────────────────// atomic_exchange is a READ-MODIFY-WRITE operation// It requires exclusive ownership of the cache line// // If multiple threads are spinning with atomic_exchange,// the cache line bounces between CPUs constantly (expensive!)//// The inner loop uses atomic_load (read-only)// This keeps the cache line in SHARED state// Only when we see 0 do we attempt the expensive exchange// // This is called "Test-and-Test-and-Set" (TTAS) optimizationCompare-and-Swap Spinlock (CAS Lock):\n\nCAS provides similar functionality but with different semantics:
12345678910111213141516171819202122232425262728293031323334353637
// COMPARE-AND-SWAP SPINLOCK typedef struct { atomic_int locked;} cas_spinlock_t; void cas_spinlock_acquire(cas_spinlock_t* lock) { int expected = 0; // CAS: if locked == 0, set to 1 and return true // otherwise, set expected = actual and return false while (!atomic_compare_exchange_weak_explicit( &lock->locked, &expected, 1, memory_order_acquire, // On success: acquire memory_order_relaxed)) // On failure: relaxed { // Failed: lock is held // Spin-wait with read-only loads (TTAS optimization) while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) { __builtin_ia32_pause(); } expected = 0; // Reset for next CAS attempt }} void cas_spinlock_release(cas_spinlock_t* lock) { atomic_store_explicit(&lock->locked, 0, memory_order_release);} // CAS vs TAS// ──────────// - Both achieve the same goal// - CAS provides more information (expected vs actual)// - CAS is basis for lock-free algorithms (try operation)// - TAS is slightly simpler for basic locks// - Modern CPUs: roughly equivalent performanceThese basic spinlocks have significant limitations: (1) No fairness—a thread can acquire the lock repeatedly while others starve; (2) No bounded waiting—a thread might wait forever; (3) Cache line bouncing under contention. For production use, more sophisticated algorithms like ticket locks or MCS locks are preferred.
The basic TAS/CAS spinlocks offer no fairness guarantees. A thread might acquire and release the lock repeatedly while another thread spins indefinitely. The ticket lock solves this by imposing FIFO ordering—like taking a number at a deli counter.\n\nHow Ticket Locks Work:\n\n1. Two counters: next_ticket (next ticket to dispense) and now_serving (current ticket being served)\n2. To acquire: atomically get the next ticket, then spin until your ticket is being served\n3. To release: increment now_serving, allowing the next ticket holder to proceed
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// TICKET LOCK - Fair, FIFO-ordered spinlock typedef struct { atomic_uint next_ticket; // Next ticket to be dispensed atomic_uint now_serving; // Ticket currently being served} ticket_lock_t; void ticket_lock_init(ticket_lock_t* lock) { atomic_store(&lock->next_ticket, 0); atomic_store(&lock->now_serving, 0);} void ticket_lock_acquire(ticket_lock_t* lock) { // Step 1: Get my ticket (atomically increment next_ticket) unsigned int my_ticket = atomic_fetch_add_explicit( &lock->next_ticket, 1, memory_order_relaxed); // Step 2: Wait until it's my turn while (atomic_load_explicit(&lock->now_serving, memory_order_acquire) != my_ticket) { __builtin_ia32_pause(); } // It's my turn! I have the lock.} void ticket_lock_release(ticket_lock_t* lock) { // Increment now_serving to pass the lock to the next waiter // (We implicitly know our ticket number was now_serving) unsigned int current = atomic_load_explicit( &lock->now_serving, memory_order_relaxed); atomic_store_explicit(&lock->now_serving, current + 1, memory_order_release);} // FAIRNESS GUARANTEE// ──────────────────// Threads acquire the lock in the order they requested it (FIFO).// If you get ticket N, you will enter after tickets 0..(N-1).// No starvation possible (assuming bounded critical sections).//// CACHE BEHAVIOR// ──────────────// All waiters spin on the SAME variable (now_serving).// When it changes, all CPUs invalidate their cache line.// This "invalidation storm" is costly under high contention.// MCS locks (next module) solve this with local spinning.A clever optimization: instead of spinning continuously, waiters can calculate how long to wait based on (my_ticket - now_serving). If you're ticket 100 and now_serving is 50, you might sleep briefly rather than spin-waiting for 50 releases. This is proportional backoff.
A production-quality blocking mutex is more complex than a spinlock. It must track:\n\n1. Lock state: Is the mutex held or free?\n2. Owner identity: Which thread holds the lock? (for recursive locks, debugging)\n3. Wait queue: Which threads are waiting for this lock?\n4. Recursion count: How many times has the owner locked this mutex? (for recursive locks)\n5. Type/attributes: What kind of mutex is this? (normal, recursive, error-checking)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// MUTEX DATA STRUCTURE (Conceptual - simplified from real implementations) // Possible mutex states#define MUTEX_UNLOCKED 0 // No owner, no waiters#define MUTEX_LOCKED 1 // Owner present, no waiters#define MUTEX_LOCKED_WAITERS 2 // Owner present, threads waiting // Thread ID type (platform-specific)typedef unsigned long tid_t;#define NO_OWNER 0 // Simplified mutex structuretypedef struct { // ─── Core State ─────────────────────────────────────────── atomic_int state; // UNLOCKED, LOCKED, or LOCKED_WAITERS tid_t owner; // Thread ID of current owner (0 if none) // ─── Recursion Support ──────────────────────────────────── unsigned int recursion; // How many times owner has locked (recursive only) // ─── Wait Queue ─────────────────────────────────────────── // In real implementations, this might be: // - A pointer to a kernel wait queue // - A futex address // - A linked list of waiting threads void* waiters; // Opaque pointer to waiting threads // ─── Attributes ─────────────────────────────────────────── int type; // NORMAL, RECURSIVE, ERRORCHECK, etc. } mutex_t; // ─── POSIX pthread_mutex_t (Conceptual Layout) ───────────//// In NPTL (Native POSIX Thread Library) on Linux://// typedef struct {// int __lock; // Lock state (0=free, otherwise tid+flags)// unsigned int __count; // Recursion count// int __owner; // Owner thread ID// unsigned int __nusers; // Number of users (for robust mutexes)// int __kind; // Type: normal, recursive, errorcheck, etc.// int __spins; // Spin count for adaptive mutexes// __pthread_list_t __list; // Robust list linkage// } pthread_mutex_t;//// The actual layout is ABI-specified and cannot be changed.State Encoding Tricks:\n\nReal implementations often pack multiple pieces of information into a single integer to minimize memory usage and enable atomic operations:\n\n- Linux futex-based mutex: Uses a single 32-bit word\n - 0 = unlocked\n - Positive = locked, value is owner TID\n - Bit flags for 'waiters present', 'owner died' (robust mutexes)\n\n- Windows CRITICAL_SECTION: Uses multiple fields\n - LockCount: negative if owned, incremented by waiters\n - RecursionCount: how many times same thread locked\n - OwningThread: handle to owning thread\n - LockSemaphore: event object for blocking
Owner tracking enables: (1) Recursive locking—same thread can lock again; (2) Error detection—detect unlock by wrong thread; (3) Priority inheritance—boost owner's priority when high-priority thread waits; (4) Debugging—identify which thread caused a deadlock.
A blocking mutex requires cooperation between userspace and kernel. The userspace code handles the fast path (uncontended lock/unlock), while the kernel handles the slow path (putting threads to sleep and waking them up).\n\nThe Fundamental Challenge:\n\nWe need to atomically both:\n1. Check if we can get the lock\n2. Go to sleep if we can't\n\nWithout atomicity, we might check, see the lock is held, then (before we sleep) the lock is released. We go to sleep anyway and might never wake up—this is the lost wakeup problem.\n\nSolution: Futex (Fast Userspace Mutex)\n\nLinux's futex system call solves this elegantly:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// FUTEX-BASED MUTEX IMPLEMENTATION// This is conceptually how NPTL pthread_mutex works on Linux #include <linux/futex.h>#include <sys/syscall.h>#include <stdatomic.h> // States:// 0 = unlocked// 1 = locked, no waiters// 2 = locked, waiters present typedef struct { atomic_int state;} futex_mutex_t; void futex_mutex_lock(futex_mutex_t* m) { int s; // FAST PATH: Try to acquire from unlocked state (0 → 1) if (atomic_compare_exchange_strong(&m->state, &(int){0}, 1)) { return; // Got lock immediately, no kernel entry! } // SLOW PATH: Must block // Set state to 2 (locked with waiters) before sleeping while ((s = atomic_exchange(&m->state, 2)) != 0) { // State was not 0, so lock is held // Sleep until state changes from 2 // // futex_wait atomically checks: if *addr == expected, sleep // This prevents the lost wakeup problem! syscall(SYS_futex, &m->state, FUTEX_WAIT, 2, NULL, NULL, 0); // Woken up - try again to acquire // Next iteration will atomic_exchange to try acquiring } // atomic_exchange returned 0: we got the lock (state is now 2)} void futex_mutex_unlock(futex_mutex_t* m) { // Atomically set to 0 and check if there were waiters if (atomic_exchange(&m->state, 0) == 2) { // There were waiters (state was 2) // Wake one of them up syscall(SYS_futex, &m->state, FUTEX_WAKE, 1, NULL, NULL, 0); } // If state was 1 (no waiters), no one to wake} // WHY THIS IS FAST:// ─────────────────// UNCONTENDED CASE: lock() and unlock() are pure userspace!// - lock(): One atomic CAS, no syscall// - unlock(): One atomic exchange, no syscall (if no waiters)//// CONTENDED CASE: Kernel involvement only when necessary// - Waiters sleep in kernel, don't burn CPU// - Wakeup is targeted (not broadcast to all waiters)How Futex Prevents Lost Wakeups:\n\nThe key insight is that FUTEX_WAIT checks the value atomically with going to sleep:\n\n\nfutex_wait(addr, expected):\n atomically {\n if (*addr == expected) {\n add_me_to_wait_queue(addr);\n go_to_sleep();\n } else {\n return immediately; // Value changed, don't sleep\n }\n }\n\n\nIf the lock is released between checking the state and calling futex_wait, futex_wait will see the new value (0, not 2) and return immediately rather than sleeping. This is the magic that makes futex work.
The kernel doesn't store state for every futex. Instead, it maintains a hash table keyed by memory address. When a thread calls futex_wait, it's added to a hash bucket for that address. Futex_wake walks that bucket to find and wake waiters. This allows an unbounded number of userspace futexes with bounded kernel memory.
Production mutex implementations typically combine spinning and blocking for optimal performance across varying contention levels.\n\nAdaptive Spinning:\n\nThe idea is to spin briefly before resorting to the kernel. If the lock is released during the spin phase, we avoid the syscall overhead entirely.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
// ADAPTIVE MUTEX WITH SPIN-THEN-BLOCK// Based on concepts from NPTL and similar implementations #define SPIN_COUNT 100 // Iterations to spin before blocking typedef struct { atomic_int state; // 0=free, 1=locked, 2=locked+waiters int adaptive_count; // Spin iterations (can be tuned dynamically)} adaptive_mutex_t; void adaptive_mutex_lock(adaptive_mutex_t* m) { int expected = 0; // ULTRA-FAST PATH: Lock is free, acquire immediately if (atomic_compare_exchange_weak(&m->state, &expected, 1)) { return; } // SPIN PHASE: Try to acquire without blocking for (int i = 0; i < m->adaptive_count; i++) { // Check if lock might be free (read-only, no cache line bouncing) if (atomic_load_explicit(&m->state, memory_order_relaxed) == 0) { expected = 0; if (atomic_compare_exchange_weak(&m->state, &expected, 1)) { // Got it during spin! // OPTIMIZATION: Increase spin count for next time // (spinning was successful, so it's likely to work again) if (m->adaptive_count < 1000) { m->adaptive_count += 10; } return; } } cpu_pause(); // Power-efficient spin } // BLOCK PHASE: Spinning didn't work, must sleep // OPTIMIZATION: Decrease spin count for next time if (m->adaptive_count > 10) { m->adaptive_count -= 5; } // Set waiters bit and enter slow path while (atomic_exchange(&m->state, 2) != 0) { futex_wait(&m->state, 2); }} void adaptive_mutex_unlock(adaptive_mutex_t* m) { if (atomic_exchange(&m->state, 0) == 2) { futex_wake(&m->state, 1); }} // ADAPTIVE SPINNING HEURISTICS// ────────────────────────────// The spin count adapts based on recent history:// - Successful spin → increase spin count (spinning works here)// - Unsuccessful spin → decrease spin count (save CPU)//// More sophisticated implementations may also consider:// - Is the lock holder running on a CPU?// - How long is this critical section typically?// - What is the current system load?Lock Holder Tracking:\n\nSome implementations track whether the lock holder is currently scheduled:\n\n- If the holder is running on a CPU, spinning makes sense (holder is making progress)\n- If the holder is not running (blocked on I/O, sleeping), spinning is wasteful\n\nLinux's adaptive spinning in kernel mutexes uses this: it checks if the lock owner is currently on_cpu before deciding to spin or sleep.
POSIX allows different mutex types via pthread_mutexattr_settype(): PTHREAD_MUTEX_NORMAL (basic), PTHREAD_MUTEX_ERRORCHECK (detects unlock from wrong thread), PTHREAD_MUTEX_RECURSIVE (allows relocking by owner), PTHREAD_MUTEX_DEFAULT (implementation-defined). Each type has different implementation requirements and overhead.
Correct mutex implementation requires careful attention to memory ordering. Without proper barriers, the compiler or CPU might reorder instructions in ways that break the synchronization guarantee.\n\nThe Problem:\n\nModern CPUs and compilers reorder memory operations for performance:\n\n- Store buffers delay writes to memory\n- Compiler optimization moves loads/stores to improve pipelining\n- Out-of-order execution completes instructions in whatever order is fastest\n\nThese optimizations are invisible to single-threaded code but can cause disaster in concurrent code.
123456789101112131415161718192021222324252627282930313233343536373839
// MEMORY ORDERING DANGER WITHOUT PROPER BARRIERS int data = 0;atomic_int lock = ATOMIC_VAR_INIT(0); void producer() { data = 42; // Write data // WITHOUT proper memory ordering, CPU might reorder: // 1. Store 1 to 'lock' (make public that we're done) // 2. Store 42 to 'data' (happens AFTER lock release?!) atomic_store(&lock, 1); // Signal we're done} void consumer() { // WITHOUT proper memory ordering: // 1. Load 'data' → get 0 (old value, speculatively loaded early) // 2. Load 'lock' → get 1 (now we think we can proceed) // 3. Use data = 0 (WRONG!) while (atomic_load(&lock) == 0); // Wait for producer int x = data; // Might read stale value!} // THE FIX: Memory ordering in atomic operations void producer_fixed() { data = 42; // RELEASE semantics: ensure data=42 is visible before lock=1 atomic_store_explicit(&lock, 1, memory_order_release);} void consumer_fixed() { // ACQUIRE semantics: ensure we see all writes before the release while (atomic_load_explicit(&lock, memory_order_acquire) == 0); int x = data; // GUARANTEED to see 42}| Operation | Required Ordering | Purpose |
|---|---|---|
| Lock acquisition (success) | Acquire | Subsequent reads see all prior writes by previous holder |
| Lock acquisition (failure) | Relaxed | We're going to spin/block anyway, no ordering needed |
| Lock release | Release | All writes in critical section visible before lock freed |
| Spin loop load | Relaxed | Just checking; actual acquire happens with CAS success |
x86 vs ARM:\n\nDifferent CPU architectures have different memory models:\n\n- x86 (TSO - Total Store Order): Relatively strong ordering. Most loads/stores are already ordered. Explicit barriers rarely needed except for complete fences.\n\n- ARM (Weak ordering): Very aggressive reordering. Explicit barriers required for correct synchronization. DMB/DSB instructions or atomic operations with ordering.\n\nThis is why portable code should use C11/C++11 atomic operations with explicit memory order parameters—the compiler inserts appropriate barriers for the target architecture.
For correct synchronization, releases and acquires must be paired on the same memory location. A thread that releases a lock with release semantics creates a 'synchronizes-with' relationship with any thread that later acquires the same lock with acquire semantics. All writes before the release are guaranteed to be visible after the acquire.
Let's trace through a complete, production-like mutex implementation to see how all the pieces fit together.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
// COMPLETE MUTEX IMPLEMENTATION (Production-Style)// Combines fast paths, adaptive spinning, and kernel sleeping #include <stdatomic.h>#include <stdint.h>#include <sys/syscall.h>#include <linux/futex.h> #define MUTEX_FREE 0#define MUTEX_LOCKED 1#define MUTEX_WAITERS 2 #define SPIN_LIMIT 100 typedef struct { atomic_int state; atomic_int owner; // Optional: for debugging/recursive atomic_int recursion; // For recursive mutexes} complete_mutex_t; static inline int gettid(void) { return syscall(SYS_gettid);} void mutex_lock(complete_mutex_t* m) { int tid = gettid(); // ─── ULTRA-FAST PATH ───────────────────────────────────── // Uncontended: CAS from FREE to LOCKED int expected = MUTEX_FREE; if (atomic_compare_exchange_strong_explicit( &m->state, &expected, MUTEX_LOCKED, memory_order_acquire, memory_order_relaxed)) { atomic_store_explicit(&m->owner, tid, memory_order_relaxed); return; // Got lock! No spinning, no syscall. } // ─── SPIN PHASE ────────────────────────────────────────── // Lock is held. Spin briefly hoping for quick release. for (int i = 0; i < SPIN_LIMIT; i++) { // Read-only check (avoids cache line exclusive ownership) if (atomic_load_explicit(&m->state, memory_order_relaxed) == MUTEX_FREE) { // Looks free! Try to acquire. expected = MUTEX_FREE; if (atomic_compare_exchange_weak_explicit( &m->state, &expected, MUTEX_LOCKED, memory_order_acquire, memory_order_relaxed)) { atomic_store(&m->owner, tid); return; } } __builtin_ia32_pause(); // CPU pause: save power, help HT } // ─── SLOW PATH: KERNEL BLOCKING ────────────────────────── // Spinning failed. Must sleep until woken. while (1) { // Mark that waiters exist (state = WAITERS) // If was FREE, we got it (race with unlock) int old = atomic_exchange_explicit( &m->state, MUTEX_WAITERS, memory_order_acquire); if (old == MUTEX_FREE) { // Lock was free! Our exchange locked it. atomic_store(&m->owner, tid); return; } // Lock was held. Sleep until state changes. syscall(SYS_futex, &m->state, FUTEX_WAIT | FUTEX_PRIVATE_FLAG, MUTEX_WAITERS, NULL, NULL, 0); // Woken up (might be spurious). Loop and try again. }} void mutex_unlock(complete_mutex_t* m) { atomic_store_explicit(&m->owner, 0, memory_order_relaxed); // Atomically release and check for waiters int old = atomic_exchange_explicit( &m->state, MUTEX_FREE, memory_order_release); if (old == MUTEX_WAITERS) { // There were waiters. Wake one. syscall(SYS_futex, &m->state, FUTEX_WAKE | FUTEX_PRIVATE_FLAG, 1, NULL, NULL, 0); } // If old was MUTEX_LOCKED (no waiters), no wake needed.}We have explored the complete journey of mutex implementation—from atomic hardware primitives to production-ready code. Let's consolidate the key insights:
What's Next:\n\nNow that we understand how mutexes are implemented, we're ready to explore the POSIX pthread mutex API—the standard interface for mutexes in Unix-like systems. We'll cover initialization, attributes, error checking, and best practices for production use.
You now understand how mutexes are implemented from the ground up—from atomic operations to kernel syscalls. You understand spinlocks, blocking locks, hybrid approaches, and the memory ordering requirements. Next, we'll apply this knowledge to the practical pthread_mutex API.