Mutex Locks - Learning Module

Loading content...

0/227

Mutex Implementation: Building the Lock

From Concept to Reality

We've established what a mutex is and explored the blocking vs. spinning tradeoff. Now we must answer a deeper question: How is a mutex actually built?\n\nImplementing a mutex correctly is surprisingly difficult. The code must work correctly even when multiple threads execute it simultaneously—precisely the condition the mutex is designed to prevent. This is the bootstrapping problem of synchronization: we need synchronization to build synchronization.\n\nThe answer lies in hardware atomic operations—special CPU instructions that complete indivisibly, even in the presence of concurrent access from other processors.

What You Will Learn

By the end of this page, you will understand how mutexes are implemented from the ground up—starting with atomic hardware primitives, building through basic spinlock algorithms, and culminating in production-quality blocking mutex implementations. You will understand the data structures, the atomic operations, and the careful ordering that makes it all work.

The Hardware Foundation: Atomic Operations

Mutex implementation begins with the CPU itself. Modern processors provide special instructions that execute atomically—they cannot be interrupted or interleaved with other operations on the same memory location.\n\nWhy Atomicity Matters:\n\nConsider a naive lock implementation:\n\nc\nif (lock == 0) { // Step 1: Check if free\n lock = 1; // Step 2: Mark as taken\n // Enter critical section\n}\n\n\nTwo threads could both see lock == 0 (Step 1), then both set lock = 1 (Step 2), and both believe they acquired the lock. This is a classic race condition—the 'check-then-act' pattern is inherently unsafe without atomicity.\n\nThe solution: Combine the check and the modification into a single atomic operation that the hardware guarantees cannot be split.

Key Atomic Operations for Mutex Implementation
Operation	Semantics	CPU Instructions
Test-and-Set (TAS)	Read old value, write 1, return old value (atomically)	x86: `XCHG`, `LOCK BTS`
Compare-and-Swap (CAS)	If value == expected, write new; return success/actual	x86: `LOCK CMPXCHG`
Fetch-and-Add	Read current, add delta, return old value (atomically)	x86: `LOCK XADD`
Load-Linked/Store-Conditional	LL loads, SC stores only if unmodified since LL	ARM: `LDREX`/`STREX`, RISC-V: `LR`/`SC`
Atomic Load	Read with memory ordering guarantees	x86: MOV (with barriers)
Atomic Store	Write with memory ordering guarantees	x86: MOV (with barriers)

atomic_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// ATOMIC OPERATIONS - The building blocks of all locks
 
#include <stdatomic.h>
 
// ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
// TEST-AND-SET (TAS)
// ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
// Atomically: read the value, replace with 1, return the old value
// If old was 0, we just acquired the lock
// If old was 1, lock was already held
 
int test_and_set(atomic_int* lock) {
    return atomic_exchange(lock, 1);
}
 
// Usage:
// while (test_and_set(&lock) == 1) { /* spin */ }
// // Critical section
// atomic_store(&lock, 0);  // Release
 
// ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
// COMPARE-AND-SWAP (CAS)
// ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
// Atomically: if *addr == expected, write new_val and return true
//             else update expected with actual value and return false
 
bool compare_and_swap(atomic_int* addr, int* expected, int new_val) {
    return atomic_compare_exchange_strong(addr, expected, new_val);
}
 
// Usage for lock acquisition:
// int expected = 0;
// while (!compare_and_swap(&lock, &expected, 1)) {
//     expected = 0;  // Reset for retry
//     /* spin */
// }
 
// ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
// FETCH-AND-ADD
// ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
// Atomically: add delta to value, return the OLD value
 
int fetch_and_add(atomic_int* value, int delta) {
    return atomic_fetch_add(value, delta);
}
 
// Used in ticket locks:
// int my_ticket = fetch_and_add(&next_ticket, 1);
// while (now_serving != my_ticket) { /* spin */ }

Memory Ordering Matters

Atomic operations come with different memory ordering options: relaxed, acquire, release, acquire-release, and sequentially consistent. For mutex implementation, you typically need acquire ordering on lock acquisition and release ordering on unlock. Incorrect ordering leads to subtle bugs where changes made in the critical section become visible before or after they should.

Basic Spinlock Implementation

The simplest mutex implementation is a spinlock: a lock where waiters spin in a loop until they acquire the lock. While not suitable for all scenarios, spinlocks are foundational—more sophisticated locks build upon these ideas.\n\nTest-and-Set Spinlock (TAS Lock):

tas_spinlock.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// TEST-AND-SET SPINLOCK
// The simplest possible lock implementation
 
#include <stdatomic.h>
 
typedef struct {
    atomic_int locked;  // 0 = free, 1 = held
} tas_spinlock_t;
 
void tas_spinlock_init(tas_spinlock_t* lock) {
    atomic_store(&lock->locked, 0);
}
 
void tas_spinlock_acquire(tas_spinlock_t* lock) {
    // Keep trying until we successfully swap 0 → 1
    while (atomic_exchange_explicit(&lock->locked, 1, 
                                    memory_order_acquire) == 1) {
        // Lock was already held (we read 1)
        // Spin until it looks free, then try again
        while (atomic_load_explicit(&lock->locked, 
                                    memory_order_relaxed) == 1) {
            // Optional: reduce power consumption
            __builtin_ia32_pause();  // x86 PAUSE instruction
        }
    }
    // We successfully swapped 0 → 1, we own the lock
}
 
void tas_spinlock_release(tas_spinlock_t* lock) {
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}
 
// WHY THE INNER LOOP?
// ────────────────────
// atomic_exchange is a READ-MODIFY-WRITE operation
// It requires exclusive ownership of the cache line
// 
// If multiple threads are spinning with atomic_exchange,
// the cache line bounces between CPUs constantly (expensive!)
//
// The inner loop uses atomic_load (read-only)
// This keeps the cache line in SHARED state
// Only when we see 0 do we attempt the expensive exchange
// 
// This is called "Test-and-Test-and-Set" (TTAS) optimization

Compare-and-Swap Spinlock (CAS Lock):\n\nCAS provides similar functionality but with different semantics:

cas_spinlock.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// COMPARE-AND-SWAP SPINLOCK
 
typedef struct {
    atomic_int locked;
} cas_spinlock_t;
 
void cas_spinlock_acquire(cas_spinlock_t* lock) {
    int expected = 0;
    
    // CAS: if locked == 0, set to 1 and return true
    //      otherwise, set expected = actual and return false
    while (!atomic_compare_exchange_weak_explicit(
            &lock->locked, &expected, 1,
            memory_order_acquire,  // On success: acquire
            memory_order_relaxed)) // On failure: relaxed
    {
        // Failed: lock is held
        // Spin-wait with read-only loads (TTAS optimization)
        while (atomic_load_explicit(&lock->locked, 
                                    memory_order_relaxed) != 0) {
            __builtin_ia32_pause();
        }
        expected = 0;  // Reset for next CAS attempt
    }
}
 
void cas_spinlock_release(cas_spinlock_t* lock) {
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}
 
// CAS vs TAS
// ──────────
// - Both achieve the same goal
// - CAS provides more information (expected vs actual)
// - CAS is basis for lock-free algorithms (try operation)
// - TAS is slightly simpler for basic locks
// - Modern CPUs: roughly equivalent performance

Spinlock Limitations

These basic spinlocks have significant limitations: (1) No fairness—a thread can acquire the lock repeatedly while others starve; (2) No bounded waiting—a thread might wait forever; (3) Cache line bouncing under contention. For production use, more sophisticated algorithms like ticket locks or MCS locks are preferred.

Fair Spinlocks: The Ticket Lock

The basic TAS/CAS spinlocks offer no fairness guarantees. A thread might acquire and release the lock repeatedly while another thread spins indefinitely. The ticket lock solves this by imposing FIFO ordering—like taking a number at a deli counter.\n\nHow Ticket Locks Work:\n\n1. Two counters: next_ticket (next ticket to dispense) and now_serving (current ticket being served)\n2. To acquire: atomically get the next ticket, then spin until your ticket is being served\n3. To release: increment now_serving, allowing the next ticket holder to proceed

ticket_lock.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// TICKET LOCK - Fair, FIFO-ordered spinlock
 
typedef struct {
    atomic_uint next_ticket;   // Next ticket to be dispensed
    atomic_uint now_serving;   // Ticket currently being served
} ticket_lock_t;
 
void ticket_lock_init(ticket_lock_t* lock) {
    atomic_store(&lock->next_ticket, 0);
    atomic_store(&lock->now_serving, 0);
}
 
void ticket_lock_acquire(ticket_lock_t* lock) {
    // Step 1: Get my ticket (atomically increment next_ticket)
    unsigned int my_ticket = atomic_fetch_add_explicit(
        &lock->next_ticket, 1, memory_order_relaxed);
    
    // Step 2: Wait until it's my turn
    while (atomic_load_explicit(&lock->now_serving, 
                                memory_order_acquire) != my_ticket) {
        __builtin_ia32_pause();
    }
    
    // It's my turn! I have the lock.
}
 
void ticket_lock_release(ticket_lock_t* lock) {
    // Increment now_serving to pass the lock to the next waiter
    // (We implicitly know our ticket number was now_serving)
    unsigned int current = atomic_load_explicit(
        &lock->now_serving, memory_order_relaxed);
    atomic_store_explicit(&lock->now_serving, current + 1, 
                          memory_order_release);
}
 
// FAIRNESS GUARANTEE
// ──────────────────
// Threads acquire the lock in the order they requested it (FIFO).
// If you get ticket N, you will enter after tickets 0..(N-1).
// No starvation possible (assuming bounded critical sections).
//
// CACHE BEHAVIOR
// ──────────────
// All waiters spin on the SAME variable (now_serving).
// When it changes, all CPUs invalidate their cache line.
// This "invalidation storm" is costly under high contention.
// MCS locks (next module) solve this with local spinning.

Ticket Lock Properties

•FIFO Fairness — Threads are served in the exact order they arrived. No starvation.
•Bounded Waiting — If N threads are ahead of you, you wait for exactly N critical sections.
•Simple Implementation — Just two integers and basic atomic operations.
•No Lock Handoff Complexity — Release just increments a counter; any waiter can see it.
•All-to-One Disadvantage — All waiters spin on the same memory location (cache line bouncing).

Proportional Backoff

A clever optimization: instead of spinning continuously, waiters can calculate how long to wait based on (my_ticket - now_serving). If you're ticket 100 and now_serving is 50, you might sleep briefly rather than spin-waiting for 50 releases. This is proportional backoff.

The Mutex Data Structure

A production-quality blocking mutex is more complex than a spinlock. It must track:\n\n1. Lock state: Is the mutex held or free?\n2. Owner identity: Which thread holds the lock? (for recursive locks, debugging)\n3. Wait queue: Which threads are waiting for this lock?\n4. Recursion count: How many times has the owner locked this mutex? (for recursive locks)\n5. Type/attributes: What kind of mutex is this? (normal, recursive, error-checking)

mutex_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// MUTEX DATA STRUCTURE (Conceptual - simplified from real implementations)
 
// Possible mutex states
#define MUTEX_UNLOCKED       0  // No owner, no waiters
#define MUTEX_LOCKED         1  // Owner present, no waiters
#define MUTEX_LOCKED_WAITERS 2  // Owner present, threads waiting
 
// Thread ID type (platform-specific)
typedef unsigned long tid_t;
#define NO_OWNER 0
 
// Simplified mutex structure
typedef struct {
    // ─── Core State ───────────────────────────────────────────
    atomic_int state;        // UNLOCKED, LOCKED, or LOCKED_WAITERS
    tid_t owner;             // Thread ID of current owner (0 if none)
    
    // ─── Recursion Support ────────────────────────────────────
    unsigned int recursion;  // How many times owner has locked (recursive only)
    
    // ─── Wait Queue ───────────────────────────────────────────
    // In real implementations, this might be:
    // - A pointer to a kernel wait queue
    // - A futex address
    // - A linked list of waiting threads
    void* waiters;           // Opaque pointer to waiting threads
    
    // ─── Attributes ───────────────────────────────────────────
    int type;                // NORMAL, RECURSIVE, ERRORCHECK, etc.
    
} mutex_t;
 
// ─── POSIX pthread_mutex_t (Conceptual Layout) ───────────
//
// In NPTL (Native POSIX Thread Library) on Linux:
//
// typedef struct {
//     int __lock;              // Lock state (0=free, otherwise tid+flags)
//     unsigned int __count;    // Recursion count
//     int __owner;             // Owner thread ID
//     unsigned int __nusers;   // Number of users (for robust mutexes)
//     int __kind;              // Type: normal, recursive, errorcheck, etc.
//     int __spins;             // Spin count for adaptive mutexes
//     __pthread_list_t __list; // Robust list linkage
// } pthread_mutex_t;
//
// The actual layout is ABI-specified and cannot be changed.

State Encoding Tricks:\n\nReal implementations often pack multiple pieces of information into a single integer to minimize memory usage and enable atomic operations:\n\n- Linux futex-based mutex: Uses a single 32-bit word\n - 0 = unlocked\n - Positive = locked, value is owner TID\n - Bit flags for 'waiters present', 'owner died' (robust mutexes)\n\n- Windows CRITICAL_SECTION: Uses multiple fields\n - LockCount: negative if owned, incremented by waiters\n - RecursionCount: how many times same thread locked\n - OwningThread: handle to owning thread\n - LockSemaphore: event object for blocking

Why Track the Owner?

Owner tracking enables: (1) Recursive locking—same thread can lock again; (2) Error detection—detect unlock by wrong thread; (3) Priority inheritance—boost owner's priority when high-priority thread waits; (4) Debugging—identify which thread caused a deadlock.

Blocking Mutex Implementation

A blocking mutex requires cooperation between userspace and kernel. The userspace code handles the fast path (uncontended lock/unlock), while the kernel handles the slow path (putting threads to sleep and waking them up).\n\nThe Fundamental Challenge:\n\nWe need to atomically both:\n1. Check if we can get the lock\n2. Go to sleep if we can't\n\nWithout atomicity, we might check, see the lock is held, then (before we sleep) the lock is released. We go to sleep anyway and might never wake up—this is the lost wakeup problem.\n\nSolution: Futex (Fast Userspace Mutex)\n\nLinux's futex system call solves this elegantly:

futex_mutex.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// FUTEX-BASED MUTEX IMPLEMENTATION
// This is conceptually how NPTL pthread_mutex works on Linux
 
#include <linux/futex.h>
#include <sys/syscall.h>
#include <stdatomic.h>
 
// States:
// 0 = unlocked
// 1 = locked, no waiters
// 2 = locked, waiters present
 
typedef struct {
    atomic_int state;
} futex_mutex_t;
 
void futex_mutex_lock(futex_mutex_t* m) {
    int s;
    
    // FAST PATH: Try to acquire from unlocked state (0 → 1)
    if (atomic_compare_exchange_strong(&m->state, &(int){0}, 1)) {
        return;  // Got lock immediately, no kernel entry!
    }
    
    // SLOW PATH: Must block
    // Set state to 2 (locked with waiters) before sleeping
    while ((s = atomic_exchange(&m->state, 2)) != 0) {
        // State was not 0, so lock is held
        // Sleep until state changes from 2
        //
        // futex_wait atomically checks: if *addr == expected, sleep
        // This prevents the lost wakeup problem!
        syscall(SYS_futex, &m->state, FUTEX_WAIT, 2, NULL, NULL, 0);
        
        // Woken up - try again to acquire
        // Next iteration will atomic_exchange to try acquiring
    }
    
    // atomic_exchange returned 0: we got the lock (state is now 2)
}
 
void futex_mutex_unlock(futex_mutex_t* m) {
    // Atomically set to 0 and check if there were waiters
    if (atomic_exchange(&m->state, 0) == 2) {
        // There were waiters (state was 2)
        // Wake one of them up
        syscall(SYS_futex, &m->state, FUTEX_WAKE, 1, NULL, NULL, 0);
    }
    // If state was 1 (no waiters), no one to wake
}
 
// WHY THIS IS FAST:
// ─────────────────
// UNCONTENDED CASE: lock() and unlock() are pure userspace!
// - lock(): One atomic CAS, no syscall
// - unlock(): One atomic exchange, no syscall (if no waiters)
//
// CONTENDED CASE: Kernel involvement only when necessary
// - Waiters sleep in kernel, don't burn CPU
// - Wakeup is targeted (not broadcast to all waiters)

How Futex Prevents Lost Wakeups:\n\nThe key insight is that FUTEX_WAIT checks the value atomically with going to sleep:\n\n\nfutex_wait(addr, expected):\n atomically {\n if (*addr == expected) {\n add_me_to_wait_queue(addr);\n go_to_sleep();\n } else {\n return immediately; // Value changed, don't sleep\n }\n }\n\n\nIf the lock is released between checking the state and calling futex_wait, futex_wait will see the new value (0, not 2) and return immediately rather than sleeping. This is the magic that makes futex work.

Futex Advantages

•Zero-syscall fast path — Uncontended lock/unlock happens entirely in userspace with atomic operations.
•Efficient sleeping — Waiters don't consume CPU cycles; kernel manages wait queue.
•No lost wakeups — Atomic check-and-sleep prevents race condition.
•Scalable — Kernel wait queues are hash-table based, supporting millions of futexes.
•Foundation for higher-level primitives — Condition variables, semaphores, barriers all use futex.

The Futex Hash Table

The kernel doesn't store state for every futex. Instead, it maintains a hash table keyed by memory address. When a thread calls futex_wait, it's added to a hash bucket for that address. Futex_wake walks that bucket to find and wake waiters. This allows an unbounded number of userspace futexes with bounded kernel memory.

Hybrid Implementation: Spin-Then-Block

Production mutex implementations typically combine spinning and blocking for optimal performance across varying contention levels.\n\nAdaptive Spinning:\n\nThe idea is to spin briefly before resorting to the kernel. If the lock is released during the spin phase, we avoid the syscall overhead entirely.

adaptive_mutex.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// ADAPTIVE MUTEX WITH SPIN-THEN-BLOCK
// Based on concepts from NPTL and similar implementations
 
#define SPIN_COUNT 100  // Iterations to spin before blocking
 
typedef struct {
    atomic_int state;  // 0=free, 1=locked, 2=locked+waiters
    int adaptive_count; // Spin iterations (can be tuned dynamically)
} adaptive_mutex_t;
 
void adaptive_mutex_lock(adaptive_mutex_t* m) {
    int expected = 0;
    
    // ULTRA-FAST PATH: Lock is free, acquire immediately
    if (atomic_compare_exchange_weak(&m->state, &expected, 1)) {
        return;
    }
    
    // SPIN PHASE: Try to acquire without blocking
    for (int i = 0; i < m->adaptive_count; i++) {
        // Check if lock might be free (read-only, no cache line bouncing)
        if (atomic_load_explicit(&m->state, memory_order_relaxed) == 0) {
            expected = 0;
            if (atomic_compare_exchange_weak(&m->state, &expected, 1)) {
                // Got it during spin!
                // OPTIMIZATION: Increase spin count for next time
                // (spinning was successful, so it's likely to work again)
                if (m->adaptive_count < 1000) {
                    m->adaptive_count += 10;
                }
                return;
            }
        }
        cpu_pause();  // Power-efficient spin
    }
    
    // BLOCK PHASE: Spinning didn't work, must sleep
    // OPTIMIZATION: Decrease spin count for next time
    if (m->adaptive_count > 10) {
        m->adaptive_count -= 5;
    }
    
    // Set waiters bit and enter slow path
    while (atomic_exchange(&m->state, 2) != 0) {
        futex_wait(&m->state, 2);
    }
}
 
void adaptive_mutex_unlock(adaptive_mutex_t* m) {
    if (atomic_exchange(&m->state, 0) == 2) {
        futex_wake(&m->state, 1);
    }
}
 
// ADAPTIVE SPINNING HEURISTICS
// ────────────────────────────
// The spin count adapts based on recent history:
// - Successful spin → increase spin count (spinning works here)
// - Unsuccessful spin → decrease spin count (save CPU)
//
// More sophisticated implementations may also consider:
// - Is the lock holder running on a CPU?
// - How long is this critical section typically?
// - What is the current system load?

Lock Holder Tracking:\n\nSome implementations track whether the lock holder is currently scheduled:\n\n- If the holder is running on a CPU, spinning makes sense (holder is making progress)\n- If the holder is not running (blocked on I/O, sleeping), spinning is wasteful\n\nLinux's adaptive spinning in kernel mutexes uses this: it checks if the lock owner is currently on_cpu before deciding to spin or sleep.

POSIX pthread_mutex Types

POSIX allows different mutex types via pthread_mutexattr_settype(): PTHREAD_MUTEX_NORMAL (basic), PTHREAD_MUTEX_ERRORCHECK (detects unlock from wrong thread), PTHREAD_MUTEX_RECURSIVE (allows relocking by owner), PTHREAD_MUTEX_DEFAULT (implementation-defined). Each type has different implementation requirements and overhead.

Memory Barriers in Mutex Implementation

Correct mutex implementation requires careful attention to memory ordering. Without proper barriers, the compiler or CPU might reorder instructions in ways that break the synchronization guarantee.\n\nThe Problem:\n\nModern CPUs and compilers reorder memory operations for performance:\n\n- Store buffers delay writes to memory\n- Compiler optimization moves loads/stores to improve pipelining\n- Out-of-order execution completes instructions in whatever order is fastest\n\nThese optimizations are invisible to single-threaded code but can cause disaster in concurrent code.

memory_ordering_danger.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// MEMORY ORDERING DANGER WITHOUT PROPER BARRIERS
 
int data = 0;
atomic_int lock = ATOMIC_VAR_INIT(0);
 
void producer() {
    data = 42;  // Write data
    
    // WITHOUT proper memory ordering, CPU might reorder:
    // 1. Store 1 to 'lock' (make public that we're done)
    // 2. Store 42 to 'data' (happens AFTER lock release?!)
    
    atomic_store(&lock, 1);  // Signal we're done
}
 
void consumer() {
    // WITHOUT proper memory ordering:
    // 1. Load 'data' → get 0 (old value, speculatively loaded early)
    // 2. Load 'lock' → get 1 (now we think we can proceed)
    // 3. Use data = 0 (WRONG!)
    
    while (atomic_load(&lock) == 0);  // Wait for producer
    
    int x = data;  // Might read stale value!
}
 
// THE FIX: Memory ordering in atomic operations
 
void producer_fixed() {
    data = 42;
    // RELEASE semantics: ensure data=42 is visible before lock=1
    atomic_store_explicit(&lock, 1, memory_order_release);
}
 
void consumer_fixed() {
    // ACQUIRE semantics: ensure we see all writes before the release
    while (atomic_load_explicit(&lock, memory_order_acquire) == 0);
    int x = data;  // GUARANTEED to see 42
}

Memory Ordering Requirements for Mutex Operations
Operation	Required Ordering	Purpose
Lock acquisition (success)	Acquire	Subsequent reads see all prior writes by previous holder
Lock acquisition (failure)	Relaxed	We're going to spin/block anyway, no ordering needed
Lock release	Release	All writes in critical section visible before lock freed
Spin loop load	Relaxed	Just checking; actual acquire happens with CAS success

x86 vs ARM:\n\nDifferent CPU architectures have different memory models:\n\n- x86 (TSO - Total Store Order): Relatively strong ordering. Most loads/stores are already ordered. Explicit barriers rarely needed except for complete fences.\n\n- ARM (Weak ordering): Very aggressive reordering. Explicit barriers required for correct synchronization. DMB/DSB instructions or atomic operations with ordering.\n\nThis is why portable code should use C11/C++11 atomic operations with explicit memory order parameters—the compiler inserts appropriate barriers for the target architecture.

Acquire-Release Pairing

For correct synchronization, releases and acquires must be paired on the same memory location. A thread that releases a lock with release semantics creates a 'synchronizes-with' relationship with any thread that later acquires the same lock with acquire semantics. All writes before the release are guaranteed to be visible after the acquire.

Putting It All Together

Let's trace through a complete, production-like mutex implementation to see how all the pieces fit together.

complete_mutex.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
// COMPLETE MUTEX IMPLEMENTATION (Production-Style)
// Combines fast paths, adaptive spinning, and kernel sleeping
 
#include <stdatomic.h>
#include <stdint.h>
#include <sys/syscall.h>
#include <linux/futex.h>
 
#define MUTEX_FREE      0
#define MUTEX_LOCKED    1
#define MUTEX_WAITERS   2
 
#define SPIN_LIMIT      100
 
typedef struct {
    atomic_int state;
    atomic_int owner;     // Optional: for debugging/recursive
    atomic_int recursion; // For recursive mutexes
} complete_mutex_t;
 
static inline int gettid(void) {
    return syscall(SYS_gettid);
}
 
void mutex_lock(complete_mutex_t* m) {
    int tid = gettid();
    
    // ─── ULTRA-FAST PATH ─────────────────────────────────────
    // Uncontended: CAS from FREE to LOCKED
    int expected = MUTEX_FREE;
    if (atomic_compare_exchange_strong_explicit(
            &m->state, &expected, MUTEX_LOCKED,
            memory_order_acquire, memory_order_relaxed)) {
        atomic_store_explicit(&m->owner, tid, memory_order_relaxed);
        return;  // Got lock! No spinning, no syscall.
    }
    
    // ─── SPIN PHASE ──────────────────────────────────────────
    // Lock is held. Spin briefly hoping for quick release.
    for (int i = 0; i < SPIN_LIMIT; i++) {
        // Read-only check (avoids cache line exclusive ownership)
        if (atomic_load_explicit(&m->state, memory_order_relaxed) 
            == MUTEX_FREE) {
            // Looks free! Try to acquire.
            expected = MUTEX_FREE;
            if (atomic_compare_exchange_weak_explicit(
                    &m->state, &expected, MUTEX_LOCKED,
                    memory_order_acquire, memory_order_relaxed)) {
                atomic_store(&m->owner, tid);
                return;
            }
        }
        __builtin_ia32_pause();  // CPU pause: save power, help HT
    }
    
    // ─── SLOW PATH: KERNEL BLOCKING ──────────────────────────
    // Spinning failed. Must sleep until woken.
    while (1) {
        // Mark that waiters exist (state = WAITERS)
        // If was FREE, we got it (race with unlock)
        int old = atomic_exchange_explicit(
            &m->state, MUTEX_WAITERS, memory_order_acquire);
        
        if (old == MUTEX_FREE) {
            // Lock was free! Our exchange locked it.
            atomic_store(&m->owner, tid);
            return;
        }
        
        // Lock was held. Sleep until state changes.
        syscall(SYS_futex, &m->state, 
                FUTEX_WAIT | FUTEX_PRIVATE_FLAG,
                MUTEX_WAITERS, NULL, NULL, 0);
        
        // Woken up (might be spurious). Loop and try again.
    }
}
 
void mutex_unlock(complete_mutex_t* m) {
    atomic_store_explicit(&m->owner, 0, memory_order_relaxed);
    
    // Atomically release and check for waiters
    int old = atomic_exchange_explicit(
        &m->state, MUTEX_FREE, memory_order_release);
    
    if (old == MUTEX_WAITERS) {
        // There were waiters. Wake one.
        syscall(SYS_futex, &m->state, 
                FUTEX_WAKE | FUTEX_PRIVATE_FLAG,
                1, NULL, NULL, 0);
    }
    // If old was MUTEX_LOCKED (no waiters), no wake needed.
}

Implementation Summary

•Fast path (uncontended): One atomic CAS—no spinning, no syscall, ~20 cycles
•Medium path (brief contention): Spin loop catches quickly released locks—no syscall
•Slow path (sustained contention): Kernel futex_wait/wake for efficient sleeping
•Memory ordering: Acquire on lock, release on unlock ensures visibility
•Owner tracking: Enables debugging and priority inheritance

Summary: Mutex Implementation

We have explored the complete journey of mutex implementation—from atomic hardware primitives to production-ready code. Let's consolidate the key insights:

Key Takeaways

•Hardware atomic operations are the foundation — CAS, TAS, and fetch-and-add provide the building blocks for all higher-level synchronization.
•Basic spinlocks are simple but limited — TAS/CAS locks work but offer no fairness guarantees and suffer cache line bouncing.
•Ticket locks provide FIFO fairness — Essential when starvation is unacceptable, but still suffer from cache invalidation storms.
•Blocking requires kernel cooperation — The futex system call provides atomic check-and-sleep to prevent lost wakeups.
•Hybrid approaches offer the best of both — Spin briefly for fast path, block for long waits. This is how real pthread_mutex works.
•Memory ordering is critical — Acquire semantics on lock, release semantics on unlock ensure proper visibility.

What's Next:\n\nNow that we understand how mutexes are implemented, we're ready to explore the POSIX pthread mutex API—the standard interface for mutexes in Unix-like systems. We'll cover initialization, attributes, error checking, and best practices for production use.

Page Complete

You now understand how mutexes are implemented from the ground up—from atomic operations to kernel syscalls. You understand spinlocks, blocking locks, hybrid approaches, and the memory ordering requirements. Next, we'll apply this knowledge to the practical pthread_mutex API.