Loading learning content...
When a thread attempts to acquire a mutex that is already held by another thread, it must wait. This waiting is inevitable—mutual exclusion, by definition, means some threads must wait while others proceed. But how a thread waits has profound implications for system performance.\n\nThe fundamental question is: What should a waiting thread do?\n\nThere are two primary strategies, and the choice between them represents one of the most important design decisions in synchronization primitives.
By the end of this page, you will understand the fundamental tradeoff between blocking and spinning, when each approach is appropriate, the costs involved in both strategies, and how modern systems use hybrid approaches to get the best of both worlds.
Option 1: Spinning (Busy-Waiting)\n\nThe thread stays on the CPU and continuously checks whether the lock has become available:\n\n\nwhile (lock_is_held()) {\n // Do nothing, just keep checking\n}\nacquire_lock();\n\n\nThe thread is 'busy' in the sense that it is executing instructions (the loop), but it is not making progress toward any useful computation. It is waiting by burning CPU cycles.\n\nOption 2: Blocking (Sleeping)\n\nThe thread voluntarily relinquishes the CPU and asks the operating system to wake it up when the lock becomes available:\n\n\nif (lock_is_held()) {\n add_me_to_wait_queue();\n go_to_sleep(); // CPU is given to another thread\n // ...later, when woken up...\n}\nacquire_lock();\n\n\nThe thread is 'blocked' in the sense that it cannot run—it is waiting in a queue maintained by the OS kernel. While blocked, it consumes no CPU time.
Spinning trades CPU efficiency for latency. Blocking trades latency for CPU efficiency. Neither is universally better—the optimal choice depends on expected wait time relative to context switch cost.
To understand when blocking makes sense, we must understand what it costs. When a thread blocks, the operating system performs a context switch: saving the current thread's state and loading another thread's state.\n\nDirect Costs of a Context Switch:\n\n1. Kernel Entry/Exit: Transitioning from user mode to kernel mode and back involves privilege level changes, stack switches, and security checks. This takes hundreds to thousands of CPU cycles.\n\n2. State Saving/Restoring: The OS must save all CPU registers (general purpose, floating point, SIMD), stack pointer, program counter, and various flags. On modern x86-64, this is 16+ general registers, 16+ SSE/AVX registers—hundreds of bytes.\n\n3. Scheduler Invocation: The kernel's scheduler must select the next thread to run, which involves examining data structures, possibly rebalancing queues, and making priority decisions.\n\n4. TLB and Cache Effects: When switching to a different thread with a different address space, the TLB (Translation Lookaside Buffer) may need to be flushed. Even for same-process threads, cache working sets differ.
| Component | Approximate Cost | Notes |
|---|---|---|
| System call overhead | 100-500 cycles | User/kernel mode transition |
| Register save/restore | 100-200 cycles | All general + SIMD registers |
| Scheduler invocation | 500-2000 cycles | Depends on runqueue complexity |
| TLB flush (if needed) | 1000-10000 cycles | Full address space switch |
| Cache warm-up | 10000-100000+ cycles | If working set evicted |
| Total (same process) | 1000-5000 cycles | Thread switch, no TLB flush |
| Total (different process) | 5000-50000+ cycles | Full context switch + TLB |
The Critical Insight:\n\nOn a 3 GHz processor, 5000 cycles is approximately 1.7 microseconds. This may seem tiny in human terms, but consider:\n\n- A spinlock protecting a simple counter increment might hold the lock for 50-100 cycles (tens of nanoseconds)\n- A context switch to block and later unblock takes ~10,000+ cycles round-trip\n\nIf the critical section is shorter than the context switch overhead, spinning is faster even though it 'wastes' CPU cycles. The thread that spins acquires the lock and proceeds in less time than it would take to sleep and wake up.
The biggest hidden cost of blocking is often cache pollution. When a thread blocks and another thread runs, the new thread's memory accesses evict the blocked thread's data from caches. When the blocked thread resumes, its working set must be reloaded from slower memory—potentially adding tens of thousands of cycles of latency.
Spinning seems wasteful—a thread is executing instructions but accomplishing nothing. To understand the true cost, we must consider what else could be running on that CPU.\n\nCPU Utilization Cost:\n\nEvery cycle spent spinning is a cycle not spent doing useful work. If there are other runnable threads waiting for CPU time:\n\n- Those threads are delayed by the spinner's CPU consumption\n- System throughput decreases\n- Latency for other work increases\n\nOn Single-CPU Systems:\n\nSpinning on a single-CPU system is almost always pathological. If thread A spins waiting for thread B to release a lock, but B cannot run because A is consuming the only CPU, the system is deadlocked (or at best, waiting for a timer interrupt to preempt A).\n\nOn Multi-CPU Systems:\n\nSpinning is more reasonable when:\n- The lock holder is running on a different CPU\n- The lock will be released soon (short critical section)\n- Other CPUs have idle cycles anyway
12345678910111213141516171819202122232425262728293031323334
// SPINNING COST ANALYSIS//// Scenario: Thread A holds lock, Thread B wants it// Thread A's critical section: 1000 cycles// Context switch cost: 5000 cycles one-way (10000 round-trip) // CASE 1: B spins waiting for A// ─────────────────────────────// Time 0: B tries to acquire, sees A holds lock// Time 0-1000: B spins (1000 cycles wasted, but on-CPU)// Time 1000: A releases, B immediately acquires// // Total latency for B: ~1000 cycles// CPU cycles 'wasted' by B: ~1000 cycles// B gets started: Very fast // CASE 2: B blocks waiting for A// ─────────────────────────────// Time 0: B tries to acquire, sees A holds lock// Time 0-5000: B performs context switch (enters kernel, sleeps)// Time ~1000: A releases lock, signals B (but B is still switching)// Time 5000: B's context switch completes, B is now sleeping// Time 5000+: A tries to wake B, another context switch for B// Time 10000+: B finally acquires lock and continues//// Total latency for B: ~10000+ cycles// CPU cycles freed: B's execution time was given to other threads// B gets started: Much slower // CROSSOVER POINT:// If critical section < context switch cost → spinning wins// If critical section > context switch cost → blocking wins// // Rule of thumb: Spin for microseconds, block for millisecondsEnergy Cost:\n\nSpinning also has an energy cost. A spinning CPU is fully active, consuming maximum power. A blocked thread allows the CPU to enter lower power states (if no other work is pending). In mobile devices, data centers, and battery-powered systems, this difference matters.\n\nThe Spin-Wait Loop Optimization:\n\nModern CPUs recognize spin loops and provide special instructions:\n\n- x86 PAUSE instruction: Reduces power consumption during spin, improves performance on hyperthreaded CPUs by yielding pipeline resources to the other hyperthread\n- ARM WFE/SEV: Wait-for-event allows efficient spinning with lower power\n\nThese instructions don't eliminate the cost of spinning, but they reduce it significantly.
Never use spinlocks on uniprocessor systems without preemption enabled. If the lock holder cannot run, the spinner will consume the entire timeslice (or worse, spin forever if interrupts are disabled). This is a classic system hang scenario.
The key to choosing between blocking and spinning is the expected wait time relative to the context switch cost. There exists a crossover point where the strategies are equally efficient.\n\nMathematical Analysis:\n\nLet:\n- W = expected wait time (how long until lock is released)\n- C = context switch cost (round-trip: sleep + wake)\n- S = spin cost per unit time (essentially 1, since we're using CPU)\n\nSpinning cost: W × S = W (thread spins for entire wait)\nBlocking cost: C (fixed context switch overhead, regardless of wait time)\n\nCrossover: W = C\n\n- If W < C: Spinning is more efficient (spin time < context switch time)\n- If W > C: Blocking is more efficient (save CPU for other work)\n- If W = C: Either approach has the same cost
123456789101112131415161718192021222324252627282930313233343536373839
CROSSOVER POINT VISUALIZATION Cost ^ | | ....... Blocking (fixed cost C) | ...... | ....... C |----------X────────────────────── ← Crossover point | / | / | / Spinning (linear cost = W) | / | / | / | / | / | / |/ +────────────────────────────────────> Wait Time (W) C For W < C: Spinning line is below blocking line → Spin is cheaperFor W > C: Spinning line is above blocking line → Block is cheaperAt W = C: Both strategies cost the same REAL NUMBERS (approximate):─────────────────────────────────────────────────────────Context switch cost (C): ~5000 cycles = ~1.5 µs @ 3GHzCrossover point: ~1.5 µs of expected wait time Critical sections shorter than 1.5 µs → spinning probably winsCritical sections longer than 1.5 µs → blocking probably wins COMPLICATIONS:• We don't know wait time in advance (it's probabilistic)• Context switch cost varies based on cache state• Multiple waiters change the calculus• System load affects whether spinning steals from useful workThe Uncertainty Problem:\n\nThe analysis above assumes we know the wait time W in advance. In reality, we don't. When a thread attempts to acquire a lock:\n\n- It doesn't know how long the holder will take\n- Critical section times may be highly variable\n- The holder might be preempted, extending wait time unexpectedly\n\nThis uncertainty motivates adaptive and hybrid approaches that attempt to estimate wait time dynamically.
As a software engineer, you can often predict whether critical sections are short or long: incrementing a counter is short (spin); performing I/O is long (block); accessing a complex data structure is usually short; calling external services is long. Use this domain knowledge to choose the appropriate lock type.
Since neither spinning nor blocking is universally superior, modern synchronization primitives often use hybrid approaches that attempt to get the benefits of both strategies.\n\nTwo-Phase Locking (Spin-Then-Block):\n\nThe most common hybrid is to spin for a short time, then block if the lock is still not available:\n\n1. Phase 1 (Spin): Spin in a tight loop for N iterations (or T microseconds)\n2. Phase 2 (Block): If lock not acquired during spin phase, block\n\nThis captures the benefits of spinning for short waits while avoiding the catastrophic cost of spinning for long waits.
123456789101112131415161718192021222324252627282930313233343536
// TWO-PHASE HYBRID MUTEX (Spin-Then-Block)//// This is conceptually how many real-world mutexes work// (actual implementations are more sophisticated) #define SPIN_LIMIT 1000 // Spin for at most 1000 iterations void hybrid_mutex_lock(mutex_t* m) { // PHASE 1: Optimistic spinning for (int i = 0; i < SPIN_LIMIT; i++) { if (try_lock(m)) { return; // Got the lock quickly! } // Pause instruction: reduces power, helps hyperthreading cpu_pause(); } // PHASE 2: Lock not acquired during spin phase, block // This is the slow path - requires kernel involvement kernel_mutex_lock(m);} // The spin limit is chosen such that:// - SPIN_LIMIT iterations ≈ context switch cost// - If we spin longer than this, blocking would be more efficient // ADVANTAGES:// • Fast path (get lock during spin): No kernel entry, very low latency// • Slow path (block): Avoids wasting CPU on long waits// • Works well for variable-length critical sections // LINUX IMPLEMENTATION:// Linux's futex (Fast Userspace muTEX) uses this pattern.// The mutex state is checked in userspace (fast spin),// and only falls back to kernel for sleeping (slow block).Adaptive Spinning:\n\nMore sophisticated implementations adjust the spin duration dynamically based on observed behavior:\n\n- Success-based adaptation: If spinning recently succeeded in acquiring the lock, spin longer next time. If spinning usually fails (transitions to blocking), spin less.\n- Lock holder CPU tracking: Some implementations check if the lock holder is currently running on a CPU. If yes, spin (holder is making progress). If no (holder is blocked), don't spin.\n- Queueing theory: Under high contention, spinning is counterproductive. Detect contention level and adjust strategy.
Linux's futex (Fast Userspace muTEX) system call is the foundation of NPTL (Native POSIX Threads Library) mutexes. In the uncontended case, acquisition is purely in userspace using atomic operations—no kernel entry. Only when contention occurs does the futex enter the kernel for sleeping. This hybrid gives userspace-only fast paths with kernel-mediated blocking for fairness.
Real mutex implementations combine atomic operations, spinning, and blocking in carefully tuned ways. Let's examine the major strategies.\n\nStrategy 1: Pure Spinlock\n\nUsed in kernel code where sleeping is impossible (interrupt handlers, holding other spinlocks).
123456789101112131415161718192021222324252627
// PURE SPINLOCK (e.g., kernel spinlock_t) typedef struct { atomic_int locked; // 0 = unlocked, 1 = locked } spinlock_t; void spin_lock(spinlock_t* lock) { while (atomic_exchange(&lock->locked, 1) == 1) { // Lock was already held, spin while (atomic_load(&lock->locked) == 1) { cpu_pause(); // Reduce power, help hyperthreading } // Lock appears free, try to acquire again } // atomic_exchange returned 0: we got the lock memory_barrier(); // Ensure subsequent reads see latest data} void spin_unlock(spinlock_t* lock) { memory_barrier(); // Ensure all writes complete before unlock atomic_store(&lock->locked, 0);} // Note: The inner while loop (read-only check) is an optimization.// Spinning on atomic_exchange is expensive due to cache line bouncing.// Spinning on atomic_load keeps the cache line in shared state until// it actually changes, reducing bus traffic.Strategy 2: Blocking Mutex (Sleep-Based)\n\nUsed in userspace when critical sections may be long.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// BLOCKING MUTEX (conceptual implementation) typedef struct { atomic_int locked; wait_queue_t waiters; // Queue of sleeping threads spinlock_t queue_lock; // Protects the wait queue} blocking_mutex_t; void blocking_mutex_lock(blocking_mutex_t* m) { // Fast path: try to acquire without blocking if (atomic_compare_exchange(&m->locked, 0, 1)) { return; // Got the lock immediately } // Slow path: must block spin_lock(&m->queue_lock); // Double-check after acquiring queue lock if (atomic_compare_exchange(&m->locked, 0, 1)) { spin_unlock(&m->queue_lock); return; } // Add ourselves to wait queue and sleep add_to_wait_queue(&m->waiters, current_thread()); spin_unlock(&m->queue_lock); // Go to sleep - kernel removes us from runqueue sleep_until_woken(); // When we wake up, we own the lock} void blocking_mutex_unlock(blocking_mutex_t* m) { spin_lock(&m->queue_lock); thread_t* waiter = remove_from_wait_queue(&m->waiters); if (waiter != NULL) { // Transfer ownership to waiter before waking // (waiter will own lock when it wakes) wake_up(waiter); } else { // No waiters, just unlock atomic_store(&m->locked, 0); } spin_unlock(&m->queue_lock);}Strategy 3: Hybrid (Spin-Then-Block)\n\nThe typical userspace mutex implementation.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// HYBRID MUTEX (Spin-Then-Block) - Similar to pthread_mutex #define MAX_SPIN 100 typedef struct { atomic_int state; // 0=unlocked, 1=locked, 2=locked+waiters} hybrid_mutex_t; void hybrid_mutex_lock(hybrid_mutex_t* m) { int expected = 0; // Ultra-fast path: lock was free if (atomic_compare_exchange_weak(&m->state, &expected, 1)) { return; } // Spin phase for (int i = 0; i < MAX_SPIN; i++) { if (atomic_load(&m->state) == 0) { expected = 0; if (atomic_compare_exchange_weak(&m->state, &expected, 1)) { return; } } cpu_pause(); } // Slow path: set state to 2 (indicates waiters) // and enter kernel to sleep while (1) { // Try to acquire, or set waiter bit int old = atomic_exchange(&m->state, 2); if (old == 0) { return; // We got it (was unlocked) } // Call kernel to sleep until woken // Kernel checks m->state and sleeps if still 2 futex_wait(&m->state, 2); }} void hybrid_mutex_unlock(hybrid_mutex_t* m) { if (atomic_exchange(&m->state, 0) == 2) { // There were waiters, wake one up futex_wake(&m->state, 1); } // If state was 1 (no waiters), nothing more needed}Given the tradeoffs, how do you choose between spinning and blocking in practice? Here are guidelines based on real-world scenarios.
| Scenario | Recommendation | Reasoning |
|---|---|---|
| Critical section: few instructions | Spin | Lock held for nanoseconds; context switch would be overkill |
| Critical section: I/O operations | Block | I/O can take milliseconds; spinning would waste massive CPU |
| Critical section: unknown/variable | Hybrid | Spin briefly for fast path, block for slow path |
| Single-CPU system | Block (almost always) | Spinning prevents lock holder from running |
| Many-CPU system, low contention | Spin or light hybrid | Lock holder likely running; quick release expected |
| Many-CPU system, high contention | Block | Spinning exacerbates contention; threads pile up |
| Kernel context, interrupts disabled | Spin (only option) | Cannot sleep with interrupts disabled |
| Userspace, normal context | Standard pthread_mutex | Uses optimized hybrid internally |
| Real-time system, predictable timing | Spin or priority-aware block | Context switch jitter may violate timing constraints |
| Battery-powered device | Block when possible | Spinning drains battery even when doing nothing |
The 'just use spinlocks for short critical sections' advice is a simplification. In practice, 'short' critical sections can become long due to: cache misses, page faults, preemption, or unexpected code paths. Always consider worst-case behavior, not just average-case.
Beyond the basic tradeoff, several advanced considerations affect the blocking vs. spinning decision.\n\nPriority Inversion:\n\nWhen a high-priority thread blocks waiting for a lock held by a low-priority thread, and a medium-priority thread preempts the low-priority holder, the high-priority thread is effectively delayed by the medium-priority thread. This is priority inversion.\n\n- Blocking mutexes can implement priority inheritance or priority ceiling to address this\n- Spinlocks don't suffer from priority inversion in the same way, since the spinner stays on CPU\n\nLock Holder Preemption:\n\nIf a lock holder is preempted (removed from CPU by the scheduler), spinning waiters waste cycles indefinitely. Solutions:\n\n- Linux kernel: Disables preemption while holding spinlocks\n- Userspace: Hybrid approach limits spin time before blocking\n- Advanced: Check if lock holder is currently running before spinning
12345678910111213141516171819202122232425262728293031323334353637
// PRIORITY INVERSION SCENARIO // Three threads with priorities: HIGH (3), MEDIUM (2), LOW (1) void low_priority_thread() { mutex_lock(&shared_mutex); // LOW acquires lock // ...doing work... // SUDDENLY: HIGH priority thread tries to acquire mutex // HIGH blocks waiting for mutex // PROBLEM: MEDIUM priority thread becomes runnable // Scheduler runs MEDIUM (higher priority than LOW) // LOW is preempted and cannot release the lock! // HIGH is blocked, but MEDIUM is running // // Result: HIGH waits for MEDIUM indirectly (priority inversion) // ...eventually LOW runs again... mutex_unlock(&shared_mutex); // HIGH finally gets the lock} // SOLUTIONS FOR BLOCKING MUTEXES: // 1. Priority Inheritance:// When HIGH blocks on mutex held by LOW, // temporarily boost LOW's priority to HIGH's level.// Now LOW runs instead of MEDIUM until it releases. // 2. Priority Ceiling:// Assign each mutex a "ceiling" priority = max of all users.// When any thread acquires the mutex, boost to ceiling.// Prevents priority inversion entirely. // Note: Spinlocks avoid this issue differently - the spinning// thread keeps its CPU, so no preemption of the spinner occurs.// But the lock holder can still be preempted, causing long spins.NUMA Effects:\n\nOn NUMA (Non-Uniform Memory Access) systems, the memory location of the lock variable affects spin performance:\n\n- Spinning on a lock in remote memory requires cross-node communication for every check\n- Cache line bouncing is more expensive across NUMA nodes\n- NUMA-aware locks try to keep the lock and waiters on the same node\n\nCache Line Bouncing:\n\nWhen multiple threads spin on the same lock variable, each spin attempt causes the cache line to bounce between CPUs. This saturates the memory bus and slows down the lock holder, extending critical section time and making the problem worse (a feedback loop).
Use 'local spinning' techniques: instead of all threads spinning on the same variable, have each thread spin on its own cache-local variable. The lock holder writes to each waiter's variable to wake them. This is how MCS locks and CLH locks work—they create a queue of spinners, each spinning locally.
We have explored the fundamental tradeoff at the heart of mutex implementation. Let's consolidate the key insights:
What's Next:\n\nNow that we understand the blocking vs. spinning tradeoff, we're ready to examine how mutexes are actually implemented in practice. The next page dives into mutex implementation—the internal data structures, atomic operations, and algorithms that make mutexes work correctly and efficiently.
You now understand the fundamental tradeoff between blocking and spinning in mutex implementations. You understand the costs involved, when each approach is appropriate, and how modern systems use hybrid approaches. Next, we'll see how these principles are applied in real mutex implementations.