Loading learning content...
A spinlock is fundamentally a waiting mechanism, and the question of how to wait is surprisingly deep. Do you spin at maximum speed, burning through CPU cycles as fast as possible? Do you pause between attempts? Do you back off exponentially when contention is high?
These questions might seem like micro-optimizations, but they have macro consequences. Poor spin behavior can cause cache coherence storms, destroy scalability, drain laptop batteries, and even lock up systems. Properly tuned spin behavior can improve throughput by 10x or more under contention.
This page explores the science and art of spin time tuning.
By the end of this page, you will understand: why spin speed matters for cache coherence; backoff strategies and their trade-offs; the role of exponential backoff in scalability; how to detect and respond to contention; and platform-specific spin tuning considerations.
The naive approach to spinning is to loop as fast as possible:
while (lock->held) { }
On a modern 3 GHz processor, this loop can execute billions of iterations per second. Each iteration reads the lock variable, and when using atomic test-and-set (TAS), each iteration attempts a write. This aggressive spinning creates several problems.
When multiple CPUs spin on the same lock using TAS, each CPU's test-and-set operation requires exclusive ownership of the cache line. The cache coherence protocol (MESI/MOESI) must:
With N CPUs spinning:
123456789101112131415
# Cache coherence traffic with aggressive TAS spinning# 4 CPUs spinning on same lock Time | CPU0 (holder) | CPU1 (spinner) | CPU2 (spinner) | CPU3 (spinner)------+----------------+----------------+----------------+----------------T1 | In critical | TAS → get excl | (waiting) | (waiting)T2 | (working) | Invalidate all | TAS → get excl | (waiting)T3 | (working) | Miss, refetch | Invalidate all | TAS → get exclT4 | (working) | TAS → get excl | Miss, refetch | Invalidate allT5 | (working) | Invalidate all | TAS → get excl | Miss, refetch... | (continuous coherence storm, even though holder works unimpeded) # Result: Each TAS iteration takes 50-200 cycles instead of 3-4# Interconnect bandwidth consumed by spinners, not useful work# Lock holder's cache also suffers from the stormAggressive spinning keeps:
This prevents the CPU from:
Result: Maximum power consumption, maximum heat generation, for zero useful work.
Aggressive spinning hits a scalability wall where adding more CPUs makes things WORSE. The coherence traffic from additional spinners overwhelms the benefits of more parallelism. This is why naive spinlocks fail spectacularly on many-core systems (32+ cores).
Modern processors provide a hint instruction specifically for spin-wait loops. On x86, this is PAUSE; on ARM, it's YIELD; on RISC-V, it's the pause instruction (ratified in 2021).
On Intel/AMD x86:
Critical for correctness too: Without PAUSE, aggressive spinning can cause memory ordering violations on some processors, leading to pipeline flushes when the lock becomes available.
1234567891011121314151617181920212223242526
// Spin loop with PAUSE - proper implementation void spin_loop_with_pause(volatile int *flag) { while (*flag == 0) { // Insert PAUSE hint // Each iteration waits ~40 cycles instead of ~1 // Dramatically reduces cache/interconnect pressure #if defined(__x86_64__) || defined(__i386__) __asm__ volatile("pause" ::: "memory"); #elif defined(__aarch64__) __asm__ volatile("yield" ::: "memory"); #elif defined(__riscv) __asm__ volatile(".insn i 0x0f, 0, x0, x0, 0x010"); // pause #else // Generic fallback - compiler barrier at minimum __asm__ volatile("" ::: "memory"); #endif }} // Using __builtin_ia32_pause() on GCC/Clang x86void spin_loop_builtin(volatile int *flag) { while (*flag == 0) { __builtin_ia32_pause(); }}The impact is dramatic, especially under contention:
| Metric | Without PAUSE | With PAUSE | Improvement |
|---|---|---|---|
| Throughput (acquisitions/sec) | ~500,000 | ~2,000,000 | 4× higher |
| Average wait time | 15 μs | 3 μs | 5× lower |
| Cache misses per acquisition | ~200 | ~20 | 10× fewer |
| Power consumption (watts) | 95W | 75W | 21% lower |
| 95th percentile latency | 100 μs | 8 μs | 12× lower |
There is essentially no scenario where PAUSE hurts performance in a spin loop. The ~40-cycle delay is trivial compared to context switch time (~5000+ cycles). The cache traffic reduction is always beneficial. Make PAUSE your default in any spin loop.
PAUSE helps, but under high contention, even PAUSE-slowed spinning can overwhelm the cache coherence system. Backoff strategies further reduce spinning frequency when contention is detected.
How do we know contention is high? Simple: if our lock acquisition attempt failed, someone else has the lock. If it fails repeatedly, multiple threads are competing.
Contention indicator: Number of consecutive failed acquisition attempts.
The simplest strategy: wait a fixed number of iterations between attempts.
123456789101112131415161718192021222324
// Constant backoff spinlock #define BACKOFF_ITERATIONS 100 void spin_lock_constant_backoff(spinlock_t *lock) { while (1) { // Check if lock looks free (TTAS) if (atomic_load(&lock->locked) == 0) { // Try to acquire if (atomic_exchange(&lock->locked, 1) == 0) { return; // Got it! } } // Failed - back off before retrying for (int i = 0; i < BACKOFF_ITERATIONS; i++) { __builtin_ia32_pause(); } }} // Pros: Simple, adds consistent delay// Cons: Fixed delay is suboptimal for varying contention levels// Too short under high contention, too long under low contentionExponential backoff adapts to contention levels. After each failed attempt, double the backoff time (up to a maximum). This approach:
12345678910111213141516171819202122232425262728293031323334
// Exponential backoff spinlock #define MIN_BACKOFF 4#define MAX_BACKOFF 256 void spin_lock_exponential_backoff(spinlock_t *lock) { int backoff = MIN_BACKOFF; while (1) { // TTAS: spin on read first while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) { __builtin_ia32_pause(); } // Try to acquire if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) { return; // Success! } // Failed - execute backoff delay for (int i = 0; i < backoff; i++) { __builtin_ia32_pause(); } // Double the backoff for next failure if (backoff < MAX_BACKOFF) { backoff *= 2; } }} void spin_unlock(spinlock_t *lock) { atomic_store_explicit(&lock->locked, 0, memory_order_release);}A further refinement adds randomization to prevent synchronized retries. Without randomization, all threads might wake from backoff simultaneously, creating burst contention.
123456789101112131415161718192021222324252627282930313233343536373839
// Randomized exponential backoff #define MIN_BACKOFF 4#define MAX_BACKOFF 256 // Simple pseudo-random number generator (fast, per-thread)static __thread unsigned int random_state = 12345; unsigned int fast_random(void) { random_state = random_state * 1103515245 + 12345; return random_state;} void spin_lock_randomized_backoff(spinlock_t *lock) { int limit = MIN_BACKOFF; while (1) { // TTAS spin while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) { __builtin_ia32_pause(); } // Try to acquire if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) { return; } // Random backoff in range [0, limit) int backoff = fast_random() % limit; for (int i = 0; i < backoff; i++) { __builtin_ia32_pause(); } // Increase limit for next iteration if (limit < MAX_BACKOFF) { limit *= 2; } }}Without randomization: 8 threads fail, all wait 100 iterations, all retry at exactly the same time, 7 fail again. With randomization: 8 threads fail, wait 50-100 iterations randomly, retries spread out, less peak contention. The random jitter effectively serializes retry attempts.
Exponential backoff has a problem: it's not aware of critical section duration. Ideally, backoff should be proportional to expected wait time.
If we can estimate likely wait time, we can backoff accordingly:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// Proportional backoff based on observed contention struct tracked_spinlock { atomic_int locked; atomic_int waiters; // Count of waiting threads atomic_int avg_hold_time;// Exponential moving average of hold time}; void spin_lock_proportional(struct tracked_spinlock *lock) { // Register as a waiter atomic_fetch_add(&lock->waiters, 1); while (1) { if (atomic_load(&lock->locked) == 0) { if (atomic_exchange(&lock->locked, 1) == 0) { // Got it - unregister as waiter atomic_fetch_sub(&lock->waiters, 1); return; } } // Proportional backoff based on waiters and expected hold time int waiters = atomic_load(&lock->waiters); int avg_hold = atomic_load(&lock->avg_hold_time); // Our expected wait ≈ waiters × avg_hold_time // Backoff proportional to this int backoff = waiters * (avg_hold / 100); // Tune divisor backoff = min(backoff, MAX_BACKOFF); backoff = max(backoff, MIN_BACKOFF); for (int i = 0; i < backoff; i++) { __builtin_ia32_pause(); } }} void spin_unlock_proportional(struct tracked_spinlock *lock, int hold_time) { // Update moving average of hold time int old_avg = atomic_load(&lock->avg_hold_time); int new_avg = (old_avg * 7 + hold_time) / 8; // 7/8 old, 1/8 new atomic_store(&lock->avg_hold_time, new_avg); atomic_store(&lock->locked, 0);}On NUMA systems, not all spins are equal. Accessing a lock on a remote NUMA node is more expensive than local access. Hierarchical backoff adjusts behavior based on the lock's location:
12345678910111213141516171819202122232425262728293031323334353637
// NUMA-aware backoff int get_numa_node(void *addr); // Get NUMA node of memory addressint my_numa_node(void); // Get current CPU's NUMA node void spin_lock_numa_aware(spinlock_t *lock) { int lock_node = get_numa_node(lock); int my_node = my_numa_node(); // Base backoff scales with NUMA distance int base_backoff; if (lock_node == my_node) { base_backoff = 4; // Local - be aggressive } else { base_backoff = 64; // Remote - back off more // Remote spinning is expensive and pollutes interconnect } int backoff = base_backoff; while (1) { while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) { __builtin_ia32_pause(); } if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) { return; } // Backoff with NUMA-adjusted values for (int i = 0; i < backoff; i++) { __builtin_ia32_pause(); } backoff = min(backoff * 2, MAX_BACKOFF); }}On NUMA systems, where you allocate the lock matters. A lock protecting per-node data should be allocated on that node. Locks for shared data might be replicated (one per node) or placed on the node with highest access frequency. This is an architecture-level decision, not just a tuning parameter.
Pure spinlocks spin forever. In production systems, this is risky—if the lock holder is delayed unexpectedly (priority inversion, long critical section, bug), spinners burn CPU indefinitely.
Spin count limits bound the maximum spin time before falling back to a different strategy.
After spinning for a limit, yield the CPU to other threads:
123456789101112131415161718192021222324252627282930
// Spin with yield fallback #define SPIN_LIMIT 1000 void spin_lock_with_yield(spinlock_t *lock) { int spin_count = 0; while (1) { while (atomic_load(&lock->locked) != 0) { if (spin_count < SPIN_LIMIT) { __builtin_ia32_pause(); spin_count++; } else { // Spun enough - let other threads run sched_yield(); spin_count = 0; // Reset and try again } } if (atomic_exchange(&lock->locked, 1) == 0) { return; } spin_count = 0; }} // sched_yield() moves this thread to the back of the run queue// Other runnable threads get a chance to execute// If the lock holder is on the same CPU, this helps it finishFor longer waits, blocking is more efficient than repeated yield cycles:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
// Adaptive spinlock: spin, then block #define SPIN_THRESHOLD 1000 typedef struct { atomic_int locked; atomic_int waiters; // Count of blocked waiters // Platform-specific: futex, mutex, semaphore, etc. blocking_primitive_t waiter;} adaptive_spinlock_t; void adaptive_lock(adaptive_spinlock_t *lock) { int spins = 0; // Phase 1: Optimistic spinning while (spins < SPIN_THRESHOLD) { if (atomic_load(&lock->locked) == 0) { if (atomic_exchange(&lock->locked, 1) == 0) { return; // Got it without blocking! } } __builtin_ia32_pause(); spins++; } // Phase 2: Spin exhausted - prepare to block atomic_fetch_add(&lock->waiters, 1); while (1) { // One more spin check before blocking if (atomic_load(&lock->locked) == 0) { if (atomic_exchange(&lock->locked, 1) == 0) { atomic_fetch_sub(&lock->waiters, 1); return; } } // Block until signaled block_wait(&lock->waiter); // Woken up - try again with brief spin for (int i = 0; i < 100; i++) { if (atomic_load(&lock->locked) == 0) { if (atomic_exchange(&lock->locked, 1) == 0) { atomic_fetch_sub(&lock->waiters, 1); return; } } __builtin_ia32_pause(); } }} void adaptive_unlock(adaptive_spinlock_t *lock) { atomic_store(&lock->locked, 0); // If waiters exist, wake one if (atomic_load(&lock->waiters) > 0) { wake_one(&lock->waiter); }}| System Type | Recommended Spin Limit | Rationale |
|---|---|---|
| Kernel (interrupt context) | Unlimited (must spin) | Cannot block; must complete |
| Kernel (process context) | ~1000-10000 iterations | Balance responsiveness and efficiency |
| Userspace high-perf | ~100-1000 iterations | Context switches are costlier |
| Userspace general | ~10-100 iterations | Yield to other apps quickly |
| Real-time | Based on deadline | Spin up to deadline, then fail |
| Power-sensitive | Very low (~10) | Minimize CPU active time |
When a lock is released, all waiting threads become eligible to acquire it. In spin-based systems, they all try simultaneously. This thundering herd causes:
12345678910111213
# Thundering herd with 8 threads waiting for lock release Time | CPU0 (holder) | CPU1-7 (spinners)-----+---------------+------------------------------------------T1 | unlock() | All 7 see lock freed (cache invalidation)T2 | - | All 7 execute TAS simultaneouslyT3 | - | CPU1 wins (arbitrary), CPU2-7 get old value 1T4 | - | CPU2-7 all retry TAST5 | - | Massive coherence traffic burstT6 | - | System stutters; unrelated work suffers # The burst of 7 TAS operations + 6 failures + retries# generates more traffic than the entire hold period1. Randomized backoff (covered earlier): Threads wait random amounts before retrying, spreading out the retry storm.
2. Queued locks: Instead of all threads competing, threads form a queue. Only the front of the queue tries to acquire. Lock release hands off to the next in queue. No thundering herd because only one thread is ever 'active'.
3. Combining-based approaches: Under high contention, multiple threads combine their operations. One delegate executes for all, reducing coherence traffic.
12345678910111213141516171819
// Staggered wakeup to reduce thundering herd // Instead of waking all waiters simultaneously,// wake one at a time with delays void smart_unlock(adaptive_spinlock_t *lock) { atomic_store(&lock->locked, 0); int waiters = atomic_load(&lock->waiters); if (waiters > 0) { // Wake just ONE waiter // That waiter will acquire and wake the next on release // Forms implicit serialization wake_one(&lock->waiter); }} // For spin-only locks, can't control wakeup order// But can use ticket locks or MCS locks for inherent orderingTicket locks (covered in the next page) elegantly solve the thundering herd problem. Each thread gets a ticket number and only spins when its number matches the 'now serving' counter. Lock release increments the counter, allowing exactly one waiter to proceed. No stampede.
Optimal spin timing varies significantly across processor architectures and generations. What works on one platform may be suboptimal on another.
Intel processors have historically had longer PAUSE delays (~40 cycles on Skylake, ~140 cycles on Skylake-X). AMD processors often have shorter PAUSE delays (~10 cycles). This means:
| Processor | PAUSE Latency | Cache Line Transfer | Implications |
|---|---|---|---|
| Intel Skylake (client) | ~40 cycles | ~40 cycles | Well-matched to cache ops |
| Intel Skylake-X (HEDT) | ~140 cycles | ~50-100 cycles | Longer delay, adjust limits |
| Intel Ice Lake | ~10-15 cycles | ~40 cycles | Shorter PAUSE, more iterations |
| AMD Zen 3 | ~10 cycles | ~40 cycles | Aggressive PAUSE needed |
| ARM Cortex-A76 | YIELD varies | ~50 cycles | YIELD behavior differs |
| Apple M1/M2 | ~40 cycles equiv | ~30 cycles | Efficient micro-ops |
On hyperthreaded (SMT) cores, two logical CPUs share the same physical execution resources. When one hyperthread is spinning:
Optimization: Detect if the lock holder is on the sibling hyperthread. If so, spin more aggressively (it can make progress with shared resources).
1234567891011121314151617181920212223242526
// SMT-aware spinning (conceptual) // Get sibling CPU ID for SMT pairingint get_sibling_cpu(int cpu); void smt_aware_spin(spinlock_t *lock) { int my_cpu = get_current_cpu(); int sibling = get_sibling_cpu(my_cpu); while (atomic_load(&lock->locked) != 0) { int holder_cpu = atomic_load(&lock->holder_cpu); if (holder_cpu == sibling) { // Holder is our sibling hyperthread! // Spin briefly - they can make progress with our resources for (int i = 0; i < 10; i++) { __builtin_ia32_pause(); } } else { // Holder is elsewhere - PAUSE (yield to sibling) __builtin_ia32_pause(); } } // Try acquire...}ARM processors have different memory models and spin primitives:
Key difference: WFE/SEV provide more explicit signaling than x86's polling model.
1234567891011121314151617181920212223
// ARM spin loop with WFE/SEV void spin_lock_arm(atomic_int *lock) { while (1) { // Try to acquire int expected = 0; if (atomic_compare_exchange_strong_explicit( lock, &expected, 1, memory_order_acquire, memory_order_relaxed)) { return; // Got it } // Wait for event (low power spin) __asm__ volatile("wfe" ::: "memory"); }} void spin_unlock_arm(atomic_int *lock) { atomic_store_explicit(lock, 0, memory_order_release); // Send event to wake WFE waiters __asm__ volatile("sev" ::: "memory");}Use platform abstraction when possible. C11's <stdatomic.h>, C++'s <atomic>, Rust's std::sync::atomic, and similar provide portable primitives. The implementation handles platform-specific details. Manual assembly should be reserved for extreme performance requirements with known target platforms.
The details of spin timing profoundly impact spinlock performance. Let's consolidate the key principles:
What's next:
We've covered the fundamental spinlock and its timing considerations. The next page introduces ticket locks—a more sophisticated design that provides fairness guarantees and eliminates the thundering herd problem through deterministic ordering. Ticket locks represent the bridge between simple spinlocks and advanced queued lock designs.
You now understand the nuances of spin timing: why aggressive spinning fails, how PAUSE instruction helps, backoff strategies for contention adaptation, spin limits for safety, and platform-specific considerations. Next, we explore ticket locks for fair, ordered lock acquisition.