Operating SystemsLocks & Hardware Support

Spinlocks

LevelIntermediate

Duration60 mins

TopicLocks & Hardware Support

4 / 5

Spin Time Considerations

The Art of Waiting

A spinlock is fundamentally a waiting mechanism, and the question of how to wait is surprisingly deep. Do you spin at maximum speed, burning through CPU cycles as fast as possible? Do you pause between attempts? Do you back off exponentially when contention is high?

These questions might seem like micro-optimizations, but they have macro consequences. Poor spin behavior can cause cache coherence storms, destroy scalability, drain laptop batteries, and even lock up systems. Properly tuned spin behavior can improve throughput by 10x or more under contention.

This page explores the science and art of spin time tuning.

What You Will Learn

By the end of this page, you will understand: why spin speed matters for cache coherence; backoff strategies and their trade-offs; the role of exponential backoff in scalability; how to detect and respond to contention; and platform-specific spin tuning considerations.

The Problem with Aggressive Spinning

The naive approach to spinning is to loop as fast as possible:

while (lock->held) { }

On a modern 3 GHz processor, this loop can execute billions of iterations per second. Each iteration reads the lock variable, and when using atomic test-and-set (TAS), each iteration attempts a write. This aggressive spinning creates several problems.

Cache Coherence Traffic Storm

When multiple CPUs spin on the same lock using TAS, each CPU's test-and-set operation requires exclusive ownership of the cache line. The cache coherence protocol (MESI/MOESI) must:

Invalidate all other CPUs' copies of that cache line
Grant exclusive access to the requesting CPU
Repeat for the next CPU attempting TAS

With N CPUs spinning:

Each CPU generates O(1) cache requests per spin attempt
Cache arbitration takes time, creating serialization
The interconnect (bus, mesh, ring) saturates with coherence messages
Even CPUs NOT contending for this lock suffer—their cache lines get evicted

cache_storm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Cache coherence traffic with aggressive TAS spinning
# 4 CPUs spinning on same lock
 
Time  | CPU0 (holder)  | CPU1 (spinner) | CPU2 (spinner) | CPU3 (spinner)
------+----------------+----------------+----------------+----------------
T1    | In critical    | TAS → get excl | (waiting)      | (waiting)
T2    | (working)      | Invalidate all | TAS → get excl | (waiting)
T3    | (working)      | Miss, refetch  | Invalidate all | TAS → get excl
T4    | (working)      | TAS → get excl | Miss, refetch  | Invalidate all
T5    | (working)      | Invalidate all | TAS → get excl | Miss, refetch
...   | (continuous coherence storm, even though holder works unimpeded)
 
# Result: Each TAS iteration takes 50-200 cycles instead of 3-4
# Interconnect bandwidth consumed by spinners, not useful work
# Lock holder's cache also suffers from the storm

Power and Thermal Impact

Aggressive spinning keeps:

All execution units active
Cache hierarchy fully engaged
Memory controllers handling requests
Interconnect fabric busy

This prevents the CPU from:

Entering low-power states (C-states)
Reducing clock frequency
Throttling unused cores

Result: Maximum power consumption, maximum heat generation, for zero useful work.

The Scalability Wall

Aggressive spinning hits a scalability wall where adding more CPUs makes things WORSE. The coherence traffic from additional spinners overwhelms the benefits of more parallelism. This is why naive spinlocks fail spectacularly on many-core systems (32+ cores).

The PAUSE Instruction: First Line of Defense

Modern processors provide a hint instruction specifically for spin-wait loops. On x86, this is PAUSE; on ARM, it's YIELD; on RISC-V, it's the pause instruction (ratified in 2021).

What PAUSE Does

On Intel/AMD x86:

Inserts a small delay (~40 cycles on modern Intel, varies by generation)
Signals to the CPU that code is in a spin-wait loop
Reduces branch prediction penalties when exiting the loop
Allows the core to reduce power consumption
On hyperthreaded cores, yields execution resources to sibling thread

Critical for correctness too: Without PAUSE, aggressive spinning can cause memory ordering violations on some processors, leading to pipeline flushes when the lock becomes available.

pause_spin.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Spin loop with PAUSE - proper implementation
 
void spin_loop_with_pause(volatile int *flag) {
    while (*flag == 0) {
        // Insert PAUSE hint
        // Each iteration waits ~40 cycles instead of ~1
        // Dramatically reduces cache/interconnect pressure
        #if defined(__x86_64__) || defined(__i386__)
            __asm__ volatile("pause" ::: "memory");
        #elif defined(__aarch64__)
            __asm__ volatile("yield" ::: "memory");
        #elif defined(__riscv)
            __asm__ volatile(".insn i 0x0f, 0, x0, x0, 0x010");  // pause
        #else
            // Generic fallback - compiler barrier at minimum
            __asm__ volatile("" ::: "memory");
        #endif
    }
}
 
// Using __builtin_ia32_pause() on GCC/Clang x86
void spin_loop_builtin(volatile int *flag) {
    while (*flag == 0) {
        __builtin_ia32_pause();
    }
}

Performance Impact of PAUSE

The impact is dramatic, especially under contention:

Impact of PAUSE Instruction Under Contention (8 cores, 1μs critical section)
Metric	Without PAUSE	With PAUSE	Improvement
Throughput (acquisitions/sec)	~500,000	~2,000,000	4× higher
Average wait time	15 μs	3 μs	5× lower
Cache misses per acquisition	~200	~20	10× fewer
Power consumption (watts)	95W	75W	21% lower
95th percentile latency	100 μs	8 μs	12× lower

Always Use PAUSE

There is essentially no scenario where PAUSE hurts performance in a spin loop. The ~40-cycle delay is trivial compared to context switch time (~5000+ cycles). The cache traffic reduction is always beneficial. Make PAUSE your default in any spin loop.

Backoff Strategies

PAUSE helps, but under high contention, even PAUSE-slowed spinning can overwhelm the cache coherence system. Backoff strategies further reduce spinning frequency when contention is detected.

The Contention Signal

How do we know contention is high? Simple: if our lock acquisition attempt failed, someone else has the lock. If it fails repeatedly, multiple threads are competing.

Contention indicator: Number of consecutive failed acquisition attempts.

Constant Backoff

The simplest strategy: wait a fixed number of iterations between attempts.

constant_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Constant backoff spinlock
 
#define BACKOFF_ITERATIONS 100
 
void spin_lock_constant_backoff(spinlock_t *lock) {
    while (1) {
        // Check if lock looks free (TTAS)
        if (atomic_load(&lock->locked) == 0) {
            // Try to acquire
            if (atomic_exchange(&lock->locked, 1) == 0) {
                return;  // Got it!
            }
        }
        
        // Failed - back off before retrying
        for (int i = 0; i < BACKOFF_ITERATIONS; i++) {
            __builtin_ia32_pause();
        }
    }
}
 
// Pros: Simple, adds consistent delay
// Cons: Fixed delay is suboptimal for varying contention levels
//       Too short under high contention, too long under low contention

Exponential Backoff

Exponential backoff adapts to contention levels. After each failed attempt, double the backoff time (up to a maximum). This approach:

Stays responsive under low contention (starts with short backoff)
Reduces traffic under high contention (backoff grows quickly)
Self-stabilizes: high contention causes long backoffs, which reduces contention

exponential_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Exponential backoff spinlock
 
#define MIN_BACKOFF 4
#define MAX_BACKOFF 256
 
void spin_lock_exponential_backoff(spinlock_t *lock) {
    int backoff = MIN_BACKOFF;
    
    while (1) {
        // TTAS: spin on read first
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __builtin_ia32_pause();
        }
        
        // Try to acquire
        if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) {
            return;  // Success!
        }
        
        // Failed - execute backoff delay
        for (int i = 0; i < backoff; i++) {
            __builtin_ia32_pause();
        }
        
        // Double the backoff for next failure
        if (backoff < MAX_BACKOFF) {
            backoff *= 2;
        }
    }
}
 
void spin_unlock(spinlock_t *lock) {
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}

Randomized Exponential Backoff

A further refinement adds randomization to prevent synchronized retries. Without randomization, all threads might wake from backoff simultaneously, creating burst contention.

randomized_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Randomized exponential backoff
 
#define MIN_BACKOFF 4
#define MAX_BACKOFF 256
 
// Simple pseudo-random number generator (fast, per-thread)
static __thread unsigned int random_state = 12345;
 
unsigned int fast_random(void) {
    random_state = random_state * 1103515245 + 12345;
    return random_state;
}
 
void spin_lock_randomized_backoff(spinlock_t *lock) {
    int limit = MIN_BACKOFF;
    
    while (1) {
        // TTAS spin
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __builtin_ia32_pause();
        }
        
        // Try to acquire
        if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) {
            return;
        }
        
        // Random backoff in range [0, limit)
        int backoff = fast_random() % limit;
        for (int i = 0; i < backoff; i++) {
            __builtin_ia32_pause();
        }
        
        // Increase limit for next iteration
        if (limit < MAX_BACKOFF) {
            limit *= 2;
        }
    }
}

Why Randomization Helps

Without randomization: 8 threads fail, all wait 100 iterations, all retry at exactly the same time, 7 fail again. With randomization: 8 threads fail, wait 50-100 iterations randomly, retries spread out, less peak contention. The random jitter effectively serializes retry attempts.

Proportional and Adaptive Backoff

Exponential backoff has a problem: it's not aware of critical section duration. Ideally, backoff should be proportional to expected wait time.

Proportional Backoff

If we can estimate likely wait time, we can backoff accordingly:

proportional_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Proportional backoff based on observed contention
 
struct tracked_spinlock {
    atomic_int locked;
    atomic_int waiters;      // Count of waiting threads
    atomic_int avg_hold_time;// Exponential moving average of hold time
};
 
void spin_lock_proportional(struct tracked_spinlock *lock) {
    // Register as a waiter
    atomic_fetch_add(&lock->waiters, 1);
    
    while (1) {
        if (atomic_load(&lock->locked) == 0) {
            if (atomic_exchange(&lock->locked, 1) == 0) {
                // Got it - unregister as waiter
                atomic_fetch_sub(&lock->waiters, 1);
                return;
            }
        }
        
        // Proportional backoff based on waiters and expected hold time
        int waiters = atomic_load(&lock->waiters);
        int avg_hold = atomic_load(&lock->avg_hold_time);
        
        // Our expected wait ≈ waiters × avg_hold_time
        // Backoff proportional to this
        int backoff = waiters * (avg_hold / 100);  // Tune divisor
        backoff = min(backoff, MAX_BACKOFF);
        backoff = max(backoff, MIN_BACKOFF);
        
        for (int i = 0; i < backoff; i++) {
            __builtin_ia32_pause();
        }
    }
}
 
void spin_unlock_proportional(struct tracked_spinlock *lock, int hold_time) {
    // Update moving average of hold time
    int old_avg = atomic_load(&lock->avg_hold_time);
    int new_avg = (old_avg * 7 + hold_time) / 8;  // 7/8 old, 1/8 new
    atomic_store(&lock->avg_hold_time, new_avg);
    
    atomic_store(&lock->locked, 0);
}

Hierarchical Backoff (NUMA-Aware)

On NUMA systems, not all spins are equal. Accessing a lock on a remote NUMA node is more expensive than local access. Hierarchical backoff adjusts behavior based on the lock's location:

numa_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// NUMA-aware backoff
 
int get_numa_node(void *addr);  // Get NUMA node of memory address
int my_numa_node(void);         // Get current CPU's NUMA node
 
void spin_lock_numa_aware(spinlock_t *lock) {
    int lock_node = get_numa_node(lock);
    int my_node = my_numa_node();
    
    // Base backoff scales with NUMA distance
    int base_backoff;
    if (lock_node == my_node) {
        base_backoff = 4;    // Local - be aggressive
    } else {
        base_backoff = 64;   // Remote - back off more
        // Remote spinning is expensive and pollutes interconnect
    }
    
    int backoff = base_backoff;
    
    while (1) {
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __builtin_ia32_pause();
        }
        
        if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) {
            return;
        }
        
        // Backoff with NUMA-adjusted values
        for (int i = 0; i < backoff; i++) {
            __builtin_ia32_pause();
        }
        
        backoff = min(backoff * 2, MAX_BACKOFF);
    }
}

NUMA Lock Placement

On NUMA systems, where you allocate the lock matters. A lock protecting per-node data should be allocated on that node. Locks for shared data might be replicated (one per node) or placed on the node with highest access frequency. This is an architecture-level decision, not just a tuning parameter.

Spin Count Limits and Fallback

Pure spinlocks spin forever. In production systems, this is risky—if the lock holder is delayed unexpectedly (priority inversion, long critical section, bug), spinners burn CPU indefinitely.

Spin count limits bound the maximum spin time before falling back to a different strategy.

Spin-Then-Yield

After spinning for a limit, yield the CPU to other threads:

spin_then_yield.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Spin with yield fallback
 
#define SPIN_LIMIT 1000
 
void spin_lock_with_yield(spinlock_t *lock) {
    int spin_count = 0;
    
    while (1) {
        while (atomic_load(&lock->locked) != 0) {
            if (spin_count < SPIN_LIMIT) {
                __builtin_ia32_pause();
                spin_count++;
            } else {
                // Spun enough - let other threads run
                sched_yield();
                spin_count = 0;  // Reset and try again
            }
        }
        
        if (atomic_exchange(&lock->locked, 1) == 0) {
            return;
        }
        
        spin_count = 0;
    }
}
 
// sched_yield() moves this thread to the back of the run queue
// Other runnable threads get a chance to execute
// If the lock holder is on the same CPU, this helps it finish

Spin-Then-Block

For longer waits, blocking is more efficient than repeated yield cycles:

spin_then_block.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Adaptive spinlock: spin, then block
 
#define SPIN_THRESHOLD 1000
 
typedef struct {
    atomic_int locked;
    atomic_int waiters;           // Count of blocked waiters
    // Platform-specific: futex, mutex, semaphore, etc.
    blocking_primitive_t waiter;
} adaptive_spinlock_t;
 
void adaptive_lock(adaptive_spinlock_t *lock) {
    int spins = 0;
    
    // Phase 1: Optimistic spinning
    while (spins < SPIN_THRESHOLD) {
        if (atomic_load(&lock->locked) == 0) {
            if (atomic_exchange(&lock->locked, 1) == 0) {
                return;  // Got it without blocking!
            }
        }
        __builtin_ia32_pause();
        spins++;
    }
    
    // Phase 2: Spin exhausted - prepare to block
    atomic_fetch_add(&lock->waiters, 1);
    
    while (1) {
        // One more spin check before blocking
        if (atomic_load(&lock->locked) == 0) {
            if (atomic_exchange(&lock->locked, 1) == 0) {
                atomic_fetch_sub(&lock->waiters, 1);
                return;
            }
        }
        
        // Block until signaled
        block_wait(&lock->waiter);
        
        // Woken up - try again with brief spin
        for (int i = 0; i < 100; i++) {
            if (atomic_load(&lock->locked) == 0) {
                if (atomic_exchange(&lock->locked, 1) == 0) {
                    atomic_fetch_sub(&lock->waiters, 1);
                    return;
                }
            }
            __builtin_ia32_pause();
        }
    }
}
 
void adaptive_unlock(adaptive_spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
    
    // If waiters exist, wake one
    if (atomic_load(&lock->waiters) > 0) {
        wake_one(&lock->waiter);
    }
}

Choosing Spin Limits
System Type	Recommended Spin Limit	Rationale
Kernel (interrupt context)	Unlimited (must spin)	Cannot block; must complete
Kernel (process context)	~1000-10000 iterations	Balance responsiveness and efficiency
Userspace high-perf	~100-1000 iterations	Context switches are costlier
Userspace general	~10-100 iterations	Yield to other apps quickly
Real-time	Based on deadline	Spin up to deadline, then fail
Power-sensitive	Very low (~10)	Minimize CPU active time

The Thundering Herd Problem

When a lock is released, all waiting threads become eligible to acquire it. In spin-based systems, they all try simultaneously. This thundering herd causes:

Massive cache coherence traffic as all CPUs do TAS simultaneously
Only one thread succeeds; all others must retry
The retry cycle repeats—more coherence traffic
Latency spikes and throughput collapse under high contention

Visualizing the Problem

thundering_herd.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
# Thundering herd with 8 threads waiting for lock release
 
Time | CPU0 (holder) | CPU1-7 (spinners)
-----+---------------+------------------------------------------
T1   | unlock()      | All 7 see lock freed (cache invalidation)
T2   | -             | All 7 execute TAS simultaneously
T3   | -             | CPU1 wins (arbitrary), CPU2-7 get old value 1
T4   | -             | CPU2-7 all retry TAS
T5   | -             | Massive coherence traffic burst
T6   | -             | System stutters; unrelated work suffers
 
# The burst of 7 TAS operations + 6 failures + retries
# generates more traffic than the entire hold period

Mitigation Strategies

1. Randomized backoff (covered earlier): Threads wait random amounts before retrying, spreading out the retry storm.

2. Queued locks: Instead of all threads competing, threads form a queue. Only the front of the queue tries to acquire. Lock release hands off to the next in queue. No thundering herd because only one thread is ever 'active'.

3. Combining-based approaches: Under high contention, multiple threads combine their operations. One delegate executes for all, reducing coherence traffic.

mitigated_release.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Staggered wakeup to reduce thundering herd
 
// Instead of waking all waiters simultaneously,
// wake one at a time with delays
 
void smart_unlock(adaptive_spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
    
    int waiters = atomic_load(&lock->waiters);
    if (waiters > 0) {
        // Wake just ONE waiter
        // That waiter will acquire and wake the next on release
        // Forms implicit serialization
        wake_one(&lock->waiter);
    }
}
 
// For spin-only locks, can't control wakeup order
// But can use ticket locks or MCS locks for inherent ordering

Ticket Locks Solve This

Ticket locks (covered in the next page) elegantly solve the thundering herd problem. Each thread gets a ticket number and only spins when its number matches the 'now serving' counter. Lock release increments the counter, allowing exactly one waiter to proceed. No stampede.

Platform-Specific Tuning

Optimal spin timing varies significantly across processor architectures and generations. What works on one platform may be suboptimal on another.

Intel vs. AMD

Intel processors have historically had longer PAUSE delays (~40 cycles on Skylake, ~140 cycles on Skylake-X). AMD processors often have shorter PAUSE delays (~10 cycles). This means:

On Intel: fewer PAUSE iterations needed for equivalent delay
On AMD: may need more iterations or additional delays
The difference narrowed in recent generations but still exists

PAUSE Duration Across Processor Generations
Processor	PAUSE Latency	Cache Line Transfer	Implications
Intel Skylake (client)	~40 cycles	~40 cycles	Well-matched to cache ops
Intel Skylake-X (HEDT)	~140 cycles	~50-100 cycles	Longer delay, adjust limits
Intel Ice Lake	~10-15 cycles	~40 cycles	Shorter PAUSE, more iterations
AMD Zen 3	~10 cycles	~40 cycles	Aggressive PAUSE needed
ARM Cortex-A76	YIELD varies	~50 cycles	YIELD behavior differs
Apple M1/M2	~40 cycles equiv	~30 cycles	Efficient micro-ops

Hyperthreading Considerations

On hyperthreaded (SMT) cores, two logical CPUs share the same physical execution resources. When one hyperthread is spinning:

It competes with its sibling for execution resources
PAUSE yields resources to sibling (intentionally)
But the sibling might be the lock holder!

Optimization: Detect if the lock holder is on the sibling hyperthread. If so, spin more aggressively (it can make progress with shared resources).

smt_aware.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// SMT-aware spinning (conceptual)
 
// Get sibling CPU ID for SMT pairing
int get_sibling_cpu(int cpu);
 
void smt_aware_spin(spinlock_t *lock) {
    int my_cpu = get_current_cpu();
    int sibling = get_sibling_cpu(my_cpu);
    
    while (atomic_load(&lock->locked) != 0) {
        int holder_cpu = atomic_load(&lock->holder_cpu);
        
        if (holder_cpu == sibling) {
            // Holder is our sibling hyperthread!
            // Spin briefly - they can make progress with our resources
            for (int i = 0; i < 10; i++) {
                __builtin_ia32_pause();
            }
        } else {
            // Holder is elsewhere - PAUSE (yield to sibling)
            __builtin_ia32_pause();
        }
    }
    
    // Try acquire...
}

ARM Considerations

ARM processors have different memory models and spin primitives:

WFE (Wait For Event): Puts core into low-power wait until an event is signaled
SEV (Send Event): Broadcasts event to wake WFE waiters
YIELD: Hints that spinning; may yield to other threads

Key difference: WFE/SEV provide more explicit signaling than x86's polling model.

arm_spin.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// ARM spin loop with WFE/SEV
 
void spin_lock_arm(atomic_int *lock) {
    while (1) {
        // Try to acquire
        int expected = 0;
        if (atomic_compare_exchange_strong_explicit(
                lock, &expected, 1,
                memory_order_acquire, memory_order_relaxed)) {
            return;  // Got it
        }
        
        // Wait for event (low power spin)
        __asm__ volatile("wfe" ::: "memory");
    }
}
 
void spin_unlock_arm(atomic_int *lock) {
    atomic_store_explicit(lock, 0, memory_order_release);
    
    // Send event to wake WFE waiters
    __asm__ volatile("sev" ::: "memory");
}

Abstraction Layers Help

Use platform abstraction when possible. C11's <stdatomic.h>, C++'s <atomic>, Rust's std::sync::atomic, and similar provide portable primitives. The implementation handles platform-specific details. Manual assembly should be reserved for extreme performance requirements with known target platforms.

Summary: Spin Time Considerations

The details of spin timing profoundly impact spinlock performance. Let's consolidate the key principles:

Key Takeaways

•Aggressive spinning causes cache storms — Without delays, spinners generate O(N²) coherence traffic, crippling scalability.
•PAUSE is essential — Always use PAUSE/YIELD in spin loops. It reduces cache traffic, power consumption, and improves throughput dramatically.
•Backoff adapts to contention — Exponential backoff with randomization scales well from low to high contention.
•Spin limits prevent infinite waste — Cap spin iterations and fall back to yield or blocking for unexpectedly long waits.
•Thundering herd degrades performance — Use randomized backoff or queue-based locks to avoid release-time stampedes.
•Platform differences matter — PAUSE latency varies 10x across architectures. SMT, NUMA, and cache hierarchy affect optimal strategies.
•Measurement is essential — Theory provides starting points; profiling on your actual system determines optimal values.

What's next:

We've covered the fundamental spinlock and its timing considerations. The next page introduces ticket locks—a more sophisticated design that provides fairness guarantees and eliminates the thundering herd problem through deterministic ordering. Ticket locks represent the bridge between simple spinlocks and advanced queued lock designs.

Page Complete

You now understand the nuances of spin timing: why aggressive spinning fails, how PAUSE instruction helps, backoff strategies for contention adaptation, spin limits for safety, and platform-specific considerations. Next, we explore ticket locks for fair, ordered lock acquisition.

4 / 5

Loading learning content...

Operating SystemsLocks & Hardware Support

Spinlocks

LevelIntermediate

Duration60 mins

TopicLocks & Hardware Support

4 / 5

Spin Time Considerations

The Art of Waiting

This page explores the science and art of spin time tuning.

What You Will Learn

The Problem with Aggressive Spinning

The naive approach to spinning is to loop as fast as possible:

while (lock->held) { }

Cache Coherence Traffic Storm

When multiple CPUs spin on the same lock using TAS, each CPU's test-and-set operation requires exclusive ownership of the cache line. The cache coherence protocol (MESI/MOESI) must:

Invalidate all other CPUs' copies of that cache line
Grant exclusive access to the requesting CPU
Repeat for the next CPU attempting TAS

With N CPUs spinning:

Each CPU generates O(1) cache requests per spin attempt
Cache arbitration takes time, creating serialization
The interconnect (bus, mesh, ring) saturates with coherence messages
Even CPUs NOT contending for this lock suffer—their cache lines get evicted

cache_storm.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Cache coherence traffic with aggressive TAS spinning
# 4 CPUs spinning on same lock
 
Time  | CPU0 (holder)  | CPU1 (spinner) | CPU2 (spinner) | CPU3 (spinner)
------+----------------+----------------+----------------+----------------
T1    | In critical    | TAS → get excl | (waiting)      | (waiting)
T2    | (working)      | Invalidate all | TAS → get excl | (waiting)
T3    | (working)      | Miss, refetch  | Invalidate all | TAS → get excl
T4    | (working)      | TAS → get excl | Miss, refetch  | Invalidate all
T5    | (working)      | Invalidate all | TAS → get excl | Miss, refetch
...   | (continuous coherence storm, even though holder works unimpeded)
 
# Result: Each TAS iteration takes 50-200 cycles instead of 3-4
# Interconnect bandwidth consumed by spinners, not useful work
# Lock holder's cache also suffers from the storm

Power and Thermal Impact

Aggressive spinning keeps:

All execution units active
Cache hierarchy fully engaged
Memory controllers handling requests
Interconnect fabric busy

This prevents the CPU from:

Entering low-power states (C-states)
Reducing clock frequency
Throttling unused cores

Result: Maximum power consumption, maximum heat generation, for zero useful work.

The Scalability Wall

The PAUSE Instruction: First Line of Defense

Modern processors provide a hint instruction specifically for spin-wait loops. On x86, this is PAUSE; on ARM, it's YIELD; on RISC-V, it's the pause instruction (ratified in 2021).

What PAUSE Does

On Intel/AMD x86:

Inserts a small delay (~40 cycles on modern Intel, varies by generation)
Signals to the CPU that code is in a spin-wait loop
Reduces branch prediction penalties when exiting the loop
Allows the core to reduce power consumption
On hyperthreaded cores, yields execution resources to sibling thread

Critical for correctness too: Without PAUSE, aggressive spinning can cause memory ordering violations on some processors, leading to pipeline flushes when the lock becomes available.

pause_spin.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Spin loop with PAUSE - proper implementation
 
void spin_loop_with_pause(volatile int *flag) {
    while (*flag == 0) {
        // Insert PAUSE hint
        // Each iteration waits ~40 cycles instead of ~1
        // Dramatically reduces cache/interconnect pressure
        #if defined(__x86_64__) || defined(__i386__)
            __asm__ volatile("pause" ::: "memory");
        #elif defined(__aarch64__)
            __asm__ volatile("yield" ::: "memory");
        #elif defined(__riscv)
            __asm__ volatile(".insn i 0x0f, 0, x0, x0, 0x010");  // pause
        #else
            // Generic fallback - compiler barrier at minimum
            __asm__ volatile("" ::: "memory");
        #endif
    }
}
 
// Using __builtin_ia32_pause() on GCC/Clang x86
void spin_loop_builtin(volatile int *flag) {
    while (*flag == 0) {
        __builtin_ia32_pause();
    }
}

Performance Impact of PAUSE

The impact is dramatic, especially under contention:

Impact of PAUSE Instruction Under Contention (8 cores, 1μs critical section)
Metric	Without PAUSE	With PAUSE	Improvement
Throughput (acquisitions/sec)	~500,000	~2,000,000	4× higher
Average wait time	15 μs	3 μs	5× lower
Cache misses per acquisition	~200	~20	10× fewer
Power consumption (watts)	95W	75W	21% lower
95th percentile latency	100 μs	8 μs	12× lower

Always Use PAUSE

Backoff Strategies

PAUSE helps, but under high contention, even PAUSE-slowed spinning can overwhelm the cache coherence system. Backoff strategies further reduce spinning frequency when contention is detected.

The Contention Signal

How do we know contention is high? Simple: if our lock acquisition attempt failed, someone else has the lock. If it fails repeatedly, multiple threads are competing.

Contention indicator: Number of consecutive failed acquisition attempts.

Constant Backoff

The simplest strategy: wait a fixed number of iterations between attempts.

constant_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Constant backoff spinlock
 
#define BACKOFF_ITERATIONS 100
 
void spin_lock_constant_backoff(spinlock_t *lock) {
    while (1) {
        // Check if lock looks free (TTAS)
        if (atomic_load(&lock->locked) == 0) {
            // Try to acquire
            if (atomic_exchange(&lock->locked, 1) == 0) {
                return;  // Got it!
            }
        }
        
        // Failed - back off before retrying
        for (int i = 0; i < BACKOFF_ITERATIONS; i++) {
            __builtin_ia32_pause();
        }
    }
}
 
// Pros: Simple, adds consistent delay
// Cons: Fixed delay is suboptimal for varying contention levels
//       Too short under high contention, too long under low contention

Exponential Backoff

Exponential backoff adapts to contention levels. After each failed attempt, double the backoff time (up to a maximum). This approach:

Stays responsive under low contention (starts with short backoff)
Reduces traffic under high contention (backoff grows quickly)
Self-stabilizes: high contention causes long backoffs, which reduces contention

exponential_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Exponential backoff spinlock
 
#define MIN_BACKOFF 4
#define MAX_BACKOFF 256
 
void spin_lock_exponential_backoff(spinlock_t *lock) {
    int backoff = MIN_BACKOFF;
    
    while (1) {
        // TTAS: spin on read first
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __builtin_ia32_pause();
        }
        
        // Try to acquire
        if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) {
            return;  // Success!
        }
        
        // Failed - execute backoff delay
        for (int i = 0; i < backoff; i++) {
            __builtin_ia32_pause();
        }
        
        // Double the backoff for next failure
        if (backoff < MAX_BACKOFF) {
            backoff *= 2;
        }
    }
}
 
void spin_unlock(spinlock_t *lock) {
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}

Randomized Exponential Backoff

A further refinement adds randomization to prevent synchronized retries. Without randomization, all threads might wake from backoff simultaneously, creating burst contention.

randomized_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Randomized exponential backoff
 
#define MIN_BACKOFF 4
#define MAX_BACKOFF 256
 
// Simple pseudo-random number generator (fast, per-thread)
static __thread unsigned int random_state = 12345;
 
unsigned int fast_random(void) {
    random_state = random_state * 1103515245 + 12345;
    return random_state;
}
 
void spin_lock_randomized_backoff(spinlock_t *lock) {
    int limit = MIN_BACKOFF;
    
    while (1) {
        // TTAS spin
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __builtin_ia32_pause();
        }
        
        // Try to acquire
        if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) {
            return;
        }
        
        // Random backoff in range [0, limit)
        int backoff = fast_random() % limit;
        for (int i = 0; i < backoff; i++) {
            __builtin_ia32_pause();
        }
        
        // Increase limit for next iteration
        if (limit < MAX_BACKOFF) {
            limit *= 2;
        }
    }
}

Why Randomization Helps

Proportional and Adaptive Backoff

Exponential backoff has a problem: it's not aware of critical section duration. Ideally, backoff should be proportional to expected wait time.

Proportional Backoff

If we can estimate likely wait time, we can backoff accordingly:

proportional_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Proportional backoff based on observed contention
 
struct tracked_spinlock {
    atomic_int locked;
    atomic_int waiters;      // Count of waiting threads
    atomic_int avg_hold_time;// Exponential moving average of hold time
};
 
void spin_lock_proportional(struct tracked_spinlock *lock) {
    // Register as a waiter
    atomic_fetch_add(&lock->waiters, 1);
    
    while (1) {
        if (atomic_load(&lock->locked) == 0) {
            if (atomic_exchange(&lock->locked, 1) == 0) {
                // Got it - unregister as waiter
                atomic_fetch_sub(&lock->waiters, 1);
                return;
            }
        }
        
        // Proportional backoff based on waiters and expected hold time
        int waiters = atomic_load(&lock->waiters);
        int avg_hold = atomic_load(&lock->avg_hold_time);
        
        // Our expected wait ≈ waiters × avg_hold_time
        // Backoff proportional to this
        int backoff = waiters * (avg_hold / 100);  // Tune divisor
        backoff = min(backoff, MAX_BACKOFF);
        backoff = max(backoff, MIN_BACKOFF);
        
        for (int i = 0; i < backoff; i++) {
            __builtin_ia32_pause();
        }
    }
}
 
void spin_unlock_proportional(struct tracked_spinlock *lock, int hold_time) {
    // Update moving average of hold time
    int old_avg = atomic_load(&lock->avg_hold_time);
    int new_avg = (old_avg * 7 + hold_time) / 8;  // 7/8 old, 1/8 new
    atomic_store(&lock->avg_hold_time, new_avg);
    
    atomic_store(&lock->locked, 0);
}

Hierarchical Backoff (NUMA-Aware)

On NUMA systems, not all spins are equal. Accessing a lock on a remote NUMA node is more expensive than local access. Hierarchical backoff adjusts behavior based on the lock's location:

numa_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// NUMA-aware backoff
 
int get_numa_node(void *addr);  // Get NUMA node of memory address
int my_numa_node(void);         // Get current CPU's NUMA node
 
void spin_lock_numa_aware(spinlock_t *lock) {
    int lock_node = get_numa_node(lock);
    int my_node = my_numa_node();
    
    // Base backoff scales with NUMA distance
    int base_backoff;
    if (lock_node == my_node) {
        base_backoff = 4;    // Local - be aggressive
    } else {
        base_backoff = 64;   // Remote - back off more
        // Remote spinning is expensive and pollutes interconnect
    }
    
    int backoff = base_backoff;
    
    while (1) {
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __builtin_ia32_pause();
        }
        
        if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) {
            return;
        }
        
        // Backoff with NUMA-adjusted values
        for (int i = 0; i < backoff; i++) {
            __builtin_ia32_pause();
        }
        
        backoff = min(backoff * 2, MAX_BACKOFF);
    }
}

NUMA Lock Placement

Spin Count Limits and Fallback

Pure spinlocks spin forever. In production systems, this is risky—if the lock holder is delayed unexpectedly (priority inversion, long critical section, bug), spinners burn CPU indefinitely.

Spin count limits bound the maximum spin time before falling back to a different strategy.

Spin-Then-Yield

After spinning for a limit, yield the CPU to other threads:

spin_then_yield.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Spin with yield fallback
 
#define SPIN_LIMIT 1000
 
void spin_lock_with_yield(spinlock_t *lock) {
    int spin_count = 0;
    
    while (1) {
        while (atomic_load(&lock->locked) != 0) {
            if (spin_count < SPIN_LIMIT) {
                __builtin_ia32_pause();
                spin_count++;
            } else {
                // Spun enough - let other threads run
                sched_yield();
                spin_count = 0;  // Reset and try again
            }
        }
        
        if (atomic_exchange(&lock->locked, 1) == 0) {
            return;
        }
        
        spin_count = 0;
    }
}
 
// sched_yield() moves this thread to the back of the run queue
// Other runnable threads get a chance to execute
// If the lock holder is on the same CPU, this helps it finish

Spin-Then-Block

For longer waits, blocking is more efficient than repeated yield cycles:

spin_then_block.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Adaptive spinlock: spin, then block
 
#define SPIN_THRESHOLD 1000
 
typedef struct {
    atomic_int locked;
    atomic_int waiters;           // Count of blocked waiters
    // Platform-specific: futex, mutex, semaphore, etc.
    blocking_primitive_t waiter;
} adaptive_spinlock_t;
 
void adaptive_lock(adaptive_spinlock_t *lock) {
    int spins = 0;
    
    // Phase 1: Optimistic spinning
    while (spins < SPIN_THRESHOLD) {
        if (atomic_load(&lock->locked) == 0) {
            if (atomic_exchange(&lock->locked, 1) == 0) {
                return;  // Got it without blocking!
            }
        }
        __builtin_ia32_pause();
        spins++;
    }
    
    // Phase 2: Spin exhausted - prepare to block
    atomic_fetch_add(&lock->waiters, 1);
    
    while (1) {
        // One more spin check before blocking
        if (atomic_load(&lock->locked) == 0) {
            if (atomic_exchange(&lock->locked, 1) == 0) {
                atomic_fetch_sub(&lock->waiters, 1);
                return;
            }
        }
        
        // Block until signaled
        block_wait(&lock->waiter);
        
        // Woken up - try again with brief spin
        for (int i = 0; i < 100; i++) {
            if (atomic_load(&lock->locked) == 0) {
                if (atomic_exchange(&lock->locked, 1) == 0) {
                    atomic_fetch_sub(&lock->waiters, 1);
                    return;
                }
            }
            __builtin_ia32_pause();
        }
    }
}
 
void adaptive_unlock(adaptive_spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
    
    // If waiters exist, wake one
    if (atomic_load(&lock->waiters) > 0) {
        wake_one(&lock->waiter);
    }
}

Choosing Spin Limits
System Type	Recommended Spin Limit	Rationale
Kernel (interrupt context)	Unlimited (must spin)	Cannot block; must complete
Kernel (process context)	~1000-10000 iterations	Balance responsiveness and efficiency
Userspace high-perf	~100-1000 iterations	Context switches are costlier
Userspace general	~10-100 iterations	Yield to other apps quickly
Real-time	Based on deadline	Spin up to deadline, then fail
Power-sensitive	Very low (~10)	Minimize CPU active time

The Thundering Herd Problem

When a lock is released, all waiting threads become eligible to acquire it. In spin-based systems, they all try simultaneously. This thundering herd causes:

Massive cache coherence traffic as all CPUs do TAS simultaneously
Only one thread succeeds; all others must retry
The retry cycle repeats—more coherence traffic
Latency spikes and throughput collapse under high contention

Visualizing the Problem

thundering_herd.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
# Thundering herd with 8 threads waiting for lock release
 
Time | CPU0 (holder) | CPU1-7 (spinners)
-----+---------------+------------------------------------------
T1   | unlock()      | All 7 see lock freed (cache invalidation)
T2   | -             | All 7 execute TAS simultaneously
T3   | -             | CPU1 wins (arbitrary), CPU2-7 get old value 1
T4   | -             | CPU2-7 all retry TAS
T5   | -             | Massive coherence traffic burst
T6   | -             | System stutters; unrelated work suffers
 
# The burst of 7 TAS operations + 6 failures + retries
# generates more traffic than the entire hold period

Mitigation Strategies

1. Randomized backoff (covered earlier): Threads wait random amounts before retrying, spreading out the retry storm.

3. Combining-based approaches: Under high contention, multiple threads combine their operations. One delegate executes for all, reducing coherence traffic.

mitigated_release.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Staggered wakeup to reduce thundering herd
 
// Instead of waking all waiters simultaneously,
// wake one at a time with delays
 
void smart_unlock(adaptive_spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
    
    int waiters = atomic_load(&lock->waiters);
    if (waiters > 0) {
        // Wake just ONE waiter
        // That waiter will acquire and wake the next on release
        // Forms implicit serialization
        wake_one(&lock->waiter);
    }
}
 
// For spin-only locks, can't control wakeup order
// But can use ticket locks or MCS locks for inherent ordering

Ticket Locks Solve This

Platform-Specific Tuning

Optimal spin timing varies significantly across processor architectures and generations. What works on one platform may be suboptimal on another.

Intel vs. AMD

Intel processors have historically had longer PAUSE delays (~40 cycles on Skylake, ~140 cycles on Skylake-X). AMD processors often have shorter PAUSE delays (~10 cycles). This means:

On Intel: fewer PAUSE iterations needed for equivalent delay
On AMD: may need more iterations or additional delays
The difference narrowed in recent generations but still exists

PAUSE Duration Across Processor Generations
Processor	PAUSE Latency	Cache Line Transfer	Implications
Intel Skylake (client)	~40 cycles	~40 cycles	Well-matched to cache ops
Intel Skylake-X (HEDT)	~140 cycles	~50-100 cycles	Longer delay, adjust limits
Intel Ice Lake	~10-15 cycles	~40 cycles	Shorter PAUSE, more iterations
AMD Zen 3	~10 cycles	~40 cycles	Aggressive PAUSE needed
ARM Cortex-A76	YIELD varies	~50 cycles	YIELD behavior differs
Apple M1/M2	~40 cycles equiv	~30 cycles	Efficient micro-ops

Hyperthreading Considerations

On hyperthreaded (SMT) cores, two logical CPUs share the same physical execution resources. When one hyperthread is spinning:

It competes with its sibling for execution resources
PAUSE yields resources to sibling (intentionally)
But the sibling might be the lock holder!

Optimization: Detect if the lock holder is on the sibling hyperthread. If so, spin more aggressively (it can make progress with shared resources).

smt_aware.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// SMT-aware spinning (conceptual)
 
// Get sibling CPU ID for SMT pairing
int get_sibling_cpu(int cpu);
 
void smt_aware_spin(spinlock_t *lock) {
    int my_cpu = get_current_cpu();
    int sibling = get_sibling_cpu(my_cpu);
    
    while (atomic_load(&lock->locked) != 0) {
        int holder_cpu = atomic_load(&lock->holder_cpu);
        
        if (holder_cpu == sibling) {
            // Holder is our sibling hyperthread!
            // Spin briefly - they can make progress with our resources
            for (int i = 0; i < 10; i++) {
                __builtin_ia32_pause();
            }
        } else {
            // Holder is elsewhere - PAUSE (yield to sibling)
            __builtin_ia32_pause();
        }
    }
    
    // Try acquire...
}

ARM Considerations

ARM processors have different memory models and spin primitives:

WFE (Wait For Event): Puts core into low-power wait until an event is signaled
SEV (Send Event): Broadcasts event to wake WFE waiters
YIELD: Hints that spinning; may yield to other threads

Key difference: WFE/SEV provide more explicit signaling than x86's polling model.

arm_spin.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// ARM spin loop with WFE/SEV
 
void spin_lock_arm(atomic_int *lock) {
    while (1) {
        // Try to acquire
        int expected = 0;
        if (atomic_compare_exchange_strong_explicit(
                lock, &expected, 1,
                memory_order_acquire, memory_order_relaxed)) {
            return;  // Got it
        }
        
        // Wait for event (low power spin)
        __asm__ volatile("wfe" ::: "memory");
    }
}
 
void spin_unlock_arm(atomic_int *lock) {
    atomic_store_explicit(lock, 0, memory_order_release);
    
    // Send event to wake WFE waiters
    __asm__ volatile("sev" ::: "memory");
}

Abstraction Layers Help

Summary: Spin Time Considerations

The details of spin timing profoundly impact spinlock performance. Let's consolidate the key principles:

Key Takeaways

•Aggressive spinning causes cache storms — Without delays, spinners generate O(N²) coherence traffic, crippling scalability.
•PAUSE is essential — Always use PAUSE/YIELD in spin loops. It reduces cache traffic, power consumption, and improves throughput dramatically.
•Backoff adapts to contention — Exponential backoff with randomization scales well from low to high contention.
•Spin limits prevent infinite waste — Cap spin iterations and fall back to yield or blocking for unexpectedly long waits.
•Thundering herd degrades performance — Use randomized backoff or queue-based locks to avoid release-time stampedes.
•Platform differences matter — PAUSE latency varies 10x across architectures. SMT, NUMA, and cache hierarchy affect optimal strategies.
•Measurement is essential — Theory provides starting points; profiling on your actual system determines optimal values.

What's next:

Page Complete

4 / 5