Spinlocks - Learning Module

Loading content...

0/227

Busy Waiting Concept

The Art of Spinning

Imagine you're at a locked door, waiting for someone inside to unlock it. You have two fundamental choices: you could sit on a nearby bench, perhaps nap, and wait to be notified when the door opens—or you could stand at the door, continuously trying the handle every few moments until it finally turns. The second approach, in computing terms, is busy waiting—the foundational concept that underlies spinlocks and represents one of the most fundamental trade-offs in systems programming.

Busy waiting is beautifully simple in concept yet remarkably nuanced in practice. Understanding when it helps versus when it hurts requires deep insight into processor architecture, cache behavior, scheduling dynamics, and the nature of the workloads you're synchronizing. This page provides that foundation.

What You Will Learn

By the end of this page, you will understand: the precise definition of busy waiting and how it differs from blocking; CPU behavior during spin loops; the fundamental trade-offs between spinning and sleeping; historical context and evolution of busy waiting in operating systems; and the CPU-cache interactions that make busy waiting viable in certain scenarios.

What Is Busy Waiting?

Busy waiting, also known as spinning or polling, is a synchronization technique where a process or thread repeatedly checks a condition in a tight loop until that condition becomes true. Unlike blocking—where a thread yields the CPU and enters a sleep state—busy waiting keeps the thread actively executing instructions, consuming CPU cycles while waiting for a lock, resource, or event.

The canonical example is a simple spin loop:

busy_wait_basic.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Busy waiting: the thread continuously polls 'lock_held'
// until it becomes false, consuming CPU cycles the entire time
volatile int lock_held = 1;  // Lock is currently held by another thread
 
void wait_for_lock(void) {
    // This loop runs continuously, executing thousands or millions
    // of iterations per second, each checking the condition
    while (lock_held == 1) {
        // Do nothing - just keep checking
        // The CPU is fully occupied during this loop
    }
    
    // Only reach here when some other thread sets lock_held = 0
    // Now we can proceed with the critical section
}

In this example, the thread executes the while loop continuously—potentially millions of times per second on modern processors—until another thread modifies lock_held. Each iteration performs a memory read, a comparison, and a conditional branch. The thread never yields, never sleeps, and never allows another thread to run on this CPU core.

The volatile keyword is critical here. Without it, the compiler might optimize away the memory read, caching the value in a register and creating an infinite loop. The volatile qualifier tells the compiler 'this variable might change at any time; reload it from memory on every access.'

The CPU Cost Is Real

A busy-waiting thread consumes 100% of its CPU core's capacity. On a battery-powered device, this drains power rapidly. On a shared server, it steals time from other workloads. The CPU is doing real work—executing instructions, accessing cache, branching—but producing no useful output. This is pure overhead, which is why busy waiting is appropriate only in carefully considered scenarios.

Contrast with Blocking

Blocking takes the opposite approach. When a thread blocks, it voluntarily surrenders the CPU, its state is saved, and the scheduler picks another thread to run. The blocked thread is placed on a wait queue and only wakes when explicitly signaled.

The fundamental trade-off:

Busy waiting: Immediate response when the condition changes, but wastes CPU cycles while waiting
Blocking: Efficient use of CPU while waiting, but incurs context switch overhead when waking

Busy Waiting vs. Blocking: Fundamental Comparison
Characteristic	Busy Waiting (Spinning)	Blocking (Sleeping)
CPU usage while waiting	100% - continuously executing	0% - thread suspended
Response latency	Near-zero (next loop iteration)	Context switch time (~1-50μs)
Power consumption	Maximum (all CPU resources active)	Minimal (core may sleep)
Scheduler involvement	None while spinning	Active - must manage queues
Memory access pattern	Continuous polling of memory location	None until wakeup
Best use case	Very short waits (<context switch time)	Longer waits, contended resources

How the CPU Sees Busy Waiting

To truly understand busy waiting, we must examine what happens at the processor level. Modern CPUs are extraordinarily sophisticated, and their behavior during spin loops reveals important insights about when spinning is efficient versus wasteful.

The Spin Loop at Machine Level

Consider this simple spin loop:

while (lock != 0) { }

The compiler translates this into something like the following x86-64 assembly:

spin_assembly.asm
1
2
3
4
5
6
7
8
9
10
11
12
; Assuming 'lock' is stored at memory address [rdi]
spin_loop:
    mov     eax, [rdi]      ; Load the lock value from memory into eax
    test    eax, eax        ; Set flags based on eax (check if zero)
    jnz     spin_loop       ; If not zero, jump back to spin_loop
                            ; If zero, fall through - lock is free
    
; This 3-instruction sequence executes billions of times per second
; Each iteration:
;   - Memory read (may hit L1 cache)  
;   - ALU operation (test)
;   - Branch prediction (usually correctly predicts 'taken')

Cache Behavior During Spinning

The memory access pattern during busy waiting is crucial for performance. When a thread spins on a memory location:

First access: The cache line containing the lock is fetched from main memory or another CPU's cache. On modern systems, this cache line is typically 64 bytes.
Subsequent accesses (while lock is held): If no one modifies the lock, the spinning thread reads from its local L1 cache—extremely fast, typically 1-4 cycles.
When the lock is released: Another CPU modifies the lock. Cache coherence protocols (MESI, MOESI, or MESIF) invalidate the spinning CPU's cache line. The next read triggers a cache miss, fetching the updated value.

This is why busy waiting can be efficient for short waits: If the lock is released quickly, the spinning thread was just reading from L1 cache—essentially free in performance terms. The moment the lock becomes available, the next cache miss delivers the updated value.

The PAUSE Instruction

Modern x86 processors provide the PAUSE instruction specifically for spin loops. It hints to the CPU that code is in a spin-wait loop, allowing the processor to reduce power consumption, avoid memory-order violations that could cause pipeline flushes, and prevent monopolizing shared resources. Well-written spinlocks always include PAUSE instructions.

spin_with_pause.asm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
; Optimized spin loop with PAUSE hint
spin_loop:
    mov     eax, [rdi]      ; Load lock value
    test    eax, eax        ; Check if zero  
    jz      acquired        ; If zero, we got the lock
    pause                   ; Hint: we're in a spin-wait loop
    jmp     spin_loop       ; Try again
    
acquired:
    ; Proceed with critical section
    
; The PAUSE instruction:
; - Reduces power consumption during spinning
; - Prevents pipeline stalls on memory order violations
; - Adds a small delay, reducing bus contention
; - Takes about 40 cycles on modern Intel CPUs (vs ~10 without it)

Memory Ordering Considerations

Busy waiting requires careful attention to memory ordering. Modern CPUs and compilers reorder operations aggressively for performance. Without proper barriers, a thread might:

See stale values because reads were cached or reordered
Miss updates because writes were delayed or reordered
Experience impossible behavior due to out-of-order execution

This is why practical spinlock implementations use atomic operations with appropriate memory ordering constraints—topics we'll explore deeply in subsequent pages.

The Spectrum of Waiting Strategies

Busy waiting and blocking aren't the only options—they're endpoints on a spectrum. Understanding this spectrum helps you choose the right approach for different scenarios.

Pure Busy Waiting

At one extreme, a thread spins continuously until the condition is satisfied. This provides the lowest possible response latency but wastes the most CPU resources.

pure_spin.c
1
2
3
4
5
6
// Pure busy waiting - maximum responsiveness, maximum waste
void pure_spin(volatile int *flag) {
    while (*flag == 0) {
        __asm__ volatile("pause");  // CPU hint only
    }
}

Spin-Then-Yield

A middle-ground approach: spin for a limited number of iterations, then yield the CPU to other threads if the lock hasn't become available. This balances responsiveness with fairness.

spin_then_yield.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Spin for a while, then yield to other threads
#define SPIN_LIMIT 1000
 
void spin_then_yield(volatile int *flag) {
    int spins = 0;
    
    while (*flag == 0) {
        if (spins < SPIN_LIMIT) {
            __asm__ volatile("pause");
            spins++;
        } else {
            // We've spun enough - let other threads run
            sched_yield();  // or pthread_yield()
            spins = 0;      // Reset and try spinning again
        }
    }
}

Spin-Then-Block (Adaptive)

The most sophisticated approach: spin briefly for fast lock acquisitions, but block if the lock isn't available quickly. This is the strategy used by most high-performance lock implementations, including Linux's mutex.

spin_then_block.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Adaptive: spin briefly, then block
#define SPIN_THRESHOLD 100
 
void spin_then_block(lock_t *lock) {
    int spins = 0;
    
    // Phase 1: Spin on the lock value (check without modifying)
    while (lock->owner != NULL) {
        if (spins < SPIN_THRESHOLD) {
            __asm__ volatile("pause");
            spins++;
        } else {
            // Spinning didn't work - block on the wait queue
            // This involves kernel intervention
            add_to_wait_queue(lock->waiters, current_thread);
            
            // Set flag indicating there are waiters
            atomic_or(&lock->state, WAITERS_FLAG);
            
            // Go to sleep until signaled
            schedule();  // Yields CPU, thread is blocked
            
            // When we wake up, try again
            spins = 0;
        }
    }
    
    // Now try to actually acquire the lock with atomic operation
    // (omitted for clarity)
}

Strategy Selection Guidelines

•Pure spin — When you're certain waits are extremely short (less than context switch time), when running in kernel context where context switches aren't possible, or when you have dedicated CPUs for spinning
•Spin-then-yield — When waits are usually short but occasionally long, when running on shared CPUs but wanting to remain responsive, or when fairness matters but blocking overhead is unacceptable
•Spin-then-block — The default choice for most scenarios; optimal when wait times vary, when the system is under variable load, or when you want both responsiveness for short waits and efficiency for long waits
•Pure blocking — When waits are always long (disk I/O, network operations), when power efficiency is critical, or when running on heavily contended resources

Where Busy Waiting Naturally Appears

Busy waiting isn't just a synchronization technique—it appears naturally in many computing scenarios. Recognizing these patterns helps you understand when spinning is inherent to the problem versus when it's a design choice.

Device Polling

Before interrupts became the dominant I/O model, CPUs polled device status registers continuously. Even today, some scenarios favor polling:

device_polling.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Polling a device status register
// Common in embedded systems and high-performance I/O
 
#define STATUS_REG    0xFFFF0000  // Memory-mapped device status
#define READY_BIT     0x01
 
void wait_device_ready(void) {
    volatile uint32_t *status = (uint32_t *)STATUS_REG;
    
    // Busy-wait until device signals ready
    while ((*status & READY_BIT) == 0) {
        // Device not ready - keep polling
    }
    
    // Device is now ready for next operation
}
 
// Why poll instead of using interrupts?
// - Latency: Polling can detect status changes faster than
//   interrupt delivery and handler execution
// - Predictability: No interrupt jitter - important for
//   real-time systems
// - Simplicity: No interrupt handler registration, priority
//   management, or reentrancy concerns

Busy-Wait Synchronization in Kernel Code

Operating system kernels frequently use busy waiting for short-duration locks because:

Interrupt handlers cannot block: When handling an interrupt, the kernel must not invoke the scheduler. Spinlocks are the only option.
Scheduler protection: The scheduler data structures themselves need locking. If scheduling requires the scheduler lock, you cannot block while waiting for it—circular dependency.
Real-time guarantees: Some kernel paths guarantee maximum latency. Context switches have variable latency; spinning has bounded latency (assuming the lock is released promptly).

kernel_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Linux kernel-style spinlock usage in interrupt context
// (simplified for illustration)
 
spinlock_t device_lock;
 
// Interrupt handler - CANNOT call schedule()
irqreturn_t device_interrupt_handler(int irq, void *dev) {
    spin_lock(&device_lock);    // Busy-wait if lock is held
    
    // Handle the interrupt
    // Access shared device state safely
    process_device_data();
    
    spin_unlock(&device_lock);
    return IRQ_HANDLED;
}
 
// In process context - could use mutex, but spinlock is fine
// for short critical sections
void device_ioctl(struct device *dev, int cmd) {
    spin_lock(&device_lock);    // Same lock as interrupt handler
    
    // Modify device configuration
    configure_device(cmd);
    
    spin_unlock(&device_lock);
}

Lock-Free Data Structure Spin Loops

Lock-free algorithms often contain busy-wait loops, though they're different from traditional spinlocks. The thread isn't waiting for a lock—it's retrying an operation that failed due to concurrent interference:

lock_free_spin.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Lock-free counter increment with CAS retry loop
// This is a form of busy waiting - we spin until we succeed
 
#include <stdatomic.h>
 
void atomic_inc(atomic_int *counter) {
    int old_value, new_value;
    
    do {
        old_value = atomic_load(counter);
        new_value = old_value + 1;
        
        // Retry loop: if someone else modified the counter,
        // compare_exchange_weak fails and we try again
        // This is busy waiting - we spin until we succeed
    } while (!atomic_compare_exchange_weak(counter, &old_value, new_value));
}
 
// Unlike traditional spinlock waiting:
// - We're not waiting FOR something (a lock holder to release)
// - We're retrying UNTIL something (our operation succeeds)
// - Other threads making progress doesn't block us; it just
//   means we retry with updated values

Busy Waiting Is Everywhere

Once you recognize busy waiting, you'll see it everywhere: in network drivers polling for packet arrival, in database engines checking commit status, in game engines waiting for frame synchronization, in real-time audio processing, and in countless kernel subsystems. It's not a flaw—it's a fundamental technique with specific appropriate use cases.

Quantifying the Trade-off

The decision between busy waiting and blocking comes down to a fundamental inequality. Let's derive when each approach wins.

The Core Equation

Let:

T_wait = average time until the condition becomes true
T_switch = cost of a context switch (save state → schedule → restore state)
C_spin = CPU cost per unit time of busy waiting
C_idle = CPU cost per unit time of a blocked thread (essentially zero)

Busy waiting total cost: C_spin × T_wait

Blocking total cost: 2 × T_switch + C_idle × T_wait ≈ 2 × T_switch

Busy waiting wins when: C_spin × T_wait < 2 × T_switch

Rearranging: Busy waiting wins when: T_wait < (2 × T_switch) / C_spin

Real-World Numbers

On modern systems (circa 2024):

Parameter	Typical Value	Notes
Context switch	1-10 μs	Depends on OS, cache state
L1 cache read	1-4 cycles (~1 ns)	Where most spin reads hit
Spin iteration	10-50 cycles (~10-50 ns)	Including PAUSE

If a context switch takes 5 μs and each spin iteration takes 50 ns, busy waiting is worthwhile for waits up to:

2 × 5 μs = 10 μs (assuming we're spinning on L1-cached data)

With 50 ns per spin, that's about 200 spin iterations before blocking becomes more efficient.

Break-even Analysis: When to Spin vs. Block
Wait Duration	Spinning Cost	Blocking Cost	Winner
100 ns (100 cycles)	100 cycles	2 context switches (~10,000+ cycles)	Spinning by 100×
1 μs (1,000 cycles)	1,000 cycles	~10,000 cycles	Spinning by 10×
10 μs	10,000 cycles	~10,000 cycles	Break-even point
100 μs	100,000 cycles	~10,000 cycles	Blocking by 10×
1 ms	1,000,000 cycles	~10,000 cycles	Blocking by 100×

Adaptive Algorithms

The best lock implementations don't force you to choose. They measure actual wait times and adapt: spin briefly for expected-short waits, then block if the wait extends beyond the break-even point. The Linux kernel's mutex and many userspace libraries like parking_lot in Rust implement this adaptive strategy.

Additional Factors

The simple equation above ignores several real-world factors that influence the decision:

CPU Utilization: On a fully loaded system, spinning steals cycles from productive work. The opportunity cost matters.

Power Consumption: On laptops and mobile devices, spinning prevents the CPU from entering low-power states. A thread blocked on a mutex consumes nearly zero power.

Fairness: Spinning threads compete for cache lines and memory bandwidth. Heavy spinning can starve other threads of resources even if they're on different CPUs.

Lock Hold Time Predictability: If critical sections have highly variable duration, adaptive approaches that measure and adjust outperform fixed policies.

NUMA Effects: On NUMA systems, spinning on a memory location owned by a remote node is expensive. Blocking might actually be faster due to cache migration costs.

Historical Evolution of Busy Waiting

Understanding the history of busy waiting illuminates why it remains relevant and how its role has evolved with hardware advances.

The Era of Single Processors (1960s-1990s)

On single-processor systems, busy waiting was generally wasteful. If thread A is spinning while waiting for thread B to release a lock, thread B cannot make progress because thread A holds the CPU. Pure busy waiting on a uniprocessor is almost always incorrect.

The solution was disabling interrupts (preventing scheduler preemption) for short critical sections, or using blocking synchronization for longer ones. Spinlocks made little sense without multiple processors.

The Uniprocessor Deadlock

Consider: Thread A acquires lock, gets preempted. Thread B runs, tries to acquire same lock, spins. Thread B has the only CPU. Thread A can never run to release the lock. Thread B spins forever. This is why uniprocessor spinlocks always disable preemption.

The Multiprocessor Revolution (1990s-2000s)

With symmetric multiprocessing (SMP), busy waiting became viable—indeed, essential:

If thread A holds a lock on CPU 0, thread B on CPU 1 can productively spin waiting for it
While B spins, A continues executing on its own CPU, making progress toward releasing the lock
Spinning avoids the overhead of blocking, context switching, and inter-processor interrupts to wake blocked threads

Operating systems developed sophisticated spinlock implementations during this era. Linux's spinlock evolved from a simple test-and-set loop to queued spinlocks with fairness guarantees.

The Many-Core Era (2010s-Present)

With 64, 128, or more cores, the picture grows more complex:

Cache coherence traffic becomes a bottleneck. Many threads spinning on the same lock generate massive amounts of inter-core communication.
NUMA effects matter. A thread spinning on memory owned by a remote node wastes memory bandwidth.
Power efficiency is critical. Data centers spend as much on cooling as on computation.
Hierarchical locks emerge. Locks are designed with awareness of the cache hierarchy, preferring to hand off to threads on the same NUMA node or the same core.

Modern implementations like MCS locks, CLH locks, and Linux's qspinlock address these challenges, as we'll explore in the ticket locks page.

The Hardware Co-Evolution

Processor architects have co-evolved with software synchronization needs:

Year	Hardware Development	Impact on Spinning
1980s	Test-and-Set instruction	Enabled basic spinlocks
1990s	Compare-and-Swap (CAS)	More sophisticated lock algorithms
2000s	PAUSE instruction	Efficient spin loops on hyperthreading
2010s	Transactional memory (TSX)	Lock elision, optimistic execution
2020s	ARM LSE, RISC-V atomics	Better atomic operations for spinning

Each hardware advance enabled new synchronization patterns and more efficient busy waiting.

Pitfalls and Anti-patterns

Busy waiting is easy to get wrong. Even experienced systems programmers fall into these traps.

Common Pitfalls

•The Missing Volatile: Omitting volatile on the polled variable allows the compiler to cache it in a register. The loop becomes infinite, optimized into while(true). Always use volatile or proper atomic types.
•Priority Inversion: On systems with priority scheduling, a low-priority thread holding a lock can be preempted indefinitely while a high-priority thread spins. The high-priority thread starves because the low-priority holder never gets CPU time to release the lock.
•Spinning on Uniprocessor without Disabling Preemption: As discussed, this leads to deadlock if the lock holder is preempted. Always disable preemption or interrupts when spinning on single-CPU systems.
•Unbounded Spinning: Spinning without any escape hatch is dangerous. If the lock holder crashes, dies, or hangs, all spinners wait forever. Production code should include timeouts or adaptive blocking.
•Cache Line Contention: Multiple CPUs spinning on the same lock generate cache coherence traffic. Each CPU repeatedly invalidates others' cache lines. Use techniques like test-and-test-and-set or exponential backoff.
•Spinning Across Page Faults: If obtaining the lock could trigger a page fault (e.g., the lock is in pageable memory), the holder might be suspended waiting for I/O while spinners waste CPU. Kernel spinlocks typically protect against this.

volatile_bug.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// BUG: Missing volatile leads to infinite loop
 
int lock_held = 1;  // Should be: volatile int lock_held = 1;
 
void broken_wait(void) {
    while (lock_held == 1) {
        // Without volatile, compiler may transform this into:
        // int cached = lock_held;
        // while (cached == 1) { }
        // 
        // The register 'cached' is never updated, so even when
        // another thread sets lock_held = 0, we never see it.
    }
}
 
// Correct version
volatile int lock_held_correct = 1;  // volatile forces memory read
 
void correct_wait(void) {
    while (lock_held_correct == 1) {
        // Each iteration re-reads from memory
    }
}

Use Proper Atomics

In modern code, prefer <stdatomic.h> (C11) or <atomic> (C++11) over volatile. Atomic types provide both visibility guarantees and proper ordering semantics, whereas volatile only prevents optimization but doesn't guarantee any ordering between threads.

Summary and Looking Ahead

We've established the foundational concept of busy waiting—the engine that powers spinlocks and many other synchronization mechanisms. Let's consolidate what we've learned:

Key Takeaways

•Busy waiting is a deliberate CPU-time trade-off — We accept continuous CPU consumption in exchange for immediate response when conditions change. This is wasteful for long waits but optimal for very short ones.
•The break-even point is context switch time — If average wait time is shorter than two context switches (~10-20 μs on modern systems), spinning wins. Adaptive algorithms measure and adjust dynamically.
•Cache behavior makes spinning efficient — Spinning on L1-cached memory is nearly free in terms of processor cycles. Only when the lock is released does a cache miss occur, delivering the updated value.
•The PAUSE instruction is essential — Modern spin loops must include PAUSE to reduce power consumption, prevent pipeline stalls, and hint to the CPU about the spin-wait pattern.
•Busy waiting is appropriate in specific contexts — Kernel code, interrupt handlers, lock-free algorithms, and real-time systems all have valid use cases. Userspace applications typically prefer adaptive spin-then-block.
•Memory ordering and volatility matter critically — Incorrect use of volatile or missing proper barriers leads to impossible-to-debug failures. Use proper atomic types in production code.

What's next:

Now that we understand busy waiting conceptually, the next page examines spinlock implementation in detail. We'll see how busy waiting combines with atomic operations to create practical mutual exclusion, examine the test-and-set and compare-and-swap approaches, and understand why naive implementations fail under contention.

The journey from concept to robust implementation reveals the deep interplay between hardware capabilities and software design—a hallmark of systems programming.

Page Complete

You now understand busy waiting at a fundamental level: what it is, how CPUs execute spin loops, the trade-offs involved, where it naturally appears, and the pitfalls to avoid. This foundation prepares you for understanding spinlock implementations in the next page.