Loading learning content...
The previous page established that busy waiting is the deliberate consumption of CPU cycles while waiting for a condition to become true. Now we face the practical question: How do we implement mutual exclusion using busy waiting?
A naive attempt might look promising but fails catastrophically under concurrent execution. The subtle bugs in spinlock implementation have consumed countless engineering hours and led to security vulnerabilities, data corruption, and system crashes. Understanding implementation deeply isn't academic—it's essential for writing correct concurrent code.
This page takes you from broken implementations to production-quality spinlocks, explaining why each refinement is necessary.
By the end of this page, you will understand: why naive lock implementations fail; how atomic test-and-set operations solve the fundamental problem; the test-and-test-and-set optimization and why it matters; memory ordering requirements for correct spinlocks; and the anatomy of production spinlock implementations from Linux and other systems.
Let's start with the most intuitive approach. A lock is just a boolean flag, right? If the flag is false, the lock is free; if true, someone holds it.
The broken implementation:
1234567891011121314151617181920212223
// WARNING: This implementation is BROKEN - do not use! typedef struct { volatile int locked; // 0 = free, 1 = held} broken_spinlock_t; void broken_lock(broken_spinlock_t *lock) { // Step 1: Wait until lock is free while (lock->locked == 1) { // Busy wait } // Step 2: Claim the lock lock->locked = 1; // BUG: Steps 1 and 2 are NOT atomic! // Another thread can observe locked==0 between // our check and our write.} void broken_unlock(broken_spinlock_t *lock) { lock->locked = 0;}Why does this fail?
The fundamental problem is the check-then-act race condition. Between the moment we observe locked == 0 and the moment we set locked = 1, another thread might do exactly the same thing. Consider this interleaving:
| Time | Thread A | Thread B | lock->locked |
|---|---|---|---|
| T1 | Read locked → sees 0 | 0 | |
| T2 | (about to write 1) | Read locked → sees 0 | 0 |
| T3 | Write locked = 1 | (about to write 1) | 1 |
| T4 | Enter critical section | Write locked = 1 | 1 |
| T5 | (in critical section) | Enter critical section | 1 |
| T6 | DISASTER: Both threads in critical section! |
Both threads observed the lock as free, both claimed it, and both entered the critical section. Mutual exclusion is violated. Shared data structures are corrupted. The system exhibits seemingly random bugs that appear only under load.
The insight: We need an operation that atomically tests the current value AND sets a new value—as a single, indivisible action that no other CPU can interrupt or interleave with.
This is a classic Time-Of-Check to Time-Of-Use (TOCTOU) vulnerability. We CHECK the lock state, but by the time we USE that information (by setting the lock), the state has changed. The only solution is to make check and use happen atomically.
Test-and-Set (TAS) is a hardware-provided atomic operation that solves the check-then-act problem. It atomically:
All three steps happen as one indivisible operation. No other CPU can access that memory location during the TAS execution. Hardware ensures this through cache locking, bus locking, or memory controller arbitration.
12345678910111213141516171819
// Conceptual semantics of test-and-set (hardware provides this atomically) // This is what TAS does, but as a single atomic operationint atomic_test_and_set(volatile int *location) { // BEGIN ATOMIC int old_value = *location; // Read current value *location = 1; // Set to 1 (or any "locked" value) return old_value; // Return what we read // END ATOMIC // Hardware guarantees no other core can access *location // between the read and write} // Hardware implementations:// x86/x64: XCHG instruction (exchange) or LOCK CMPXCHG// ARM: LDREX/STREX pair (load-linked/store-conditional)// RISC-V: AMOSWAP (atomic memory operation swap)// PowerPC: LWARX/STWCX (load-word-and-reserve/store-conditional)With TAS, we can build a correct spinlock:
123456789101112131415161718192021222324252627282930
// Correct spinlock using test-and-set typedef struct { volatile int locked; // 0 = free, 1 = held} spinlock_t; // GCC intrinsic for atomic exchange// __sync_lock_test_and_set atomically: reads *ptr, writes val, returns old value#define atomic_exchange(ptr, val) __sync_lock_test_and_set(ptr, val) void spin_lock(spinlock_t *lock) { // Try to acquire: atomically set locked=1 and check old value while (atomic_exchange(&lock->locked, 1) == 1) { // If old value was 1, lock was already held // Keep spinning - try again // Optional: Add pause for efficiency __asm__ volatile("pause"); } // If old value was 0, we changed it from 0→1 // We now hold the lock!} void spin_unlock(spinlock_t *lock) { // Release the lock by setting to 0 // Memory barrier ensures all critical section writes // are visible before the unlock __sync_lock_release(&lock->locked);}Why does this work?
The key insight is that atomic_exchange returns the old value. If the old value was 0, we successfully acquired the lock (we were the one who changed it from 0 to 1). If the old value was 1, someone else already held it, and our exchange from 1 to 1 had no effect—we just need to try again.
Only one thread can ever get the 0.
If two threads call atomic_exchange simultaneously, hardware arbitrates. One thread's exchange completes first (reading 0, writing 1). The other thread's exchange then reads 1 (the value just written by the first thread). The serialization is guaranteed by hardware.
Atomic operations provide a fundamental contract: multiple CPUs attempting the same atomic operation on the same memory location are serialized. One completes entirely before another begins. This is the foundation of all lock-free and lock-based synchronization.
The simple TAS spinlock works correctly but performs poorly under contention. The problem is cache coherence traffic.
When multiple CPUs spin using TAS on the same lock:
On a 16-core system with 15 threads spinning on TAS, you get continuous cache line ping-pong. Performance collapses; even unrelated threads on the same cores suffer cache evictions.
12345678910111213141516
# Cache coherence traffic with pure TAS spinlock Timeline when lock is held by CPU 0: CPU1: TAS (get exclusive, invalidate all) → fails, retryCPU2: TAS (get exclusive, invalidate all) → fails, retry CPU3: TAS (get exclusive, invalidate all) → fails, retryCPU1: TAS (get exclusive, invalidate all) → fails, retry... Each TAS triggers:- Invalidation messages to all other CPUs- Cache miss on next TAS by any other CPU- Memory controller arbitration With N spinning CPUs: O(N²) coherence messages per lock hold period!Test-and-Test-and-Set (TTAS) reduces coherence traffic by spinning on a local cache read first, and only attempting the expensive atomic operation when the lock appears free:
12345678910111213141516171819202122232425262728293031
// Test-and-Test-and-Set spinlock typedef struct { volatile int locked;} spinlock_t; void spin_lock_ttas(spinlock_t *lock) { while (1) { // OUTER LOOP: Test - spin on cached read (cheap) while (lock->locked == 1) { // This is a normal memory read - hits L1 cache // No cache invalidations while lock is unchanged // CPU does minimal work, bus is quiet __asm__ volatile("pause"); } // Lock appears free! Try to acquire with TAS // INNER: Test-and-set (expensive but only when likely to succeed) if (atomic_exchange(&lock->locked, 1) == 0) { // Success! We got the lock return; } // Failed - another CPU beat us // Go back to spinning on read }} void spin_unlock(spinlock_t *lock) { __sync_lock_release(&lock->locked);}While the lock is held:
When the lock is released:
The improvement:
| Metric | Pure TAS | TTAS | Improvement |
|---|---|---|---|
| Cache traffic while locked | O(N²) continuous | O(1) (sharedread) | Massively reduced |
| Bus bandwidth consumed | Maximum | Minimal while held | Near-zero during spin |
| Impact on other workloads | Severe (bus saturation) | Minimal | System-wide benefit |
| Power consumption | Maximum (cache invalidations) | Lower (local reads) | Notable power savings |
| Fairness | None (random) | None (random) | Same - needs further work |
TTAS has a problem at release time: ALL waiting threads see the lock become free simultaneously and ALL attempt TAS. This 'thundering herd' causes a burst of coherence traffic. Exponential backoff and queue-based locks (like ticket locks and MCS locks) address this issue.
Compare-and-Swap (CAS) is a more flexible atomic operation that enables sophisticated lock algorithms. While TAS unconditionally writes a new value, CAS writes only if the current value matches an expected value.
CAS semantics:
bool CAS(location, expected, desired):
ATOMIC:
if (*location == expected):
*location = desired
return true
else:
return false
This conditional write enables lock implementations that can't be built with TAS alone, particularly those with multiple states or owner tracking.
12345678910111213141516171819202122232425262728
// CAS-based spinlock implementation#include <stdatomic.h> typedef struct { atomic_int locked;} spinlock_t; void spin_lock_cas(spinlock_t *lock) { int expected = 0; // Loop until we successfully change 0 → 1 while (!atomic_compare_exchange_weak(&lock->locked, &expected, 1)) { // CAS failed - 'expected' now contains actual value // Reset expected for next attempt expected = 0; // Spin on read until it looks free (TTAS optimization) while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) { __asm__ volatile("pause"); } } // Successfully changed 0 → 1, we hold the lock} void spin_unlock_cas(spinlock_t *lock) { atomic_store(&lock->locked, 0);}CAS enables locks that track which thread holds them—useful for debugging, recursive locking, and priority inheritance:
1234567891011121314151617181920212223242526272829303132333435363738394041
// Spinlock with owner tracking using CAS#include <stdatomic.h>#include <pthread.h> typedef struct { atomic_uintptr_t owner; // 0 = free, or thread ID} tracked_spinlock_t; #define NO_OWNER 0 void tracked_lock(tracked_spinlock_t *lock) { uintptr_t me = (uintptr_t)pthread_self(); uintptr_t expected = NO_OWNER; while (!atomic_compare_exchange_weak(&lock->owner, &expected, me)) { // Failed to acquire - reset expected and retry expected = NO_OWNER; // TTAS spin while (atomic_load_explicit(&lock->owner, memory_order_relaxed) != NO_OWNER) { __asm__ volatile("pause"); } } // We set owner to our thread ID - lock acquired} void tracked_unlock(tracked_spinlock_t *lock) { uintptr_t me = (uintptr_t)pthread_self(); uintptr_t expected = me; // Only release if we're the owner (sanity check) if (!atomic_compare_exchange_strong(&lock->owner, &expected, NO_OWNER)) { // BUG: We don't own this lock! abort(); }} bool tracked_is_locked_by_me(tracked_spinlock_t *lock) { return atomic_load(&lock->owner) == (uintptr_t)pthread_self();}CAS advantages:
TAS advantages:
In practice: Modern implementations often use CAS because its flexibility outweighs the tiny performance difference, and it enables advanced features like trylock, recursive locks, and debugging support.
The 'weak' variant may spuriously fail even if the comparison succeeds—this is an artifact of LL/SC implementations on some architectures (ARM, RISC-V). Weak is appropriate in loops where we'll retry anyway. Strong guarantees success if comparison matches, but may be slightly slower. Use weak in spin loops, strong for single-try operations.
Correct spinlock implementation requires more than just atomic operations—it requires proper memory ordering. Without barriers, the compiler and CPU can reorder operations in ways that violate mutual exclusion.
Consider what could go wrong without proper ordering:
12345678910111213141516171819
// Potential ordering violation (conceptual) shared_data = 42; // (1) Write shared dataspin_unlock(&lock); // (2) Release lock // Without proper barriers, CPU might reorder to:spin_unlock(&lock); // (2) Release lock FIRSTshared_data = 42; // (1) Write shared data AFTER // Another thread acquiring the lock could see stale data! // Similarly for acquire:spin_lock(&lock); // (1) Acquire lockuse(shared_data); // (2) Read shared data // Without barriers, CPU might reorder to:temp = shared_data; // (2) Read BEFORE lock acquiredspin_lock(&lock); // (1) Actually acquire lockuse(temp); // Uses possibly stale value!Locks must enforce:
Acquire semantics: No memory operations from AFTER the lock acquisition can be reordered to BEFORE the acquisition. This ensures we see all writes from the previous lock holder.
Release semantics: No memory operations from BEFORE the unlock can be reordered to AFTER the unlock. This ensures our writes are visible to the next lock acquirer.
123456789101112131415161718192021222324252627282930
#include <stdatomic.h> typedef struct { atomic_int locked;} spinlock_t; void spin_lock(spinlock_t *lock) { int expected = 0; while (!atomic_compare_exchange_weak_explicit( &lock->locked, &expected, 1, memory_order_acquire, // Acquire barrier on success memory_order_relaxed // No barrier on failure )) { expected = 0; while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) { __asm__ volatile("pause"); } } // ACQUIRE BARRIER: all reads/writes after this point // cannot be reordered before the lock acquisition} void spin_unlock(spinlock_t *lock) { // RELEASE BARRIER: all reads/writes before this point // cannot be reordered after the store atomic_store_explicit(&lock->locked, 0, memory_order_release);}x86/x64 provides strong ordering by default—most loads and stores are already ordered. ARM and RISC-V have weak memory models where explicit barriers are essential. Code that 'works' on x86 may be subtly broken on ARM. Always use proper atomic types and orderings; never assume the hardware will do what you expect.
Let's examine how spinlocks are implemented at the assembly level on x86/x64, the dominant architecture in servers and desktops.
On x86, atomic read-modify-write operations are achieved using the LOCK prefix. This prefix guarantees:
Common LOCK instructions for spinlocks:
123456789101112131415161718192021222324252627282930313233
; x86-64 spinlock implementation section .data lock_var: dd 0 ; 32-bit spinlock variable (0 = free, 1 = locked) section .text ; void spin_lock(int *lock)spin_lock: ; rdi contains pointer to lock variable .try_lock: mov eax, 1 ; Value to exchange xchg eax, [rdi] ; Atomic exchange: eax <-> [rdi] ; XCHG has implicit LOCK prefix test eax, eax ; Was old value 0? jnz .spin ; If not 0, lock was held - spin ret ; Got the lock! .spin: ; TTAS optimization: spin on regular read until lock looks free pause ; Hint to CPU: spin-wait loop mov eax, [rdi] ; Regular (non-atomic) read test eax, eax ; Is it 0 (free)? jnz .spin ; Still locked - keep spinning jmp .try_lock ; Looks free - try atomic exchange again ; void spin_unlock(int *lock) spin_unlock: ; rdi contains pointer to lock variable mov dword [rdi], 0 ; Store 0 - release lock ; x86 stores have release semantics retThe CMPXCHG instruction provides CAS semantics:
1234567891011121314151617181920212223242526272829303132333435363738
; Spinlock using CMPXCHG spin_lock_cmpxchg: ; rdi = pointer to lock .try_acquire: xor eax, eax ; eax = 0 (expected value) mov ecx, 1 ; ecx = 1 (desired value) lock cmpxchg [rdi], ecx ; if [rdi]==eax, then [rdi]=ecx ; eax gets overwritten with actual [rdi] jz .acquired ; Zero flag set if exchange succeeded ; Failed - lock wasn't 0.spin: pause cmp dword [rdi], 0 ; TTAS: check without LOCK jne .spin ; Not free - keep spinning jmp .try_acquire ; Try again .acquired: ret ; Alternative: Using BTS (Bit Test-and-Set) for single-bit lockspin_lock_bts: ; rdi = pointer to byte containing lock bit .try: lock bts dword [rdi], 0 ; Test bit 0, set it to 1 ; Carry flag = old value of bit 0 jc .spin ; If carry set, bit was already 1 ret ; Got the lock .spin: pause test dword [rdi], 1 ; Check bit 0 jnz .spin ; Still set - spin jmp .try ; Try againXCHG has an implicit LOCK prefix—it's always atomic. It's simpler and often faster for basic spinlocks.
LOCK CMPXCHG is more flexible, allowing conditional updates. Essential for owner-tracking locks, lock-free algorithms, and complex state machines.
Performance notes:
x86 has a relatively strong memory model. Stores are not reordered with other stores; loads are not reordered with other loads. The main reordering allowed is store-load (a later load can 'pass' an earlier store to a different address). This makes x86 spinlocks simpler than on ARM/RISC-V, but modern C11/C++11 atomics abstract these differences.
Let's examine how real operating systems implement spinlocks, seeing how theory translates to production code.
The Linux kernel has evolved its spinlock implementation multiple times. The modern implementation uses queued spinlocks for fairness, but the basic concept is shown here:
123456789101112131415161718192021222324252627282930313233343536373839
// Simplified Linux spinlock (actual kernel code is more complex)// Based on arch/x86/include/asm/spinlock.h concepts typedef struct { volatile unsigned int lock;} arch_spinlock_t; static inline void arch_spin_lock(arch_spinlock_t *lock) { // Disable preemption first - we cannot be scheduled out holding spinlock preempt_disable(); // Try to acquire while (1) { // TTAS: spin on read first if (READ_ONCE(lock->lock) == 0) { // Lock appears free - try atomic exchange if (cmpxchg(&lock->lock, 0, 1) == 0) { break; // Got it! } } // Still locked - keep spinning cpu_relax(); // PAUSE instruction wrapper }} static inline void arch_spin_unlock(arch_spinlock_t *lock) { // Release - simple store on x86 (has release semantics) WRITE_ONCE(lock->lock, 0); // Re-enable preemption preempt_enable();} // CPU relaxation hint - essential for efficiency#define cpu_relax() __asm__ volatile("pause" ::: "memory") // READ_ONCE / WRITE_ONCE prevent compiler optimization issues#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))preempt_disable(): Before taking a spinlock, the kernel disables preemption on the current CPU. This prevents the scheduler from switching to another thread while we hold the spinlock. If the holder were preempted, spinners on other CPUs would waste enormous amounts of time.
cpu_relax(): This wraps the PAUSE instruction (or equivalent on other architectures). It:
READ_ONCE / WRITE_ONCE: These prevent compiler optimizations that might:
These are simpler than volatile but serve similar purposes with better semantics.
1234567891011121314151617181920212223242526272829303132
// Spinlock with interrupt handling - very common in kernel code // spin_lock_irqsave: Acquires spinlock AND disables interrupts// Returns previous interrupt state in 'flags'static inline void spin_lock_irqsave(spinlock_t *lock, unsigned long *flags) { // Save current interrupt state and disable interrupts local_irq_save(*flags); // Disable preemption preempt_disable(); // Acquire the spinlock arch_spin_lock(&lock->raw_lock);} // spin_unlock_irqrestore: Releases spinlock AND restores interrupt statestatic inline void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) { // Release the spinlock arch_spin_unlock(&lock->raw_lock); // Re-enable preemption preempt_enable(); // Restore previous interrupt state local_irq_restore(flags);} // Why disable interrupts?// - If an interrupt handler tries to acquire the same spinlock// while we hold it, we deadlock (handler can't complete,// we can't continue)// - spin_lock_irq* variants handle this common scenarioLinux provides multiple spinlock APIs: spin_lock() for process context where interrupts are not a concern, spin_lock_bh() which also disables softirqs, spin_lock_irq() which disables IRQs, and spin_lock_irqsave() which saves and restores IRQ state. Choosing the right variant is crucial for avoiding deadlocks.
We've journeyed from a broken naive lock to production-quality spinlock implementations. Let's consolidate the key lessons:
What's next:
Now that we understand how spinlocks are implemented, the next page addresses a crucial question: When are spinlocks the right choice? We'll explore the trade-offs between spinlocks and blocking locks, the scenarios where spinning excels, and when it's actively harmful.
Understanding implementation gives us the vocabulary; understanding applicability gives us wisdom.
You now understand spinlock implementation from first principles: why naive locks fail, how atomic operations solve the problem, optimization with TTAS, memory ordering requirements, and how production systems like Linux implement spinlocks. Next, we examine when to choose spinlocks over alternatives.