Loading learning content...
Armed with our understanding of atomic Test-and-Set semantics and the hardware that makes atomicity possible, we now turn to the practical challenge: implementing actual locks. While the conceptual idea is simple—keep trying TestAndSet until it returns false—the details of implementation profoundly affect correctness, performance, and fairness.
In this page, we'll build progressively more sophisticated lock implementations, starting from the naive spinlock and advancing through optimizations that address cache performance, power consumption, and system responsiveness. We'll examine real code from production systems and understand why even 'simple' spinlocks require careful engineering.
The goal is not just to understand lock implementations intellectually, but to develop the intuition needed to choose, configure, and debug locks in real systems.
By the end of this page, you will understand: (1) The basic TAS spinlock and its limitations, (2) The Test-and-Test-and-Set optimization, (3) Exponential backoff strategies, (4) Platform-specific implementations in C, assembly, and with compiler intrinsics, and (5) Production considerations for spinlock deployment.
Let's start with the simplest possible implementation: a lock that repeatedly executes Test-and-Set until it acquires the lock. This is the basic TAS spinlock, and while it has serious limitations, it forms the conceptual foundation for all more sophisticated variants.
Implementation in Pseudocode:
123456789101112131415161718192021222324252627
// Basic TAS Spinlock// Lock is represented as an integer: 0 = unlocked, 1 = locked typedef struct { volatile int locked; // 'volatile' prevents compiler from caching} spinlock_t; #define SPINLOCK_INIT { .locked = 0 } // Acquire: spin until TestAndSet returns 0 (was unlocked)void spin_lock(spinlock_t *lock) { while (atomic_exchange(&lock->locked, 1) == 1) { // Spin - lock was already held } // Lock acquired: we set it from 0 to 1} // Release: simply set to 0void spin_unlock(spinlock_t *lock) { atomic_store(&lock->locked, 0);} // Try to acquire without spinningint spin_trylock(spinlock_t *lock) { // Returns 1 if lock acquired, 0 if failed return atomic_exchange(&lock->locked, 1) == 0;}Why This Works:
The correctness of this implementation follows directly from the atomicity of atomic_exchange:
Only one processor can observe the 0→1 transition. The exchange is atomic, so if two processors execute simultaneously, exactly one sees the old value as 0.
Once a processor exits the while loop, the lock is set to 1, preventing any other processor from acquiring it until explicit release.
The release operation simply stores 0. Any processor spinning in the while loop will eventually observe this 0 and attempt to acquire.
This simple implementation, while correct, is a performance disaster under contention. Every spinning processor continuously executes atomic exchange operations, each one generating cache coherence traffic and invalidating other processors' cached copies. On a system with 16 spinning processors, the memory interconnect becomes saturated with useless traffic.
Performance Analysis:
Let's quantify the problem. Each atomic_exchange on a shared variable:
With N processors spinning:
| Processors | Invalidations/Second | Bus Bandwidth Consumed | Lock Throughput |
|---|---|---|---|
| 2 | ~1M | 64MB/s | Good |
| 4 | ~4M | 256MB/s | Moderate |
| 8 | ~16M | 1GB/s | Poor |
| 16 | ~64M | 4GB/s | Terrible |
| 32 | ~256M | 16GB/s | System stall |
The quadratic scaling (O(N²) traffic for N processors) makes basic TAS unusable for any system with significant lock contention. We need to reduce the rate of atomic operations.
The key insight for improving TAS spinlock performance is recognizing that we don't need to execute atomic operations while spinning. We only need atomicity at the moment we attempt to acquire. This leads to the Test-and-Test-and-Set (TTAS) optimization—also called "spin on read" or "local spinning".
The TTAS Algorithm:
1234567891011121314151617181920212223242526272829303132333435363738
// Test-and-Test-and-Set Spinlock// Dramatically reduces cache coherence traffic void spin_lock_ttas(spinlock_t *lock) { while (1) { // Phase 1: Spin locally, just reading // This read can be satisfied from local cache! while (lock->locked == 1) { // Local spinning - no bus traffic // Other CPUs' caches are unaffected cpu_relax(); // Hint for spin-wait loops } // Phase 2: Lock appears free, attempt acquisition // Now we need atomicity if (atomic_exchange(&lock->locked, 1) == 0) { // Success! We acquired the lock return; } // Failed - someone else got it first // Back to spinning }} // cpu_relax() implementation varies by architecture#if defined(__x86_64__) || defined(__i386__)static inline void cpu_relax(void) { asm volatile("pause" ::: "memory");}#elif defined(__aarch64__)static inline void cpu_relax(void) { asm volatile("yield" ::: "memory");}#elif defined(__powerpc__)static inline void cpu_relax(void) { asm volatile("or 1,1,1" ::: "memory"); // Low-power hint}#endifWhy TTAS Dramatically Reduces Traffic:
During local spinning, each processor reads from its own L1 cache. After the initial miss (or invalidation when the lock was acquired), the lock variable sits in each spinning processor's cache in Shared state. Reading a Shared cache line generates zero bus traffic.
Traffic only occurs when:
When the lock is released, ALL spinning processors simultaneously observe the change and attempt atomic acquisition. This creates a brief but intense burst of coherence traffic—the 'thundering herd.' Only one succeeds; the rest return to spinning. We'll address this with backoff strategies.
Comparative Performance:
| Metric | Basic TAS | TTAS | Improvement |
|---|---|---|---|
| Bus transactions while spinning | Continuous | Zero | ∞ |
| Cache invalidations | O(N) per CPU | O(1) per release | N× |
| Power consumption | Very high | Moderate | ~50% |
| Lock throughput at N=8 | ~10K/s | ~500K/s | 50× |
| Lock fairness | Poor | Poor | Same |
Implementation Note: The cpu_relax() Hint:
The cpu_relax() function (or PAUSE instruction on x86) serves multiple purposes:
Omitting cpu_relax() can cause correctness issues on some microarchitectures (memory ordering violations when exiting the spin loop) and always wastes power.
TTAS eliminates traffic during spinning but still suffers from the thundering herd when the lock is released. Exponential backoff addresses this by spreading out acquisition attempts over time.
The Backoff Strategy:
After a failed acquisition attempt, instead of immediately retrying, wait for a random period. If we fail again, double the maximum wait time. This exponentially increasing delay:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// TTAS with Exponential Backoff #define MIN_DELAY 4 // Minimum backoff cycles#define MAX_DELAY 4096 // Maximum backoff cycles#define DELAY_MULTIPLIER 2 void spin_lock_backoff(spinlock_t *lock) { int delay = MIN_DELAY; while (1) { // Phase 1: Local spinning with backoff while (lock->locked == 1) { // Spin locally with delay for (int i = 0; i < delay; i++) { cpu_relax(); } } // Phase 2: Attempt acquisition if (atomic_exchange(&lock->locked, 1) == 0) { // Success! return; } // Failed: exponentially increase backoff // Add randomness to prevent synchronization delay = delay * DELAY_MULTIPLIER; if (delay > MAX_DELAY) { delay = MAX_DELAY; } // Random backoff within [MIN_DELAY, delay] int actual_delay = MIN_DELAY + (random() % (delay - MIN_DELAY + 1)); for (int i = 0; i < actual_delay; i++) { cpu_relax(); } }} // Alternative: Proportional backoff based on contention levelvoid spin_lock_proportional_backoff(spinlock_t *lock) { while (1) { int spin_count = 0; while (lock->locked == 1) { cpu_relax(); spin_count++; } if (atomic_exchange(&lock->locked, 1) == 0) { return; // Success } // Backoff proportional to observed contention int backoff = spin_count < 1000 ? spin_count / 10 : 100; for (int i = 0; i < backoff; i++) { cpu_relax(); } }}Tuning Backoff Parameters:
Backoff effectiveness depends heavily on system characteristics and workload. The optimal parameters vary based on:
Production systems often use adaptive backoff that monitors acquisition success rate and adjusts parameters dynamically. Linux's qspinlock and Windows' critical sections both employ adaptive strategies that tune themselves to workload characteristics.
| System Type | MIN_DELAY | MAX_DELAY | Notes |
|---|---|---|---|
| Desktop (4-8 cores) | 2-8 | 256-1K | Light contention typical |
| Server (16-64 cores) | 8-32 | 2K-16K | Contention common |
| NUMA (multi-socket) | 32-128 | 16K-64K | High remote latency |
| Critical section < 100ns | 2-8 | 128-512 | Short hold → quick retry |
| Critical section > 1μs | 32-128 | 4K-64K | Longer hold → more patience |
Backoff Tradeoffs:
Backoff is not free. Excessive backoff:
Insufficient backoff:
The optimal balance is workload-dependent and often requires empirical tuning.
For maximum performance and control, critical spinlock code is often written directly in assembly. Let's examine real implementations for major architectures.
x86-64 Implementation:
123456789101112131415161718192021222324252627282930313233343536373839
; x86-64 TTAS Spinlock Implementation; Input: RDI = address of lock variable; Lock variable: 0 = unlocked, 1 = locked .global spin_lock.global spin_unlock spin_lock: mov eax, 1 ; Value to exchange .Lttas_loop: ; Phase 1: Local spinning (test).Lspin: pause ; Hint: spin-wait loop cmp dword ptr [rdi], 0 ; Is lock free? jne .Lspin ; No: keep spinning locally ; Phase 2: Attempt acquisition (test-and-set) xchg eax, [rdi] ; Atomic: EAX <-> [RDI] test eax, eax ; Was old value 0? jnz .Lttas_loop ; No: failed, retry ; Success: lock acquired ; XCHG provides implicit memory barrier ret spin_unlock: ; Simple store with release semantics ; On x86, regular stores have release semantics mov dword ptr [rdi], 0 ret ; Trylock variant: returns 1 on success, 0 on failure.global spin_trylockspin_trylock: mov eax, 1 xchg eax, [rdi] ; Attempt to acquire xor eax, 1 ; Invert: 0->1 (success), 1->0 (fail) retARM64 (AArch64) Implementation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
// ARM64 Spinlock using Load-Exclusive/Store-Exclusive// and ARMv8.1 atomic instructions .global spin_lock.global spin_unlock // Using LDAXR/STXR (Load-Acquire Exclusive, Store-Exclusive)spin_lock: mov w2, #1 // Value to store 1: // Spin phase: wait for lock to appear free ldaxr w1, [x0] // Load-Acquire Exclusive cbnz w1, 2f // If locked, go to spinning stxr w1, w2, [x0] // Try to store 1 cbnz w1, 1b // If store failed, retry ret // Success: lock acquired 2: // Local spinning without exclusive access wfe // Wait For Event (power saving) ldr w1, [x0] // Regular load cbnz w1, 2b // Still locked? Keep waiting b 1b // Lock appears free, try to acquire spin_unlock: stlr wzr, [x0] // Store-Release zero sev // Send Event (wake spinning CPUs) ret // ARMv8.1 with Large System Extensions (LSE)// Uses direct atomic swap instruction.global spin_lock_lsespin_lock_lse: mov w1, #1 1: ldr w2, [x0] // Local spinning cbnz w2, 1b swpa w1, w2, [x0] // Atomic swap with acquire cbnz w2, 1b // If old value != 0, retry ret spin_unlock_lse: stlr wzr, [x0] // Store-Release zero retKey Differences Between Architectures:
| Aspect | x86-64 | ARM64 (LL/SC) | ARM64 (LSE) |
|---|---|---|---|
| Atomic mechanism | XCHG (implicit lock) | LDXR/STXR pair | SWP instruction |
| Failure mode | Never fails | SC can fail spuriously | Never fails |
| Memory barriers | Implicit in XCHG | LDAXR/STLR explicit | Acquire/Release in swap |
| Spin hint | PAUSE | WFE (wait for event) | WFE |
| Wake mechanism | N/A | SEV (send event) | SEV |
| Code complexity | Low | Medium | Low |
ARM's Wait-For-Event (WFE) and Send-Event (SEV) instructions provide power-efficient spinning. WFE puts the core into a low-power state until an event arrives. SEV broadcasts an event to all cores. This is more efficient than spinning on PAUSE but requires unlock code to execute SEV.
Writing assembly for every architecture is impractical for most projects. Modern compilers provide atomic built-ins and intrinsics that generate optimal code for each target platform while maintaining source portability.
GCC/Clang Atomic Built-ins:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// Portable spinlock using GCC/Clang atomic built-ins#include <stdatomic.h>#include <stdbool.h> typedef struct { atomic_int locked;} spinlock_t; #define SPINLOCK_INITIALIZER { .locked = 0 } static inline void spin_lock(spinlock_t *lock) { while (1) { // Phase 1: Local spinning while (atomic_load_explicit(&lock->locked, memory_order_relaxed)) { #if defined(__x86_64__) || defined(__i386__) __builtin_ia32_pause(); #elif defined(__aarch64__) asm volatile("yield" ::: "memory"); #endif } // Phase 2: Attempt acquisition with acquire semantics int expected = 0; if (atomic_compare_exchange_weak_explicit( &lock->locked, &expected, 1, memory_order_acquire, // Success: acquire memory_order_relaxed // Failure: relaxed )) { return; // Lock acquired } }} static inline void spin_unlock(spinlock_t *lock) { atomic_store_explicit(&lock->locked, 0, memory_order_release);} static inline bool spin_trylock(spinlock_t *lock) { int expected = 0; return atomic_compare_exchange_strong_explicit( &lock->locked, &expected, 1, memory_order_acquire, memory_order_relaxed );}C++11 std::atomic Implementation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
// C++11 Portable Spinlock#include <atomic>#include <thread> class Spinlock {private: std::atomic<bool> locked{false}; public: void lock() noexcept { while (true) { // Optimistic read with relaxed ordering while (locked.load(std::memory_order_relaxed)) { // Hint to CPU that this is a spin-wait #if defined(__cpp_lib_hardware_interference_size) std::this_thread::yield(); // May be too heavy #endif // Platform-specific pause #if defined(_MSC_VER) _mm_pause(); #elif defined(__GNUC__) __builtin_ia32_pause(); #endif } // Attempt acquisition with acquire semantics if (!locked.exchange(true, std::memory_order_acquire)) { return; // Success } } } bool try_lock() noexcept { // Don't spin - single attempt return !locked.exchange(true, std::memory_order_acquire); } void unlock() noexcept { locked.store(false, std::memory_order_release); }}; // Usage example with RAII guardclass SpinlockGuard {private: Spinlock& lock_;public: explicit SpinlockGuard(Spinlock& lock) : lock_(lock) { lock_.lock(); } ~SpinlockGuard() { lock_.unlock(); } // Non-copyable, non-movable SpinlockGuard(const SpinlockGuard&) = delete; SpinlockGuard& operator=(const SpinlockGuard&) = delete;};For lock acquisition, use memory_order_acquire. For lock release, use memory_order_release. This ensures proper synchronization of data protected by the lock. Using memory_order_seq_cst is correct but may be unnecessarily expensive, especially on ARM/POWER. Using memory_order_relaxed is dangerous for synchronization.
Windows Intrinsics:
12345678910111213141516171819202122232425262728293031323334353637383940414243
// Windows-specific spinlock using interlocked intrinsics#include <Windows.h>#include <intrin.h> typedef volatile LONG spinlock_t; #define SPINLOCK_INIT 0 static inline void spin_lock(spinlock_t *lock) { while (1) { // Local spinning while (*lock != 0) { _mm_pause(); // Intel pause intrinsic } // Attempt acquisition // InterlockedExchange returns old value if (InterlockedExchange(lock, 1) == 0) { return; } }} static inline void spin_unlock(spinlock_t *lock) { // MemoryBarrier ensures all prior writes are visible MemoryBarrier(); *lock = 0; // Or use InterlockedExchange for guaranteed atomicity: // InterlockedExchange(lock, 0);} static inline BOOL spin_trylock(spinlock_t *lock) { return InterlockedExchange(lock, 1) == 0;} // Alternative using InterlockedCompareExchangestatic inline void spin_lock_cmpxchg(spinlock_t *lock) { while (InterlockedCompareExchange(lock, 1, 0) != 0) { while (*lock != 0) { _mm_pause(); } }}Deploying spinlocks in production systems requires careful consideration of several factors that don't appear in textbook implementations.
When to Use Spinlocks:
If a low-priority thread holds a spinlock and gets preempted, high-priority threads spinning for the lock can't make progress. On preemptible systems, spinlocks must disable preemption while held. POSIX spinlocks (pthread_spin_*) don't do this automatically—use with caution!
Debugging and Instrumentation:
Production spinlocks often include debugging features:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Production spinlock with debugging supporttypedef struct { atomic_int locked; #ifdef DEBUG_SPINLOCK const char *file; // File where lock was acquired int line; // Line where lock was acquired pid_t owner; // Thread/process holding lock uint64_t acquire_time; // Timestamp of acquisition uint64_t spin_count; // Total spins for contention analysis #endif} spinlock_t; #ifdef DEBUG_SPINLOCKvoid spin_lock_debug(spinlock_t *lock, const char *file, int line) { uint64_t spins = 0; while (1) { while (atomic_load_explicit(&lock->locked, memory_order_relaxed)) { cpu_relax(); spins++; // Detect potential deadlock or excessive contention if (spins > 100000000) { fprintf(stderr, "SPINLOCK WARNING: Excessive spinning!\n"); fprintf(stderr, " Waiting at: %s:%d\n", file, line); fprintf(stderr, " Held by: %s:%d (thread %d)\n", lock->file, lock->line, lock->owner); spins = 0; // Reset to continue warning } } if (atomic_exchange_explicit(&lock->locked, 1, memory_order_acquire) == 0) { lock->file = file; lock->line = line; lock->owner = gettid(); lock->acquire_time = get_timestamp(); lock->spin_count += spins; return; } }} #define spin_lock(lock) spin_lock_debug(lock, __FILE__, __LINE__) #endif // DEBUG_SPINLOCKReal-World Examples:
Production kernels and libraries use sophisticated spinlock implementations:
Linux Kernel (qspinlock): Since 4.2, Linux uses queued spinlocks that provide fairness while maintaining performance. Combines ticket lock fairness with MCS-style queueing.
FreeBSD: Uses adaptive mutexes that spin briefly then block if contention persists.
Windows (Critical Sections): Hybrid locks that spin for a configurable count before blocking on a kernel object.
We have explored the full spectrum of Test-and-Set lock implementations, from the naive but correct basic spinlock through production-ready variants with debugging support. Let's consolidate the key insights:
What's Next:
We've covered standalone TSL-based lock implementations. The next page will explore how to build complete lock abstractions using Test-and-Set—examining mutex implementations that combine spinning with blocking, and understanding the relationship between simple spinlocks and full-featured synchronization primitives.
You now have practical knowledge to implement, optimize, and deploy Test-and-Set based locks. From understanding why basic TAS fails under contention to writing platform-specific assembly, you can reason about spinlock performance and make informed implementation choices.