Operating SystemsTest-and-Set Lock (TSL)

Test-and-Set Lock (TSL): Hardware-Assisted Synchronization

LevelIntermediate

Duration75 mins

TopicTest-and-Set Lock (TSL)

3 / 5

TSL Implementation: From Theory to Practice

Building Locks with Test-and-Set

Armed with our understanding of atomic Test-and-Set semantics and the hardware that makes atomicity possible, we now turn to the practical challenge: implementing actual locks. While the conceptual idea is simple—keep trying TestAndSet until it returns false—the details of implementation profoundly affect correctness, performance, and fairness.

In this page, we'll build progressively more sophisticated lock implementations, starting from the naive spinlock and advancing through optimizations that address cache performance, power consumption, and system responsiveness. We'll examine real code from production systems and understand why even 'simple' spinlocks require careful engineering.

The goal is not just to understand lock implementations intellectually, but to develop the intuition needed to choose, configure, and debug locks in real systems.

What You Will Learn

By the end of this page, you will understand: (1) The basic TAS spinlock and its limitations, (2) The Test-and-Test-and-Set optimization, (3) Exponential backoff strategies, (4) Platform-specific implementations in C, assembly, and with compiler intrinsics, and (5) Production considerations for spinlock deployment.

The Basic TAS Spinlock

Let's start with the simplest possible implementation: a lock that repeatedly executes Test-and-Set until it acquires the lock. This is the basic TAS spinlock, and while it has serious limitations, it forms the conceptual foundation for all more sophisticated variants.

Implementation in Pseudocode:

basic_tas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Basic TAS Spinlock
// Lock is represented as an integer: 0 = unlocked, 1 = locked
 
typedef struct {
    volatile int locked;  // 'volatile' prevents compiler from caching
} spinlock_t;
 
#define SPINLOCK_INIT { .locked = 0 }
 
// Acquire: spin until TestAndSet returns 0 (was unlocked)
void spin_lock(spinlock_t *lock) {
    while (atomic_exchange(&lock->locked, 1) == 1) {
        // Spin - lock was already held
    }
    // Lock acquired: we set it from 0 to 1
}
 
// Release: simply set to 0
void spin_unlock(spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
}
 
// Try to acquire without spinning
int spin_trylock(spinlock_t *lock) {
    // Returns 1 if lock acquired, 0 if failed
    return atomic_exchange(&lock->locked, 1) == 0;
}

Why This Works:

The correctness of this implementation follows directly from the atomicity of atomic_exchange:

Only one processor can observe the 0→1 transition. The exchange is atomic, so if two processors execute simultaneously, exactly one sees the old value as 0.
Once a processor exits the while loop, the lock is set to 1, preventing any other processor from acquiring it until explicit release.
The release operation simply stores 0. Any processor spinning in the while loop will eventually observe this 0 and attempt to acquire.

The Hidden Disaster

This simple implementation, while correct, is a performance disaster under contention. Every spinning processor continuously executes atomic exchange operations, each one generating cache coherence traffic and invalidating other processors' cached copies. On a system with 16 spinning processors, the memory interconnect becomes saturated with useless traffic.

Performance Analysis:

Let's quantify the problem. Each atomic_exchange on a shared variable:

Invalidates all other cached copies of the cache line (O(N) invalidations)
Transfers the cache line to the executing processor
Immediately gets invalidated by the next processor's exchange

With N processors spinning:

Cache Traffic in Basic TAS Spinlock
Processors	Invalidations/Second	Bus Bandwidth Consumed	Lock Throughput
2	~1M	64MB/s	Good
4	~4M	256MB/s	Moderate
8	~16M	1GB/s	Poor
16	~64M	4GB/s	Terrible
32	~256M	16GB/s	System stall

The quadratic scaling (O(N²) traffic for N processors) makes basic TAS unusable for any system with significant lock contention. We need to reduce the rate of atomic operations.

Test-and-Test-and-Set (TTAS) Optimization

The key insight for improving TAS spinlock performance is recognizing that we don't need to execute atomic operations while spinning. We only need atomicity at the moment we attempt to acquire. This leads to the Test-and-Test-and-Set (TTAS) optimization—also called "spin on read" or "local spinning".

The TTAS Algorithm:

Read the lock value (regular load, not atomic)
If locked (value is 1), keep reading (spin locally)
When read shows unlocked (value is 0), attempt atomic acquisition
If atomic acquisition fails, return to step 1

ttas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Test-and-Test-and-Set Spinlock
// Dramatically reduces cache coherence traffic
 
void spin_lock_ttas(spinlock_t *lock) {
    while (1) {
        // Phase 1: Spin locally, just reading
        // This read can be satisfied from local cache!
        while (lock->locked == 1) {
            // Local spinning - no bus traffic
            // Other CPUs' caches are unaffected
            cpu_relax();  // Hint for spin-wait loops
        }
        
        // Phase 2: Lock appears free, attempt acquisition
        // Now we need atomicity
        if (atomic_exchange(&lock->locked, 1) == 0) {
            // Success! We acquired the lock
            return;
        }
        // Failed - someone else got it first
        // Back to spinning
    }
}
 
// cpu_relax() implementation varies by architecture
#if defined(__x86_64__) || defined(__i386__)
static inline void cpu_relax(void) {
    asm volatile("pause" ::: "memory");
}
#elif defined(__aarch64__)
static inline void cpu_relax(void) {
    asm volatile("yield" ::: "memory");
}
#elif defined(__powerpc__)
static inline void cpu_relax(void) {
    asm volatile("or 1,1,1" ::: "memory");  // Low-power hint
}
#endif

Why TTAS Dramatically Reduces Traffic:

During local spinning, each processor reads from its own L1 cache. After the initial miss (or invalidation when the lock was acquired), the lock variable sits in each spinning processor's cache in Shared state. Reading a Shared cache line generates zero bus traffic.

Traffic only occurs when:

The lock is released (one store, one broadcast)
Spinning processors observe the change (they see the invalidation or snoop response)
Spinning processors attempt acquisition (atomic operations)

The 'Thundering Herd' Problem

When the lock is released, ALL spinning processors simultaneously observe the change and attempt atomic acquisition. This creates a brief but intense burst of coherence traffic—the 'thundering herd.' Only one succeeds; the rest return to spinning. We'll address this with backoff strategies.

Comparative Performance:

TAS vs. TTAS Under Contention
Metric	Basic TAS	TTAS	Improvement
Bus transactions while spinning	Continuous	Zero	∞
Cache invalidations	O(N) per CPU	O(1) per release	N×
Power consumption	Very high	Moderate	~50%
Lock throughput at N=8	~10K/s	~500K/s	50×
Lock fairness	Poor	Poor	Same

Implementation Note: The cpu_relax() Hint:

The cpu_relax() function (or PAUSE instruction on x86) serves multiple purposes:

Pipeline hint: Tells the CPU this is a spin loop, allowing pipeline resource optimization
Power reduction: May reduce clock speed or power to execution units
Memory ordering: On some architectures, ensures memory barrier semantics
SMT improvement: On hyperthreaded CPUs, yields resources to the sibling thread

Omitting cpu_relax() can cause correctness issues on some microarchitectures (memory ordering violations when exiting the spin loop) and always wastes power.

Exponential Backoff: Reducing Thundering Herd

TTAS eliminates traffic during spinning but still suffers from the thundering herd when the lock is released. Exponential backoff addresses this by spreading out acquisition attempts over time.

The Backoff Strategy:

After a failed acquisition attempt, instead of immediately retrying, wait for a random period. If we fail again, double the maximum wait time. This exponentially increasing delay:

Reduces simultaneous acquisition attempts
Gives the current holder time to complete and release
Allows just-arrived processors to acquire quickly
Spreads traffic over time, reducing peak interconnect load

ttas_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// TTAS with Exponential Backoff
 
#define MIN_DELAY 4      // Minimum backoff cycles
#define MAX_DELAY 4096   // Maximum backoff cycles
#define DELAY_MULTIPLIER 2
 
void spin_lock_backoff(spinlock_t *lock) {
    int delay = MIN_DELAY;
    
    while (1) {
        // Phase 1: Local spinning with backoff
        while (lock->locked == 1) {
            // Spin locally with delay
            for (int i = 0; i < delay; i++) {
                cpu_relax();
            }
        }
        
        // Phase 2: Attempt acquisition
        if (atomic_exchange(&lock->locked, 1) == 0) {
            // Success!
            return;
        }
        
        // Failed: exponentially increase backoff
        // Add randomness to prevent synchronization
        delay = delay * DELAY_MULTIPLIER;
        if (delay > MAX_DELAY) {
            delay = MAX_DELAY;
        }
        // Random backoff within [MIN_DELAY, delay]
        int actual_delay = MIN_DELAY + (random() % (delay - MIN_DELAY + 1));
        
        for (int i = 0; i < actual_delay; i++) {
            cpu_relax();
        }
    }
}
 
// Alternative: Proportional backoff based on contention level
void spin_lock_proportional_backoff(spinlock_t *lock) {
    while (1) {
        int spin_count = 0;
        
        while (lock->locked == 1) {
            cpu_relax();
            spin_count++;
        }
        
        if (atomic_exchange(&lock->locked, 1) == 0) {
            return;  // Success
        }
        
        // Backoff proportional to observed contention
        int backoff = spin_count < 1000 ? spin_count / 10 : 100;
        for (int i = 0; i < backoff; i++) {
            cpu_relax();
        }
    }
}

Tuning Backoff Parameters:

Backoff effectiveness depends heavily on system characteristics and workload. The optimal parameters vary based on:

Number of CPUs: More CPUs → more contention → more backoff needed
Lock hold time: Longer critical sections → more backoff helpful
Interconnect latency: Higher latency → longer MIN_DELAY effective
Workload variability: Highly variable hold times benefit from randomization

Adaptive Backoff

Production systems often use adaptive backoff that monitors acquisition success rate and adjusts parameters dynamically. Linux's qspinlock and Windows' critical sections both employ adaptive strategies that tune themselves to workload characteristics.

Backoff Parameter Guidelines
System Type	MIN_DELAY	MAX_DELAY	Notes
Desktop (4-8 cores)	2-8	256-1K	Light contention typical
Server (16-64 cores)	8-32	2K-16K	Contention common
NUMA (multi-socket)	32-128	16K-64K	High remote latency
Critical section < 100ns	2-8	128-512	Short hold → quick retry
Critical section > 1μs	32-128	4K-64K	Longer hold → more patience

Backoff Tradeoffs:

Backoff is not free. Excessive backoff:

Increases average acquisition latency
May leave the lock idle while processors are backing off
Reduces throughput if backoff exceeds useful work time

Insufficient backoff:

Fails to prevent thundering herd
Wastes interconnect bandwidth
May slow down the lock holder (stealing memory bandwidth)

The optimal balance is workload-dependent and often requires empirical tuning.

Assembly-Level Implementations

For maximum performance and control, critical spinlock code is often written directly in assembly. Let's examine real implementations for major architectures.

x86-64 Implementation:

spinlock_x86_64.s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
; x86-64 TTAS Spinlock Implementation
; Input: RDI = address of lock variable
; Lock variable: 0 = unlocked, 1 = locked
 
.global spin_lock
.global spin_unlock
 
spin_lock:
    mov     eax, 1              ; Value to exchange
 
.Lttas_loop:
    ; Phase 1: Local spinning (test)
.Lspin:
    pause                       ; Hint: spin-wait loop
    cmp     dword ptr [rdi], 0  ; Is lock free?
    jne     .Lspin              ; No: keep spinning locally
 
    ; Phase 2: Attempt acquisition (test-and-set)
    xchg    eax, [rdi]          ; Atomic: EAX <-> [RDI]
    test    eax, eax            ; Was old value 0?
    jnz     .Lttas_loop         ; No: failed, retry
 
    ; Success: lock acquired
    ; XCHG provides implicit memory barrier
    ret
 
spin_unlock:
    ; Simple store with release semantics
    ; On x86, regular stores have release semantics
    mov     dword ptr [rdi], 0
    ret
 
; Trylock variant: returns 1 on success, 0 on failure
.global spin_trylock
spin_trylock:
    mov     eax, 1
    xchg    eax, [rdi]          ; Attempt to acquire
    xor     eax, 1              ; Invert: 0->1 (success), 1->0 (fail)
    ret

ARM64 (AArch64) Implementation:

spinlock_arm64.s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// ARM64 Spinlock using Load-Exclusive/Store-Exclusive
// and ARMv8.1 atomic instructions
 
.global spin_lock
.global spin_unlock
 
// Using LDAXR/STXR (Load-Acquire Exclusive, Store-Exclusive)
spin_lock:
    mov     w2, #1              // Value to store
 
1:  // Spin phase: wait for lock to appear free
    ldaxr   w1, [x0]            // Load-Acquire Exclusive
    cbnz    w1, 2f              // If locked, go to spinning
    stxr    w1, w2, [x0]        // Try to store 1
    cbnz    w1, 1b              // If store failed, retry
    ret                         // Success: lock acquired
 
2:  // Local spinning without exclusive access
    wfe                         // Wait For Event (power saving)
    ldr     w1, [x0]            // Regular load
    cbnz    w1, 2b              // Still locked? Keep waiting
    b       1b                  // Lock appears free, try to acquire
 
spin_unlock:
    stlr    wzr, [x0]           // Store-Release zero
    sev                         // Send Event (wake spinning CPUs)
    ret
 
// ARMv8.1 with Large System Extensions (LSE)
// Uses direct atomic swap instruction
.global spin_lock_lse
spin_lock_lse:
    mov     w1, #1
 
1:  ldr     w2, [x0]            // Local spinning
    cbnz    w2, 1b
 
    swpa    w1, w2, [x0]        // Atomic swap with acquire
    cbnz    w2, 1b              // If old value != 0, retry
    ret
 
spin_unlock_lse:
    stlr    wzr, [x0]           // Store-Release zero
    ret

Key Differences Between Architectures:

Spinlock Implementation Across Architectures
Aspect	x86-64	ARM64 (LL/SC)	ARM64 (LSE)
Atomic mechanism	XCHG (implicit lock)	LDXR/STXR pair	SWP instruction
Failure mode	Never fails	SC can fail spuriously	Never fails
Memory barriers	Implicit in XCHG	LDAXR/STLR explicit	Acquire/Release in swap
Spin hint	PAUSE	WFE (wait for event)	WFE
Wake mechanism	N/A	SEV (send event)	SEV
Code complexity	Low	Medium	Low

WFE/SEV on ARM

ARM's Wait-For-Event (WFE) and Send-Event (SEV) instructions provide power-efficient spinning. WFE puts the core into a low-power state until an event arrives. SEV broadcasts an event to all cores. This is more efficient than spinning on PAUSE but requires unlock code to execute SEV.

Using Compiler Intrinsics and Atomics

Writing assembly for every architecture is impractical for most projects. Modern compilers provide atomic built-ins and intrinsics that generate optimal code for each target platform while maintaining source portability.

GCC/Clang Atomic Built-ins:

spinlock_gcc_atomics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Portable spinlock using GCC/Clang atomic built-ins
#include <stdatomic.h>
#include <stdbool.h>
 
typedef struct {
    atomic_int locked;
} spinlock_t;
 
#define SPINLOCK_INITIALIZER { .locked = 0 }
 
static inline void spin_lock(spinlock_t *lock) {
    while (1) {
        // Phase 1: Local spinning
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed)) {
            #if defined(__x86_64__) || defined(__i386__)
                __builtin_ia32_pause();
            #elif defined(__aarch64__)
                asm volatile("yield" ::: "memory");
            #endif
        }
        
        // Phase 2: Attempt acquisition with acquire semantics
        int expected = 0;
        if (atomic_compare_exchange_weak_explicit(
                &lock->locked,
                &expected,
                1,
                memory_order_acquire,  // Success: acquire
                memory_order_relaxed   // Failure: relaxed
            )) {
            return;  // Lock acquired
        }
    }
}
 
static inline void spin_unlock(spinlock_t *lock) {
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}
 
static inline bool spin_trylock(spinlock_t *lock) {
    int expected = 0;
    return atomic_compare_exchange_strong_explicit(
        &lock->locked,
        &expected,
        1,
        memory_order_acquire,
        memory_order_relaxed
    );
}

C++11 std::atomic Implementation:

spinlock_cpp11.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// C++11 Portable Spinlock
#include <atomic>
#include <thread>
 
class Spinlock {
private:
    std::atomic<bool> locked{false};
 
public:
    void lock() noexcept {
        while (true) {
            // Optimistic read with relaxed ordering
            while (locked.load(std::memory_order_relaxed)) {
                // Hint to CPU that this is a spin-wait
                #if defined(__cpp_lib_hardware_interference_size)
                    std::this_thread::yield();  // May be too heavy
                #endif
                // Platform-specific pause
                #if defined(_MSC_VER)
                    _mm_pause();
                #elif defined(__GNUC__)
                    __builtin_ia32_pause();
                #endif
            }
            
            // Attempt acquisition with acquire semantics
            if (!locked.exchange(true, std::memory_order_acquire)) {
                return;  // Success
            }
        }
    }
 
    bool try_lock() noexcept {
        // Don't spin - single attempt
        return !locked.exchange(true, std::memory_order_acquire);
    }
 
    void unlock() noexcept {
        locked.store(false, std::memory_order_release);
    }
};
 
// Usage example with RAII guard
class SpinlockGuard {
private:
    Spinlock& lock_;
public:
    explicit SpinlockGuard(Spinlock& lock) : lock_(lock) {
        lock_.lock();
    }
    ~SpinlockGuard() {
        lock_.unlock();
    }
    // Non-copyable, non-movable
    SpinlockGuard(const SpinlockGuard&) = delete;
    SpinlockGuard& operator=(const SpinlockGuard&) = delete;
};

Memory Order Selection

For lock acquisition, use memory_order_acquire. For lock release, use memory_order_release. This ensures proper synchronization of data protected by the lock. Using memory_order_seq_cst is correct but may be unnecessarily expensive, especially on ARM/POWER. Using memory_order_relaxed is dangerous for synchronization.

Windows Intrinsics:

spinlock_windows.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Windows-specific spinlock using interlocked intrinsics
#include <Windows.h>
#include <intrin.h>
 
typedef volatile LONG spinlock_t;
 
#define SPINLOCK_INIT 0
 
static inline void spin_lock(spinlock_t *lock) {
    while (1) {
        // Local spinning
        while (*lock != 0) {
            _mm_pause();  // Intel pause intrinsic
        }
        
        // Attempt acquisition
        // InterlockedExchange returns old value
        if (InterlockedExchange(lock, 1) == 0) {
            return;
        }
    }
}
 
static inline void spin_unlock(spinlock_t *lock) {
    // MemoryBarrier ensures all prior writes are visible
    MemoryBarrier();
    *lock = 0;
    // Or use InterlockedExchange for guaranteed atomicity:
    // InterlockedExchange(lock, 0);
}
 
static inline BOOL spin_trylock(spinlock_t *lock) {
    return InterlockedExchange(lock, 1) == 0;
}
 
// Alternative using InterlockedCompareExchange
static inline void spin_lock_cmpxchg(spinlock_t *lock) {
    while (InterlockedCompareExchange(lock, 1, 0) != 0) {
        while (*lock != 0) {
            _mm_pause();
        }
    }
}

Production Considerations

Deploying spinlocks in production systems requires careful consideration of several factors that don't appear in textbook implementations.

When to Use Spinlocks:

Spinlock Use Cases

•Very short critical sections — Lock hold time < scheduling quantum (typically < 1-10μs). If hold time exceeds this, context switch overhead is justified.
•Kernel-mode code — Can't block in interrupt handlers or with interrupts disabled. Spinlocks are the only option.
•Real-time constraints — predictable latency more important than efficiency. No context switch delays.
•High-frequency locking — Millions of lock/unlock per second where blocking overhead dominates.
•Lock contention is rare — Uncontended spinlock is faster than uncontended mutex.

When NOT to Use Spinlocks

•Long critical sections — Spinning wastes CPU cycles; blocking allows useful work.
•Single-core systems — Spinning just wastes the only CPU; the holder can't run.
•User-mode general use — Blocking mutexes are typically better; OS can schedule effectively.
•High contention with long holds — Spinning threads burn power and slow the holder.
•Priority inversion risk — Low-priority holder can be preempted while high-priority waiter spins.

Priority Inversion Danger

If a low-priority thread holds a spinlock and gets preempted, high-priority threads spinning for the lock can't make progress. On preemptible systems, spinlocks must disable preemption while held. POSIX spinlocks (pthread_spin_*) don't do this automatically—use with caution!

Debugging and Instrumentation:

Production spinlocks often include debugging features:

debug_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Production spinlock with debugging support
typedef struct {
    atomic_int locked;
    #ifdef DEBUG_SPINLOCK
    const char *file;      // File where lock was acquired
    int line;              // Line where lock was acquired
    pid_t owner;           // Thread/process holding lock
    uint64_t acquire_time; // Timestamp of acquisition
    uint64_t spin_count;   // Total spins for contention analysis
    #endif
} spinlock_t;
 
#ifdef DEBUG_SPINLOCK
void spin_lock_debug(spinlock_t *lock, const char *file, int line) {
    uint64_t spins = 0;
    
    while (1) {
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed)) {
            cpu_relax();
            spins++;
            
            // Detect potential deadlock or excessive contention
            if (spins > 100000000) {
                fprintf(stderr, "SPINLOCK WARNING: Excessive spinning!\n");
                fprintf(stderr, "  Waiting at: %s:%d\n", file, line);
                fprintf(stderr, "  Held by: %s:%d (thread %d)\n",
                    lock->file, lock->line, lock->owner);
                spins = 0;  // Reset to continue warning
            }
        }
        
        if (atomic_exchange_explicit(&lock->locked, 1, 
                                     memory_order_acquire) == 0) {
            lock->file = file;
            lock->line = line;
            lock->owner = gettid();
            lock->acquire_time = get_timestamp();
            lock->spin_count += spins;
            return;
        }
    }
}
 
#define spin_lock(lock) spin_lock_debug(lock, __FILE__, __LINE__)
 
#endif // DEBUG_SPINLOCK

Real-World Examples:

Production kernels and libraries use sophisticated spinlock implementations:

Linux Kernel (qspinlock): Since 4.2, Linux uses queued spinlocks that provide fairness while maintaining performance. Combines ticket lock fairness with MCS-style queueing.

FreeBSD: Uses adaptive mutexes that spin briefly then block if contention persists.

Windows (Critical Sections): Hybrid locks that spin for a configurable count before blocking on a kernel object.

Summary: Implementing TAS Locks

We have explored the full spectrum of Test-and-Set lock implementations, from the naive but correct basic spinlock through production-ready variants with debugging support. Let's consolidate the key insights:

Key Takeaways

•Basic TAS is correct but slow — Continuous atomic operations generate O(N²) coherence traffic under contention, making it unsuitable for production use.
•TTAS dramatically reduces traffic — Local spinning eliminates bus transactions during the spin phase; only acquisition attempts require atomics.
•Exponential backoff addresses thundering herd — Spreading acquisition attempts over time reduces peak traffic and improves throughput.
•Assembly provides maximum control — Architecture-specific implementations leverage platform features like PAUSE, WFE/SEV, and optimal instruction selection.
•Compiler atomics enable portability — C11/C++11 atomics and compiler intrinsics generate optimal code while maintaining source portability.
•Production use requires care — Spinlocks are appropriate for short critical sections; debugging support, priority inversion awareness, and proper use cases matter.

What's Next:

We've covered standalone TSL-based lock implementations. The next page will explore how to build complete lock abstractions using Test-and-Set—examining mutex implementations that combine spinning with blocking, and understanding the relationship between simple spinlocks and full-featured synchronization primitives.

Page Complete

You now have practical knowledge to implement, optimize, and deploy Test-and-Set based locks. From understanding why basic TAS fails under contention to writing platform-specific assembly, you can reason about spinlock performance and make informed implementation choices.

3 / 5

Loading learning content...

Operating SystemsTest-and-Set Lock (TSL)

Test-and-Set Lock (TSL): Hardware-Assisted Synchronization

LevelIntermediate

Duration75 mins

TopicTest-and-Set Lock (TSL)

3 / 5

TSL Implementation: From Theory to Practice

Building Locks with Test-and-Set

The goal is not just to understand lock implementations intellectually, but to develop the intuition needed to choose, configure, and debug locks in real systems.

What You Will Learn

The Basic TAS Spinlock

Implementation in Pseudocode:

basic_tas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Basic TAS Spinlock
// Lock is represented as an integer: 0 = unlocked, 1 = locked
 
typedef struct {
    volatile int locked;  // 'volatile' prevents compiler from caching
} spinlock_t;
 
#define SPINLOCK_INIT { .locked = 0 }
 
// Acquire: spin until TestAndSet returns 0 (was unlocked)
void spin_lock(spinlock_t *lock) {
    while (atomic_exchange(&lock->locked, 1) == 1) {
        // Spin - lock was already held
    }
    // Lock acquired: we set it from 0 to 1
}
 
// Release: simply set to 0
void spin_unlock(spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
}
 
// Try to acquire without spinning
int spin_trylock(spinlock_t *lock) {
    // Returns 1 if lock acquired, 0 if failed
    return atomic_exchange(&lock->locked, 1) == 0;
}

Why This Works:

The correctness of this implementation follows directly from the atomicity of atomic_exchange:

Only one processor can observe the 0→1 transition. The exchange is atomic, so if two processors execute simultaneously, exactly one sees the old value as 0.
Once a processor exits the while loop, the lock is set to 1, preventing any other processor from acquiring it until explicit release.
The release operation simply stores 0. Any processor spinning in the while loop will eventually observe this 0 and attempt to acquire.

The Hidden Disaster

Performance Analysis:

Let's quantify the problem. Each atomic_exchange on a shared variable:

Invalidates all other cached copies of the cache line (O(N) invalidations)
Transfers the cache line to the executing processor
Immediately gets invalidated by the next processor's exchange

With N processors spinning:

Cache Traffic in Basic TAS Spinlock
Processors	Invalidations/Second	Bus Bandwidth Consumed	Lock Throughput
2	~1M	64MB/s	Good
4	~4M	256MB/s	Moderate
8	~16M	1GB/s	Poor
16	~64M	4GB/s	Terrible
32	~256M	16GB/s	System stall

The quadratic scaling (O(N²) traffic for N processors) makes basic TAS unusable for any system with significant lock contention. We need to reduce the rate of atomic operations.

Test-and-Test-and-Set (TTAS) Optimization

The TTAS Algorithm:

Read the lock value (regular load, not atomic)
If locked (value is 1), keep reading (spin locally)
When read shows unlocked (value is 0), attempt atomic acquisition
If atomic acquisition fails, return to step 1

ttas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Test-and-Test-and-Set Spinlock
// Dramatically reduces cache coherence traffic
 
void spin_lock_ttas(spinlock_t *lock) {
    while (1) {
        // Phase 1: Spin locally, just reading
        // This read can be satisfied from local cache!
        while (lock->locked == 1) {
            // Local spinning - no bus traffic
            // Other CPUs' caches are unaffected
            cpu_relax();  // Hint for spin-wait loops
        }
        
        // Phase 2: Lock appears free, attempt acquisition
        // Now we need atomicity
        if (atomic_exchange(&lock->locked, 1) == 0) {
            // Success! We acquired the lock
            return;
        }
        // Failed - someone else got it first
        // Back to spinning
    }
}
 
// cpu_relax() implementation varies by architecture
#if defined(__x86_64__) || defined(__i386__)
static inline void cpu_relax(void) {
    asm volatile("pause" ::: "memory");
}
#elif defined(__aarch64__)
static inline void cpu_relax(void) {
    asm volatile("yield" ::: "memory");
}
#elif defined(__powerpc__)
static inline void cpu_relax(void) {
    asm volatile("or 1,1,1" ::: "memory");  // Low-power hint
}
#endif

Why TTAS Dramatically Reduces Traffic:

Traffic only occurs when:

The lock is released (one store, one broadcast)
Spinning processors observe the change (they see the invalidation or snoop response)
Spinning processors attempt acquisition (atomic operations)

The 'Thundering Herd' Problem

Comparative Performance:

TAS vs. TTAS Under Contention
Metric	Basic TAS	TTAS	Improvement
Bus transactions while spinning	Continuous	Zero	∞
Cache invalidations	O(N) per CPU	O(1) per release	N×
Power consumption	Very high	Moderate	~50%
Lock throughput at N=8	~10K/s	~500K/s	50×
Lock fairness	Poor	Poor	Same

Implementation Note: The cpu_relax() Hint:

The cpu_relax() function (or PAUSE instruction on x86) serves multiple purposes:

Pipeline hint: Tells the CPU this is a spin loop, allowing pipeline resource optimization
Power reduction: May reduce clock speed or power to execution units
Memory ordering: On some architectures, ensures memory barrier semantics
SMT improvement: On hyperthreaded CPUs, yields resources to the sibling thread

Omitting cpu_relax() can cause correctness issues on some microarchitectures (memory ordering violations when exiting the spin loop) and always wastes power.

Exponential Backoff: Reducing Thundering Herd

TTAS eliminates traffic during spinning but still suffers from the thundering herd when the lock is released. Exponential backoff addresses this by spreading out acquisition attempts over time.

The Backoff Strategy:

After a failed acquisition attempt, instead of immediately retrying, wait for a random period. If we fail again, double the maximum wait time. This exponentially increasing delay:

Reduces simultaneous acquisition attempts
Gives the current holder time to complete and release
Allows just-arrived processors to acquire quickly
Spreads traffic over time, reducing peak interconnect load

ttas_backoff.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// TTAS with Exponential Backoff
 
#define MIN_DELAY 4      // Minimum backoff cycles
#define MAX_DELAY 4096   // Maximum backoff cycles
#define DELAY_MULTIPLIER 2
 
void spin_lock_backoff(spinlock_t *lock) {
    int delay = MIN_DELAY;
    
    while (1) {
        // Phase 1: Local spinning with backoff
        while (lock->locked == 1) {
            // Spin locally with delay
            for (int i = 0; i < delay; i++) {
                cpu_relax();
            }
        }
        
        // Phase 2: Attempt acquisition
        if (atomic_exchange(&lock->locked, 1) == 0) {
            // Success!
            return;
        }
        
        // Failed: exponentially increase backoff
        // Add randomness to prevent synchronization
        delay = delay * DELAY_MULTIPLIER;
        if (delay > MAX_DELAY) {
            delay = MAX_DELAY;
        }
        // Random backoff within [MIN_DELAY, delay]
        int actual_delay = MIN_DELAY + (random() % (delay - MIN_DELAY + 1));
        
        for (int i = 0; i < actual_delay; i++) {
            cpu_relax();
        }
    }
}
 
// Alternative: Proportional backoff based on contention level
void spin_lock_proportional_backoff(spinlock_t *lock) {
    while (1) {
        int spin_count = 0;
        
        while (lock->locked == 1) {
            cpu_relax();
            spin_count++;
        }
        
        if (atomic_exchange(&lock->locked, 1) == 0) {
            return;  // Success
        }
        
        // Backoff proportional to observed contention
        int backoff = spin_count < 1000 ? spin_count / 10 : 100;
        for (int i = 0; i < backoff; i++) {
            cpu_relax();
        }
    }
}

Tuning Backoff Parameters:

Backoff effectiveness depends heavily on system characteristics and workload. The optimal parameters vary based on:

Number of CPUs: More CPUs → more contention → more backoff needed
Lock hold time: Longer critical sections → more backoff helpful
Interconnect latency: Higher latency → longer MIN_DELAY effective
Workload variability: Highly variable hold times benefit from randomization

Adaptive Backoff

Backoff Parameter Guidelines
System Type	MIN_DELAY	MAX_DELAY	Notes
Desktop (4-8 cores)	2-8	256-1K	Light contention typical
Server (16-64 cores)	8-32	2K-16K	Contention common
NUMA (multi-socket)	32-128	16K-64K	High remote latency
Critical section < 100ns	2-8	128-512	Short hold → quick retry
Critical section > 1μs	32-128	4K-64K	Longer hold → more patience

Backoff Tradeoffs:

Backoff is not free. Excessive backoff:

Increases average acquisition latency
May leave the lock idle while processors are backing off
Reduces throughput if backoff exceeds useful work time

Insufficient backoff:

Fails to prevent thundering herd
Wastes interconnect bandwidth
May slow down the lock holder (stealing memory bandwidth)

The optimal balance is workload-dependent and often requires empirical tuning.

Assembly-Level Implementations

For maximum performance and control, critical spinlock code is often written directly in assembly. Let's examine real implementations for major architectures.

x86-64 Implementation:

spinlock_x86_64.s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
; x86-64 TTAS Spinlock Implementation
; Input: RDI = address of lock variable
; Lock variable: 0 = unlocked, 1 = locked
 
.global spin_lock
.global spin_unlock
 
spin_lock:
    mov     eax, 1              ; Value to exchange
 
.Lttas_loop:
    ; Phase 1: Local spinning (test)
.Lspin:
    pause                       ; Hint: spin-wait loop
    cmp     dword ptr [rdi], 0  ; Is lock free?
    jne     .Lspin              ; No: keep spinning locally
 
    ; Phase 2: Attempt acquisition (test-and-set)
    xchg    eax, [rdi]          ; Atomic: EAX <-> [RDI]
    test    eax, eax            ; Was old value 0?
    jnz     .Lttas_loop         ; No: failed, retry
 
    ; Success: lock acquired
    ; XCHG provides implicit memory barrier
    ret
 
spin_unlock:
    ; Simple store with release semantics
    ; On x86, regular stores have release semantics
    mov     dword ptr [rdi], 0
    ret
 
; Trylock variant: returns 1 on success, 0 on failure
.global spin_trylock
spin_trylock:
    mov     eax, 1
    xchg    eax, [rdi]          ; Attempt to acquire
    xor     eax, 1              ; Invert: 0->1 (success), 1->0 (fail)
    ret

ARM64 (AArch64) Implementation:

spinlock_arm64.s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// ARM64 Spinlock using Load-Exclusive/Store-Exclusive
// and ARMv8.1 atomic instructions
 
.global spin_lock
.global spin_unlock
 
// Using LDAXR/STXR (Load-Acquire Exclusive, Store-Exclusive)
spin_lock:
    mov     w2, #1              // Value to store
 
1:  // Spin phase: wait for lock to appear free
    ldaxr   w1, [x0]            // Load-Acquire Exclusive
    cbnz    w1, 2f              // If locked, go to spinning
    stxr    w1, w2, [x0]        // Try to store 1
    cbnz    w1, 1b              // If store failed, retry
    ret                         // Success: lock acquired
 
2:  // Local spinning without exclusive access
    wfe                         // Wait For Event (power saving)
    ldr     w1, [x0]            // Regular load
    cbnz    w1, 2b              // Still locked? Keep waiting
    b       1b                  // Lock appears free, try to acquire
 
spin_unlock:
    stlr    wzr, [x0]           // Store-Release zero
    sev                         // Send Event (wake spinning CPUs)
    ret
 
// ARMv8.1 with Large System Extensions (LSE)
// Uses direct atomic swap instruction
.global spin_lock_lse
spin_lock_lse:
    mov     w1, #1
 
1:  ldr     w2, [x0]            // Local spinning
    cbnz    w2, 1b
 
    swpa    w1, w2, [x0]        // Atomic swap with acquire
    cbnz    w2, 1b              // If old value != 0, retry
    ret
 
spin_unlock_lse:
    stlr    wzr, [x0]           // Store-Release zero
    ret

Key Differences Between Architectures:

Spinlock Implementation Across Architectures
Aspect	x86-64	ARM64 (LL/SC)	ARM64 (LSE)
Atomic mechanism	XCHG (implicit lock)	LDXR/STXR pair	SWP instruction
Failure mode	Never fails	SC can fail spuriously	Never fails
Memory barriers	Implicit in XCHG	LDAXR/STLR explicit	Acquire/Release in swap
Spin hint	PAUSE	WFE (wait for event)	WFE
Wake mechanism	N/A	SEV (send event)	SEV
Code complexity	Low	Medium	Low

WFE/SEV on ARM

Using Compiler Intrinsics and Atomics

GCC/Clang Atomic Built-ins:

spinlock_gcc_atomics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Portable spinlock using GCC/Clang atomic built-ins
#include <stdatomic.h>
#include <stdbool.h>
 
typedef struct {
    atomic_int locked;
} spinlock_t;
 
#define SPINLOCK_INITIALIZER { .locked = 0 }
 
static inline void spin_lock(spinlock_t *lock) {
    while (1) {
        // Phase 1: Local spinning
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed)) {
            #if defined(__x86_64__) || defined(__i386__)
                __builtin_ia32_pause();
            #elif defined(__aarch64__)
                asm volatile("yield" ::: "memory");
            #endif
        }
        
        // Phase 2: Attempt acquisition with acquire semantics
        int expected = 0;
        if (atomic_compare_exchange_weak_explicit(
                &lock->locked,
                &expected,
                1,
                memory_order_acquire,  // Success: acquire
                memory_order_relaxed   // Failure: relaxed
            )) {
            return;  // Lock acquired
        }
    }
}
 
static inline void spin_unlock(spinlock_t *lock) {
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}
 
static inline bool spin_trylock(spinlock_t *lock) {
    int expected = 0;
    return atomic_compare_exchange_strong_explicit(
        &lock->locked,
        &expected,
        1,
        memory_order_acquire,
        memory_order_relaxed
    );
}

C++11 std::atomic Implementation:

spinlock_cpp11.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// C++11 Portable Spinlock
#include <atomic>
#include <thread>
 
class Spinlock {
private:
    std::atomic<bool> locked{false};
 
public:
    void lock() noexcept {
        while (true) {
            // Optimistic read with relaxed ordering
            while (locked.load(std::memory_order_relaxed)) {
                // Hint to CPU that this is a spin-wait
                #if defined(__cpp_lib_hardware_interference_size)
                    std::this_thread::yield();  // May be too heavy
                #endif
                // Platform-specific pause
                #if defined(_MSC_VER)
                    _mm_pause();
                #elif defined(__GNUC__)
                    __builtin_ia32_pause();
                #endif
            }
            
            // Attempt acquisition with acquire semantics
            if (!locked.exchange(true, std::memory_order_acquire)) {
                return;  // Success
            }
        }
    }
 
    bool try_lock() noexcept {
        // Don't spin - single attempt
        return !locked.exchange(true, std::memory_order_acquire);
    }
 
    void unlock() noexcept {
        locked.store(false, std::memory_order_release);
    }
};
 
// Usage example with RAII guard
class SpinlockGuard {
private:
    Spinlock& lock_;
public:
    explicit SpinlockGuard(Spinlock& lock) : lock_(lock) {
        lock_.lock();
    }
    ~SpinlockGuard() {
        lock_.unlock();
    }
    // Non-copyable, non-movable
    SpinlockGuard(const SpinlockGuard&) = delete;
    SpinlockGuard& operator=(const SpinlockGuard&) = delete;
};

Memory Order Selection

Windows Intrinsics:

spinlock_windows.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Windows-specific spinlock using interlocked intrinsics
#include <Windows.h>
#include <intrin.h>
 
typedef volatile LONG spinlock_t;
 
#define SPINLOCK_INIT 0
 
static inline void spin_lock(spinlock_t *lock) {
    while (1) {
        // Local spinning
        while (*lock != 0) {
            _mm_pause();  // Intel pause intrinsic
        }
        
        // Attempt acquisition
        // InterlockedExchange returns old value
        if (InterlockedExchange(lock, 1) == 0) {
            return;
        }
    }
}
 
static inline void spin_unlock(spinlock_t *lock) {
    // MemoryBarrier ensures all prior writes are visible
    MemoryBarrier();
    *lock = 0;
    // Or use InterlockedExchange for guaranteed atomicity:
    // InterlockedExchange(lock, 0);
}
 
static inline BOOL spin_trylock(spinlock_t *lock) {
    return InterlockedExchange(lock, 1) == 0;
}
 
// Alternative using InterlockedCompareExchange
static inline void spin_lock_cmpxchg(spinlock_t *lock) {
    while (InterlockedCompareExchange(lock, 1, 0) != 0) {
        while (*lock != 0) {
            _mm_pause();
        }
    }
}

Production Considerations

Deploying spinlocks in production systems requires careful consideration of several factors that don't appear in textbook implementations.

When to Use Spinlocks:

Spinlock Use Cases

•Very short critical sections — Lock hold time < scheduling quantum (typically < 1-10μs). If hold time exceeds this, context switch overhead is justified.
•Kernel-mode code — Can't block in interrupt handlers or with interrupts disabled. Spinlocks are the only option.
•Real-time constraints — predictable latency more important than efficiency. No context switch delays.
•High-frequency locking — Millions of lock/unlock per second where blocking overhead dominates.
•Lock contention is rare — Uncontended spinlock is faster than uncontended mutex.

When NOT to Use Spinlocks

•Long critical sections — Spinning wastes CPU cycles; blocking allows useful work.
•Single-core systems — Spinning just wastes the only CPU; the holder can't run.
•User-mode general use — Blocking mutexes are typically better; OS can schedule effectively.
•High contention with long holds — Spinning threads burn power and slow the holder.
•Priority inversion risk — Low-priority holder can be preempted while high-priority waiter spins.

Priority Inversion Danger

Debugging and Instrumentation:

Production spinlocks often include debugging features:

debug_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Production spinlock with debugging support
typedef struct {
    atomic_int locked;
    #ifdef DEBUG_SPINLOCK
    const char *file;      // File where lock was acquired
    int line;              // Line where lock was acquired
    pid_t owner;           // Thread/process holding lock
    uint64_t acquire_time; // Timestamp of acquisition
    uint64_t spin_count;   // Total spins for contention analysis
    #endif
} spinlock_t;
 
#ifdef DEBUG_SPINLOCK
void spin_lock_debug(spinlock_t *lock, const char *file, int line) {
    uint64_t spins = 0;
    
    while (1) {
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed)) {
            cpu_relax();
            spins++;
            
            // Detect potential deadlock or excessive contention
            if (spins > 100000000) {
                fprintf(stderr, "SPINLOCK WARNING: Excessive spinning!\n");
                fprintf(stderr, "  Waiting at: %s:%d\n", file, line);
                fprintf(stderr, "  Held by: %s:%d (thread %d)\n",
                    lock->file, lock->line, lock->owner);
                spins = 0;  // Reset to continue warning
            }
        }
        
        if (atomic_exchange_explicit(&lock->locked, 1, 
                                     memory_order_acquire) == 0) {
            lock->file = file;
            lock->line = line;
            lock->owner = gettid();
            lock->acquire_time = get_timestamp();
            lock->spin_count += spins;
            return;
        }
    }
}
 
#define spin_lock(lock) spin_lock_debug(lock, __FILE__, __LINE__)
 
#endif // DEBUG_SPINLOCK

Real-World Examples:

Production kernels and libraries use sophisticated spinlock implementations:

Linux Kernel (qspinlock): Since 4.2, Linux uses queued spinlocks that provide fairness while maintaining performance. Combines ticket lock fairness with MCS-style queueing.

FreeBSD: Uses adaptive mutexes that spin briefly then block if contention persists.

Windows (Critical Sections): Hybrid locks that spin for a configurable count before blocking on a kernel object.

Summary: Implementing TAS Locks

Key Takeaways

•Basic TAS is correct but slow — Continuous atomic operations generate O(N²) coherence traffic under contention, making it unsuitable for production use.
•TTAS dramatically reduces traffic — Local spinning eliminates bus transactions during the spin phase; only acquisition attempts require atomics.
•Exponential backoff addresses thundering herd — Spreading acquisition attempts over time reduces peak traffic and improves throughput.
•Assembly provides maximum control — Architecture-specific implementations leverage platform features like PAUSE, WFE/SEV, and optimal instruction selection.
•Compiler atomics enable portability — C11/C++11 atomics and compiler intrinsics generate optimal code while maintaining source portability.
•Production use requires care — Spinlocks are appropriate for short critical sections; debugging support, priority inversion awareness, and proper use cases matter.

What's Next:

Page Complete

3 / 5