Operating SystemsLocks & Hardware Support

Spinlocks

LevelIntermediate

Duration60 mins

TopicLocks & Hardware Support

2 / 5

Spinlock Implementation

From Concept to Code

The previous page established that busy waiting is the deliberate consumption of CPU cycles while waiting for a condition to become true. Now we face the practical question: How do we implement mutual exclusion using busy waiting?

A naive attempt might look promising but fails catastrophically under concurrent execution. The subtle bugs in spinlock implementation have consumed countless engineering hours and led to security vulnerabilities, data corruption, and system crashes. Understanding implementation deeply isn't academic—it's essential for writing correct concurrent code.

This page takes you from broken implementations to production-quality spinlocks, explaining why each refinement is necessary.

What You Will Learn

By the end of this page, you will understand: why naive lock implementations fail; how atomic test-and-set operations solve the fundamental problem; the test-and-test-and-set optimization and why it matters; memory ordering requirements for correct spinlocks; and the anatomy of production spinlock implementations from Linux and other systems.

Why Naive Implementations Fail

Let's start with the most intuitive approach. A lock is just a boolean flag, right? If the flag is false, the lock is free; if true, someone holds it.

The broken implementation:

broken_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// WARNING: This implementation is BROKEN - do not use!
 
typedef struct {
    volatile int locked;  // 0 = free, 1 = held
} broken_spinlock_t;
 
void broken_lock(broken_spinlock_t *lock) {
    // Step 1: Wait until lock is free
    while (lock->locked == 1) {
        // Busy wait
    }
    
    // Step 2: Claim the lock
    lock->locked = 1;
    
    // BUG: Steps 1 and 2 are NOT atomic!
    // Another thread can observe locked==0 between
    // our check and our write.
}
 
void broken_unlock(broken_spinlock_t *lock) {
    lock->locked = 0;
}

Why does this fail?

The fundamental problem is the check-then-act race condition. Between the moment we observe locked == 0 and the moment we set locked = 1, another thread might do exactly the same thing. Consider this interleaving:

Race Condition in Naive Lock
Time	Thread A	Thread B	lock->locked
T1	Read locked → sees 0		0
T2	(about to write 1)	Read locked → sees 0	0
T3	Write locked = 1	(about to write 1)	1
T4	Enter critical section	Write locked = 1	1
T5	(in critical section)	Enter critical section	1
T6	DISASTER: Both threads in critical section!

Both threads observed the lock as free, both claimed it, and both entered the critical section. Mutual exclusion is violated. Shared data structures are corrupted. The system exhibits seemingly random bugs that appear only under load.

The insight: We need an operation that atomically tests the current value AND sets a new value—as a single, indivisible action that no other CPU can interrupt or interleave with.

The TOCTOU Problem

This is a classic Time-Of-Check to Time-Of-Use (TOCTOU) vulnerability. We CHECK the lock state, but by the time we USE that information (by setting the lock), the state has changed. The only solution is to make check and use happen atomically.

Test-and-Set: The Foundation

Test-and-Set (TAS) is a hardware-provided atomic operation that solves the check-then-act problem. It atomically:

Reads the current value of a memory location
Writes a new value (typically 1 or 'true')
Returns the original value that was read

All three steps happen as one indivisible operation. No other CPU can access that memory location during the TAS execution. Hardware ensures this through cache locking, bus locking, or memory controller arbitration.

test_and_set.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Conceptual semantics of test-and-set (hardware provides this atomically)
 
// This is what TAS does, but as a single atomic operation
int atomic_test_and_set(volatile int *location) {
    // BEGIN ATOMIC
    int old_value = *location;  // Read current value
    *location = 1;              // Set to 1 (or any "locked" value)
    return old_value;           // Return what we read
    // END ATOMIC
    
    // Hardware guarantees no other core can access *location
    // between the read and write
}
 
// Hardware implementations:
// x86/x64:     XCHG instruction (exchange) or LOCK CMPXCHG
// ARM:         LDREX/STREX pair (load-linked/store-conditional)
// RISC-V:      AMOSWAP (atomic memory operation swap)
// PowerPC:     LWARX/STWCX (load-word-and-reserve/store-conditional)

Spinlock Using Test-and-Set

With TAS, we can build a correct spinlock:

tas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Correct spinlock using test-and-set
 
typedef struct {
    volatile int locked;  // 0 = free, 1 = held
} spinlock_t;
 
// GCC intrinsic for atomic exchange
// __sync_lock_test_and_set atomically: reads *ptr, writes val, returns old value
#define atomic_exchange(ptr, val) __sync_lock_test_and_set(ptr, val)
 
void spin_lock(spinlock_t *lock) {
    // Try to acquire: atomically set locked=1 and check old value
    while (atomic_exchange(&lock->locked, 1) == 1) {
        // If old value was 1, lock was already held
        // Keep spinning - try again
        
        // Optional: Add pause for efficiency
        __asm__ volatile("pause");
    }
    
    // If old value was 0, we changed it from 0→1
    // We now hold the lock!
}
 
void spin_unlock(spinlock_t *lock) {
    // Release the lock by setting to 0
    // Memory barrier ensures all critical section writes
    // are visible before the unlock
    __sync_lock_release(&lock->locked);
}

Why does this work?

The key insight is that atomic_exchange returns the old value. If the old value was 0, we successfully acquired the lock (we were the one who changed it from 0 to 1). If the old value was 1, someone else already held it, and our exchange from 1 to 1 had no effect—we just need to try again.

Only one thread can ever get the 0.

If two threads call atomic_exchange simultaneously, hardware arbitrates. One thread's exchange completes first (reading 0, writing 1). The other thread's exchange then reads 1 (the value just written by the first thread). The serialization is guaranteed by hardware.

The Atomicity Contract

Atomic operations provide a fundamental contract: multiple CPUs attempting the same atomic operation on the same memory location are serialized. One completes entirely before another begins. This is the foundation of all lock-free and lock-based synchronization.

Test-and-Test-and-Set: The First Optimization

The simple TAS spinlock works correctly but performs poorly under contention. The problem is cache coherence traffic.

Understanding the Problem

When multiple CPUs spin using TAS on the same lock:

Each TAS operation requires exclusive access to the cache line containing the lock
Getting exclusive access invalidates all other CPUs' cached copies
Each CPU's TAS causes cache invalidations across the entire system
The interconnect (bus, mesh, ring) becomes saturated with coherence messages

On a 16-core system with 15 threads spinning on TAS, you get continuous cache line ping-pong. Performance collapses; even unrelated threads on the same cores suffer cache evictions.

tas_problem.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Cache coherence traffic with pure TAS spinlock
 
Timeline when lock is held by CPU 0:
 
CPU1: TAS (get exclusive, invalidate all) → fails, retry
CPU2: TAS (get exclusive, invalidate all) → fails, retry  
CPU3: TAS (get exclusive, invalidate all) → fails, retry
CPU1: TAS (get exclusive, invalidate all) → fails, retry
...
 
Each TAS triggers:
- Invalidation messages to all other CPUs
- Cache miss on next TAS by any other CPU
- Memory controller arbitration
 
With N spinning CPUs: O(N²) coherence messages per lock hold period!

The TTAS Solution

Test-and-Test-and-Set (TTAS) reduces coherence traffic by spinning on a local cache read first, and only attempting the expensive atomic operation when the lock appears free:

ttas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Test-and-Test-and-Set spinlock
 
typedef struct {
    volatile int locked;
} spinlock_t;
 
void spin_lock_ttas(spinlock_t *lock) {
    while (1) {
        // OUTER LOOP: Test - spin on cached read (cheap)
        while (lock->locked == 1) {
            // This is a normal memory read - hits L1 cache
            // No cache invalidations while lock is unchanged
            // CPU does minimal work, bus is quiet
            __asm__ volatile("pause");
        }
        
        // Lock appears free! Try to acquire with TAS
        // INNER: Test-and-set (expensive but only when likely to succeed)
        if (atomic_exchange(&lock->locked, 1) == 0) {
            // Success! We got the lock
            return;
        }
        
        // Failed - another CPU beat us
        // Go back to spinning on read
    }
}
 
void spin_unlock(spinlock_t *lock) {
    __sync_lock_release(&lock->locked);
}

Why TTAS Is Better

While the lock is held:

All waiting CPUs read from their local L1 cache
The cache line is in Shared state across all waiting CPUs
No coherence traffic—the bus is quiet
Each CPU just loops on its local cache data

When the lock is released:

Owner writes 0, triggering cache invalidation to all sharers
All waiting CPUs see cache miss, fetch new value (0)
All attempt TAS roughly simultaneously
One succeeds; others get 1 from their TAS
Losers go back to spinning on cached reads

The improvement:

Instead of N² coherence messages during hold, we get ~N messages once at release
The bus is free for actual work while threads spin
Contention causes less collateral damage to unrelated threads

TAS vs. TTAS Under Contention
Metric	Pure TAS	TTAS	Improvement
Cache traffic while locked	O(N²) continuous	O(1) (sharedread)	Massively reduced
Bus bandwidth consumed	Maximum	Minimal while held	Near-zero during spin
Impact on other workloads	Severe (bus saturation)	Minimal	System-wide benefit
Power consumption	Maximum (cache invalidations)	Lower (local reads)	Notable power savings
Fairness	None (random)	None (random)	Same - needs further work

The Thundering Herd

TTAS has a problem at release time: ALL waiting threads see the lock become free simultaneously and ALL attempt TAS. This 'thundering herd' causes a burst of coherence traffic. Exponential backoff and queue-based locks (like ticket locks and MCS locks) address this issue.

Compare-and-Swap Based Locks

Compare-and-Swap (CAS) is a more flexible atomic operation that enables sophisticated lock algorithms. While TAS unconditionally writes a new value, CAS writes only if the current value matches an expected value.

CAS semantics:

bool CAS(location, expected, desired):
    ATOMIC:
        if (*location == expected):
            *location = desired
            return true
        else:
            return false

This conditional write enables lock implementations that can't be built with TAS alone, particularly those with multiple states or owner tracking.

cas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// CAS-based spinlock implementation
#include <stdatomic.h>
 
typedef struct {
    atomic_int locked;
} spinlock_t;
 
void spin_lock_cas(spinlock_t *lock) {
    int expected = 0;
    
    // Loop until we successfully change 0 → 1
    while (!atomic_compare_exchange_weak(&lock->locked, &expected, 1)) {
        // CAS failed - 'expected' now contains actual value
        // Reset expected for next attempt
        expected = 0;
        
        // Spin on read until it looks free (TTAS optimization)
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __asm__ volatile("pause");
        }
    }
    
    // Successfully changed 0 → 1, we hold the lock
}
 
void spin_unlock_cas(spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
}

Spinlock with Owner Tracking

CAS enables locks that track which thread holds them—useful for debugging, recursive locking, and priority inheritance:

owner_tracking_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Spinlock with owner tracking using CAS
#include <stdatomic.h>
#include <pthread.h>
 
typedef struct {
    atomic_uintptr_t owner;  // 0 = free, or thread ID
} tracked_spinlock_t;
 
#define NO_OWNER 0
 
void tracked_lock(tracked_spinlock_t *lock) {
    uintptr_t me = (uintptr_t)pthread_self();
    uintptr_t expected = NO_OWNER;
    
    while (!atomic_compare_exchange_weak(&lock->owner, &expected, me)) {
        // Failed to acquire - reset expected and retry
        expected = NO_OWNER;
        
        // TTAS spin
        while (atomic_load_explicit(&lock->owner, memory_order_relaxed) != NO_OWNER) {
            __asm__ volatile("pause");
        }
    }
    
    // We set owner to our thread ID - lock acquired
}
 
void tracked_unlock(tracked_spinlock_t *lock) {
    uintptr_t me = (uintptr_t)pthread_self();
    uintptr_t expected = me;
    
    // Only release if we're the owner (sanity check)
    if (!atomic_compare_exchange_strong(&lock->owner, &expected, NO_OWNER)) {
        // BUG: We don't own this lock!
        abort();
    }
}
 
bool tracked_is_locked_by_me(tracked_spinlock_t *lock) {
    return atomic_load(&lock->owner) == (uintptr_t)pthread_self();
}

CAS vs. TAS Trade-offs

CAS advantages:

More flexible—conditional updates enable complex algorithms
Owner tracking, recursion detection, and state machines possible
Foundation for lock-free data structures

TAS advantages:

Slightly faster on some architectures (simpler operation)
Enough for simple locks with no extra features
More universally available (older hardware)

In practice: Modern implementations often use CAS because its flexibility outweighs the tiny performance difference, and it enables advanced features like trylock, recursive locks, and debugging support.

atomic_compare_exchange_weak vs. strong

The 'weak' variant may spuriously fail even if the comparison succeeds—this is an artifact of LL/SC implementations on some architectures (ARM, RISC-V). Weak is appropriate in loops where we'll retry anyway. Strong guarantees success if comparison matches, but may be slightly slower. Use weak in spin loops, strong for single-try operations.

Memory Ordering Requirements

Correct spinlock implementation requires more than just atomic operations—it requires proper memory ordering. Without barriers, the compiler and CPU can reorder operations in ways that violate mutual exclusion.

The Ordering Problem

Consider what could go wrong without proper ordering:

ordering_problem.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Potential ordering violation (conceptual)
 
shared_data = 42;           // (1) Write shared data
spin_unlock(&lock);         // (2) Release lock
 
// Without proper barriers, CPU might reorder to:
spin_unlock(&lock);         // (2) Release lock FIRST
shared_data = 42;           // (1) Write shared data AFTER
 
// Another thread acquiring the lock could see stale data!
 
// Similarly for acquire:
spin_lock(&lock);           // (1) Acquire lock
use(shared_data);           // (2) Read shared data
 
// Without barriers, CPU might reorder to:
temp = shared_data;         // (2) Read BEFORE lock acquired
spin_lock(&lock);           // (1) Actually acquire lock
use(temp);                  // Uses possibly stale value!

Required Ordering Semantics

Locks must enforce:

Acquire semantics: No memory operations from AFTER the lock acquisition can be reordered to BEFORE the acquisition. This ensures we see all writes from the previous lock holder.

Release semantics: No memory operations from BEFORE the unlock can be reordered to AFTER the unlock. This ensures our writes are visible to the next lock acquirer.

proper_ordering.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <stdatomic.h>
 
typedef struct {
    atomic_int locked;
} spinlock_t;
 
void spin_lock(spinlock_t *lock) {
    int expected = 0;
    
    while (!atomic_compare_exchange_weak_explicit(
        &lock->locked,
        &expected,
        1,
        memory_order_acquire,   // Acquire barrier on success
        memory_order_relaxed    // No barrier on failure
    )) {
        expected = 0;
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __asm__ volatile("pause");
        }
    }
    // ACQUIRE BARRIER: all reads/writes after this point
    // cannot be reordered before the lock acquisition
}
 
void spin_unlock(spinlock_t *lock) {
    // RELEASE BARRIER: all reads/writes before this point
    // cannot be reordered after the store
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}

Memory Order Summary for Locks

•memory_order_acquire — Used when acquiring the lock. Prevents subsequent operations from being hoisted above the atomic. Ensures we see all prior writes by the releasing thread.
•memory_order_release — Used when releasing the lock. Prevents prior operations from being sunk below the atomic. Ensures our writes are visible to the acquiring thread.
•memory_order_relaxed — Used for spinning reads where we're just checking status. No ordering guarantees, but still atomic. Fine for the TTAS spin loop.
•memory_order_seq_cst — Sequential consistency. Strongest ordering, global total order. Overkill for basic locks but sometimes used for simplicity.

Architecture Matters

x86/x64 provides strong ordering by default—most loads and stores are already ordered. ARM and RISC-V have weak memory models where explicit barriers are essential. Code that 'works' on x86 may be subtly broken on ARM. Always use proper atomic types and orderings; never assume the hardware will do what you expect.

x86/x64 Implementation Details

Let's examine how spinlocks are implemented at the assembly level on x86/x64, the dominant architecture in servers and desktops.

The LOCK Prefix

On x86, atomic read-modify-write operations are achieved using the LOCK prefix. This prefix guarantees:

The operation is atomic across all CPUs
A full memory barrier is implied (on older CPUs; newer CPUs are more nuanced)
The cache line is locked or ownership is acquired

Common LOCK instructions for spinlocks:

x86_spinlock.asm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
; x86-64 spinlock implementation
 
section .data
    lock_var: dd 0      ; 32-bit spinlock variable (0 = free, 1 = locked)
 
section .text
 
; void spin_lock(int *lock)
spin_lock:
    ; rdi contains pointer to lock variable
    
.try_lock:
    mov     eax, 1              ; Value to exchange
    xchg    eax, [rdi]          ; Atomic exchange: eax <-> [rdi]
                                ; XCHG has implicit LOCK prefix
    test    eax, eax            ; Was old value 0?
    jnz     .spin               ; If not 0, lock was held - spin
    ret                         ; Got the lock!
 
.spin:
    ; TTAS optimization: spin on regular read until lock looks free
    pause                       ; Hint to CPU: spin-wait loop
    mov     eax, [rdi]          ; Regular (non-atomic) read
    test    eax, eax            ; Is it 0 (free)?
    jnz     .spin               ; Still locked - keep spinning
    jmp     .try_lock           ; Looks free - try atomic exchange again
 
; void spin_unlock(int *lock)  
spin_unlock:
    ; rdi contains pointer to lock variable
    mov     dword [rdi], 0      ; Store 0 - release lock
                                ; x86 stores have release semantics
    ret

Alternative: CMPXCHG (Compare-and-Exchange)

The CMPXCHG instruction provides CAS semantics:

x86_cmpxchg.asm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
; Spinlock using CMPXCHG
 
spin_lock_cmpxchg:
    ; rdi = pointer to lock
    
.try_acquire:
    xor     eax, eax            ; eax = 0 (expected value)
    mov     ecx, 1              ; ecx = 1 (desired value)
    lock cmpxchg [rdi], ecx     ; if [rdi]==eax, then [rdi]=ecx
                                ; eax gets overwritten with actual [rdi]
    jz      .acquired           ; Zero flag set if exchange succeeded
    
    ; Failed - lock wasn't 0
.spin:
    pause
    cmp     dword [rdi], 0      ; TTAS: check without LOCK
    jne     .spin               ; Not free - keep spinning
    jmp     .try_acquire        ; Try again
    
.acquired:
    ret
 
 
; Alternative: Using BTS (Bit Test-and-Set) for single-bit lock
spin_lock_bts:
    ; rdi = pointer to byte containing lock bit
    
.try:
    lock bts dword [rdi], 0     ; Test bit 0, set it to 1
                                ; Carry flag = old value of bit 0
    jc      .spin               ; If carry set, bit was already 1
    ret                         ; Got the lock
    
.spin:
    pause
    test    dword [rdi], 1      ; Check bit 0
    jnz     .spin               ; Still set - spin
    jmp     .try                ; Try again

XCHG vs. LOCK CMPXCHG

XCHG has an implicit LOCK prefix—it's always atomic. It's simpler and often faster for basic spinlocks.

LOCK CMPXCHG is more flexible, allowing conditional updates. Essential for owner-tracking locks, lock-free algorithms, and complex state machines.

Performance notes:

Both execute in roughly the same time on modern CPUs (~20 cycles)
XCHG may be slightly faster due to fewer operands to handle
In practice, the difference is negligible compared to cache coherence delays under contention

The x86 Memory Model Advantage

x86 has a relatively strong memory model. Stores are not reordered with other stores; loads are not reordered with other loads. The main reordering allowed is store-load (a later load can 'pass' an earlier store to a different address). This makes x86 spinlocks simpler than on ARM/RISC-V, but modern C11/C++11 atomics abstract these differences.

Production Spinlock Implementations

Let's examine how real operating systems implement spinlocks, seeing how theory translates to production code.

Linux Kernel Spinlock (Simplified)

The Linux kernel has evolved its spinlock implementation multiple times. The modern implementation uses queued spinlocks for fairness, but the basic concept is shown here:

linux_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Simplified Linux spinlock (actual kernel code is more complex)
// Based on arch/x86/include/asm/spinlock.h concepts
 
typedef struct {
    volatile unsigned int lock;
} arch_spinlock_t;
 
static inline void arch_spin_lock(arch_spinlock_t *lock) {
    // Disable preemption first - we cannot be scheduled out holding spinlock
    preempt_disable();
    
    // Try to acquire
    while (1) {
        // TTAS: spin on read first
        if (READ_ONCE(lock->lock) == 0) {
            // Lock appears free - try atomic exchange
            if (cmpxchg(&lock->lock, 0, 1) == 0) {
                break;  // Got it!
            }
        }
        // Still locked - keep spinning
        cpu_relax();  // PAUSE instruction wrapper
    }
}
 
static inline void arch_spin_unlock(arch_spinlock_t *lock) {
    // Release - simple store on x86 (has release semantics)
    WRITE_ONCE(lock->lock, 0);
    
    // Re-enable preemption
    preempt_enable();
}
 
// CPU relaxation hint - essential for efficiency
#define cpu_relax() __asm__ volatile("pause" ::: "memory")
 
// READ_ONCE / WRITE_ONCE prevent compiler optimization issues
#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))

Critical Implementation Details

preempt_disable(): Before taking a spinlock, the kernel disables preemption on the current CPU. This prevents the scheduler from switching to another thread while we hold the spinlock. If the holder were preempted, spinners on other CPUs would waste enormous amounts of time.

cpu_relax(): This wraps the PAUSE instruction (or equivalent on other architectures). It:

Signals to the CPU that we're in a spin loop
Reduces power consumption
Improves performance by preventing speculative execution issues

READ_ONCE / WRITE_ONCE: These prevent compiler optimizations that might:

Cache values in registers (missing updates from other CPUs)
Remove 'redundant' reads (the spin loop)
Reorder operations unexpectedly

These are simpler than volatile but serve similar purposes with better semantics.

spinlock_with_irq.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Spinlock with interrupt handling - very common in kernel code
 
// spin_lock_irqsave: Acquires spinlock AND disables interrupts
// Returns previous interrupt state in 'flags'
static inline void spin_lock_irqsave(spinlock_t *lock, unsigned long *flags) {
    // Save current interrupt state and disable interrupts
    local_irq_save(*flags);
    
    // Disable preemption
    preempt_disable();
    
    // Acquire the spinlock
    arch_spin_lock(&lock->raw_lock);
}
 
// spin_unlock_irqrestore: Releases spinlock AND restores interrupt state
static inline void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) {
    // Release the spinlock
    arch_spin_unlock(&lock->raw_lock);
    
    // Re-enable preemption
    preempt_enable();
    
    // Restore previous interrupt state
    local_irq_restore(flags);
}
 
// Why disable interrupts?
// - If an interrupt handler tries to acquire the same spinlock
//   while we hold it, we deadlock (handler can't complete,
//   we can't continue)
// - spin_lock_irq* variants handle this common scenario

Spinlock Variants in Linux

Linux provides multiple spinlock APIs: spin_lock() for process context where interrupts are not a concern, spin_lock_bh() which also disables softirqs, spin_lock_irq() which disables IRQs, and spin_lock_irqsave() which saves and restores IRQ state. Choosing the right variant is crucial for avoiding deadlocks.

Summary and Next Steps

We've journeyed from a broken naive lock to production-quality spinlock implementations. Let's consolidate the key lessons:

Key Takeaways

•Naive check-then-act is fundamentally broken — The gap between checking and acting allows races. Atomic test-and-set or compare-and-swap is essential.
•TAS provides the foundation — Atomically read, write, and return old value. Whoever gets 0 wins the lock.
•TTAS dramatically improves scalability — Spinning on local cache reads eliminates coherence traffic while waiting. Only use atomic ops when the lock appears free.
•CAS enables advanced features — Owner tracking, recursive locks, and lock-free algorithms all require CAS's conditional update semantics.
•Memory ordering is non-negotiable — Acquire on lock, release on unlock. Without proper barriers, the critical section isn't protected.
•Production implementations include much more — Preemption control, interrupt handling, debugging support, and fairness mechanisms like queuing.

What's next:

Now that we understand how spinlocks are implemented, the next page addresses a crucial question: When are spinlocks the right choice? We'll explore the trade-offs between spinlocks and blocking locks, the scenarios where spinning excels, and when it's actively harmful.

Understanding implementation gives us the vocabulary; understanding applicability gives us wisdom.

Page Complete

You now understand spinlock implementation from first principles: why naive locks fail, how atomic operations solve the problem, optimization with TTAS, memory ordering requirements, and how production systems like Linux implement spinlocks. Next, we examine when to choose spinlocks over alternatives.

2 / 5

Loading learning content...

Operating SystemsLocks & Hardware Support

Spinlocks

LevelIntermediate

Duration60 mins

TopicLocks & Hardware Support

2 / 5

Spinlock Implementation

From Concept to Code

This page takes you from broken implementations to production-quality spinlocks, explaining why each refinement is necessary.

What You Will Learn

Why Naive Implementations Fail

Let's start with the most intuitive approach. A lock is just a boolean flag, right? If the flag is false, the lock is free; if true, someone holds it.

The broken implementation:

broken_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// WARNING: This implementation is BROKEN - do not use!
 
typedef struct {
    volatile int locked;  // 0 = free, 1 = held
} broken_spinlock_t;
 
void broken_lock(broken_spinlock_t *lock) {
    // Step 1: Wait until lock is free
    while (lock->locked == 1) {
        // Busy wait
    }
    
    // Step 2: Claim the lock
    lock->locked = 1;
    
    // BUG: Steps 1 and 2 are NOT atomic!
    // Another thread can observe locked==0 between
    // our check and our write.
}
 
void broken_unlock(broken_spinlock_t *lock) {
    lock->locked = 0;
}

Why does this fail?

Race Condition in Naive Lock
Time	Thread A	Thread B	lock->locked
T1	Read locked → sees 0		0
T2	(about to write 1)	Read locked → sees 0	0
T3	Write locked = 1	(about to write 1)	1
T4	Enter critical section	Write locked = 1	1
T5	(in critical section)	Enter critical section	1
T6	DISASTER: Both threads in critical section!

The insight: We need an operation that atomically tests the current value AND sets a new value—as a single, indivisible action that no other CPU can interrupt or interleave with.

The TOCTOU Problem

Test-and-Set: The Foundation

Test-and-Set (TAS) is a hardware-provided atomic operation that solves the check-then-act problem. It atomically:

Reads the current value of a memory location
Writes a new value (typically 1 or 'true')
Returns the original value that was read

test_and_set.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Conceptual semantics of test-and-set (hardware provides this atomically)
 
// This is what TAS does, but as a single atomic operation
int atomic_test_and_set(volatile int *location) {
    // BEGIN ATOMIC
    int old_value = *location;  // Read current value
    *location = 1;              // Set to 1 (or any "locked" value)
    return old_value;           // Return what we read
    // END ATOMIC
    
    // Hardware guarantees no other core can access *location
    // between the read and write
}
 
// Hardware implementations:
// x86/x64:     XCHG instruction (exchange) or LOCK CMPXCHG
// ARM:         LDREX/STREX pair (load-linked/store-conditional)
// RISC-V:      AMOSWAP (atomic memory operation swap)
// PowerPC:     LWARX/STWCX (load-word-and-reserve/store-conditional)

Spinlock Using Test-and-Set

With TAS, we can build a correct spinlock:

tas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Correct spinlock using test-and-set
 
typedef struct {
    volatile int locked;  // 0 = free, 1 = held
} spinlock_t;
 
// GCC intrinsic for atomic exchange
// __sync_lock_test_and_set atomically: reads *ptr, writes val, returns old value
#define atomic_exchange(ptr, val) __sync_lock_test_and_set(ptr, val)
 
void spin_lock(spinlock_t *lock) {
    // Try to acquire: atomically set locked=1 and check old value
    while (atomic_exchange(&lock->locked, 1) == 1) {
        // If old value was 1, lock was already held
        // Keep spinning - try again
        
        // Optional: Add pause for efficiency
        __asm__ volatile("pause");
    }
    
    // If old value was 0, we changed it from 0→1
    // We now hold the lock!
}
 
void spin_unlock(spinlock_t *lock) {
    // Release the lock by setting to 0
    // Memory barrier ensures all critical section writes
    // are visible before the unlock
    __sync_lock_release(&lock->locked);
}

Why does this work?

Only one thread can ever get the 0.

The Atomicity Contract

Test-and-Test-and-Set: The First Optimization

The simple TAS spinlock works correctly but performs poorly under contention. The problem is cache coherence traffic.

Understanding the Problem

When multiple CPUs spin using TAS on the same lock:

Each TAS operation requires exclusive access to the cache line containing the lock
Getting exclusive access invalidates all other CPUs' cached copies
Each CPU's TAS causes cache invalidations across the entire system
The interconnect (bus, mesh, ring) becomes saturated with coherence messages

On a 16-core system with 15 threads spinning on TAS, you get continuous cache line ping-pong. Performance collapses; even unrelated threads on the same cores suffer cache evictions.

tas_problem.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Cache coherence traffic with pure TAS spinlock
 
Timeline when lock is held by CPU 0:
 
CPU1: TAS (get exclusive, invalidate all) → fails, retry
CPU2: TAS (get exclusive, invalidate all) → fails, retry  
CPU3: TAS (get exclusive, invalidate all) → fails, retry
CPU1: TAS (get exclusive, invalidate all) → fails, retry
...
 
Each TAS triggers:
- Invalidation messages to all other CPUs
- Cache miss on next TAS by any other CPU
- Memory controller arbitration
 
With N spinning CPUs: O(N²) coherence messages per lock hold period!

The TTAS Solution

Test-and-Test-and-Set (TTAS) reduces coherence traffic by spinning on a local cache read first, and only attempting the expensive atomic operation when the lock appears free:

ttas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Test-and-Test-and-Set spinlock
 
typedef struct {
    volatile int locked;
} spinlock_t;
 
void spin_lock_ttas(spinlock_t *lock) {
    while (1) {
        // OUTER LOOP: Test - spin on cached read (cheap)
        while (lock->locked == 1) {
            // This is a normal memory read - hits L1 cache
            // No cache invalidations while lock is unchanged
            // CPU does minimal work, bus is quiet
            __asm__ volatile("pause");
        }
        
        // Lock appears free! Try to acquire with TAS
        // INNER: Test-and-set (expensive but only when likely to succeed)
        if (atomic_exchange(&lock->locked, 1) == 0) {
            // Success! We got the lock
            return;
        }
        
        // Failed - another CPU beat us
        // Go back to spinning on read
    }
}
 
void spin_unlock(spinlock_t *lock) {
    __sync_lock_release(&lock->locked);
}

Why TTAS Is Better

While the lock is held:

All waiting CPUs read from their local L1 cache
The cache line is in Shared state across all waiting CPUs
No coherence traffic—the bus is quiet
Each CPU just loops on its local cache data

When the lock is released:

Owner writes 0, triggering cache invalidation to all sharers
All waiting CPUs see cache miss, fetch new value (0)
All attempt TAS roughly simultaneously
One succeeds; others get 1 from their TAS
Losers go back to spinning on cached reads

The improvement:

Instead of N² coherence messages during hold, we get ~N messages once at release
The bus is free for actual work while threads spin
Contention causes less collateral damage to unrelated threads

TAS vs. TTAS Under Contention
Metric	Pure TAS	TTAS	Improvement
Cache traffic while locked	O(N²) continuous	O(1) (sharedread)	Massively reduced
Bus bandwidth consumed	Maximum	Minimal while held	Near-zero during spin
Impact on other workloads	Severe (bus saturation)	Minimal	System-wide benefit
Power consumption	Maximum (cache invalidations)	Lower (local reads)	Notable power savings
Fairness	None (random)	None (random)	Same - needs further work

The Thundering Herd

Compare-and-Swap Based Locks

CAS semantics:

bool CAS(location, expected, desired):
    ATOMIC:
        if (*location == expected):
            *location = desired
            return true
        else:
            return false

This conditional write enables lock implementations that can't be built with TAS alone, particularly those with multiple states or owner tracking.

cas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// CAS-based spinlock implementation
#include <stdatomic.h>
 
typedef struct {
    atomic_int locked;
} spinlock_t;
 
void spin_lock_cas(spinlock_t *lock) {
    int expected = 0;
    
    // Loop until we successfully change 0 → 1
    while (!atomic_compare_exchange_weak(&lock->locked, &expected, 1)) {
        // CAS failed - 'expected' now contains actual value
        // Reset expected for next attempt
        expected = 0;
        
        // Spin on read until it looks free (TTAS optimization)
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __asm__ volatile("pause");
        }
    }
    
    // Successfully changed 0 → 1, we hold the lock
}
 
void spin_unlock_cas(spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
}

Spinlock with Owner Tracking

CAS enables locks that track which thread holds them—useful for debugging, recursive locking, and priority inheritance:

owner_tracking_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Spinlock with owner tracking using CAS
#include <stdatomic.h>
#include <pthread.h>
 
typedef struct {
    atomic_uintptr_t owner;  // 0 = free, or thread ID
} tracked_spinlock_t;
 
#define NO_OWNER 0
 
void tracked_lock(tracked_spinlock_t *lock) {
    uintptr_t me = (uintptr_t)pthread_self();
    uintptr_t expected = NO_OWNER;
    
    while (!atomic_compare_exchange_weak(&lock->owner, &expected, me)) {
        // Failed to acquire - reset expected and retry
        expected = NO_OWNER;
        
        // TTAS spin
        while (atomic_load_explicit(&lock->owner, memory_order_relaxed) != NO_OWNER) {
            __asm__ volatile("pause");
        }
    }
    
    // We set owner to our thread ID - lock acquired
}
 
void tracked_unlock(tracked_spinlock_t *lock) {
    uintptr_t me = (uintptr_t)pthread_self();
    uintptr_t expected = me;
    
    // Only release if we're the owner (sanity check)
    if (!atomic_compare_exchange_strong(&lock->owner, &expected, NO_OWNER)) {
        // BUG: We don't own this lock!
        abort();
    }
}
 
bool tracked_is_locked_by_me(tracked_spinlock_t *lock) {
    return atomic_load(&lock->owner) == (uintptr_t)pthread_self();
}

CAS vs. TAS Trade-offs

CAS advantages:

More flexible—conditional updates enable complex algorithms
Owner tracking, recursion detection, and state machines possible
Foundation for lock-free data structures

TAS advantages:

Slightly faster on some architectures (simpler operation)
Enough for simple locks with no extra features
More universally available (older hardware)

atomic_compare_exchange_weak vs. strong

Memory Ordering Requirements

The Ordering Problem

Consider what could go wrong without proper ordering:

ordering_problem.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Potential ordering violation (conceptual)
 
shared_data = 42;           // (1) Write shared data
spin_unlock(&lock);         // (2) Release lock
 
// Without proper barriers, CPU might reorder to:
spin_unlock(&lock);         // (2) Release lock FIRST
shared_data = 42;           // (1) Write shared data AFTER
 
// Another thread acquiring the lock could see stale data!
 
// Similarly for acquire:
spin_lock(&lock);           // (1) Acquire lock
use(shared_data);           // (2) Read shared data
 
// Without barriers, CPU might reorder to:
temp = shared_data;         // (2) Read BEFORE lock acquired
spin_lock(&lock);           // (1) Actually acquire lock
use(temp);                  // Uses possibly stale value!

Required Ordering Semantics

Locks must enforce:

Acquire semantics: No memory operations from AFTER the lock acquisition can be reordered to BEFORE the acquisition. This ensures we see all writes from the previous lock holder.

Release semantics: No memory operations from BEFORE the unlock can be reordered to AFTER the unlock. This ensures our writes are visible to the next lock acquirer.

proper_ordering.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <stdatomic.h>
 
typedef struct {
    atomic_int locked;
} spinlock_t;
 
void spin_lock(spinlock_t *lock) {
    int expected = 0;
    
    while (!atomic_compare_exchange_weak_explicit(
        &lock->locked,
        &expected,
        1,
        memory_order_acquire,   // Acquire barrier on success
        memory_order_relaxed    // No barrier on failure
    )) {
        expected = 0;
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed) != 0) {
            __asm__ volatile("pause");
        }
    }
    // ACQUIRE BARRIER: all reads/writes after this point
    // cannot be reordered before the lock acquisition
}
 
void spin_unlock(spinlock_t *lock) {
    // RELEASE BARRIER: all reads/writes before this point
    // cannot be reordered after the store
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}

Memory Order Summary for Locks

•memory_order_acquire — Used when acquiring the lock. Prevents subsequent operations from being hoisted above the atomic. Ensures we see all prior writes by the releasing thread.
•memory_order_release — Used when releasing the lock. Prevents prior operations from being sunk below the atomic. Ensures our writes are visible to the acquiring thread.
•memory_order_relaxed — Used for spinning reads where we're just checking status. No ordering guarantees, but still atomic. Fine for the TTAS spin loop.
•memory_order_seq_cst — Sequential consistency. Strongest ordering, global total order. Overkill for basic locks but sometimes used for simplicity.

Architecture Matters

x86/x64 Implementation Details

Let's examine how spinlocks are implemented at the assembly level on x86/x64, the dominant architecture in servers and desktops.

The LOCK Prefix

On x86, atomic read-modify-write operations are achieved using the LOCK prefix. This prefix guarantees:

The operation is atomic across all CPUs
A full memory barrier is implied (on older CPUs; newer CPUs are more nuanced)
The cache line is locked or ownership is acquired

Common LOCK instructions for spinlocks:

x86_spinlock.asm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
; x86-64 spinlock implementation
 
section .data
    lock_var: dd 0      ; 32-bit spinlock variable (0 = free, 1 = locked)
 
section .text
 
; void spin_lock(int *lock)
spin_lock:
    ; rdi contains pointer to lock variable
    
.try_lock:
    mov     eax, 1              ; Value to exchange
    xchg    eax, [rdi]          ; Atomic exchange: eax <-> [rdi]
                                ; XCHG has implicit LOCK prefix
    test    eax, eax            ; Was old value 0?
    jnz     .spin               ; If not 0, lock was held - spin
    ret                         ; Got the lock!
 
.spin:
    ; TTAS optimization: spin on regular read until lock looks free
    pause                       ; Hint to CPU: spin-wait loop
    mov     eax, [rdi]          ; Regular (non-atomic) read
    test    eax, eax            ; Is it 0 (free)?
    jnz     .spin               ; Still locked - keep spinning
    jmp     .try_lock           ; Looks free - try atomic exchange again
 
; void spin_unlock(int *lock)  
spin_unlock:
    ; rdi contains pointer to lock variable
    mov     dword [rdi], 0      ; Store 0 - release lock
                                ; x86 stores have release semantics
    ret

Alternative: CMPXCHG (Compare-and-Exchange)

The CMPXCHG instruction provides CAS semantics:

x86_cmpxchg.asm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
; Spinlock using CMPXCHG
 
spin_lock_cmpxchg:
    ; rdi = pointer to lock
    
.try_acquire:
    xor     eax, eax            ; eax = 0 (expected value)
    mov     ecx, 1              ; ecx = 1 (desired value)
    lock cmpxchg [rdi], ecx     ; if [rdi]==eax, then [rdi]=ecx
                                ; eax gets overwritten with actual [rdi]
    jz      .acquired           ; Zero flag set if exchange succeeded
    
    ; Failed - lock wasn't 0
.spin:
    pause
    cmp     dword [rdi], 0      ; TTAS: check without LOCK
    jne     .spin               ; Not free - keep spinning
    jmp     .try_acquire        ; Try again
    
.acquired:
    ret
 
 
; Alternative: Using BTS (Bit Test-and-Set) for single-bit lock
spin_lock_bts:
    ; rdi = pointer to byte containing lock bit
    
.try:
    lock bts dword [rdi], 0     ; Test bit 0, set it to 1
                                ; Carry flag = old value of bit 0
    jc      .spin               ; If carry set, bit was already 1
    ret                         ; Got the lock
    
.spin:
    pause
    test    dword [rdi], 1      ; Check bit 0
    jnz     .spin               ; Still set - spin
    jmp     .try                ; Try again

XCHG vs. LOCK CMPXCHG

XCHG has an implicit LOCK prefix—it's always atomic. It's simpler and often faster for basic spinlocks.

LOCK CMPXCHG is more flexible, allowing conditional updates. Essential for owner-tracking locks, lock-free algorithms, and complex state machines.

Performance notes:

Both execute in roughly the same time on modern CPUs (~20 cycles)
XCHG may be slightly faster due to fewer operands to handle
In practice, the difference is negligible compared to cache coherence delays under contention

The x86 Memory Model Advantage

Production Spinlock Implementations

Let's examine how real operating systems implement spinlocks, seeing how theory translates to production code.

Linux Kernel Spinlock (Simplified)

The Linux kernel has evolved its spinlock implementation multiple times. The modern implementation uses queued spinlocks for fairness, but the basic concept is shown here:

linux_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Simplified Linux spinlock (actual kernel code is more complex)
// Based on arch/x86/include/asm/spinlock.h concepts
 
typedef struct {
    volatile unsigned int lock;
} arch_spinlock_t;
 
static inline void arch_spin_lock(arch_spinlock_t *lock) {
    // Disable preemption first - we cannot be scheduled out holding spinlock
    preempt_disable();
    
    // Try to acquire
    while (1) {
        // TTAS: spin on read first
        if (READ_ONCE(lock->lock) == 0) {
            // Lock appears free - try atomic exchange
            if (cmpxchg(&lock->lock, 0, 1) == 0) {
                break;  // Got it!
            }
        }
        // Still locked - keep spinning
        cpu_relax();  // PAUSE instruction wrapper
    }
}
 
static inline void arch_spin_unlock(arch_spinlock_t *lock) {
    // Release - simple store on x86 (has release semantics)
    WRITE_ONCE(lock->lock, 0);
    
    // Re-enable preemption
    preempt_enable();
}
 
// CPU relaxation hint - essential for efficiency
#define cpu_relax() __asm__ volatile("pause" ::: "memory")
 
// READ_ONCE / WRITE_ONCE prevent compiler optimization issues
#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))

Critical Implementation Details

cpu_relax(): This wraps the PAUSE instruction (or equivalent on other architectures). It:

Signals to the CPU that we're in a spin loop
Reduces power consumption
Improves performance by preventing speculative execution issues

READ_ONCE / WRITE_ONCE: These prevent compiler optimizations that might:

Cache values in registers (missing updates from other CPUs)
Remove 'redundant' reads (the spin loop)
Reorder operations unexpectedly

These are simpler than volatile but serve similar purposes with better semantics.

spinlock_with_irq.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Spinlock with interrupt handling - very common in kernel code
 
// spin_lock_irqsave: Acquires spinlock AND disables interrupts
// Returns previous interrupt state in 'flags'
static inline void spin_lock_irqsave(spinlock_t *lock, unsigned long *flags) {
    // Save current interrupt state and disable interrupts
    local_irq_save(*flags);
    
    // Disable preemption
    preempt_disable();
    
    // Acquire the spinlock
    arch_spin_lock(&lock->raw_lock);
}
 
// spin_unlock_irqrestore: Releases spinlock AND restores interrupt state
static inline void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) {
    // Release the spinlock
    arch_spin_unlock(&lock->raw_lock);
    
    // Re-enable preemption
    preempt_enable();
    
    // Restore previous interrupt state
    local_irq_restore(flags);
}
 
// Why disable interrupts?
// - If an interrupt handler tries to acquire the same spinlock
//   while we hold it, we deadlock (handler can't complete,
//   we can't continue)
// - spin_lock_irq* variants handle this common scenario

Spinlock Variants in Linux

Summary and Next Steps

We've journeyed from a broken naive lock to production-quality spinlock implementations. Let's consolidate the key lessons:

Key Takeaways

•Naive check-then-act is fundamentally broken — The gap between checking and acting allows races. Atomic test-and-set or compare-and-swap is essential.
•TAS provides the foundation — Atomically read, write, and return old value. Whoever gets 0 wins the lock.
•TTAS dramatically improves scalability — Spinning on local cache reads eliminates coherence traffic while waiting. Only use atomic ops when the lock appears free.
•CAS enables advanced features — Owner tracking, recursive locks, and lock-free algorithms all require CAS's conditional update semantics.
•Memory ordering is non-negotiable — Acquire on lock, release on unlock. Without proper barriers, the critical section isn't protected.
•Production implementations include much more — Preemption control, interrupt handling, debugging support, and fairness mechanisms like queuing.

What's next:

Understanding implementation gives us the vocabulary; understanding applicability gives us wisdom.

Page Complete

2 / 5