Operating SystemsCompare-and-Swap (CAS)

Compare-and-Swap: Foundation of Lock-Free Concurrency

LevelAdvanced

Duration120 mins

TopicCompare-and-Swap (CAS)

3 / 5

CAS-Based Locks: From Spinlocks to Queue Locks

Building Locks from Atomic Primitives

Despite the power of lock-free programming, traditional locks remain the workhorse of concurrent software. Most real-world concurrent code uses locks, and those locks are themselves built from the atomic primitives we've been studying—particularly CAS (Compare-and-Swap) and TAS (Test-and-Set).

Understanding how locks are implemented illuminates their performance characteristics, helps in choosing the right lock for each situation, and provides insight into the design tradeoffs that kernel and library developers face.

This page provides a comprehensive exploration of CAS-based lock implementations. We begin with the simplest spinlock, progressively improving upon it to address issues of performance, fairness, and scalability. By the end, you will understand the full spectrum of lock designs and the considerations that drive their use in practice.

What You Will Learn

By the end of this page, you will understand how locks are implemented using CAS and TAS, the evolution from simple spinlocks to sophisticated queue locks, the tradeoffs between spinning and blocking, and how to choose the right lock implementation for different scenarios. You will gain appreciation for the engineering that underlies synchronization primitives you use daily.

The Basic Test-and-Set Spinlock

The simplest lock implementation uses a single boolean variable and the Test-and-Set (TAS) atomic operation. This lock, known as a TAS spinlock, is the foundation for understanding more sophisticated designs.

Test-and-Set Semantics

Recall that TAS atomically reads a memory location and sets it to true (1), returning the previous value:

bool test_and_set(bool* lock) {
    // ATOMIC OPERATION
    bool old = *lock;
    *lock = true;
    return old;
}

TAS Spinlock Implementation

Using TAS, we can implement a basic spinlock:

Lock state: false = unlocked, true = locked
Lock acquisition: Repeatedly TAS until we get false (meaning we set it from false to true)
Lock release: Simply set to false

tas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Basic Test-and-Set Spinlock
 
#include <stdatomic.h>
#include <stdbool.h>
 
typedef struct {
    atomic_bool locked;
} TASSpinlock;
 
void tas_spinlock_init(TASSpinlock* lock) {
    atomic_store(&lock->locked, false);
}
 
void tas_spinlock_acquire(TASSpinlock* lock) {
    // Spin until we successfully change locked from false to true
    while (atomic_exchange(&lock->locked, true)) {
        // atomic_exchange is essentially TAS: atomically swap in 'true'
        // Returns: previous value
        // If previous was 'true', lock was held - spin and retry
        // If previous was 'false', we just acquired it - exit loop
    }
    // We now hold the lock
}
 
void tas_spinlock_release(TASSpinlock* lock) {
    atomic_store(&lock->locked, false);
    // Lock is now available for others
}
 
/*
 * Analysis:
 * 
 * Correctness:
 * ✓ Mutual exclusion: Only thread that sees 'false' from TAS enters CS
 * ✓ Atomicity: TAS is hardware atomic, no race in acquire
 * 
 * Performance Problems:
 * ✗ Cache line bouncing: Every TAS attempt writes to the cache line
 * ✗ Bus saturation: Continuous TAS generates heavy memory traffic
 * ✗ Unfairness: No ordering guarantees; starvation possible
 * ✗ No backoff: Threads hammer the lock at max CPU speed
 */

Performance Problems with TAS Spinlock

The TAS spinlock, while correct, has severe performance issues under contention:

1. Cache Line Thrashing

Every TAS operation is a write (it always writes true). On multi-processor systems, this means:

Each TAS requires exclusive cache line ownership
The cache line bounces between processors with every attempt
Even unsuccessful attempts generate coherence traffic

2. Memory Bus Saturation

With N processors spinning on the same lock:

Each processor continuously issues TAS operations
The memory bus becomes the bottleneck
Useful work on the bus is crowded out by lock traffic

3. No Fairness

When the lock is released:

All spinning processors race to acquire
The winner is essentially random (closest to finish a TAS first)
Some processors may be repeatedly unlucky (starvation)

TAS Spinlock is Rarely Appropriate

The basic TAS spinlock should almost never be used in production code. Its performance degrades catastrophically under contention, and it can saturate memory bandwidth even when only a few threads contend. The improvements in subsequent sections address these issues. If you see a TAS spinlock in code, it's usually a sign that the author prioritized simplicity over performance—which may be acceptable in very low-contention scenarios.

Test-and-Test-and-Set (TTAS) Lock

The first major improvement to the TAS spinlock is the Test-and-Test-and-Set (TTAS) lock. The key insight: reads are much cheaper than writes in cache-coherent systems.

The Insight: Read Before Write

In the TAS spinlock, we always execute an atomic write (TAS). But atomic writes are expensive because they require exclusive cache line ownership.

Solution: First, test the lock with a simple read. Only if the read suggests the lock might be free, attempt the TAS.

TTAS pattern:
1. READ lock (cheap, can be satisfied from shared cache line)
2. If lock appears free, attempt TAS (expensive, requires exclusive access)
3. If locked or TAS fails, go back to step 1

This dramatically reduces cache line traffic because spinning threads continuously read from their local cache rather than hammering the bus with writes.

ttas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Test-and-Test-and-Set Spinlock
 
#include <stdatomic.h>
#include <stdbool.h>
 
typedef struct {
    atomic_bool locked;
} TTASSpinlock;
 
void ttas_spinlock_init(TTASSpinlock* lock) {
    atomic_store(&lock->locked, false);
}
 
void ttas_spinlock_acquire(TTASSpinlock* lock) {
    while (true) {
        // FIRST TEST: Spin on local cache reading
        // This loop generates no bus traffic after first cache miss
        while (atomic_load(&lock->locked)) {
            // Lock is held - spin locally on cached value
            // This read can be satisfied from our L1 cache
            // Other processors' caches also hold shared copy
            
            // Optional: Yield or pause to reduce power consumption
            // __builtin_ia32_pause();  // x86 PAUSE instruction
        }
        
        // Lock appears free - try to acquire
        // SECOND TEST: Attempt atomic TAS
        if (!atomic_exchange(&lock->locked, true)) {
            // TAS returned false - we successfully acquired
            return;
        }
        
        // TAS returned true - someone else grabbed it first
        // Loop back to spinning on reads
    }
}
 
void ttas_spinlock_release(TTASSpinlock* lock) {
    atomic_store(&lock->locked, false);
    // This write invalidates all cached copies
    // Spinning threads will see lock == false on next iteration
}
 
/*
 * Performance Analysis:
 * 
 * While lock held:
 * - All waiters spin on local cached reads
 * - NO memory bus traffic (after initial cache miss)
 * - Much better than TAS which writes on every iteration
 * 
 * When lock released:
 * - Release write invalidates all cached copies
 * - All waiters see lock==false and exit their read loop
 * - All waiters simultaneously attempt TAS (brief traffic burst)
 * - One wins, others see true from TAS and return to reading
 * 
 * Still unfair, but much less bus traffic during spinning
 */

The Release Thundering Herd

TTAS improves performance during spinning, but it still has a problem at lock release: the thundering herd.

When the lock is released:

The write to locked = false invalidates all cached copies
All spinning threads exit their read loop
All threads simultaneously attempt TAS
Cache line bounces rapidly between all processors
One thread wins; others return to read-spinning

This burst of traffic can be significant with many waiters. The next improvement addresses this with backoff.

TAS vs TTAS Performance Characteristics
Aspect	TAS Spinlock	TTAS Spinlock
Spin behavior	Continuous atomic writes	Mostly local reads, occasional writes
Bus traffic while held	Constant, high	Minimal after cache fill
Cache line state	Exclusive, bouncing	Shared among waiters
Traffic at release	Continuous already	Burst as all threads TAS
Lock holder impact	Possibly slowed by traffic	Minimal interference
Fairness	None	None

The PAUSE Instruction

On x86 processors, the PAUSE instruction (or __mm_pause() intrinsic) is designed for spin loops. It hints to the processor that a spin-wait is occurring, allowing: power savings (slower pipeline), avoiding memory order violations (important on some CPUs), and giving hyperthreaded sibling threads more resources. Always include PAUSE in production spinlocks.

Backoff Spinlocks

The thundering herd problem in TTAS can be mitigated with backoff—waiting for a random period after a failed lock attempt before retrying. This spreads out retry attempts over time, reducing contention.

Exponential Backoff

The most common strategy is exponential backoff:

Start with a small backoff delay
After each failed attempt, double the delay (up to a maximum)
Wait for a random duration within the current delay range

This approach gradually spreads out retries, reducing contention while allowing quick acquisition under low contention.

backoff_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// TTAS Spinlock with Exponential Backoff
 
#include <stdatomic.h>
#include <stdbool.h>
#include <stdlib.h>
#include <time.h>
 
#define MIN_BACKOFF 16
#define MAX_BACKOFF 8192
 
typedef struct {
    atomic_bool locked;
} BackoffSpinlock;
 
void backoff_spinlock_init(BackoffSpinlock* lock) {
    atomic_store(&lock->locked, false);
}
 
// Spin for approximately 'iterations' cycles
static inline void spin_wait(int iterations) {
    for (int i = 0; i < iterations; i++) {
        __builtin_ia32_pause();  // x86 PAUSE instruction
    }
}
 
void backoff_spinlock_acquire(BackoffSpinlock* lock) {
    int backoff = MIN_BACKOFF;
    
    while (true) {
        // TTAS: spin on local reads first
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed)) {
            __builtin_ia32_pause();
        }
        
        // Try to acquire
        if (!atomic_exchange_explicit(&lock->locked, true, memory_order_acquire)) {
            return;  // Got the lock!
        }
        
        // Failed to acquire - back off before retrying
        // Random delay within [0, backoff) to prevent synchronization
        int delay = rand() % backoff;
        spin_wait(delay);
        
        // Exponential increase (with cap)
        if (backoff < MAX_BACKOFF) {
            backoff *= 2;
        }
    }
}
 
void backoff_spinlock_release(BackoffSpinlock* lock) {
    atomic_store_explicit(&lock->locked, false, memory_order_release);
}
 
/*
 * How backoff helps:
 * 
 * Without backoff (TTAS release):
 *   T1→ TAS ────────────────────────────
 *   T2→ TAS ────────────────────────────
 *   T3→ TAS ────────────────────────────
 *   (All simultaneously, cache line thrash)
 * 
 * With exponential backoff:
 *   T1→ TAS ─ wait(16) ─ TAS ─ wait(32) ─ TAS ──→ acquire
 *   T2→ TAS ─ wait(28) ─────── TAS ─ wait(56) ──→
 *   T3→ TAS ─ wait(10) ─ TAS ─ wait(45) ────────→
 *   (Spread out over time, less traffic)
 * 
 * Key insight: Randomness prevents threads from synchronizing
 * their retry attempts, which would defeat the purpose.
 */

Tuning Backoff Parameters

Backoff parameters significantly affect performance:

MIN_BACKOFF (initial delay):

Too small: Not enough spreading under high contention
Too large: Slow acquisition under low contention
Typical values: 10-100 iterations

MAX_BACKOFF (maximum delay):

Too small: Excessive retries under high contention
Too large: Slow responsiveness when lock becomes free
Typical values: 1000-10000 iterations

Backoff strategy:

Exponential: Works well in general; adapts to contention level
Linear: More predictable but adapts slower
Constant: Simple but doesn't adapt

Randomization:

Essential to prevent synchronized retries
Without randomness, all threads might retry simultaneously

Platform-Specific Tuning

Optimal backoff parameters depend on hardware: CPU speed, cache latency, memory bandwidth, number of cores, and NUMA topology. Parameters that work well on a 4-core laptop may be wrong for a 128-core server. Serious lock implementations either auto-tune or provide platform-specific defaults. The Linux kernel, for example, uses carefully tuned constants for each supported architecture.

Ticket Locks: Ensuring Fairness

All the spinlocks we've seen so far are unfair—there's no guarantee about the order in which threads acquire the lock. A thread that requests the lock later might acquire it before threads that have been waiting longer. Under high contention, some threads can be starved.

Ticket locks solve this by introducing FIFO ordering. They work like the ticket system at a deli counter:

Each arriving thread takes a ticket (atomically increments a counter)
The lock holder displays the "now serving" number
Threads wait until their ticket number is displayed

Ticket Lock Implementation

ticket_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Ticket Lock: FIFO-fair spinlock
 
#include <stdatomic.h>
#include <stdint.h>
 
typedef struct {
    atomic_uint ticket;     // Next ticket to hand out
    atomic_uint serving;    // Currently serving ticket number
} TicketLock;
 
void ticket_lock_init(TicketLock* lock) {
    atomic_store(&lock->ticket, 0);
    atomic_store(&lock->serving, 0);
}
 
void ticket_lock_acquire(TicketLock* lock) {
    // Take a ticket (atomic increment)
    // fetch_add returns the OLD value (our ticket number)
    unsigned int my_ticket = atomic_fetch_add(&lock->ticket, 1);
    
    // Wait until it's our turn
    while (atomic_load(&lock->serving) != my_ticket) {
        __builtin_ia32_pause();  // Spin-wait
    }
    
    // Our turn - acquire the lock
    // Memory barrier implicit in the acquire semantics
}
 
void ticket_lock_release(TicketLock* lock) {
    // Call the next number
    // No atomic needed if single writer (only holder writes serving)
    unsigned int current = atomic_load_explicit(&lock->serving, memory_order_relaxed);
    atomic_store(&lock->serving, current + 1);
}
 
/*
 * Properties:
 * 
 * ✓ Fair (FIFO): Threads acquire in order of arrival
 * ✓ Bounded waiting: With N waiters, at most N-1 will go before you
 * ✓ Simple: Just two counters
 * 
 * Performance:
 * - Acquire: O(1) for taking ticket + O(waiters) for spinning
 * - Release: O(1) increment
 * - All waiters spin on same 'serving' variable (cache contention)
 * 
 * When lock released:
 * - New 'serving' value invalidates all caches
 * - ALL waiters read new value
 * - Only ONE waiter's condition matches
 * - Still generates cache traffic, but only for reads
 */

Ticket Lock Analysis

Advantages:

FIFO fairness: Threads acquire in arrival order; no starvation
Simplicity: Just two atomic counters
Low acquire overhead: Single atomic increment for ticket

Disadvantages:

Spinning location: All threads spin on the same serving variable
Cache invalidation storm: Each release invalidates all waiters' caches
No local spinning: Waiters can't spin on local memory

Proportional Backoff Optimization

Since each waiter knows its position in line (my_ticket - serving), we can implement proportional backoff:

void ticket_lock_acquire_proportional(TicketLock* lock) {
    unsigned int my_ticket = atomic_fetch_add(&lock->ticket, 1);
    
    while (true) {
        unsigned int now_serving = atomic_load(&lock->serving);
        if (now_serving == my_ticket) {
            return;  // Our turn!
        }
        
        // Wait proportional to our distance from the front
        unsigned int distance = my_ticket - now_serving;
        spin_wait(distance * DELAY_PER_THREAD);
    }
}

Thread far back in line waits longer before checking, reducing contention.

Ticket Locks in Practice

Ticket locks are widely used where fairness is important. The Linux kernel used ticket locks for years (before switching to queue spinlocks). They're particularly good when: contention exists but isn't extreme, fairness/bounded waiting is required, and simplicity is valued. For high-contention scenarios with many cores, queue locks (next section) offer better scalability.

MCS Queue Lock: Local Spinning

The MCS lock (Mellor-Crummey and Scott, 1991) is a breakthrough in spinlock design. It addresses the fundamental scalability problem of all previous locks: all waiters spinning on the same memory location.

The Key Insight: Local Spinning

In ticket locks, every waiter spins on the shared serving counter. Each release invalidates all waiters' caches—O(N) cache invalidations.

MCS lock gives each waiter its own private spin location:

Waiters form an explicit queue
Each waiter spins only on its own node's flag
Release notifies only the next waiter in line

Result: O(1) cache invalidations per release, regardless of queue length.

MCS Lock Structure

Each thread uses a queue node (often allocated on the stack):

mcs_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
// MCS Queue Lock: Scalable spinlock with local spinning
 
#include <stdatomic.h>
#include <stdbool.h>
#include <stddef.h>
 
typedef struct MCSNode {
    atomic_bool locked;              // Am I waiting?
    struct MCSNode* volatile next;   // Next waiter in queue
} MCSNode;
 
typedef struct {
    atomic_ptr(MCSNode*) tail;       // Tail of waiter queue
} MCSLock;
 
void mcs_lock_init(MCSLock* lock) {
    atomic_store(&lock->tail, NULL);
}
 
void mcs_lock_acquire(MCSLock* lock, MCSNode* my_node) {
    // Initialize my node
    my_node->next = NULL;
    my_node->locked = true;  // I'm waiting
    
    // Atomically append to queue, get predecessor
    MCSNode* predecessor = atomic_exchange(&lock->tail, my_node);
    
    if (predecessor != NULL) {
        // Queue wasn't empty - link myself and wait
        predecessor->next = my_node;
        
        // ════════════════════════════════════════════════════════
        // LOCAL SPINNING: I only spin on MY OWN node's flag
        // No cache coherence traffic from other waiters or releases
        // ════════════════════════════════════════════════════════
        while (atomic_load(&my_node->locked)) {
            __builtin_ia32_pause();
        }
    }
    // If predecessor was NULL, queue was empty - we have the lock
}
 
void mcs_lock_release(MCSLock* lock, MCSNode* my_node) {
    MCSNode* successor = my_node->next;
    
    if (successor == NULL) {
        // No known successor - try to atomically set tail to NULL
        MCSNode* expected = my_node;
        if (atomic_compare_exchange_strong(&lock->tail, &expected, NULL)) {
            // Successfully cleared tail - no one was waiting
            return;
        }
        
        // CAS failed: someone is in process of linking themselves
        // Wait for them to finish setting my_node->next
        while ((successor = my_node->next) == NULL) {
            __builtin_ia32_pause();
        }
    }
    
    // ════════════════════════════════════════════════════════
    // NOTIFY ONLY NEXT WAITER: O(1) cache invalidations
    // Compare to ticket lock: O(N) invalidations
    // ════════════════════════════════════════════════════════
    atomic_store(&successor->locked, false);
}
 
/*
 * Queue visualization:
 * 
 * Initially: tail→NULL
 * After T1 acquires:
 *   tail→[T1: locked=false, next=NULL]
 * After T2, T3 enqueue:
 *   tail─────────────────────────────→[T3]
 *   [T1: locked=false, next→] [T2: locked=true, next→] [T3: locked=true, next=NULL]
 * 
 * When T1 releases:
 *   - T1 sets T2->locked = false
 *   - ONLY T2's cache is invalidated
 *   - T3 never sees any traffic (still spinning on T3->locked)
 */

Why MCS Scales Better

The key to MCS lock's scalability is local spinning:

Operation	Ticket Lock	MCS Lock
Acquire (no wait)	1 FAA	1 XCHG
Acquire (with wait)	Spin on global	Spin on local
Release	1 write, N invalidations	1 write, 1 invalidation
Memory per waiter	0	1 node (2 words)

Under high contention with many waiters, MCS dramatically reduces cache coherence traffic:

Ticket lock: Each release causes O(N) cache invalidations (N waiters polling serving)
MCS lock: Each release causes O(1) cache invalidations (only next waiter's node)

Trade-offs

Advantages:

O(1) cache traffic per release, regardless of waiter count
Truly scalable to hundreds of cores
FIFO fairness (like ticket lock)
Each waiter's spin is on unique cache line (no false sharing)

Disadvantages:

Requires per-acquire node allocation (or thread-local storage)
More complex implementation
Slightly higher single-threaded overhead
Cannot use a simple unlock protocol (must pass node)

MCS in Modern Systems

MCS locks are the basis for the Linux kernel's 'qspinlock' (queue spinlock), introduced in Linux 4.2. The kernel's implementation adds optimizations for the common case (single waiter or no waiters) while falling back to full MCS queuing under contention. Many high-performance lock libraries (like Google's absl::SpinLock) also use MCS or similar queue-based designs.

CAS-Based Mutex: Spinning to Blocking

Spinlocks are efficient for short critical sections but waste CPU cycles during long waits. Mutex locks (mutual exclusion locks) address this by putting waiting threads to sleep, allowing the CPU to do useful work.

A mutex combines CAS-based acquisition with OS-assisted blocking when contention is detected.

The Hybrid Approach

Modern mutexes typically use a two-phase strategy:

Spin phase: Try to acquire using CAS (fast path if uncontended)
Block phase: If spinning fails, ask the OS to put us to sleep

This adaptive approach provides:

Low latency under no/low contention (no syscall overhead)
CPU efficiency under high contention (sleeping instead of spinning)

cas_mutex.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Simplified CAS-Based Mutex with Blocking
 
#include <stdatomic.h>
#include <linux/futex.h>
#include <sys/syscall.h>
#include <unistd.h>
 
// Lock states
#define UNLOCKED    0
#define LOCKED      1
#define CONTENDED   2   // Locked with waiters
 
typedef struct {
    atomic_int state;
} Mutex;
 
void mutex_init(Mutex* m) {
    atomic_store(&m->state, UNLOCKED);
}
 
void mutex_lock(Mutex* m) {
    int expected = UNLOCKED;
    
    // Fast path: Try to acquire uncontended lock
    if (atomic_compare_exchange_strong(&m->state, &expected, LOCKED)) {
        return;  // Got it immediately!
    }
    
    // Slow path: Lock is held
    // Try to spin briefly first
    for (int i = 0; i < SPIN_COUNT; i++) {
        expected = UNLOCKED;
        if (atomic_compare_exchange_weak(&m->state, &expected, LOCKED)) {
            return;  // Got it after spinning
        }
        __builtin_ia32_pause();
    }
    
    // Still can't get it - time to sleep
    while (true) {
        // Mark as contended (so holder knows to wake us)
        if (atomic_exchange(&m->state, CONTENDED) == UNLOCKED) {
            // Oops, it just became unlocked! We have it now.
            return;
        }
        
        // Sleep until state changes (using Linux futex)
        // FUTEX_WAIT: If state still equals CONTENDED, sleep
        syscall(SYS_futex, &m->state, FUTEX_WAIT, CONTENDED, NULL, NULL, 0);
        
        // Woken up - try to acquire again
        expected = UNLOCKED;
        if (atomic_compare_exchange_strong(&m->state, &expected, CONTENDED)) {
            return;  // Got it!
        }
        // Someone else got it first, go back to sleep
    }
}
 
void mutex_unlock(Mutex* m) {
    // Atomically check state and unlock
    int old_state = atomic_exchange(&m->state, UNLOCKED);
    
    if (old_state == CONTENDED) {
        // There are waiters - wake one up
        syscall(SYS_futex, &m->state, FUTEX_WAKE, 1, NULL, NULL, 0);
    }
    // If old_state was LOCKED (not CONTENDED), no waiters - just return
}
 
/*
 * State transitions:
 * 
 * UNLOCKED ──CAS──→ LOCKED      (uncontended acquisition)
 * LOCKED ──XCHG──→ CONTENDED    (waiter arriving)
 * CONTENDED ──CAS──→ CONTENDED  (waiter acquiring)
 * CONTENDED ──XCHG──→ UNLOCKED  (release, wake waiter)
 * LOCKED ──XCHG──→ UNLOCKED     (release, no waiters)
 * 
 * Key insight: CONTENDED state tells unlock() to wake waiters
 */

The Role of Futex

Futex (Fast Userspace muTEX) is a Linux syscall that enables:

Fast path in userspace: No syscall when uncontended
Kernel-assisted blocking: Put thread to sleep when contended
Wakeup notification: Wake sleeping threads when lock released

Futex operations are atomic with respect to the memory location:

FUTEX_WAIT(addr, val): If *addr == val, put caller to sleep
FUTEX_WAKE(addr, n): Wake up to n threads sleeping on addr

The combination of CAS (for fast path) and futex (for slow path) is how pthread_mutex_t is implemented on Linux.

Spinlock vs Mutex Trade-offs
Characteristic	Spinlock	Mutex (Blocking)
Wait mechanism	CPU spinning	OS-assisted sleep
CPU usage while waiting	100% (one core)	0% (sleeping)
Context switch	Never	On contention (expensive)
Best for	Short critical sections	Long critical sections
Uncontended overhead	1-10 cycles	10-100 cycles
Contended overhead	Spinning (cheap if short)	Sleep/wake (expensive but fair to CPU)

Adaptive Spinning

Many mutex implementations use adaptive spinning: they spin for a while before blocking, hoping the lock becomes free quickly. The Linux kernel's mutex implementation tracks how long the lock holder typically holds the lock; if the holder is running and critical sections are short, it spins more. If the holder is sleeping or critical sections are long, it blocks immediately. This runtime adaptation can provide the best of both worlds.

Reader-Writer Locks

Standard mutexes enforce exclusive access—only one thread at a time. But many workloads are read-heavy: many threads read shared data, but writes are infrequent. Reader-Writer locks (RWLocks) optimize for this pattern:

Multiple readers can hold the lock simultaneously
Writers require exclusive access (no readers or other writers)

RWLock Semantics

Requesting	Current Holders	Result
Reader	None	Grant read lock
Reader	Readers	Grant read lock
Reader	Writer	Block
Writer	None	Grant write lock
Writer	Any	Block

rwlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Simple CAS-Based Reader-Writer Lock
 
#include <stdatomic.h>
#include <stdint.h>
 
// State encoding:
// Bits 0-30: Reader count
// Bit 31: Writer active flag
#define WRITER_BIT   0x80000000
#define READER_MASK  0x7FFFFFFF
 
typedef struct {
    atomic_uint state;
} RWLock;
 
void rwlock_init(RWLock* rw) {
    atomic_store(&rw->state, 0);
}
 
void rwlock_read_lock(RWLock* rw) {
    while (true) {
        unsigned int old_state = atomic_load(&rw->state);
        
        // Can't acquire read lock if writer is active or waiting
        if (old_state & WRITER_BIT) {
            __builtin_ia32_pause();
            continue;
        }
        
        // Try to increment reader count
        unsigned int new_state = old_state + 1;
        if (atomic_compare_exchange_weak(&rw->state, &old_state, new_state)) {
            return;  // Got read lock
        }
        // CAS failed, retry
    }
}
 
void rwlock_read_unlock(RWLock* rw) {
    // Decrement reader count
    atomic_fetch_sub(&rw->state, 1);
}
 
void rwlock_write_lock(RWLock* rw) {
    while (true) {
        unsigned int old_state = atomic_load(&rw->state);
        
        // Can only acquire if no readers and no writer
        if (old_state != 0) {
            __builtin_ia32_pause();
            continue;
        }
        
        // Try to set writer bit
        if (atomic_compare_exchange_weak(&rw->state, &old_state, WRITER_BIT)) {
            return;  // Got write lock
        }
        // CAS failed, retry
    }
}
 
void rwlock_write_unlock(RWLock* rw) {
    // Clear writer bit
    atomic_store(&rw->state, 0);
}
 
/*
 * Issues with this simple implementation:
 * 
 * 1. Writer starvation: Continuous readers prevent writers forever
 * 2. No fairness: New readers can keep jumping ahead of waiting writers
 * 3. Scalability: All readers CAS on same location (contention)
 * 
 * Production RWLocks are significantly more complex to address these
 */

RWLock Considerations

Writer Starvation

The simple implementation above suffers from writer starvation: if readers continuously hold the lock, a waiting writer may never get access. Solutions:

Write-preferring: Once a writer is waiting, block new readers
Fair: Writers and readers take turns (complex to implement)
Phase-fair: Alternate between reader phases and writer phases

Scalability Issues

Reader contention: All readers CAS on the same state word
Cache line bouncing: Reader count updates invalidate all caches
SeqLock alternative: For read-heavy workloads with short reads, a SeqLock may scale better

When RWLocks Help

RWLocks only help when:

Reads significantly outnumber writes (10:1 or more)
Reads are long enough to amortize lock acquisition cost
Write starvation is tolerable (or mitigated)

If reads and writes are balanced, or reads are very short, a simple mutex may perform better due to lower overhead.

RWLock Pitfalls

RWLocks are often used incorrectly. Common mistakes: (1) Using RWLock when read/write ratio doesn't justify the overhead, (2) Ignoring writer starvation until production problems occur, (3) Upgrading read lock to write lock (causes deadlock if two readers try simultaneously), (4) Assuming RWLock always wins (it often doesn't for short critical sections). Profile carefully before adopting RWLocks.

Summary: CAS-Based Locks

We have journeyed from the simplest possible lock to sophisticated queue-based designs. Each step addressed specific performance problems while introducing new trade-offs. Understanding this evolution helps in choosing the right lock for each situation.

Key Takeaways

•TAS spinlock is simple but has poor performance due to continuous writes causing cache line bouncing.
•TTAS spinlock improves by reading before writing, allowing waiters to spin on cached values during holding.
•Backoff spinlocks reduce thundering herd by spreading retries over time with exponential delays.
•Ticket locks provide FIFO fairness using two counters (ticket and serving) but all waiters spin on the same location.
•MCS queue locks achieve true scalability through local spinning—each waiter spins on its own private flag, yielding O(1) release overhead.
•Mutexes combine CAS-based acquisition with OS-assisted blocking for CPU-efficient waiting during long contention.
•Reader-Writer locks allow concurrent readers while requiring exclusive access for writers, but have starvation and scalability challenges.

Lock Implementation Comparison
Lock Type	Fairness	Scalability	Complexity	Best Use Case
TAS Spinlock	None	Poor	Minimal	Learning, very low contention
TTAS Spinlock	None	Fair	Low	Short CS, few cores
Backoff Spinlock	None	Good	Medium	Short CS, moderate contention
Ticket Lock	FIFO	Fair	Low	Fairness needed, moderate contention
MCS Lock	FIFO	Excellent	High	Many cores, high contention
Mutex	Varies	Good	Medium	Long CS, mixed workload

What's Next: CAS Loops and Patterns

The next page explores CAS loops—the fundamental programming pattern for lock-free and CAS-based code. We'll examine how to structure CAS retry loops correctly, common pitfalls, and optimization techniques.

We will cover:

The canonical CAS loop pattern
Avoiding spurious failures
Contention management in CAS loops
Common CAS loop bugs and how to avoid them

Understanding CAS loops ties together everything we've learned about CAS operations, lock-free programming, and CAS-based locks.

Page Complete

You now understand the spectrum of CAS-based lock implementations, from simple spinlocks to sophisticated queue locks and blocking mutexes. This knowledge enables you to choose appropriate synchronization primitives for different scenarios and understand the trade-offs underlying the locks you use every day.

3 / 5

Loading learning content...

Operating SystemsCompare-and-Swap (CAS)

Compare-and-Swap: Foundation of Lock-Free Concurrency

LevelAdvanced

Duration120 mins

TopicCompare-and-Swap (CAS)

3 / 5

CAS-Based Locks: From Spinlocks to Queue Locks

Building Locks from Atomic Primitives

What You Will Learn

The Basic Test-and-Set Spinlock

Test-and-Set Semantics

Recall that TAS atomically reads a memory location and sets it to true (1), returning the previous value:

bool test_and_set(bool* lock) {
    // ATOMIC OPERATION
    bool old = *lock;
    *lock = true;
    return old;
}

TAS Spinlock Implementation

Using TAS, we can implement a basic spinlock:

Lock state: false = unlocked, true = locked
Lock acquisition: Repeatedly TAS until we get false (meaning we set it from false to true)
Lock release: Simply set to false

tas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Basic Test-and-Set Spinlock
 
#include <stdatomic.h>
#include <stdbool.h>
 
typedef struct {
    atomic_bool locked;
} TASSpinlock;
 
void tas_spinlock_init(TASSpinlock* lock) {
    atomic_store(&lock->locked, false);
}
 
void tas_spinlock_acquire(TASSpinlock* lock) {
    // Spin until we successfully change locked from false to true
    while (atomic_exchange(&lock->locked, true)) {
        // atomic_exchange is essentially TAS: atomically swap in 'true'
        // Returns: previous value
        // If previous was 'true', lock was held - spin and retry
        // If previous was 'false', we just acquired it - exit loop
    }
    // We now hold the lock
}
 
void tas_spinlock_release(TASSpinlock* lock) {
    atomic_store(&lock->locked, false);
    // Lock is now available for others
}
 
/*
 * Analysis:
 * 
 * Correctness:
 * ✓ Mutual exclusion: Only thread that sees 'false' from TAS enters CS
 * ✓ Atomicity: TAS is hardware atomic, no race in acquire
 * 
 * Performance Problems:
 * ✗ Cache line bouncing: Every TAS attempt writes to the cache line
 * ✗ Bus saturation: Continuous TAS generates heavy memory traffic
 * ✗ Unfairness: No ordering guarantees; starvation possible
 * ✗ No backoff: Threads hammer the lock at max CPU speed
 */

Performance Problems with TAS Spinlock

The TAS spinlock, while correct, has severe performance issues under contention:

1. Cache Line Thrashing

Every TAS operation is a write (it always writes true). On multi-processor systems, this means:

Each TAS requires exclusive cache line ownership
The cache line bounces between processors with every attempt
Even unsuccessful attempts generate coherence traffic

2. Memory Bus Saturation

With N processors spinning on the same lock:

Each processor continuously issues TAS operations
The memory bus becomes the bottleneck
Useful work on the bus is crowded out by lock traffic

3. No Fairness

When the lock is released:

All spinning processors race to acquire
The winner is essentially random (closest to finish a TAS first)
Some processors may be repeatedly unlucky (starvation)

TAS Spinlock is Rarely Appropriate

Test-and-Test-and-Set (TTAS) Lock

The first major improvement to the TAS spinlock is the Test-and-Test-and-Set (TTAS) lock. The key insight: reads are much cheaper than writes in cache-coherent systems.

The Insight: Read Before Write

In the TAS spinlock, we always execute an atomic write (TAS). But atomic writes are expensive because they require exclusive cache line ownership.

Solution: First, test the lock with a simple read. Only if the read suggests the lock might be free, attempt the TAS.

TTAS pattern:
1. READ lock (cheap, can be satisfied from shared cache line)
2. If lock appears free, attempt TAS (expensive, requires exclusive access)
3. If locked or TAS fails, go back to step 1

This dramatically reduces cache line traffic because spinning threads continuously read from their local cache rather than hammering the bus with writes.

ttas_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Test-and-Test-and-Set Spinlock
 
#include <stdatomic.h>
#include <stdbool.h>
 
typedef struct {
    atomic_bool locked;
} TTASSpinlock;
 
void ttas_spinlock_init(TTASSpinlock* lock) {
    atomic_store(&lock->locked, false);
}
 
void ttas_spinlock_acquire(TTASSpinlock* lock) {
    while (true) {
        // FIRST TEST: Spin on local cache reading
        // This loop generates no bus traffic after first cache miss
        while (atomic_load(&lock->locked)) {
            // Lock is held - spin locally on cached value
            // This read can be satisfied from our L1 cache
            // Other processors' caches also hold shared copy
            
            // Optional: Yield or pause to reduce power consumption
            // __builtin_ia32_pause();  // x86 PAUSE instruction
        }
        
        // Lock appears free - try to acquire
        // SECOND TEST: Attempt atomic TAS
        if (!atomic_exchange(&lock->locked, true)) {
            // TAS returned false - we successfully acquired
            return;
        }
        
        // TAS returned true - someone else grabbed it first
        // Loop back to spinning on reads
    }
}
 
void ttas_spinlock_release(TTASSpinlock* lock) {
    atomic_store(&lock->locked, false);
    // This write invalidates all cached copies
    // Spinning threads will see lock == false on next iteration
}
 
/*
 * Performance Analysis:
 * 
 * While lock held:
 * - All waiters spin on local cached reads
 * - NO memory bus traffic (after initial cache miss)
 * - Much better than TAS which writes on every iteration
 * 
 * When lock released:
 * - Release write invalidates all cached copies
 * - All waiters see lock==false and exit their read loop
 * - All waiters simultaneously attempt TAS (brief traffic burst)
 * - One wins, others see true from TAS and return to reading
 * 
 * Still unfair, but much less bus traffic during spinning
 */

The Release Thundering Herd

TTAS improves performance during spinning, but it still has a problem at lock release: the thundering herd.

When the lock is released:

The write to locked = false invalidates all cached copies
All spinning threads exit their read loop
All threads simultaneously attempt TAS
Cache line bounces rapidly between all processors
One thread wins; others return to read-spinning

This burst of traffic can be significant with many waiters. The next improvement addresses this with backoff.

TAS vs TTAS Performance Characteristics
Aspect	TAS Spinlock	TTAS Spinlock
Spin behavior	Continuous atomic writes	Mostly local reads, occasional writes
Bus traffic while held	Constant, high	Minimal after cache fill
Cache line state	Exclusive, bouncing	Shared among waiters
Traffic at release	Continuous already	Burst as all threads TAS
Lock holder impact	Possibly slowed by traffic	Minimal interference
Fairness	None	None

The PAUSE Instruction

Backoff Spinlocks

Exponential Backoff

The most common strategy is exponential backoff:

Start with a small backoff delay
After each failed attempt, double the delay (up to a maximum)
Wait for a random duration within the current delay range

This approach gradually spreads out retries, reducing contention while allowing quick acquisition under low contention.

backoff_spinlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// TTAS Spinlock with Exponential Backoff
 
#include <stdatomic.h>
#include <stdbool.h>
#include <stdlib.h>
#include <time.h>
 
#define MIN_BACKOFF 16
#define MAX_BACKOFF 8192
 
typedef struct {
    atomic_bool locked;
} BackoffSpinlock;
 
void backoff_spinlock_init(BackoffSpinlock* lock) {
    atomic_store(&lock->locked, false);
}
 
// Spin for approximately 'iterations' cycles
static inline void spin_wait(int iterations) {
    for (int i = 0; i < iterations; i++) {
        __builtin_ia32_pause();  // x86 PAUSE instruction
    }
}
 
void backoff_spinlock_acquire(BackoffSpinlock* lock) {
    int backoff = MIN_BACKOFF;
    
    while (true) {
        // TTAS: spin on local reads first
        while (atomic_load_explicit(&lock->locked, memory_order_relaxed)) {
            __builtin_ia32_pause();
        }
        
        // Try to acquire
        if (!atomic_exchange_explicit(&lock->locked, true, memory_order_acquire)) {
            return;  // Got the lock!
        }
        
        // Failed to acquire - back off before retrying
        // Random delay within [0, backoff) to prevent synchronization
        int delay = rand() % backoff;
        spin_wait(delay);
        
        // Exponential increase (with cap)
        if (backoff < MAX_BACKOFF) {
            backoff *= 2;
        }
    }
}
 
void backoff_spinlock_release(BackoffSpinlock* lock) {
    atomic_store_explicit(&lock->locked, false, memory_order_release);
}
 
/*
 * How backoff helps:
 * 
 * Without backoff (TTAS release):
 *   T1→ TAS ────────────────────────────
 *   T2→ TAS ────────────────────────────
 *   T3→ TAS ────────────────────────────
 *   (All simultaneously, cache line thrash)
 * 
 * With exponential backoff:
 *   T1→ TAS ─ wait(16) ─ TAS ─ wait(32) ─ TAS ──→ acquire
 *   T2→ TAS ─ wait(28) ─────── TAS ─ wait(56) ──→
 *   T3→ TAS ─ wait(10) ─ TAS ─ wait(45) ────────→
 *   (Spread out over time, less traffic)
 * 
 * Key insight: Randomness prevents threads from synchronizing
 * their retry attempts, which would defeat the purpose.
 */

Tuning Backoff Parameters

Backoff parameters significantly affect performance:

MIN_BACKOFF (initial delay):

Too small: Not enough spreading under high contention
Too large: Slow acquisition under low contention
Typical values: 10-100 iterations

MAX_BACKOFF (maximum delay):

Too small: Excessive retries under high contention
Too large: Slow responsiveness when lock becomes free
Typical values: 1000-10000 iterations

Backoff strategy:

Exponential: Works well in general; adapts to contention level
Linear: More predictable but adapts slower
Constant: Simple but doesn't adapt

Randomization:

Essential to prevent synchronized retries
Without randomness, all threads might retry simultaneously

Platform-Specific Tuning

Ticket Locks: Ensuring Fairness

Ticket locks solve this by introducing FIFO ordering. They work like the ticket system at a deli counter:

Each arriving thread takes a ticket (atomically increments a counter)
The lock holder displays the "now serving" number
Threads wait until their ticket number is displayed

Ticket Lock Implementation

ticket_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Ticket Lock: FIFO-fair spinlock
 
#include <stdatomic.h>
#include <stdint.h>
 
typedef struct {
    atomic_uint ticket;     // Next ticket to hand out
    atomic_uint serving;    // Currently serving ticket number
} TicketLock;
 
void ticket_lock_init(TicketLock* lock) {
    atomic_store(&lock->ticket, 0);
    atomic_store(&lock->serving, 0);
}
 
void ticket_lock_acquire(TicketLock* lock) {
    // Take a ticket (atomic increment)
    // fetch_add returns the OLD value (our ticket number)
    unsigned int my_ticket = atomic_fetch_add(&lock->ticket, 1);
    
    // Wait until it's our turn
    while (atomic_load(&lock->serving) != my_ticket) {
        __builtin_ia32_pause();  // Spin-wait
    }
    
    // Our turn - acquire the lock
    // Memory barrier implicit in the acquire semantics
}
 
void ticket_lock_release(TicketLock* lock) {
    // Call the next number
    // No atomic needed if single writer (only holder writes serving)
    unsigned int current = atomic_load_explicit(&lock->serving, memory_order_relaxed);
    atomic_store(&lock->serving, current + 1);
}
 
/*
 * Properties:
 * 
 * ✓ Fair (FIFO): Threads acquire in order of arrival
 * ✓ Bounded waiting: With N waiters, at most N-1 will go before you
 * ✓ Simple: Just two counters
 * 
 * Performance:
 * - Acquire: O(1) for taking ticket + O(waiters) for spinning
 * - Release: O(1) increment
 * - All waiters spin on same 'serving' variable (cache contention)
 * 
 * When lock released:
 * - New 'serving' value invalidates all caches
 * - ALL waiters read new value
 * - Only ONE waiter's condition matches
 * - Still generates cache traffic, but only for reads
 */

Ticket Lock Analysis

Advantages:

FIFO fairness: Threads acquire in arrival order; no starvation
Simplicity: Just two atomic counters
Low acquire overhead: Single atomic increment for ticket

Disadvantages:

Spinning location: All threads spin on the same serving variable
Cache invalidation storm: Each release invalidates all waiters' caches
No local spinning: Waiters can't spin on local memory

Proportional Backoff Optimization

Since each waiter knows its position in line (my_ticket - serving), we can implement proportional backoff:

void ticket_lock_acquire_proportional(TicketLock* lock) {
    unsigned int my_ticket = atomic_fetch_add(&lock->ticket, 1);
    
    while (true) {
        unsigned int now_serving = atomic_load(&lock->serving);
        if (now_serving == my_ticket) {
            return;  // Our turn!
        }
        
        // Wait proportional to our distance from the front
        unsigned int distance = my_ticket - now_serving;
        spin_wait(distance * DELAY_PER_THREAD);
    }
}

Thread far back in line waits longer before checking, reducing contention.

Ticket Locks in Practice

MCS Queue Lock: Local Spinning

The Key Insight: Local Spinning

In ticket locks, every waiter spins on the shared serving counter. Each release invalidates all waiters' caches—O(N) cache invalidations.

MCS lock gives each waiter its own private spin location:

Waiters form an explicit queue
Each waiter spins only on its own node's flag
Release notifies only the next waiter in line

Result: O(1) cache invalidations per release, regardless of queue length.

MCS Lock Structure

Each thread uses a queue node (often allocated on the stack):

mcs_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
// MCS Queue Lock: Scalable spinlock with local spinning
 
#include <stdatomic.h>
#include <stdbool.h>
#include <stddef.h>
 
typedef struct MCSNode {
    atomic_bool locked;              // Am I waiting?
    struct MCSNode* volatile next;   // Next waiter in queue
} MCSNode;
 
typedef struct {
    atomic_ptr(MCSNode*) tail;       // Tail of waiter queue
} MCSLock;
 
void mcs_lock_init(MCSLock* lock) {
    atomic_store(&lock->tail, NULL);
}
 
void mcs_lock_acquire(MCSLock* lock, MCSNode* my_node) {
    // Initialize my node
    my_node->next = NULL;
    my_node->locked = true;  // I'm waiting
    
    // Atomically append to queue, get predecessor
    MCSNode* predecessor = atomic_exchange(&lock->tail, my_node);
    
    if (predecessor != NULL) {
        // Queue wasn't empty - link myself and wait
        predecessor->next = my_node;
        
        // ════════════════════════════════════════════════════════
        // LOCAL SPINNING: I only spin on MY OWN node's flag
        // No cache coherence traffic from other waiters or releases
        // ════════════════════════════════════════════════════════
        while (atomic_load(&my_node->locked)) {
            __builtin_ia32_pause();
        }
    }
    // If predecessor was NULL, queue was empty - we have the lock
}
 
void mcs_lock_release(MCSLock* lock, MCSNode* my_node) {
    MCSNode* successor = my_node->next;
    
    if (successor == NULL) {
        // No known successor - try to atomically set tail to NULL
        MCSNode* expected = my_node;
        if (atomic_compare_exchange_strong(&lock->tail, &expected, NULL)) {
            // Successfully cleared tail - no one was waiting
            return;
        }
        
        // CAS failed: someone is in process of linking themselves
        // Wait for them to finish setting my_node->next
        while ((successor = my_node->next) == NULL) {
            __builtin_ia32_pause();
        }
    }
    
    // ════════════════════════════════════════════════════════
    // NOTIFY ONLY NEXT WAITER: O(1) cache invalidations
    // Compare to ticket lock: O(N) invalidations
    // ════════════════════════════════════════════════════════
    atomic_store(&successor->locked, false);
}
 
/*
 * Queue visualization:
 * 
 * Initially: tail→NULL
 * After T1 acquires:
 *   tail→[T1: locked=false, next=NULL]
 * After T2, T3 enqueue:
 *   tail─────────────────────────────→[T3]
 *   [T1: locked=false, next→] [T2: locked=true, next→] [T3: locked=true, next=NULL]
 * 
 * When T1 releases:
 *   - T1 sets T2->locked = false
 *   - ONLY T2's cache is invalidated
 *   - T3 never sees any traffic (still spinning on T3->locked)
 */

Why MCS Scales Better

The key to MCS lock's scalability is local spinning:

Operation	Ticket Lock	MCS Lock
Acquire (no wait)	1 FAA	1 XCHG
Acquire (with wait)	Spin on global	Spin on local
Release	1 write, N invalidations	1 write, 1 invalidation
Memory per waiter	0	1 node (2 words)

Under high contention with many waiters, MCS dramatically reduces cache coherence traffic:

Ticket lock: Each release causes O(N) cache invalidations (N waiters polling serving)
MCS lock: Each release causes O(1) cache invalidations (only next waiter's node)

Trade-offs

Advantages:

O(1) cache traffic per release, regardless of waiter count
Truly scalable to hundreds of cores
FIFO fairness (like ticket lock)
Each waiter's spin is on unique cache line (no false sharing)

Disadvantages:

Requires per-acquire node allocation (or thread-local storage)
More complex implementation
Slightly higher single-threaded overhead
Cannot use a simple unlock protocol (must pass node)

MCS in Modern Systems

CAS-Based Mutex: Spinning to Blocking

A mutex combines CAS-based acquisition with OS-assisted blocking when contention is detected.

The Hybrid Approach

Modern mutexes typically use a two-phase strategy:

Spin phase: Try to acquire using CAS (fast path if uncontended)
Block phase: If spinning fails, ask the OS to put us to sleep

This adaptive approach provides:

Low latency under no/low contention (no syscall overhead)
CPU efficiency under high contention (sleeping instead of spinning)

cas_mutex.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Simplified CAS-Based Mutex with Blocking
 
#include <stdatomic.h>
#include <linux/futex.h>
#include <sys/syscall.h>
#include <unistd.h>
 
// Lock states
#define UNLOCKED    0
#define LOCKED      1
#define CONTENDED   2   // Locked with waiters
 
typedef struct {
    atomic_int state;
} Mutex;
 
void mutex_init(Mutex* m) {
    atomic_store(&m->state, UNLOCKED);
}
 
void mutex_lock(Mutex* m) {
    int expected = UNLOCKED;
    
    // Fast path: Try to acquire uncontended lock
    if (atomic_compare_exchange_strong(&m->state, &expected, LOCKED)) {
        return;  // Got it immediately!
    }
    
    // Slow path: Lock is held
    // Try to spin briefly first
    for (int i = 0; i < SPIN_COUNT; i++) {
        expected = UNLOCKED;
        if (atomic_compare_exchange_weak(&m->state, &expected, LOCKED)) {
            return;  // Got it after spinning
        }
        __builtin_ia32_pause();
    }
    
    // Still can't get it - time to sleep
    while (true) {
        // Mark as contended (so holder knows to wake us)
        if (atomic_exchange(&m->state, CONTENDED) == UNLOCKED) {
            // Oops, it just became unlocked! We have it now.
            return;
        }
        
        // Sleep until state changes (using Linux futex)
        // FUTEX_WAIT: If state still equals CONTENDED, sleep
        syscall(SYS_futex, &m->state, FUTEX_WAIT, CONTENDED, NULL, NULL, 0);
        
        // Woken up - try to acquire again
        expected = UNLOCKED;
        if (atomic_compare_exchange_strong(&m->state, &expected, CONTENDED)) {
            return;  // Got it!
        }
        // Someone else got it first, go back to sleep
    }
}
 
void mutex_unlock(Mutex* m) {
    // Atomically check state and unlock
    int old_state = atomic_exchange(&m->state, UNLOCKED);
    
    if (old_state == CONTENDED) {
        // There are waiters - wake one up
        syscall(SYS_futex, &m->state, FUTEX_WAKE, 1, NULL, NULL, 0);
    }
    // If old_state was LOCKED (not CONTENDED), no waiters - just return
}
 
/*
 * State transitions:
 * 
 * UNLOCKED ──CAS──→ LOCKED      (uncontended acquisition)
 * LOCKED ──XCHG──→ CONTENDED    (waiter arriving)
 * CONTENDED ──CAS──→ CONTENDED  (waiter acquiring)
 * CONTENDED ──XCHG──→ UNLOCKED  (release, wake waiter)
 * LOCKED ──XCHG──→ UNLOCKED     (release, no waiters)
 * 
 * Key insight: CONTENDED state tells unlock() to wake waiters
 */

The Role of Futex

Futex (Fast Userspace muTEX) is a Linux syscall that enables:

Fast path in userspace: No syscall when uncontended
Kernel-assisted blocking: Put thread to sleep when contended
Wakeup notification: Wake sleeping threads when lock released

Futex operations are atomic with respect to the memory location:

FUTEX_WAIT(addr, val): If *addr == val, put caller to sleep
FUTEX_WAKE(addr, n): Wake up to n threads sleeping on addr

The combination of CAS (for fast path) and futex (for slow path) is how pthread_mutex_t is implemented on Linux.

Spinlock vs Mutex Trade-offs
Characteristic	Spinlock	Mutex (Blocking)
Wait mechanism	CPU spinning	OS-assisted sleep
CPU usage while waiting	100% (one core)	0% (sleeping)
Context switch	Never	On contention (expensive)
Best for	Short critical sections	Long critical sections
Uncontended overhead	1-10 cycles	10-100 cycles
Contended overhead	Spinning (cheap if short)	Sleep/wake (expensive but fair to CPU)

Adaptive Spinning

Reader-Writer Locks

Multiple readers can hold the lock simultaneously
Writers require exclusive access (no readers or other writers)

RWLock Semantics

Requesting	Current Holders	Result
Reader	None	Grant read lock
Reader	Readers	Grant read lock
Reader	Writer	Block
Writer	None	Grant write lock
Writer	Any	Block

rwlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Simple CAS-Based Reader-Writer Lock
 
#include <stdatomic.h>
#include <stdint.h>
 
// State encoding:
// Bits 0-30: Reader count
// Bit 31: Writer active flag
#define WRITER_BIT   0x80000000
#define READER_MASK  0x7FFFFFFF
 
typedef struct {
    atomic_uint state;
} RWLock;
 
void rwlock_init(RWLock* rw) {
    atomic_store(&rw->state, 0);
}
 
void rwlock_read_lock(RWLock* rw) {
    while (true) {
        unsigned int old_state = atomic_load(&rw->state);
        
        // Can't acquire read lock if writer is active or waiting
        if (old_state & WRITER_BIT) {
            __builtin_ia32_pause();
            continue;
        }
        
        // Try to increment reader count
        unsigned int new_state = old_state + 1;
        if (atomic_compare_exchange_weak(&rw->state, &old_state, new_state)) {
            return;  // Got read lock
        }
        // CAS failed, retry
    }
}
 
void rwlock_read_unlock(RWLock* rw) {
    // Decrement reader count
    atomic_fetch_sub(&rw->state, 1);
}
 
void rwlock_write_lock(RWLock* rw) {
    while (true) {
        unsigned int old_state = atomic_load(&rw->state);
        
        // Can only acquire if no readers and no writer
        if (old_state != 0) {
            __builtin_ia32_pause();
            continue;
        }
        
        // Try to set writer bit
        if (atomic_compare_exchange_weak(&rw->state, &old_state, WRITER_BIT)) {
            return;  // Got write lock
        }
        // CAS failed, retry
    }
}
 
void rwlock_write_unlock(RWLock* rw) {
    // Clear writer bit
    atomic_store(&rw->state, 0);
}
 
/*
 * Issues with this simple implementation:
 * 
 * 1. Writer starvation: Continuous readers prevent writers forever
 * 2. No fairness: New readers can keep jumping ahead of waiting writers
 * 3. Scalability: All readers CAS on same location (contention)
 * 
 * Production RWLocks are significantly more complex to address these
 */

RWLock Considerations

Writer Starvation

The simple implementation above suffers from writer starvation: if readers continuously hold the lock, a waiting writer may never get access. Solutions:

Write-preferring: Once a writer is waiting, block new readers
Fair: Writers and readers take turns (complex to implement)
Phase-fair: Alternate between reader phases and writer phases

Scalability Issues

Reader contention: All readers CAS on the same state word
Cache line bouncing: Reader count updates invalidate all caches
SeqLock alternative: For read-heavy workloads with short reads, a SeqLock may scale better

When RWLocks Help

RWLocks only help when:

Reads significantly outnumber writes (10:1 or more)
Reads are long enough to amortize lock acquisition cost
Write starvation is tolerable (or mitigated)

If reads and writes are balanced, or reads are very short, a simple mutex may perform better due to lower overhead.

RWLock Pitfalls

Summary: CAS-Based Locks

Key Takeaways

•TAS spinlock is simple but has poor performance due to continuous writes causing cache line bouncing.
•TTAS spinlock improves by reading before writing, allowing waiters to spin on cached values during holding.
•Backoff spinlocks reduce thundering herd by spreading retries over time with exponential delays.
•Ticket locks provide FIFO fairness using two counters (ticket and serving) but all waiters spin on the same location.
•MCS queue locks achieve true scalability through local spinning—each waiter spins on its own private flag, yielding O(1) release overhead.
•Mutexes combine CAS-based acquisition with OS-assisted blocking for CPU-efficient waiting during long contention.
•Reader-Writer locks allow concurrent readers while requiring exclusive access for writers, but have starvation and scalability challenges.

Lock Implementation Comparison
Lock Type	Fairness	Scalability	Complexity	Best Use Case
TAS Spinlock	None	Poor	Minimal	Learning, very low contention
TTAS Spinlock	None	Fair	Low	Short CS, few cores
Backoff Spinlock	None	Good	Medium	Short CS, moderate contention
Ticket Lock	FIFO	Fair	Low	Fairness needed, moderate contention
MCS Lock	FIFO	Excellent	High	Many cores, high contention
Mutex	Varies	Good	Medium	Long CS, mixed workload

What's Next: CAS Loops and Patterns

We will cover:

The canonical CAS loop pattern
Avoiding spurious failures
Contention management in CAS loops
Common CAS loop bugs and how to avoid them

Understanding CAS loops ties together everything we've learned about CAS operations, lock-free programming, and CAS-based locks.

Page Complete

3 / 5