Loading learning content...
Despite the power of lock-free programming, traditional locks remain the workhorse of concurrent software. Most real-world concurrent code uses locks, and those locks are themselves built from the atomic primitives we've been studying—particularly CAS (Compare-and-Swap) and TAS (Test-and-Set).
Understanding how locks are implemented illuminates their performance characteristics, helps in choosing the right lock for each situation, and provides insight into the design tradeoffs that kernel and library developers face.
This page provides a comprehensive exploration of CAS-based lock implementations. We begin with the simplest spinlock, progressively improving upon it to address issues of performance, fairness, and scalability. By the end, you will understand the full spectrum of lock designs and the considerations that drive their use in practice.
By the end of this page, you will understand how locks are implemented using CAS and TAS, the evolution from simple spinlocks to sophisticated queue locks, the tradeoffs between spinning and blocking, and how to choose the right lock implementation for different scenarios. You will gain appreciation for the engineering that underlies synchronization primitives you use daily.
The simplest lock implementation uses a single boolean variable and the Test-and-Set (TAS) atomic operation. This lock, known as a TAS spinlock, is the foundation for understanding more sophisticated designs.
Recall that TAS atomically reads a memory location and sets it to true (1), returning the previous value:
bool test_and_set(bool* lock) {
// ATOMIC OPERATION
bool old = *lock;
*lock = true;
return old;
}
Using TAS, we can implement a basic spinlock:
false = unlocked, true = lockedfalse (meaning we set it from false to true)false123456789101112131415161718192021222324252627282930313233343536373839404142
// Basic Test-and-Set Spinlock #include <stdatomic.h>#include <stdbool.h> typedef struct { atomic_bool locked;} TASSpinlock; void tas_spinlock_init(TASSpinlock* lock) { atomic_store(&lock->locked, false);} void tas_spinlock_acquire(TASSpinlock* lock) { // Spin until we successfully change locked from false to true while (atomic_exchange(&lock->locked, true)) { // atomic_exchange is essentially TAS: atomically swap in 'true' // Returns: previous value // If previous was 'true', lock was held - spin and retry // If previous was 'false', we just acquired it - exit loop } // We now hold the lock} void tas_spinlock_release(TASSpinlock* lock) { atomic_store(&lock->locked, false); // Lock is now available for others} /* * Analysis: * * Correctness: * ✓ Mutual exclusion: Only thread that sees 'false' from TAS enters CS * ✓ Atomicity: TAS is hardware atomic, no race in acquire * * Performance Problems: * ✗ Cache line bouncing: Every TAS attempt writes to the cache line * ✗ Bus saturation: Continuous TAS generates heavy memory traffic * ✗ Unfairness: No ordering guarantees; starvation possible * ✗ No backoff: Threads hammer the lock at max CPU speed */The TAS spinlock, while correct, has severe performance issues under contention:
1. Cache Line Thrashing
Every TAS operation is a write (it always writes true). On multi-processor systems, this means:
2. Memory Bus Saturation
With N processors spinning on the same lock:
3. No Fairness
When the lock is released:
The basic TAS spinlock should almost never be used in production code. Its performance degrades catastrophically under contention, and it can saturate memory bandwidth even when only a few threads contend. The improvements in subsequent sections address these issues. If you see a TAS spinlock in code, it's usually a sign that the author prioritized simplicity over performance—which may be acceptable in very low-contention scenarios.
The first major improvement to the TAS spinlock is the Test-and-Test-and-Set (TTAS) lock. The key insight: reads are much cheaper than writes in cache-coherent systems.
In the TAS spinlock, we always execute an atomic write (TAS). But atomic writes are expensive because they require exclusive cache line ownership.
Solution: First, test the lock with a simple read. Only if the read suggests the lock might be free, attempt the TAS.
TTAS pattern:
1. READ lock (cheap, can be satisfied from shared cache line)
2. If lock appears free, attempt TAS (expensive, requires exclusive access)
3. If locked or TAS fails, go back to step 1
This dramatically reduces cache line traffic because spinning threads continuously read from their local cache rather than hammering the bus with writes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// Test-and-Test-and-Set Spinlock #include <stdatomic.h>#include <stdbool.h> typedef struct { atomic_bool locked;} TTASSpinlock; void ttas_spinlock_init(TTASSpinlock* lock) { atomic_store(&lock->locked, false);} void ttas_spinlock_acquire(TTASSpinlock* lock) { while (true) { // FIRST TEST: Spin on local cache reading // This loop generates no bus traffic after first cache miss while (atomic_load(&lock->locked)) { // Lock is held - spin locally on cached value // This read can be satisfied from our L1 cache // Other processors' caches also hold shared copy // Optional: Yield or pause to reduce power consumption // __builtin_ia32_pause(); // x86 PAUSE instruction } // Lock appears free - try to acquire // SECOND TEST: Attempt atomic TAS if (!atomic_exchange(&lock->locked, true)) { // TAS returned false - we successfully acquired return; } // TAS returned true - someone else grabbed it first // Loop back to spinning on reads }} void ttas_spinlock_release(TTASSpinlock* lock) { atomic_store(&lock->locked, false); // This write invalidates all cached copies // Spinning threads will see lock == false on next iteration} /* * Performance Analysis: * * While lock held: * - All waiters spin on local cached reads * - NO memory bus traffic (after initial cache miss) * - Much better than TAS which writes on every iteration * * When lock released: * - Release write invalidates all cached copies * - All waiters see lock==false and exit their read loop * - All waiters simultaneously attempt TAS (brief traffic burst) * - One wins, others see true from TAS and return to reading * * Still unfair, but much less bus traffic during spinning */TTAS improves performance during spinning, but it still has a problem at lock release: the thundering herd.
When the lock is released:
locked = false invalidates all cached copiesThis burst of traffic can be significant with many waiters. The next improvement addresses this with backoff.
| Aspect | TAS Spinlock | TTAS Spinlock |
|---|---|---|
| Spin behavior | Continuous atomic writes | Mostly local reads, occasional writes |
| Bus traffic while held | Constant, high | Minimal after cache fill |
| Cache line state | Exclusive, bouncing | Shared among waiters |
| Traffic at release | Continuous already | Burst as all threads TAS |
| Lock holder impact | Possibly slowed by traffic | Minimal interference |
| Fairness | None | None |
On x86 processors, the PAUSE instruction (or __mm_pause() intrinsic) is designed for spin loops. It hints to the processor that a spin-wait is occurring, allowing: power savings (slower pipeline), avoiding memory order violations (important on some CPUs), and giving hyperthreaded sibling threads more resources. Always include PAUSE in production spinlocks.
The thundering herd problem in TTAS can be mitigated with backoff—waiting for a random period after a failed lock attempt before retrying. This spreads out retry attempts over time, reducing contention.
The most common strategy is exponential backoff:
This approach gradually spreads out retries, reducing contention while allowing quick acquisition under low contention.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
// TTAS Spinlock with Exponential Backoff #include <stdatomic.h>#include <stdbool.h>#include <stdlib.h>#include <time.h> #define MIN_BACKOFF 16#define MAX_BACKOFF 8192 typedef struct { atomic_bool locked;} BackoffSpinlock; void backoff_spinlock_init(BackoffSpinlock* lock) { atomic_store(&lock->locked, false);} // Spin for approximately 'iterations' cyclesstatic inline void spin_wait(int iterations) { for (int i = 0; i < iterations; i++) { __builtin_ia32_pause(); // x86 PAUSE instruction }} void backoff_spinlock_acquire(BackoffSpinlock* lock) { int backoff = MIN_BACKOFF; while (true) { // TTAS: spin on local reads first while (atomic_load_explicit(&lock->locked, memory_order_relaxed)) { __builtin_ia32_pause(); } // Try to acquire if (!atomic_exchange_explicit(&lock->locked, true, memory_order_acquire)) { return; // Got the lock! } // Failed to acquire - back off before retrying // Random delay within [0, backoff) to prevent synchronization int delay = rand() % backoff; spin_wait(delay); // Exponential increase (with cap) if (backoff < MAX_BACKOFF) { backoff *= 2; } }} void backoff_spinlock_release(BackoffSpinlock* lock) { atomic_store_explicit(&lock->locked, false, memory_order_release);} /* * How backoff helps: * * Without backoff (TTAS release): * T1→ TAS ──────────────────────────── * T2→ TAS ──────────────────────────── * T3→ TAS ──────────────────────────── * (All simultaneously, cache line thrash) * * With exponential backoff: * T1→ TAS ─ wait(16) ─ TAS ─ wait(32) ─ TAS ──→ acquire * T2→ TAS ─ wait(28) ─────── TAS ─ wait(56) ──→ * T3→ TAS ─ wait(10) ─ TAS ─ wait(45) ────────→ * (Spread out over time, less traffic) * * Key insight: Randomness prevents threads from synchronizing * their retry attempts, which would defeat the purpose. */Backoff parameters significantly affect performance:
MIN_BACKOFF (initial delay):
MAX_BACKOFF (maximum delay):
Backoff strategy:
Randomization:
Optimal backoff parameters depend on hardware: CPU speed, cache latency, memory bandwidth, number of cores, and NUMA topology. Parameters that work well on a 4-core laptop may be wrong for a 128-core server. Serious lock implementations either auto-tune or provide platform-specific defaults. The Linux kernel, for example, uses carefully tuned constants for each supported architecture.
All the spinlocks we've seen so far are unfair—there's no guarantee about the order in which threads acquire the lock. A thread that requests the lock later might acquire it before threads that have been waiting longer. Under high contention, some threads can be starved.
Ticket locks solve this by introducing FIFO ordering. They work like the ticket system at a deli counter:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
// Ticket Lock: FIFO-fair spinlock #include <stdatomic.h>#include <stdint.h> typedef struct { atomic_uint ticket; // Next ticket to hand out atomic_uint serving; // Currently serving ticket number} TicketLock; void ticket_lock_init(TicketLock* lock) { atomic_store(&lock->ticket, 0); atomic_store(&lock->serving, 0);} void ticket_lock_acquire(TicketLock* lock) { // Take a ticket (atomic increment) // fetch_add returns the OLD value (our ticket number) unsigned int my_ticket = atomic_fetch_add(&lock->ticket, 1); // Wait until it's our turn while (atomic_load(&lock->serving) != my_ticket) { __builtin_ia32_pause(); // Spin-wait } // Our turn - acquire the lock // Memory barrier implicit in the acquire semantics} void ticket_lock_release(TicketLock* lock) { // Call the next number // No atomic needed if single writer (only holder writes serving) unsigned int current = atomic_load_explicit(&lock->serving, memory_order_relaxed); atomic_store(&lock->serving, current + 1);} /* * Properties: * * ✓ Fair (FIFO): Threads acquire in order of arrival * ✓ Bounded waiting: With N waiters, at most N-1 will go before you * ✓ Simple: Just two counters * * Performance: * - Acquire: O(1) for taking ticket + O(waiters) for spinning * - Release: O(1) increment * - All waiters spin on same 'serving' variable (cache contention) * * When lock released: * - New 'serving' value invalidates all caches * - ALL waiters read new value * - Only ONE waiter's condition matches * - Still generates cache traffic, but only for reads */Advantages:
Disadvantages:
serving variableSince each waiter knows its position in line (my_ticket - serving), we can implement proportional backoff:
void ticket_lock_acquire_proportional(TicketLock* lock) {
unsigned int my_ticket = atomic_fetch_add(&lock->ticket, 1);
while (true) {
unsigned int now_serving = atomic_load(&lock->serving);
if (now_serving == my_ticket) {
return; // Our turn!
}
// Wait proportional to our distance from the front
unsigned int distance = my_ticket - now_serving;
spin_wait(distance * DELAY_PER_THREAD);
}
}
Thread far back in line waits longer before checking, reducing contention.
Ticket locks are widely used where fairness is important. The Linux kernel used ticket locks for years (before switching to queue spinlocks). They're particularly good when: contention exists but isn't extreme, fairness/bounded waiting is required, and simplicity is valued. For high-contention scenarios with many cores, queue locks (next section) offer better scalability.
The MCS lock (Mellor-Crummey and Scott, 1991) is a breakthrough in spinlock design. It addresses the fundamental scalability problem of all previous locks: all waiters spinning on the same memory location.
In ticket locks, every waiter spins on the shared serving counter. Each release invalidates all waiters' caches—O(N) cache invalidations.
MCS lock gives each waiter its own private spin location:
Result: O(1) cache invalidations per release, regardless of queue length.
Each thread uses a queue node (often allocated on the stack):
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
// MCS Queue Lock: Scalable spinlock with local spinning #include <stdatomic.h>#include <stdbool.h>#include <stddef.h> typedef struct MCSNode { atomic_bool locked; // Am I waiting? struct MCSNode* volatile next; // Next waiter in queue} MCSNode; typedef struct { atomic_ptr(MCSNode*) tail; // Tail of waiter queue} MCSLock; void mcs_lock_init(MCSLock* lock) { atomic_store(&lock->tail, NULL);} void mcs_lock_acquire(MCSLock* lock, MCSNode* my_node) { // Initialize my node my_node->next = NULL; my_node->locked = true; // I'm waiting // Atomically append to queue, get predecessor MCSNode* predecessor = atomic_exchange(&lock->tail, my_node); if (predecessor != NULL) { // Queue wasn't empty - link myself and wait predecessor->next = my_node; // ════════════════════════════════════════════════════════ // LOCAL SPINNING: I only spin on MY OWN node's flag // No cache coherence traffic from other waiters or releases // ════════════════════════════════════════════════════════ while (atomic_load(&my_node->locked)) { __builtin_ia32_pause(); } } // If predecessor was NULL, queue was empty - we have the lock} void mcs_lock_release(MCSLock* lock, MCSNode* my_node) { MCSNode* successor = my_node->next; if (successor == NULL) { // No known successor - try to atomically set tail to NULL MCSNode* expected = my_node; if (atomic_compare_exchange_strong(&lock->tail, &expected, NULL)) { // Successfully cleared tail - no one was waiting return; } // CAS failed: someone is in process of linking themselves // Wait for them to finish setting my_node->next while ((successor = my_node->next) == NULL) { __builtin_ia32_pause(); } } // ════════════════════════════════════════════════════════ // NOTIFY ONLY NEXT WAITER: O(1) cache invalidations // Compare to ticket lock: O(N) invalidations // ════════════════════════════════════════════════════════ atomic_store(&successor->locked, false);} /* * Queue visualization: * * Initially: tail→NULL * After T1 acquires: * tail→[T1: locked=false, next=NULL] * After T2, T3 enqueue: * tail─────────────────────────────→[T3] * [T1: locked=false, next→] [T2: locked=true, next→] [T3: locked=true, next=NULL] * * When T1 releases: * - T1 sets T2->locked = false * - ONLY T2's cache is invalidated * - T3 never sees any traffic (still spinning on T3->locked) */The key to MCS lock's scalability is local spinning:
| Operation | Ticket Lock | MCS Lock |
|---|---|---|
| Acquire (no wait) | 1 FAA | 1 XCHG |
| Acquire (with wait) | Spin on global | Spin on local |
| Release | 1 write, N invalidations | 1 write, 1 invalidation |
| Memory per waiter | 0 | 1 node (2 words) |
Under high contention with many waiters, MCS dramatically reduces cache coherence traffic:
serving)Advantages:
Disadvantages:
MCS locks are the basis for the Linux kernel's 'qspinlock' (queue spinlock), introduced in Linux 4.2. The kernel's implementation adds optimizations for the common case (single waiter or no waiters) while falling back to full MCS queuing under contention. Many high-performance lock libraries (like Google's absl::SpinLock) also use MCS or similar queue-based designs.
Spinlocks are efficient for short critical sections but waste CPU cycles during long waits. Mutex locks (mutual exclusion locks) address this by putting waiting threads to sleep, allowing the CPU to do useful work.
A mutex combines CAS-based acquisition with OS-assisted blocking when contention is detected.
Modern mutexes typically use a two-phase strategy:
This adaptive approach provides:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
// Simplified CAS-Based Mutex with Blocking #include <stdatomic.h>#include <linux/futex.h>#include <sys/syscall.h>#include <unistd.h> // Lock states#define UNLOCKED 0#define LOCKED 1#define CONTENDED 2 // Locked with waiters typedef struct { atomic_int state;} Mutex; void mutex_init(Mutex* m) { atomic_store(&m->state, UNLOCKED);} void mutex_lock(Mutex* m) { int expected = UNLOCKED; // Fast path: Try to acquire uncontended lock if (atomic_compare_exchange_strong(&m->state, &expected, LOCKED)) { return; // Got it immediately! } // Slow path: Lock is held // Try to spin briefly first for (int i = 0; i < SPIN_COUNT; i++) { expected = UNLOCKED; if (atomic_compare_exchange_weak(&m->state, &expected, LOCKED)) { return; // Got it after spinning } __builtin_ia32_pause(); } // Still can't get it - time to sleep while (true) { // Mark as contended (so holder knows to wake us) if (atomic_exchange(&m->state, CONTENDED) == UNLOCKED) { // Oops, it just became unlocked! We have it now. return; } // Sleep until state changes (using Linux futex) // FUTEX_WAIT: If state still equals CONTENDED, sleep syscall(SYS_futex, &m->state, FUTEX_WAIT, CONTENDED, NULL, NULL, 0); // Woken up - try to acquire again expected = UNLOCKED; if (atomic_compare_exchange_strong(&m->state, &expected, CONTENDED)) { return; // Got it! } // Someone else got it first, go back to sleep }} void mutex_unlock(Mutex* m) { // Atomically check state and unlock int old_state = atomic_exchange(&m->state, UNLOCKED); if (old_state == CONTENDED) { // There are waiters - wake one up syscall(SYS_futex, &m->state, FUTEX_WAKE, 1, NULL, NULL, 0); } // If old_state was LOCKED (not CONTENDED), no waiters - just return} /* * State transitions: * * UNLOCKED ──CAS──→ LOCKED (uncontended acquisition) * LOCKED ──XCHG──→ CONTENDED (waiter arriving) * CONTENDED ──CAS──→ CONTENDED (waiter acquiring) * CONTENDED ──XCHG──→ UNLOCKED (release, wake waiter) * LOCKED ──XCHG──→ UNLOCKED (release, no waiters) * * Key insight: CONTENDED state tells unlock() to wake waiters */Futex (Fast Userspace muTEX) is a Linux syscall that enables:
Futex operations are atomic with respect to the memory location:
FUTEX_WAIT(addr, val): If *addr == val, put caller to sleepFUTEX_WAKE(addr, n): Wake up to n threads sleeping on addrThe combination of CAS (for fast path) and futex (for slow path) is how pthread_mutex_t is implemented on Linux.
| Characteristic | Spinlock | Mutex (Blocking) |
|---|---|---|
| Wait mechanism | CPU spinning | OS-assisted sleep |
| CPU usage while waiting | 100% (one core) | 0% (sleeping) |
| Context switch | Never | On contention (expensive) |
| Best for | Short critical sections | Long critical sections |
| Uncontended overhead | 1-10 cycles | 10-100 cycles |
| Contended overhead | Spinning (cheap if short) | Sleep/wake (expensive but fair to CPU) |
Many mutex implementations use adaptive spinning: they spin for a while before blocking, hoping the lock becomes free quickly. The Linux kernel's mutex implementation tracks how long the lock holder typically holds the lock; if the holder is running and critical sections are short, it spins more. If the holder is sleeping or critical sections are long, it blocks immediately. This runtime adaptation can provide the best of both worlds.
Standard mutexes enforce exclusive access—only one thread at a time. But many workloads are read-heavy: many threads read shared data, but writes are infrequent. Reader-Writer locks (RWLocks) optimize for this pattern:
| Requesting | Current Holders | Result |
|---|---|---|
| Reader | None | Grant read lock |
| Reader | Readers | Grant read lock |
| Reader | Writer | Block |
| Writer | None | Grant write lock |
| Writer | Any | Block |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
// Simple CAS-Based Reader-Writer Lock #include <stdatomic.h>#include <stdint.h> // State encoding:// Bits 0-30: Reader count// Bit 31: Writer active flag#define WRITER_BIT 0x80000000#define READER_MASK 0x7FFFFFFF typedef struct { atomic_uint state;} RWLock; void rwlock_init(RWLock* rw) { atomic_store(&rw->state, 0);} void rwlock_read_lock(RWLock* rw) { while (true) { unsigned int old_state = atomic_load(&rw->state); // Can't acquire read lock if writer is active or waiting if (old_state & WRITER_BIT) { __builtin_ia32_pause(); continue; } // Try to increment reader count unsigned int new_state = old_state + 1; if (atomic_compare_exchange_weak(&rw->state, &old_state, new_state)) { return; // Got read lock } // CAS failed, retry }} void rwlock_read_unlock(RWLock* rw) { // Decrement reader count atomic_fetch_sub(&rw->state, 1);} void rwlock_write_lock(RWLock* rw) { while (true) { unsigned int old_state = atomic_load(&rw->state); // Can only acquire if no readers and no writer if (old_state != 0) { __builtin_ia32_pause(); continue; } // Try to set writer bit if (atomic_compare_exchange_weak(&rw->state, &old_state, WRITER_BIT)) { return; // Got write lock } // CAS failed, retry }} void rwlock_write_unlock(RWLock* rw) { // Clear writer bit atomic_store(&rw->state, 0);} /* * Issues with this simple implementation: * * 1. Writer starvation: Continuous readers prevent writers forever * 2. No fairness: New readers can keep jumping ahead of waiting writers * 3. Scalability: All readers CAS on same location (contention) * * Production RWLocks are significantly more complex to address these */Writer Starvation
The simple implementation above suffers from writer starvation: if readers continuously hold the lock, a waiting writer may never get access. Solutions:
Scalability Issues
RWLocks only help when:
If reads and writes are balanced, or reads are very short, a simple mutex may perform better due to lower overhead.
RWLocks are often used incorrectly. Common mistakes: (1) Using RWLock when read/write ratio doesn't justify the overhead, (2) Ignoring writer starvation until production problems occur, (3) Upgrading read lock to write lock (causes deadlock if two readers try simultaneously), (4) Assuming RWLock always wins (it often doesn't for short critical sections). Profile carefully before adopting RWLocks.
We have journeyed from the simplest possible lock to sophisticated queue-based designs. Each step addressed specific performance problems while introducing new trade-offs. Understanding this evolution helps in choosing the right lock for each situation.
| Lock Type | Fairness | Scalability | Complexity | Best Use Case |
|---|---|---|---|---|
| TAS Spinlock | None | Poor | Minimal | Learning, very low contention |
| TTAS Spinlock | None | Fair | Low | Short CS, few cores |
| Backoff Spinlock | None | Good | Medium | Short CS, moderate contention |
| Ticket Lock | FIFO | Fair | Low | Fairness needed, moderate contention |
| MCS Lock | FIFO | Excellent | High | Many cores, high contention |
| Mutex | Varies | Good | Medium | Long CS, mixed workload |
The next page explores CAS loops—the fundamental programming pattern for lock-free and CAS-based code. We'll examine how to structure CAS retry loops correctly, common pitfalls, and optimization techniques.
We will cover:
Understanding CAS loops ties together everything we've learned about CAS operations, lock-free programming, and CAS-based locks.
You now understand the spectrum of CAS-based lock implementations, from simple spinlocks to sophisticated queue locks and blocking mutexes. This knowledge enables you to choose appropriate synchronization primitives for different scenarios and understand the trade-offs underlying the locks you use every day.