Loading learning content...
In the landscape of concurrent programming, one atomic operation stands as the universal foundation upon which virtually all modern synchronization is built: Compare-and-Swap (CAS). This seemingly simple operation—compare a memory location to an expected value, and if they match, swap in a new value, all atomically—is the cornerstone of lock-free data structures, non-blocking algorithms, and high-performance concurrent systems.
From the Java AtomicInteger you use in everyday programming to the Linux kernel's spinlock implementations, from database transaction managers to garbage collectors, CAS is the hidden workhorse that makes concurrent software possible. Understanding CAS deeply—its semantics, its guarantees, its limitations, and its proper use—is essential knowledge for any systems programmer.
This page provides a rigorous, comprehensive treatment of the CAS operation itself. We will dissect its semantics, explore its hardware implementation across architectures, analyze memory ordering implications, and establish the formal foundation for reasoning about CAS-based algorithms.
By the end of this page, you will understand the precise semantics of Compare-and-Swap, how it is implemented in hardware, the memory ordering guarantees it provides, and why it is the preferred atomic primitive for modern concurrent programming. You will be equipped with the foundational knowledge to reason about CAS-based algorithms and understand their correctness properties.
Compare-and-Swap (CAS), also known as Compare-and-Exchange (CMPXCHG) on x86 architectures, is an atomic instruction that performs a conditional update of a memory location. The operation takes three operands:
The CAS operation executes the following logic atomically (as a single, indivisible operation):
CAS(address, expected, new_value):
if (*address == expected):
*address = new_value
return true // Success: value was as expected, now updated
else:
return false // Failure: value was different, not modified
The critical property is atomicity: the entire compare-and-conditional-swap sequence occurs as one indivisible unit. No other operation on any processor can observe the memory location in an intermediate state (i.e., after the comparison but before the swap).
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// Conceptual CAS implementation (NOT how hardware actually works)// This illustrates the SEMANTICS of CAS // The key insight: everything between the lines happens ATOMICALLY// In reality, this atomicity is provided by hardware, not by code bool compare_and_swap(int* address, int expected, int new_value) { // ═══════════════════════════════════════════════════════════ // │ ATOMIC REGION START - No other operation can interleave │ // ═══════════════════════════════════════════════════════════ int current = *address; // Read current value if (current == expected) { // Compare *address = new_value; // Swap if equal // ═════════════════════════════════════════════════════════ // │ ATOMIC REGION END │ // ═════════════════════════════════════════════════════════ return true; // Report success } else { // ═════════════════════════════════════════════════════════ // │ ATOMIC REGION END │ // ═════════════════════════════════════════════════════════ return false; // Report failure, value unchanged }} // Alternative formulation: Return the OLD value (common in many APIs)// This is how CMPXCHG on x86 actually worksint compare_and_swap_return_old(int* address, int expected, int new_value) { // ═══════════════════════════════════════════════════════════ // │ ATOMIC REGION - Single indivisible operation │ // ═══════════════════════════════════════════════════════════ int old_value = *address; // Capture current value if (old_value == expected) { *address = new_value; // Swap if match } return old_value; // Always return what was there // Caller checks: (returned == expected) to determine success}CAS operations come in two common API styles: (1) return a boolean indicating success/failure, or (2) return the old value, leaving the caller to determine success by comparing the returned value to the expected value. The x86 CMPXCHG instruction follows the second style. Both provide equivalent information, but returning the old value is often more useful for retry loops, as it tells you what value to expect on the next attempt.
CAS is consensus number ∞ in the theory of concurrent objects, meaning it can solve the consensus problem for any number of processes. This theoretical property translates to practical power: CAS can implement any concurrent data structure that is possible to implement, making it a universal synchronization primitive.
In contrast, simpler atomic operations like Test-and-Set (TAS) have consensus number 2—they can only solve consensus for two processes. This is why CAS has become the dominant atomic primitive in modern architectures:
Every modern processor architecture provides some form of CAS or equivalent functionality, making CAS-based algorithms portable across platforms.
To appreciate CAS, we must understand how it differs from—and surpasses—other atomic operations. Let's compare CAS with the simpler atomic primitives.
Test-and-Set atomically reads a memory location and sets it to 1 (or true), returning the previous value.
TAS(address):
old = *address
*address = 1
return old
TAS is simpler than CAS but less powerful:
Fetch-and-Add atomically reads a value and adds a delta to it, returning the previous value.
FAA(address, delta):
old = *address
*address = old + delta
return old
FAA is useful for counters but cannot implement general synchronization:
| Primitive | Semantics | Consensus Number | Can Implement Locks? | Can Implement Lock-Free Structures? |
|---|---|---|---|---|
| Read/Write | Load or store single value | 1 | No | No |
| Test-and-Set | Set to 1, return old value | 2 | Yes (with busy-wait) | Limited |
| Fetch-and-Add | Add delta, return old value | 2 | Yes (ticket locks) | Limited |
| Swap (XCHG) | Replace value, return old | 2 | Yes (with busy-wait) | Limited |
| Compare-and-Swap | Conditional replace | ∞ | Yes (all types) | Yes (universal) |
The key advantage of CAS is its conditional nature. CAS only modifies the memory location if it currently holds a specific expected value. This conditionality enables a powerful programming pattern:
This pattern—called optimistic concurrency control—allows multiple threads to make progress without blocking each other, as long as they don't actually conflict. In low-contention scenarios, this dramatically outperforms lock-based approaches.
With TAS or FAA, we cannot implement this pattern because the modification is unconditional. We must know in advance what value to write, independent of the current state. CAS allows the decision to depend on the current state, enabling rich state machines and complex data structures.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// Example: Implementing a bounded counter (0 to MAX)// This demonstrates why CAS is more powerful than TAS #define MAX_VALUE 100 // ═══════════════════════════════════════════════════════════════// IMPOSSIBLE with Test-and-Set// TAS always writes 1, cannot implement conditional increment// ═══════════════════════════════════════════════════════════════ // ═══════════════════════════════════════════════════════════════// POSSIBLE with CAS: Bounded lock-free counter// ═══════════════════════════════════════════════════════════════ int bounded_counter = 0; // Returns true if increment succeeded, false if at maximumbool increment_bounded(void) { while (true) { int current = bounded_counter; // 1. Read current state if (current >= MAX_VALUE) { return false; // At maximum, cannot increment } int next = current + 1; // 2. Compute next state // 3. Attempt atomic transition if (CAS(&bounded_counter, current, next)) { return true; // Success! } // 4. Someone else modified counter, retry // The loop continues with new 'current' value }} // The key insight: CAS allows us to:// - Check a condition (current < MAX)// - Compute result based on current value (current + 1)// - Atomically update only if nothing changed//// TAS cannot do this because it cannot:// - Make the write conditional on current value// - Write a value that depends on current valueThe fundamental power of CAS is enabling atomic read-modify-write operations where the modification depends on the current value. This is exactly what non-atomic operations fail to do safely—CAS makes it atomic. Any algorithm that needs to update state conditionally based on current state can use CAS; simpler primitives cannot express this pattern.
Understanding how CAS is implemented in hardware demystifies its behavior and explains its performance characteristics. Different processor architectures implement CAS through different mechanisms, but all provide the same fundamental guarantee of atomicity.
On Intel and AMD processors, CAS is implemented via the CMPXCHG family of instructions:
CMPXCHG — Compare and exchange for 8, 16, or 32-bit operandsCMPXCHG8B — Compare and exchange 8 bytes (64 bits)CMPXCHG16B — Compare and exchange 16 bytes (128 bits, x64 only)The instruction works as follows:
; CMPXCHG destination, source
; Implicit operand: EAX (or RAX for 64-bit) holds expected value
CMPXCHG [memory], new_value
; If [memory] == EAX:
; [memory] = new_value
; ZF = 1 (zero flag set, indicating success)
; Else:
; EAX = [memory] (load actual value into EAX)
; ZF = 0 (zero flag clear, indicating failure)
For multi-processor systems, CMPXCHG must be preceded by the LOCK prefix to ensure atomicity across all CPUs:
LOCK CMPXCHG [global_var], ebx
The LOCK prefix instructs the processor to:
12345678910111213141516171819202122232425262728
; x86-64 assembly: CAS implementation using CMPXCHG; Function: int cas(int* addr, int expected, int new_val); Returns: old value at *addr (caller checks if == expected) ; System V AMD64 ABI:; RDI = addr (pointer); ESI = expected value ; EDX = new value; Return in EAX cas: mov eax, esi ; Move expected value to EAX (implicit operand) ; The LOCK prefix is ESSENTIAL for multi-processor atomicity ; Without LOCK, the operation is only atomic on single-CPU systems lock cmpxchg [rdi], edx ; Atomic: if [rdi]==EAX then [rdi]=EDX ; After CMPXCHG: ; - If matched: [rdi] now contains EDX, EAX unchanged, ZF=1 ; - If mismatched: EAX now contains actual [rdi] value, ZF=0 ; Return value is in EAX (the old value) ret ; Example usage pattern in pseudo-C:; int old = cas(&counter, expected, new_val);; if (old == expected) { /* success */ }; else { expected = old; /* retry */ }ARM processors (prior to ARMv8.1) use a different approach called Load-Link/Store-Conditional (LL/SC), implemented as LDREX (Load Exclusive) and STREX (Store Exclusive):
The exclusive monitor is invalidated if any other processor writes to the same cache line, or if certain events occur (interrupts, context switches). This provides atomicity without an explicit lock.
| Aspect | x86 CMPXCHG | ARM LDREX/STREX |
|---|---|---|
| Instruction Type | Single instruction does compare-and-swap | Two-instruction sequence (load, then conditional store) |
| Success Guarantee | Always succeeds if value matches | May spuriously fail even if value unchanged |
| Failure Semantics | Returns mismatched value | Returns success/failure indicator |
| Locking Mechanism | LOCK prefix (bus lock or cache lock) | Exclusive monitor (reservation-based) |
| Retry Overhead | Just retry the instruction | Must re-execute LDREX before retrying STREX |
| Spurious Failure | Never | Possible (context switch, other core's access) |
12345678910111213141516171819202122232425262728293031323334353637
; ARM assembly: CAS using Load-Exclusive/Store-Exclusive; Function: int cas(int* addr, int expected, int new_val) ; ARM calling convention:; R0 = addr, R1 = expected, R2 = new_val; Return in R0 cas: ; Outer loop: handles spurious failuresretry: LDREX R3, [R0] ; Load exclusive: R3 = *addr, set monitor CMP R3, R1 ; Compare loaded value with expected BNE fail ; If not equal, someone else changed it ; Values match, try to store new value STREX R4, R2, [R0] ; Store exclusive: *addr = R2 if monitor valid ; R4 = 0 if success, 1 if failed CMP R4, #0 ; Check if STREX succeeded BNE retry ; Spurious failure: monitor invalidated, retry ; Success: return the expected value (caller knows success when ret == expected) MOV R0, R1 BX LR fail: ; CAS failed: return the actual value we found MOV R0, R3 BX LR ; Key insight: ARM's LL/SC can fail spuriously!; STREX may fail even if no one modified the value, because:; - Context switch occurred between LDREX and STREX; - Another core accessed the same cache line (even if different variable); - Interrupt occurred; This is why ARM CAS loops must always retry on STREX failureOn ARM and other LL/SC-based architectures, the conditional store can fail even when the memory value hasn't been modified by another thread. This 'spurious failure' occurs due to various CPU events (interrupts, cache-line sharing). Code must always be written to retry on failure—you cannot assume failure means another thread modified the value. This is a key difference from x86's guaranteed non-spurious behavior.
Modern multi-processor systems maintain cache coherence using protocols like MESI (Modified, Exclusive, Shared, Invalid). Understanding how CAS interacts with cache coherence explains its performance characteristics:
This cache coherence activity is the primary source of CAS overhead in high-contention scenarios. When multiple processors contend for the same variable, cache lines bounce between processors, creating significant traffic on the memory bus or interconnect.
Atomic CAS operations provide sequentially consistent semantics on most platforms:
This makes CAS ideal for implementing synchronization primitives that require memory visibility guarantees.
Modern programming languages provide high-level abstractions for CAS operations, hiding the architecture-specific details while exposing the essential semantics. Understanding these APIs is crucial for writing portable concurrent code.
C11 and C++11 introduced standardized atomic operations, including CAS:
#include <stdatomic.h> // C11
// or
#include <atomic> // C++11
The primary CAS functions are:
atomic_compare_exchange_strong: Never fails spuriouslyatomic_compare_exchange_weak: May fail spuriously, but can be more efficient in loops1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
#include <atomic>#include <iostream> // C++11 atomic CAS demonstration std::atomic<int> counter(0); // Using compare_exchange_strong (no spurious failures)bool increment_if_below_max_strong(int max_value) { int expected = counter.load(); // Read current value while (expected < max_value) { int desired = expected + 1; // CAS: if counter == expected, set counter = desired // If fails, expected is updated to the actual current value if (counter.compare_exchange_strong(expected, desired)) { return true; // Successfully incremented } // expected now contains actual value; loop will check condition } return false; // Counter was at or above max} // Using compare_exchange_weak (can fail spuriously, use in loops)bool increment_weak(void) { int expected = counter.load(); int desired = expected + 1; // Weak CAS may fail spuriously on some architectures // More efficient in a loop because it maps directly to LL/SC while (!counter.compare_exchange_weak(expected, desired)) { // On failure, expected is updated to actual value desired = expected + 1; } return true;} // Memory ordering controlvoid increment_with_ordering() { int expected = counter.load(std::memory_order_relaxed); int desired; do { desired = expected + 1; } while (!counter.compare_exchange_weak( expected, desired, std::memory_order_release, // Memory order on success std::memory_order_relaxed // Memory order on failure ));} /* * Key API differences: * * compare_exchange_strong: * - Returns true only if CAS succeeded * - Returns false only if value was different * - Never fails spuriously * - Use when you need to distinguish "value changed" from "retry needed" * * compare_exchange_weak: * - May return false even if value matched (spurious failure) * - More efficient on LL/SC architectures (ARM, RISC-V) * - Use in loops where you'd retry anyway * - May compile to fewer instructions on non-x86 */Java provides atomic classes since JDK 1.5 and more flexible atomic operations via VarHandle since JDK 9:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import java.util.concurrent.atomic.AtomicInteger;import java.lang.invoke.MethodHandles;import java.lang.invoke.VarHandle; public class CASExample { // Classic AtomicInteger approach private AtomicInteger counter = new AtomicInteger(0); public boolean incrementIfBelowMax(int max) { while (true) { int current = counter.get(); if (current >= max) { return false; } // compareAndSet returns true if successful if (counter.compareAndSet(current, current + 1)) { return true; } // CAS failed, someone else modified counter, loop and retry } } // Using getAndUpdate (cleaner, internally uses CAS loop) public int incrementBounded(int max) { return counter.getAndUpdate(current -> { if (current >= max) return current; // Don't increment return current + 1; // Increment }); } // Modern VarHandle approach (JDK 9+) - more flexible private int value = 0; private static final VarHandle VALUE_HANDLE; static { try { VALUE_HANDLE = MethodHandles.lookup() .findVarHandle(CASExample.class, "value", int.class); } catch (Exception e) { throw new Error(e); } } public boolean casWithVarHandle(int expected, int newValue) { // VarHandle.compareAndSet works on regular fields return VALUE_HANDLE.compareAndSet(this, expected, newValue); } // compareAndExchange returns the witness value (old value) public int casExchange(int expected, int newValue) { // Returns the value that was in the field // If return == expected, CAS succeeded return (int) VALUE_HANDLE.compareAndExchange(this, expected, newValue); }} /* * Java CAS Considerations: * * 1. AtomicInteger.compareAndSet: Simple boolean return * - Returns true if CAS succeeded * - Doesn't tell you what the actual value was on failure * * 2. AtomicInteger.compareAndExchange (JDK 9+): Returns witness value * - Returns the value that was actually in the variable * - More efficient for retry loops (no separate load needed) * * 3. VarHandle: Most flexible, works on any field * - Can add atomic semantics to existing code * - Supports various memory ordering modes */atomic_compare_exchange_strong/weakstd::atomic<T>::compare_exchange_strong/weakAtomicInteger.compareAndSet, VarHandle.compareAndSetAtomicUsize::compare_exchangeatomic.CompareAndSwapInt32/Int64/PointerWhen writing a CAS retry loop, prefer the 'weak' variant (compare_exchange_weak in C++, though Java only offers strong CAS semantically). On LL/SC architectures, weak CAS can be more efficient because it doesn't need to retry internally to eliminate spurious failures—your loop will retry anyway. On x86, weak and strong compile to the same code since CMPXCHG never fails spuriously.
CAS is not without pitfalls. The most famous is the ABA problem, a subtle bug that can occur when a value changes from A to B and back to A between a read and a subsequent CAS operation.
Consider a lock-free stack implemented with CAS:
top = A (pointing to node A)top is still A!But the stack structure has changed—perhaps B is now leaked, or the new A points to invalid memory. The CAS succeeded because it cannot detect that the value was modified and restored.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
// Demonstrating the ABA Problem in a lock-free stack typedef struct Node { int data; struct Node* next;} Node; Node* top = NULL; // Stack top pointer // Naive lock-free pop (VULNERABLE TO ABA)Node* pop_naive(void) { while (true) { Node* old_top = top; // Step 1: Read current top if (old_top == NULL) { return NULL; // Stack is empty } Node* new_top = old_top->next; // Step 2: Read next pointer // DANGER ZONE: Between Step 1 and Step 3, another thread could: // - Pop 'old_top' (removing A) // - Pop the next node // - Push 'old_top' back (re-adding A) // Now 'top' equals 'old_top' again, but 'old_top->next' is different! // Step 3: CAS to update top if (CAS(&top, old_top, new_top)) { return old_top; // Success, but may be corrupted! } }} /* * ABA Scenario Timeline: * * Initial state: top -> A -> B -> C -> NULL * * Thread 1: * 1. old_top = A * 2. new_top = A->next = B * [SUSPENDED] * * Thread 2: * 1. pop() returns A, now top -> B -> C -> NULL * 2. pop() returns B, now top -> C -> NULL * 3. push(A) with A->next = C, now top -> A -> C -> NULL * * Thread 1 resumes: * 3. CAS(&top, A, B) -- SUCCEEDS because top == A ! * 4. Returns A * * Result: top = B, but B was already removed and may be: * - Freed/deallocated (use-after-free) * - Reused for something else (data corruption) * - Pointing to invalid memory (crash) */Several techniques address the ABA problem:
1. Tagged Pointers (Version Counting)
Augment each pointer with a version counter. The CAS operates on the combined pointer+counter, and the counter is incremented on every modification. The ABA scenario now has A₁ → B → A₂, where A₁ ≠ A₂.
2. Double-Wide CAS (DCAS/CMPXCHG16B)
Use 128-bit CAS to atomically compare-and-swap both a pointer and a counter. x86-64's CMPXCHG16B supports this.
3. Hazard Pointers
Threads announce which nodes they're accessing. Other threads skip reclamation of announced nodes. Ensures nodes aren't recycled while in use.
4. Epoch-Based Reclamation
Nodes are only recycled after all threads have completed any operation that started before the node was removed.
5. Reference Counting
Maintain reference counts on nodes. Nodes are only freed when their reference count reaches zero.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
// Solution 1: Tagged Pointers using CMPXCHG16B// Combine pointer with a monotonic version counter #include <stdint.h>#include <stdbool.h> // Tagged pointer: 64-bit pointer + 64-bit counter = 128 bitstypedef struct { Node* ptr; uint64_t tag;} __attribute__((aligned(16))) TaggedPointer; TaggedPointer top = {NULL, 0}; // Atomic 128-bit CAS (uses CMPXCHG16B on x86-64)bool cas128(TaggedPointer* target, TaggedPointer expected, TaggedPointer desired) { // Compiler intrinsic or inline assembly for CMPXCHG16B return __atomic_compare_exchange( target, &expected, &desired, false, // strong __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST );} // ABA-safe pop using tagged pointersNode* pop_safe(void) { while (true) { TaggedPointer old_top = top; // Atomically read both ptr and tag if (old_top.ptr == NULL) { return NULL; } Node* next = old_top.ptr->next; // Create new tagged pointer with incremented tag TaggedPointer new_top = { .ptr = next, .tag = old_top.tag + 1 // Increment tag on every modification }; // 128-bit CAS: both pointer AND tag must match if (cas128(&top, old_top, new_top)) { return old_top.ptr; } // If tag changed (even if ptr is same), CAS fails - ABA prevented! }} /* * How tagged pointers prevent ABA: * * Initial: top = {ptr: A, tag: 0} * * Thread 1: * old_top = {A, 0} * [SUSPENDED] * * Thread 2: * pop() -> top = {B, 1} // tag incremented * pop() -> top = {C, 2} // tag incremented * push(A) -> top = {A, 3} // tag incremented * * Thread 1 resumes: * CAS(top, {A, 0}, ...) -> FAILS! * Because current top is {A, 3}, not {A, 0} * Tag mismatch prevents false success! */The ABA problem is often not caught by testing because it requires specific, rare timing conditions. Code with ABA vulnerabilities may work correctly for years before failing catastrophically. When implementing lock-free data structures, always explicitly consider and address ABA. The safest approach is to use established libraries (like Java's concurrent collections) that have been thoroughly vetted, rather than rolling your own lock-free code.
CAS operations interact with the memory model in important ways. Understanding memory ordering is essential for correct concurrent programming, as modern processors and compilers aggressively reorder memory operations for performance.
Without proper memory ordering, changes made by one thread may not be visible to other threads—or may become visible in a surprising order. Consider:
// Thread 1
data = 42; // (A)
ready = true; // (B)
// Thread 2
while (!ready); // (C)
assert(data == 42); // (D) — May fail!
The assertion can fail because:
ready = true before it sees data = 42C11/C++11 define several memory orderings:
| Ordering | Guarantees | Use Case |
|---|---|---|
memory_order_relaxed | Only atomicity, no ordering constraints | Counters where order doesn't matter |
memory_order_acquire | Reads after this see writes released elsewhere | Reading a lock, loading a flag |
memory_order_release | Writes before this are visible to acquires | Writing a lock, storing a flag |
memory_order_acq_rel | Both acquire and release (for RMW ops) | Lock transfer, CAS in some protocols |
memory_order_seq_cst | Full sequential consistency, total order | Default, safest, most intuitive (slowest) |
CAS is a read-modify-write operation, so it touches memory in two ways: it reads the current value and potentially writes a new value. This leads to an interesting question: what memory ordering does a CAS provide?
Default behavior (seq_cst): By default, most languages use sequentially consistent ordering for CAS. This provides the strongest guarantees:
Custom ordering in C++: C++ allows specifying separate orderings for success and failure cases:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
#include <atomic> std::atomic<bool> lock(false);int shared_data = 0; void acquire_lock() { bool expected = false; while (!lock.compare_exchange_weak( expected, true, std::memory_order_acquire, // On SUCCESS: acquire semantics // - All subsequent reads see prior writes std::memory_order_relaxed // On FAILURE: no ordering needed // - We're just going to retry anyway )) { expected = false; // Reset expected and retry } // CRITICAL SECTION // memory_order_acquire ensures we see all writes made before // the previous release of this lock shared_data++; // Safe to access shared_data} void release_lock() { // memory_order_release ensures all our writes are visible // to threads that subsequently acquire the lock lock.store(false, std::memory_order_release);} /* * Memory Ordering Reasoning: * * Thread 1 (lock holder): * [writes to shared_data] * lock.store(false, release) <-- RELEASE barrier * * Thread 2 (acquiring): * lock.compare_exchange(..., acquire, relaxed) <-- ACQUIRE barrier * [reads from shared_data] * * The release-acquire pair creates a "happens-before" relationship: * All writes by Thread 1 before the release ARE GUARANTEED to be * visible to Thread 2's reads after the acquire. * * Why relaxed on failure? * - If CAS fails, we didn't acquire the lock * - No synchronization is happening * - We're about to retry, so no ordering constraints needed * - Relaxed is free or nearly free on most architectures */x86/x64: The x86 memory model is relatively strong (Total Store Order). LOCK-prefixed instructions like CMPXCHG act as full memory barriers, preventing any reordering around them. This means even memory_order_relaxed CAS on x86 gets acquire-release semantics "for free".
ARM/RISC-V: These architectures have weaker memory models. Memory barriers must be explicitly generated based on the requested memory order. A relaxed CAS truly provides only atomicity, with separate barrier instructions needed for ordering.
This architectural difference has implications:
If you're unsure about memory ordering, use the default sequential consistency (seq_cst). It's the safest option and easiest to reason about. Only optimize to weaker orderings when you've proven correctness and profiling shows memory barriers are a bottleneck. Incorrect memory ordering leads to bugs that are extremely difficult to diagnose—they may only appear on specific architectures or under specific timing conditions.
While CAS is powerful, it has limitations that affect algorithm design and performance. Understanding these limitations helps in choosing appropriate synchronization strategies.
Standard CAS operates on a single memory word (typically 32 or 64 bits, or 128 bits with CMPXCHG16B). This creates challenges when atomically updating multiple locations.
The Multi-Word Update Problem:
// Cannot make this atomic with single CAS:
struct Account {
int balance;
int last_transaction;
};
// We want: atomically update both balance AND last_transaction
// But CAS can only compare-and-swap one field at a time
Solutions:
CAS performance degrades significantly under high contention:
The Contention Storm:
When many threads compete for the same CAS:
Cache Line Bouncing:
Every CAS requires exclusive ownership of the cache line. Under contention:
| Contention Level | CAS Success Rate | Cache Behavior | Effective Performance |
|---|---|---|---|
| None (single thread) | 100% | Cache line stable | Maximum throughput |
| Low (few threads, rare conflict) | ~90%+ | Occasional cache miss | Good throughput |
| Medium (frequent conflict) | ~50-80% | Frequent cache line bouncing | Degraded throughput |
| High (constant conflict) | <20% | Cache line thrashing | Near-zero effective throughput |
Several techniques can reduce CAS contention:
1. Exponential Backoff
When CAS fails, wait before retrying. Increase wait time exponentially on repeated failures:
int backoff = 1;
while (!CAS(...)) {
for (int i = 0; i < backoff; i++) {
cpu_pause(); // Yield CPU momentarily
}
backoff = min(backoff * 2, MAX_BACKOFF);
}
2. Hierarchical Data Structures
Spread contention across multiple CAS targets. Instead of one central counter, use per-thread or per-shard counters:
// Instead of:
atomic<int> global_counter; // Massive contention
// Use:
atomic<int> counters[NUM_SHARDS]; // Spread load
int total() {
int sum = 0;
for (int i = 0; i < NUM_SHARDS; i++) sum += counters[i];
return sum;
}
3. Elimination
For operations with inverse (e.g., push/pop), pair threads to cancel out:
4. Combining
Aggregate multiple operations:
CAS is not always the best choice. For high-contention scenarios with long critical sections, traditional blocking locks may perform better. Locks allow the holder to complete unimpeded while waiters sleep, rather than having all threads burn CPU in retry loops. The rule of thumb: use CAS for short, simple atomic updates with low-to-moderate contention; use locks for longer critical sections or under high contention.
We have thoroughly explored the Compare-and-Swap operation—the foundational atomic primitive that powers modern concurrent programming. Let us consolidate the key concepts before moving to lock-free programming patterns.
Now that we understand the CAS primitive itself, the next page explores lock-free programming—a programming paradigm that uses CAS to build data structures and algorithms that never block, never deadlock, and provide guaranteed system-wide progress even when individual threads are delayed.
We will examine:
Lock-free programming represents one of the most powerful applications of CAS, enabling concurrent systems that are immune to many problems associated with traditional locking.
You now have a deep understanding of the Compare-and-Swap operation—its semantics, hardware implementation, language APIs, pitfalls like ABA, and performance characteristics. This knowledge forms the foundation for understanding lock-free programming and CAS-based synchronization constructs in the pages that follow.