Operating SystemsCompare-and-Swap (CAS)

Compare-and-Swap: Foundation of Lock-Free Concurrency

LevelAdvanced

Duration120 mins

TopicCompare-and-Swap (CAS)

1 / 5

The CAS Operation: Atomic Compare-and-Swap

The Primitive That Powers Modern Concurrency

In the landscape of concurrent programming, one atomic operation stands as the universal foundation upon which virtually all modern synchronization is built: Compare-and-Swap (CAS). This seemingly simple operation—compare a memory location to an expected value, and if they match, swap in a new value, all atomically—is the cornerstone of lock-free data structures, non-blocking algorithms, and high-performance concurrent systems.

From the Java AtomicInteger you use in everyday programming to the Linux kernel's spinlock implementations, from database transaction managers to garbage collectors, CAS is the hidden workhorse that makes concurrent software possible. Understanding CAS deeply—its semantics, its guarantees, its limitations, and its proper use—is essential knowledge for any systems programmer.

This page provides a rigorous, comprehensive treatment of the CAS operation itself. We will dissect its semantics, explore its hardware implementation across architectures, analyze memory ordering implications, and establish the formal foundation for reasoning about CAS-based algorithms.

What You Will Learn

By the end of this page, you will understand the precise semantics of Compare-and-Swap, how it is implemented in hardware, the memory ordering guarantees it provides, and why it is the preferred atomic primitive for modern concurrent programming. You will be equipped with the foundational knowledge to reason about CAS-based algorithms and understand their correctness properties.

What is Compare-and-Swap?

Compare-and-Swap (CAS), also known as Compare-and-Exchange (CMPXCHG) on x86 architectures, is an atomic instruction that performs a conditional update of a memory location. The operation takes three operands:

Address: A pointer to the memory location to be modified
Expected Value: The value we expect to find at that location
New Value: The value to write if the expectation matches

Semantic Definition

The CAS operation executes the following logic atomically (as a single, indivisible operation):

CAS(address, expected, new_value):
    if (*address == expected):
        *address = new_value
        return true   // Success: value was as expected, now updated
    else:
        return false  // Failure: value was different, not modified

The critical property is atomicity: the entire compare-and-conditional-swap sequence occurs as one indivisible unit. No other operation on any processor can observe the memory location in an intermediate state (i.e., after the comparison but before the swap).

cas_semantics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Conceptual CAS implementation (NOT how hardware actually works)
// This illustrates the SEMANTICS of CAS
 
// The key insight: everything between the lines happens ATOMICALLY
// In reality, this atomicity is provided by hardware, not by code
 
bool compare_and_swap(int* address, int expected, int new_value) {
    // ═══════════════════════════════════════════════════════════
    // │ ATOMIC REGION START - No other operation can interleave │
    // ═══════════════════════════════════════════════════════════
    
    int current = *address;           // Read current value
    
    if (current == expected) {        // Compare
        *address = new_value;         // Swap if equal
        
        // ═════════════════════════════════════════════════════════
        // │ ATOMIC REGION END                                     │
        // ═════════════════════════════════════════════════════════
        return true;                  // Report success
    } else {
        // ═════════════════════════════════════════════════════════
        // │ ATOMIC REGION END                                     │
        // ═════════════════════════════════════════════════════════
        return false;                 // Report failure, value unchanged
    }
}
 
// Alternative formulation: Return the OLD value (common in many APIs)
// This is how CMPXCHG on x86 actually works
int compare_and_swap_return_old(int* address, int expected, int new_value) {
    // ═══════════════════════════════════════════════════════════
    // │ ATOMIC REGION - Single indivisible operation            │
    // ═══════════════════════════════════════════════════════════
    
    int old_value = *address;         // Capture current value
    
    if (old_value == expected) {
        *address = new_value;         // Swap if match
    }
    
    return old_value;                 // Always return what was there
    
    // Caller checks: (returned == expected) to determine success
}

Two API Variants

CAS operations come in two common API styles: (1) return a boolean indicating success/failure, or (2) return the old value, leaving the caller to determine success by comparing the returned value to the expected value. The x86 CMPXCHG instruction follows the second style. Both provide equivalent information, but returning the old value is often more useful for retry loops, as it tells you what value to expect on the next attempt.

Why CAS is Fundamental

CAS is consensus number ∞ in the theory of concurrent objects, meaning it can solve the consensus problem for any number of processes. This theoretical property translates to practical power: CAS can implement any concurrent data structure that is possible to implement, making it a universal synchronization primitive.

In contrast, simpler atomic operations like Test-and-Set (TAS) have consensus number 2—they can only solve consensus for two processes. This is why CAS has become the dominant atomic primitive in modern architectures:

Intel x86/x64: CMPXCHG, CMPXCHG8B, CMPXCHG16B
ARM: LDREX/STREX (load-exclusive/store-exclusive pairs), CAS in ARMv8.1+
RISC-V: LR/SC (load-reserved/store-conditional)
PowerPC: lwarx/stwcx. (load-word-and-reserve-indexed / store-word-conditional-indexed)

Every modern processor architecture provides some form of CAS or equivalent functionality, making CAS-based algorithms portable across platforms.

CAS vs Other Atomic Operations

To appreciate CAS, we must understand how it differs from—and surpasses—other atomic operations. Let's compare CAS with the simpler atomic primitives.

Test-and-Set (TAS)

Test-and-Set atomically reads a memory location and sets it to 1 (or true), returning the previous value.

TAS(address):
    old = *address
    *address = 1
    return old

TAS is simpler than CAS but less powerful:

TAS always writes 1, regardless of the current value
TAS cannot conditionally update based on current state
TAS has consensus number 2 (only works for 2-process mutual exclusion)

Fetch-and-Add (FAA)

Fetch-and-Add atomically reads a value and adds a delta to it, returning the previous value.

FAA(address, delta):
    old = *address
    *address = old + delta
    return old

FAA is useful for counters but cannot implement general synchronization:

FAA always succeeds (unconditional)
FAA cannot be used for conditional state transitions
FAA also has consensus number 2

Why CAS is Superior

Comparison of Atomic Primitives
Primitive	Semantics	Consensus Number	Can Implement Locks?	Can Implement Lock-Free Structures?
Read/Write	Load or store single value	1	No	No
Test-and-Set	Set to 1, return old value	2	Yes (with busy-wait)	Limited
Fetch-and-Add	Add delta, return old value	2	Yes (ticket locks)	Limited
Swap (XCHG)	Replace value, return old	2	Yes (with busy-wait)	Limited
Compare-and-Swap	Conditional replace	∞	Yes (all types)	Yes (universal)

The Power of Conditional Updates

The key advantage of CAS is its conditional nature. CAS only modifies the memory location if it currently holds a specific expected value. This conditionality enables a powerful programming pattern:

Read the current state
Compute the desired new state based on the current state
Attempt to atomically transition from current to new using CAS
Retry if someone else changed the state in the meantime

This pattern—called optimistic concurrency control—allows multiple threads to make progress without blocking each other, as long as they don't actually conflict. In low-contention scenarios, this dramatically outperforms lock-based approaches.

With TAS or FAA, we cannot implement this pattern because the modification is unconditional. We must know in advance what value to write, independent of the current state. CAS allows the decision to depend on the current state, enabling rich state machines and complex data structures.

cas_vs_tas_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Example: Implementing a bounded counter (0 to MAX)
// This demonstrates why CAS is more powerful than TAS
 
#define MAX_VALUE 100
 
// ═══════════════════════════════════════════════════════════════
// IMPOSSIBLE with Test-and-Set
// TAS always writes 1, cannot implement conditional increment
// ═══════════════════════════════════════════════════════════════
 
// ═══════════════════════════════════════════════════════════════
// POSSIBLE with CAS: Bounded lock-free counter
// ═══════════════════════════════════════════════════════════════
 
int bounded_counter = 0;
 
// Returns true if increment succeeded, false if at maximum
bool increment_bounded(void) {
    while (true) {
        int current = bounded_counter;  // 1. Read current state
        
        if (current >= MAX_VALUE) {
            return false;               // At maximum, cannot increment
        }
        
        int next = current + 1;         // 2. Compute next state
        
        // 3. Attempt atomic transition
        if (CAS(&bounded_counter, current, next)) {
            return true;                // Success!
        }
        
        // 4. Someone else modified counter, retry
        // The loop continues with new 'current' value
    }
}
 
// The key insight: CAS allows us to:
// - Check a condition (current < MAX)
// - Compute result based on current value (current + 1)
// - Atomically update only if nothing changed
//
// TAS cannot do this because it cannot:
// - Make the write conditional on current value
// - Write a value that depends on current value

CAS Enables Read-Modify-Write

The fundamental power of CAS is enabling atomic read-modify-write operations where the modification depends on the current value. This is exactly what non-atomic operations fail to do safely—CAS makes it atomic. Any algorithm that needs to update state conditionally based on current state can use CAS; simpler primitives cannot express this pattern.

Hardware Implementation of CAS

Understanding how CAS is implemented in hardware demystifies its behavior and explains its performance characteristics. Different processor architectures implement CAS through different mechanisms, but all provide the same fundamental guarantee of atomicity.

x86/x64: The CMPXCHG Instruction

On Intel and AMD processors, CAS is implemented via the CMPXCHG family of instructions:

CMPXCHG — Compare and exchange for 8, 16, or 32-bit operands
CMPXCHG8B — Compare and exchange 8 bytes (64 bits)
CMPXCHG16B — Compare and exchange 16 bytes (128 bits, x64 only)

The instruction works as follows:

; CMPXCHG destination, source
; Implicit operand: EAX (or RAX for 64-bit) holds expected value

CMPXCHG [memory], new_value
; If [memory] == EAX:
;     [memory] = new_value
;     ZF = 1 (zero flag set, indicating success)
; Else:
;     EAX = [memory]  (load actual value into EAX)
;     ZF = 0 (zero flag clear, indicating failure)

The LOCK Prefix

For multi-processor systems, CMPXCHG must be preceded by the LOCK prefix to ensure atomicity across all CPUs:

LOCK CMPXCHG [global_var], ebx

The LOCK prefix instructs the processor to:

Assert the bus lock or use cache-locking protocols (MESI)
Ensure exclusive access to the cache line containing the memory location
Prevent other processors from accessing that cache line during the operation
Issue memory barriers to prevent memory reordering around the instruction

x86_cmpxchg.asm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
; x86-64 assembly: CAS implementation using CMPXCHG
; Function: int cas(int* addr, int expected, int new_val)
; Returns: old value at *addr (caller checks if == expected)
 
; System V AMD64 ABI:
; RDI = addr (pointer)
; ESI = expected value  
; EDX = new value
; Return in EAX
 
cas:
    mov     eax, esi          ; Move expected value to EAX (implicit operand)
    
    ; The LOCK prefix is ESSENTIAL for multi-processor atomicity
    ; Without LOCK, the operation is only atomic on single-CPU systems
    lock cmpxchg [rdi], edx   ; Atomic: if [rdi]==EAX then [rdi]=EDX
    
    ; After CMPXCHG:
    ; - If matched: [rdi] now contains EDX, EAX unchanged, ZF=1
    ; - If mismatched: EAX now contains actual [rdi] value, ZF=0
    
    ; Return value is in EAX (the old value)
    ret
 
; Example usage pattern in pseudo-C:
; int old = cas(&counter, expected, new_val);
; if (old == expected) { /* success */ }
; else { expected = old; /* retry */ }

ARM: Load-Exclusive / Store-Exclusive (LL/SC)

ARM processors (prior to ARMv8.1) use a different approach called Load-Link/Store-Conditional (LL/SC), implemented as LDREX (Load Exclusive) and STREX (Store Exclusive):

LDREX: Reads the value and sets an exclusive monitor on the address
STREX: Attempts to write; succeeds only if the exclusive monitor is still valid

The exclusive monitor is invalidated if any other processor writes to the same cache line, or if certain events occur (interrupts, context switches). This provides atomicity without an explicit lock.

LDREX/STREX vs CMPXCHG

Comparison of Hardware CAS Mechanisms
Aspect	x86 CMPXCHG	ARM LDREX/STREX
Instruction Type	Single instruction does compare-and-swap	Two-instruction sequence (load, then conditional store)
Success Guarantee	Always succeeds if value matches	May spuriously fail even if value unchanged
Failure Semantics	Returns mismatched value	Returns success/failure indicator
Locking Mechanism	LOCK prefix (bus lock or cache lock)	Exclusive monitor (reservation-based)
Retry Overhead	Just retry the instruction	Must re-execute LDREX before retrying STREX
Spurious Failure	Never	Possible (context switch, other core's access)

arm_ll_sc.asm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
; ARM assembly: CAS using Load-Exclusive/Store-Exclusive
; Function: int cas(int* addr, int expected, int new_val)
 
; ARM calling convention:
; R0 = addr, R1 = expected, R2 = new_val
; Return in R0
 
cas:
    ; Outer loop: handles spurious failures
retry:
    LDREX   R3, [R0]          ; Load exclusive: R3 = *addr, set monitor
    
    CMP     R3, R1            ; Compare loaded value with expected
    BNE     fail              ; If not equal, someone else changed it
    
    ; Values match, try to store new value
    STREX   R4, R2, [R0]      ; Store exclusive: *addr = R2 if monitor valid
                              ; R4 = 0 if success, 1 if failed
    
    CMP     R4, #0            ; Check if STREX succeeded
    BNE     retry             ; Spurious failure: monitor invalidated, retry
    
    ; Success: return the expected value (caller knows success when ret == expected)
    MOV     R0, R1
    BX      LR
 
fail:
    ; CAS failed: return the actual value we found
    MOV     R0, R3
    BX      LR
 
; Key insight: ARM's LL/SC can fail spuriously!
; STREX may fail even if no one modified the value, because:
; - Context switch occurred between LDREX and STREX
; - Another core accessed the same cache line (even if different variable)
; - Interrupt occurred
; This is why ARM CAS loops must always retry on STREX failure

LL/SC Spurious Failures

On ARM and other LL/SC-based architectures, the conditional store can fail even when the memory value hasn't been modified by another thread. This 'spurious failure' occurs due to various CPU events (interrupts, cache-line sharing). Code must always be written to retry on failure—you cannot assume failure means another thread modified the value. This is a key difference from x86's guaranteed non-spurious behavior.

Cache Coherence and CAS

Modern multi-processor systems maintain cache coherence using protocols like MESI (Modified, Exclusive, Shared, Invalid). Understanding how CAS interacts with cache coherence explains its performance characteristics:

Before CAS: The processor must acquire the cache line in Exclusive or Modified state
During CAS: Other processors' copies of the cache line are invalidated
After CAS: The modified cache line is in Modified state, visible to other processors on-demand

This cache coherence activity is the primary source of CAS overhead in high-contention scenarios. When multiple processors contend for the same variable, cache lines bounce between processors, creating significant traffic on the memory bus or interconnect.

Memory Ordering Guarantees

Atomic CAS operations provide sequentially consistent semantics on most platforms:

Full memory barrier: On x86, LOCK-prefixed instructions act as full memory barriers, preventing reordering of memory operations around them
Acquire-release semantics: CAS typically provides at least acquire semantics on success (all subsequent reads see writes before the CAS) and release semantics (all prior writes are visible to other threads seeing the CAS)

This makes CAS ideal for implementing synchronization primitives that require memory visibility guarantees.

CAS in Programming Languages

Modern programming languages provide high-level abstractions for CAS operations, hiding the architecture-specific details while exposing the essential semantics. Understanding these APIs is crucial for writing portable concurrent code.

C/C++: The Atomic Library

C11 and C++11 introduced standardized atomic operations, including CAS:

#include <stdatomic.h>  // C11
// or
#include <atomic>        // C++11

The primary CAS functions are:

atomic_compare_exchange_strong: Never fails spuriously
atomic_compare_exchange_weak: May fail spuriously, but can be more efficient in loops

cas_cpp.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#include <atomic>
#include <iostream>
 
// C++11 atomic CAS demonstration
 
std::atomic<int> counter(0);
 
// Using compare_exchange_strong (no spurious failures)
bool increment_if_below_max_strong(int max_value) {
    int expected = counter.load();  // Read current value
    
    while (expected < max_value) {
        int desired = expected + 1;
        
        // CAS: if counter == expected, set counter = desired
        // If fails, expected is updated to the actual current value
        if (counter.compare_exchange_strong(expected, desired)) {
            return true;  // Successfully incremented
        }
        // expected now contains actual value; loop will check condition
    }
    return false;  // Counter was at or above max
}
 
// Using compare_exchange_weak (can fail spuriously, use in loops)
bool increment_weak(void) {
    int expected = counter.load();
    int desired = expected + 1;
    
    // Weak CAS may fail spuriously on some architectures
    // More efficient in a loop because it maps directly to LL/SC
    while (!counter.compare_exchange_weak(expected, desired)) {
        // On failure, expected is updated to actual value
        desired = expected + 1;
    }
    return true;
}
 
// Memory ordering control
void increment_with_ordering() {
    int expected = counter.load(std::memory_order_relaxed);
    int desired;
    
    do {
        desired = expected + 1;
    } while (!counter.compare_exchange_weak(
        expected, 
        desired,
        std::memory_order_release,  // Memory order on success
        std::memory_order_relaxed   // Memory order on failure
    ));
}
 
/*
 * Key API differences:
 * 
 * compare_exchange_strong:
 *   - Returns true only if CAS succeeded
 *   - Returns false only if value was different
 *   - Never fails spuriously
 *   - Use when you need to distinguish "value changed" from "retry needed"
 * 
 * compare_exchange_weak:
 *   - May return false even if value matched (spurious failure)
 *   - More efficient on LL/SC architectures (ARM, RISC-V)
 *   - Use in loops where you'd retry anyway
 *   - May compile to fewer instructions on non-x86
 */

Java: AtomicInteger and VarHandle

Java provides atomic classes since JDK 1.5 and more flexible atomic operations via VarHandle since JDK 9:

CASJava.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import java.util.concurrent.atomic.AtomicInteger;
import java.lang.invoke.MethodHandles;
import java.lang.invoke.VarHandle;
 
public class CASExample {
    
    // Classic AtomicInteger approach
    private AtomicInteger counter = new AtomicInteger(0);
    
    public boolean incrementIfBelowMax(int max) {
        while (true) {
            int current = counter.get();
            if (current >= max) {
                return false;
            }
            // compareAndSet returns true if successful
            if (counter.compareAndSet(current, current + 1)) {
                return true;
            }
            // CAS failed, someone else modified counter, loop and retry
        }
    }
    
    // Using getAndUpdate (cleaner, internally uses CAS loop)
    public int incrementBounded(int max) {
        return counter.getAndUpdate(current -> {
            if (current >= max) return current;  // Don't increment
            return current + 1;                   // Increment
        });
    }
    
    // Modern VarHandle approach (JDK 9+) - more flexible
    private int value = 0;
    private static final VarHandle VALUE_HANDLE;
    
    static {
        try {
            VALUE_HANDLE = MethodHandles.lookup()
                .findVarHandle(CASExample.class, "value", int.class);
        } catch (Exception e) {
            throw new Error(e);
        }
    }
    
    public boolean casWithVarHandle(int expected, int newValue) {
        // VarHandle.compareAndSet works on regular fields
        return VALUE_HANDLE.compareAndSet(this, expected, newValue);
    }
    
    // compareAndExchange returns the witness value (old value)
    public int casExchange(int expected, int newValue) {
        // Returns the value that was in the field
        // If return == expected, CAS succeeded
        return (int) VALUE_HANDLE.compareAndExchange(this, expected, newValue);
    }
}
 
/*
 * Java CAS Considerations:
 * 
 * 1. AtomicInteger.compareAndSet: Simple boolean return
 *    - Returns true if CAS succeeded
 *    - Doesn't tell you what the actual value was on failure
 * 
 * 2. AtomicInteger.compareAndExchange (JDK 9+): Returns witness value
 *    - Returns the value that was actually in the variable
 *    - More efficient for retry loops (no separate load needed)
 * 
 * 3. VarHandle: Most flexible, works on any field
 *    - Can add atomic semantics to existing code
 *    - Supports various memory ordering modes
 */

Language CAS APIs

•C11: atomic_compare_exchange_strong/weak
•C++11: std::atomic<T>::compare_exchange_strong/weak
•Java: AtomicInteger.compareAndSet, VarHandle.compareAndSet
•Rust: AtomicUsize::compare_exchange
•Go: atomic.CompareAndSwapInt32/Int64/Pointer

Common Patterns

•Read-Modify-Write Loop: Load, compute, CAS, retry on failure
•State Transition: Atomically change from state A to B
•Flag Setting: Set a flag only if currently unset
•Pointer Swap: Atomically replace a pointer
•Counter Update: Increment/decrement with conditions

Prefer weak CAS in Loops

When writing a CAS retry loop, prefer the 'weak' variant (compare_exchange_weak in C++, though Java only offers strong CAS semantically). On LL/SC architectures, weak CAS can be more efficient because it doesn't need to retry internally to eliminate spurious failures—your loop will retry anyway. On x86, weak and strong compile to the same code since CMPXCHG never fails spuriously.

The ABA Problem

CAS is not without pitfalls. The most famous is the ABA problem, a subtle bug that can occur when a value changes from A to B and back to A between a read and a subsequent CAS operation.

The Problem Illustrated

Consider a lock-free stack implemented with CAS:

Thread 1 reads top = A (pointing to node A)
Thread 1 is suspended
Thread 2 pops A, pops B, pushes A back (perhaps a recycled node)
Thread 1 resumes, performs CAS(top, A, C)
CAS succeeds because top is still A!

But the stack structure has changed—perhaps B is now leaked, or the new A points to invalid memory. The CAS succeeded because it cannot detect that the value was modified and restored.

aba_problem.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Demonstrating the ABA Problem in a lock-free stack
 
typedef struct Node {
    int data;
    struct Node* next;
} Node;
 
Node* top = NULL;  // Stack top pointer
 
// Naive lock-free pop (VULNERABLE TO ABA)
Node* pop_naive(void) {
    while (true) {
        Node* old_top = top;              // Step 1: Read current top
        
        if (old_top == NULL) {
            return NULL;                  // Stack is empty
        }
        
        Node* new_top = old_top->next;    // Step 2: Read next pointer
        
        // DANGER ZONE: Between Step 1 and Step 3, another thread could:
        // - Pop 'old_top' (removing A)
        // - Pop the next node
        // - Push 'old_top' back (re-adding A)
        // Now 'top' equals 'old_top' again, but 'old_top->next' is different!
        
        // Step 3: CAS to update top
        if (CAS(&top, old_top, new_top)) {
            return old_top;               // Success, but may be corrupted!
        }
    }
}
 
/*
 * ABA Scenario Timeline:
 * 
 * Initial state: top -> A -> B -> C -> NULL
 * 
 * Thread 1:
 *   1. old_top = A
 *   2. new_top = A->next = B
 *   [SUSPENDED]
 * 
 * Thread 2:
 *   1. pop() returns A, now top -> B -> C -> NULL
 *   2. pop() returns B, now top -> C -> NULL
 *   3. push(A) with A->next = C, now top -> A -> C -> NULL
 * 
 * Thread 1 resumes:
 *   3. CAS(&top, A, B) -- SUCCEEDS because top == A !
 *   4. Returns A
 * 
 * Result: top = B, but B was already removed and may be:
 *   - Freed/deallocated (use-after-free)
 *   - Reused for something else (data corruption)
 *   - Pointing to invalid memory (crash)
 */

Solutions to the ABA Problem

Several techniques address the ABA problem:

1. Tagged Pointers (Version Counting)

Augment each pointer with a version counter. The CAS operates on the combined pointer+counter, and the counter is incremented on every modification. The ABA scenario now has A₁ → B → A₂, where A₁ ≠ A₂.

2. Double-Wide CAS (DCAS/CMPXCHG16B)

Use 128-bit CAS to atomically compare-and-swap both a pointer and a counter. x86-64's CMPXCHG16B supports this.

3. Hazard Pointers

Threads announce which nodes they're accessing. Other threads skip reclamation of announced nodes. Ensures nodes aren't recycled while in use.

4. Epoch-Based Reclamation

Nodes are only recycled after all threads have completed any operation that started before the node was removed.

5. Reference Counting

Maintain reference counts on nodes. Nodes are only freed when their reference count reaches zero.

aba_solution_tagged.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// Solution 1: Tagged Pointers using CMPXCHG16B
// Combine pointer with a monotonic version counter
 
#include <stdint.h>
#include <stdbool.h>
 
// Tagged pointer: 64-bit pointer + 64-bit counter = 128 bits
typedef struct {
    Node* ptr;
    uint64_t tag;
} __attribute__((aligned(16))) TaggedPointer;
 
TaggedPointer top = {NULL, 0};
 
// Atomic 128-bit CAS (uses CMPXCHG16B on x86-64)
bool cas128(TaggedPointer* target, 
            TaggedPointer expected, 
            TaggedPointer desired) {
    // Compiler intrinsic or inline assembly for CMPXCHG16B
    return __atomic_compare_exchange(
        target, &expected, &desired,
        false,  // strong
        __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST
    );
}
 
// ABA-safe pop using tagged pointers
Node* pop_safe(void) {
    while (true) {
        TaggedPointer old_top = top;      // Atomically read both ptr and tag
        
        if (old_top.ptr == NULL) {
            return NULL;
        }
        
        Node* next = old_top.ptr->next;
        
        // Create new tagged pointer with incremented tag
        TaggedPointer new_top = {
            .ptr = next,
            .tag = old_top.tag + 1        // Increment tag on every modification
        };
        
        // 128-bit CAS: both pointer AND tag must match
        if (cas128(&top, old_top, new_top)) {
            return old_top.ptr;
        }
        // If tag changed (even if ptr is same), CAS fails - ABA prevented!
    }
}
 
/*
 * How tagged pointers prevent ABA:
 * 
 * Initial: top = {ptr: A, tag: 0}
 * 
 * Thread 1:
 *   old_top = {A, 0}
 *   [SUSPENDED]
 * 
 * Thread 2:
 *   pop() -> top = {B, 1}   // tag incremented
 *   pop() -> top = {C, 2}   // tag incremented
 *   push(A) -> top = {A, 3} // tag incremented
 * 
 * Thread 1 resumes:
 *   CAS(top, {A, 0}, ...) -> FAILS!
 *   Because current top is {A, 3}, not {A, 0}
 *   Tag mismatch prevents false success!
 */

ABA is Subtle and Dangerous

The ABA problem is often not caught by testing because it requires specific, rare timing conditions. Code with ABA vulnerabilities may work correctly for years before failing catastrophically. When implementing lock-free data structures, always explicitly consider and address ABA. The safest approach is to use established libraries (like Java's concurrent collections) that have been thoroughly vetted, rather than rolling your own lock-free code.

Memory Ordering and CAS

CAS operations interact with the memory model in important ways. Understanding memory ordering is essential for correct concurrent programming, as modern processors and compilers aggressively reorder memory operations for performance.

The Memory Visibility Problem

Without proper memory ordering, changes made by one thread may not be visible to other threads—or may become visible in a surprising order. Consider:

// Thread 1
data = 42;          // (A)
ready = true;       // (B)

// Thread 2
while (!ready);     // (C)
assert(data == 42); // (D) — May fail!

The assertion can fail because:

Compiler reordering: The compiler might move (A) after (B) as an optimization
CPU reordering: The processor might execute stores out of order
Cache visibility: Thread 2 might see ready = true before it sees data = 42

Memory Ordering Semantics

C11/C++11 define several memory orderings:

C/C++ Memory Orderings
Ordering	Guarantees	Use Case
`memory_order_relaxed`	Only atomicity, no ordering constraints	Counters where order doesn't matter
`memory_order_acquire`	Reads after this see writes released elsewhere	Reading a lock, loading a flag
`memory_order_release`	Writes before this are visible to acquires	Writing a lock, storing a flag
`memory_order_acq_rel`	Both acquire and release (for RMW ops)	Lock transfer, CAS in some protocols
`memory_order_seq_cst`	Full sequential consistency, total order	Default, safest, most intuitive (slowest)

CAS and Memory Ordering

CAS is a read-modify-write operation, so it touches memory in two ways: it reads the current value and potentially writes a new value. This leads to an interesting question: what memory ordering does a CAS provide?

Default behavior (seq_cst): By default, most languages use sequentially consistent ordering for CAS. This provides the strongest guarantees:

All threads agree on a total order of all seq_cst operations
Operations before the CAS (in program order) are visible to threads that see the CAS's effect
Operations after the CAS see all prior effects from other threads

Custom ordering in C++: C++ allows specifying separate orderings for success and failure cases:

cas_memory_order.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <atomic>
 
std::atomic<bool> lock(false);
int shared_data = 0;
 
void acquire_lock() {
    bool expected = false;
    
    while (!lock.compare_exchange_weak(
        expected,
        true,
        std::memory_order_acquire,  // On SUCCESS: acquire semantics
                                    // - All subsequent reads see prior writes
        std::memory_order_relaxed   // On FAILURE: no ordering needed
                                    // - We're just going to retry anyway
    )) {
        expected = false;  // Reset expected and retry
    }
    
    // CRITICAL SECTION
    // memory_order_acquire ensures we see all writes made before
    // the previous release of this lock
    shared_data++;  // Safe to access shared_data
}
 
void release_lock() {
    // memory_order_release ensures all our writes are visible
    // to threads that subsequently acquire the lock
    lock.store(false, std::memory_order_release);
}
 
/*
 * Memory Ordering Reasoning:
 * 
 * Thread 1 (lock holder):
 *   [writes to shared_data]
 *   lock.store(false, release)  <-- RELEASE barrier
 *   
 * Thread 2 (acquiring):
 *   lock.compare_exchange(..., acquire, relaxed) <-- ACQUIRE barrier
 *   [reads from shared_data]
 *   
 * The release-acquire pair creates a "happens-before" relationship:
 * All writes by Thread 1 before the release ARE GUARANTEED to be
 * visible to Thread 2's reads after the acquire.
 * 
 * Why relaxed on failure?
 * - If CAS fails, we didn't acquire the lock
 * - No synchronization is happening
 * - We're about to retry, so no ordering constraints needed
 * - Relaxed is free or nearly free on most architectures
 */

Platform-Specific Behavior

x86/x64: The x86 memory model is relatively strong (Total Store Order). LOCK-prefixed instructions like CMPXCHG act as full memory barriers, preventing any reordering around them. This means even memory_order_relaxed CAS on x86 gets acquire-release semantics "for free".

ARM/RISC-V: These architectures have weaker memory models. Memory barriers must be explicitly generated based on the requested memory order. A relaxed CAS truly provides only atomicity, with separate barrier instructions needed for ordering.

This architectural difference has implications:

Code relying on x86's strong ordering may break on ARM
Using proper memory orderings ensures portability
The C/C++ memory model is designed to be architecture-neutral

When in Doubt, Use seq_cst

If you're unsure about memory ordering, use the default sequential consistency (seq_cst). It's the safest option and easiest to reason about. Only optimize to weaker orderings when you've proven correctness and profiling shows memory barriers are a bottleneck. Incorrect memory ordering leads to bugs that are extremely difficult to diagnose—they may only appear on specific architectures or under specific timing conditions.

CAS Limitations and Considerations

While CAS is powerful, it has limitations that affect algorithm design and performance. Understanding these limitations helps in choosing appropriate synchronization strategies.

Single-Word Limitation

Standard CAS operates on a single memory word (typically 32 or 64 bits, or 128 bits with CMPXCHG16B). This creates challenges when atomically updating multiple locations.

The Multi-Word Update Problem:

// Cannot make this atomic with single CAS:
struct Account {
    int balance;
    int last_transaction;
};

// We want: atomically update both balance AND last_transaction
// But CAS can only compare-and-swap one field at a time

Solutions:

Pack multiple small values into one word (e.g., two 32-bit values into 64 bits)
Use indirection: point to immutable structures, CAS the pointer
Use Software Transactional Memory (STM)
Use Multi-Word CAS implementations (MCAS) built from single CAS
Use locks when atomic multi-word update is required

Contention and Scalability

CAS performance degrades significantly under high contention:

The Contention Storm:

When many threads compete for the same CAS:

All threads read the same value
All threads compute new values
Only one CAS succeeds
All others fail and must retry
This repeats, wasting CPU cycles

Cache Line Bouncing:

Every CAS requires exclusive ownership of the cache line. Under contention:

Cache line constantly moves between cores
Each CAS requires cache coherence traffic
Memory bus becomes the bottleneck
Cores spend more time waiting for cache lines than doing useful work

CAS Performance Under Contention
Contention Level	CAS Success Rate	Cache Behavior	Effective Performance
None (single thread)	100%	Cache line stable	Maximum throughput
Low (few threads, rare conflict)	~90%+	Occasional cache miss	Good throughput
Medium (frequent conflict)	~50-80%	Frequent cache line bouncing	Degraded throughput
High (constant conflict)	<20%	Cache line thrashing	Near-zero effective throughput

Mitigation Strategies

Several techniques can reduce CAS contention:

1. Exponential Backoff

When CAS fails, wait before retrying. Increase wait time exponentially on repeated failures:

int backoff = 1;
while (!CAS(...)) {
    for (int i = 0; i < backoff; i++) {
        cpu_pause();  // Yield CPU momentarily
    }
    backoff = min(backoff * 2, MAX_BACKOFF);
}

2. Hierarchical Data Structures

Spread contention across multiple CAS targets. Instead of one central counter, use per-thread or per-shard counters:

// Instead of:
atomic<int> global_counter;  // Massive contention

// Use:
atomic<int> counters[NUM_SHARDS];  // Spread load
int total() {
    int sum = 0;
    for (int i = 0; i < NUM_SHARDS; i++) sum += counters[i];
    return sum;
}

3. Elimination

For operations with inverse (e.g., push/pop), pair threads to cancel out:

Thread A wants to push X
Thread B wants to pop
Instead of both fighting over the stack, A gives X directly to B

4. Combining

Aggregate multiple operations:

Multiple increment requests are combined into one CAS adding the total
Reduces contention on the central variable

When to Avoid CAS

CAS is not always the best choice. For high-contention scenarios with long critical sections, traditional blocking locks may perform better. Locks allow the holder to complete unimpeded while waiters sleep, rather than having all threads burn CPU in retry loops. The rule of thumb: use CAS for short, simple atomic updates with low-to-moderate contention; use locks for longer critical sections or under high contention.

Summary: The CAS Operation

We have thoroughly explored the Compare-and-Swap operation—the foundational atomic primitive that powers modern concurrent programming. Let us consolidate the key concepts before moving to lock-free programming patterns.

Key Takeaways

•CAS is a conditional atomic update: It compares a memory location to an expected value and swaps in a new value only if they match, all as one atomic operation.
•CAS has consensus number ∞: Unlike simpler primitives (TAS, FAA), CAS can solve consensus for any number of processes, making it a universal synchronization primitive.
•Hardware provides CAS natively: x86 uses CMPXCHG with LOCK prefix; ARM uses LDREX/STREX (LL/SC); all modern architectures provide equivalent functionality.
•The ABA problem is a critical pitfall: Values changing A→B→A between read and CAS can cause subtle bugs. Solutions include tagged pointers, hazard pointers, and epoch-based reclamation.
•Memory ordering matters: CAS interacts with the memory model. Use appropriate ordering (acquire/release or seq_cst) to ensure visibility of related data.
•Contention degrades CAS performance: High contention leads to cache line bouncing and failed CAS attempts. Use backoff, sharding, or other mitigation strategies.

What's Next: Lock-Free Programming

Now that we understand the CAS primitive itself, the next page explores lock-free programming—a programming paradigm that uses CAS to build data structures and algorithms that never block, never deadlock, and provide guaranteed system-wide progress even when individual threads are delayed.

We will examine:

What "lock-free" and "wait-free" mean formally
The progress guarantees these categories provide
Why lock-free programming matters for reliability and performance
Common lock-free patterns and idioms

Lock-free programming represents one of the most powerful applications of CAS, enabling concurrent systems that are immune to many problems associated with traditional locking.

Page Complete

You now have a deep understanding of the Compare-and-Swap operation—its semantics, hardware implementation, language APIs, pitfalls like ABA, and performance characteristics. This knowledge forms the foundation for understanding lock-free programming and CAS-based synchronization constructs in the pages that follow.

1 / 5

Loading learning content...

Operating SystemsCompare-and-Swap (CAS)

Compare-and-Swap: Foundation of Lock-Free Concurrency

LevelAdvanced

Duration120 mins

TopicCompare-and-Swap (CAS)

1 / 5

The CAS Operation: Atomic Compare-and-Swap

The Primitive That Powers Modern Concurrency

What You Will Learn

What is Compare-and-Swap?

Address: A pointer to the memory location to be modified
Expected Value: The value we expect to find at that location
New Value: The value to write if the expectation matches

Semantic Definition

The CAS operation executes the following logic atomically (as a single, indivisible operation):

CAS(address, expected, new_value):
    if (*address == expected):
        *address = new_value
        return true   // Success: value was as expected, now updated
    else:
        return false  // Failure: value was different, not modified

cas_semantics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Conceptual CAS implementation (NOT how hardware actually works)
// This illustrates the SEMANTICS of CAS
 
// The key insight: everything between the lines happens ATOMICALLY
// In reality, this atomicity is provided by hardware, not by code
 
bool compare_and_swap(int* address, int expected, int new_value) {
    // ═══════════════════════════════════════════════════════════
    // │ ATOMIC REGION START - No other operation can interleave │
    // ═══════════════════════════════════════════════════════════
    
    int current = *address;           // Read current value
    
    if (current == expected) {        // Compare
        *address = new_value;         // Swap if equal
        
        // ═════════════════════════════════════════════════════════
        // │ ATOMIC REGION END                                     │
        // ═════════════════════════════════════════════════════════
        return true;                  // Report success
    } else {
        // ═════════════════════════════════════════════════════════
        // │ ATOMIC REGION END                                     │
        // ═════════════════════════════════════════════════════════
        return false;                 // Report failure, value unchanged
    }
}
 
// Alternative formulation: Return the OLD value (common in many APIs)
// This is how CMPXCHG on x86 actually works
int compare_and_swap_return_old(int* address, int expected, int new_value) {
    // ═══════════════════════════════════════════════════════════
    // │ ATOMIC REGION - Single indivisible operation            │
    // ═══════════════════════════════════════════════════════════
    
    int old_value = *address;         // Capture current value
    
    if (old_value == expected) {
        *address = new_value;         // Swap if match
    }
    
    return old_value;                 // Always return what was there
    
    // Caller checks: (returned == expected) to determine success
}

Two API Variants

Why CAS is Fundamental

Intel x86/x64: CMPXCHG, CMPXCHG8B, CMPXCHG16B
ARM: LDREX/STREX (load-exclusive/store-exclusive pairs), CAS in ARMv8.1+
RISC-V: LR/SC (load-reserved/store-conditional)
PowerPC: lwarx/stwcx. (load-word-and-reserve-indexed / store-word-conditional-indexed)

Every modern processor architecture provides some form of CAS or equivalent functionality, making CAS-based algorithms portable across platforms.

CAS vs Other Atomic Operations

To appreciate CAS, we must understand how it differs from—and surpasses—other atomic operations. Let's compare CAS with the simpler atomic primitives.

Test-and-Set (TAS)

Test-and-Set atomically reads a memory location and sets it to 1 (or true), returning the previous value.

TAS(address):
    old = *address
    *address = 1
    return old

TAS is simpler than CAS but less powerful:

TAS always writes 1, regardless of the current value
TAS cannot conditionally update based on current state
TAS has consensus number 2 (only works for 2-process mutual exclusion)

Fetch-and-Add (FAA)

Fetch-and-Add atomically reads a value and adds a delta to it, returning the previous value.

FAA(address, delta):
    old = *address
    *address = old + delta
    return old

FAA is useful for counters but cannot implement general synchronization:

FAA always succeeds (unconditional)
FAA cannot be used for conditional state transitions
FAA also has consensus number 2

Why CAS is Superior

Comparison of Atomic Primitives
Primitive	Semantics	Consensus Number	Can Implement Locks?	Can Implement Lock-Free Structures?
Read/Write	Load or store single value	1	No	No
Test-and-Set	Set to 1, return old value	2	Yes (with busy-wait)	Limited
Fetch-and-Add	Add delta, return old value	2	Yes (ticket locks)	Limited
Swap (XCHG)	Replace value, return old	2	Yes (with busy-wait)	Limited
Compare-and-Swap	Conditional replace	∞	Yes (all types)	Yes (universal)

The Power of Conditional Updates

Read the current state
Compute the desired new state based on the current state
Attempt to atomically transition from current to new using CAS
Retry if someone else changed the state in the meantime

cas_vs_tas_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Example: Implementing a bounded counter (0 to MAX)
// This demonstrates why CAS is more powerful than TAS
 
#define MAX_VALUE 100
 
// ═══════════════════════════════════════════════════════════════
// IMPOSSIBLE with Test-and-Set
// TAS always writes 1, cannot implement conditional increment
// ═══════════════════════════════════════════════════════════════
 
// ═══════════════════════════════════════════════════════════════
// POSSIBLE with CAS: Bounded lock-free counter
// ═══════════════════════════════════════════════════════════════
 
int bounded_counter = 0;
 
// Returns true if increment succeeded, false if at maximum
bool increment_bounded(void) {
    while (true) {
        int current = bounded_counter;  // 1. Read current state
        
        if (current >= MAX_VALUE) {
            return false;               // At maximum, cannot increment
        }
        
        int next = current + 1;         // 2. Compute next state
        
        // 3. Attempt atomic transition
        if (CAS(&bounded_counter, current, next)) {
            return true;                // Success!
        }
        
        // 4. Someone else modified counter, retry
        // The loop continues with new 'current' value
    }
}
 
// The key insight: CAS allows us to:
// - Check a condition (current < MAX)
// - Compute result based on current value (current + 1)
// - Atomically update only if nothing changed
//
// TAS cannot do this because it cannot:
// - Make the write conditional on current value
// - Write a value that depends on current value

CAS Enables Read-Modify-Write

Hardware Implementation of CAS

x86/x64: The CMPXCHG Instruction

On Intel and AMD processors, CAS is implemented via the CMPXCHG family of instructions:

CMPXCHG — Compare and exchange for 8, 16, or 32-bit operands
CMPXCHG8B — Compare and exchange 8 bytes (64 bits)
CMPXCHG16B — Compare and exchange 16 bytes (128 bits, x64 only)

The instruction works as follows:

; CMPXCHG destination, source
; Implicit operand: EAX (or RAX for 64-bit) holds expected value

CMPXCHG [memory], new_value
; If [memory] == EAX:
;     [memory] = new_value
;     ZF = 1 (zero flag set, indicating success)
; Else:
;     EAX = [memory]  (load actual value into EAX)
;     ZF = 0 (zero flag clear, indicating failure)

The LOCK Prefix

For multi-processor systems, CMPXCHG must be preceded by the LOCK prefix to ensure atomicity across all CPUs:

LOCK CMPXCHG [global_var], ebx

The LOCK prefix instructs the processor to:

Assert the bus lock or use cache-locking protocols (MESI)
Ensure exclusive access to the cache line containing the memory location
Prevent other processors from accessing that cache line during the operation
Issue memory barriers to prevent memory reordering around the instruction

x86_cmpxchg.asm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
; x86-64 assembly: CAS implementation using CMPXCHG
; Function: int cas(int* addr, int expected, int new_val)
; Returns: old value at *addr (caller checks if == expected)
 
; System V AMD64 ABI:
; RDI = addr (pointer)
; ESI = expected value  
; EDX = new value
; Return in EAX
 
cas:
    mov     eax, esi          ; Move expected value to EAX (implicit operand)
    
    ; The LOCK prefix is ESSENTIAL for multi-processor atomicity
    ; Without LOCK, the operation is only atomic on single-CPU systems
    lock cmpxchg [rdi], edx   ; Atomic: if [rdi]==EAX then [rdi]=EDX
    
    ; After CMPXCHG:
    ; - If matched: [rdi] now contains EDX, EAX unchanged, ZF=1
    ; - If mismatched: EAX now contains actual [rdi] value, ZF=0
    
    ; Return value is in EAX (the old value)
    ret
 
; Example usage pattern in pseudo-C:
; int old = cas(&counter, expected, new_val);
; if (old == expected) { /* success */ }
; else { expected = old; /* retry */ }

ARM: Load-Exclusive / Store-Exclusive (LL/SC)

ARM processors (prior to ARMv8.1) use a different approach called Load-Link/Store-Conditional (LL/SC), implemented as LDREX (Load Exclusive) and STREX (Store Exclusive):

LDREX: Reads the value and sets an exclusive monitor on the address
STREX: Attempts to write; succeeds only if the exclusive monitor is still valid

The exclusive monitor is invalidated if any other processor writes to the same cache line, or if certain events occur (interrupts, context switches). This provides atomicity without an explicit lock.

LDREX/STREX vs CMPXCHG

Comparison of Hardware CAS Mechanisms
Aspect	x86 CMPXCHG	ARM LDREX/STREX
Instruction Type	Single instruction does compare-and-swap	Two-instruction sequence (load, then conditional store)
Success Guarantee	Always succeeds if value matches	May spuriously fail even if value unchanged
Failure Semantics	Returns mismatched value	Returns success/failure indicator
Locking Mechanism	LOCK prefix (bus lock or cache lock)	Exclusive monitor (reservation-based)
Retry Overhead	Just retry the instruction	Must re-execute LDREX before retrying STREX
Spurious Failure	Never	Possible (context switch, other core's access)

arm_ll_sc.asm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
; ARM assembly: CAS using Load-Exclusive/Store-Exclusive
; Function: int cas(int* addr, int expected, int new_val)
 
; ARM calling convention:
; R0 = addr, R1 = expected, R2 = new_val
; Return in R0
 
cas:
    ; Outer loop: handles spurious failures
retry:
    LDREX   R3, [R0]          ; Load exclusive: R3 = *addr, set monitor
    
    CMP     R3, R1            ; Compare loaded value with expected
    BNE     fail              ; If not equal, someone else changed it
    
    ; Values match, try to store new value
    STREX   R4, R2, [R0]      ; Store exclusive: *addr = R2 if monitor valid
                              ; R4 = 0 if success, 1 if failed
    
    CMP     R4, #0            ; Check if STREX succeeded
    BNE     retry             ; Spurious failure: monitor invalidated, retry
    
    ; Success: return the expected value (caller knows success when ret == expected)
    MOV     R0, R1
    BX      LR
 
fail:
    ; CAS failed: return the actual value we found
    MOV     R0, R3
    BX      LR
 
; Key insight: ARM's LL/SC can fail spuriously!
; STREX may fail even if no one modified the value, because:
; - Context switch occurred between LDREX and STREX
; - Another core accessed the same cache line (even if different variable)
; - Interrupt occurred
; This is why ARM CAS loops must always retry on STREX failure

LL/SC Spurious Failures

Cache Coherence and CAS

Before CAS: The processor must acquire the cache line in Exclusive or Modified state
During CAS: Other processors' copies of the cache line are invalidated
After CAS: The modified cache line is in Modified state, visible to other processors on-demand

Memory Ordering Guarantees

Atomic CAS operations provide sequentially consistent semantics on most platforms:

Full memory barrier: On x86, LOCK-prefixed instructions act as full memory barriers, preventing reordering of memory operations around them
Acquire-release semantics: CAS typically provides at least acquire semantics on success (all subsequent reads see writes before the CAS) and release semantics (all prior writes are visible to other threads seeing the CAS)

This makes CAS ideal for implementing synchronization primitives that require memory visibility guarantees.

CAS in Programming Languages

C/C++: The Atomic Library

C11 and C++11 introduced standardized atomic operations, including CAS:

#include <stdatomic.h>  // C11
// or
#include <atomic>        // C++11

The primary CAS functions are:

atomic_compare_exchange_strong: Never fails spuriously
atomic_compare_exchange_weak: May fail spuriously, but can be more efficient in loops

cas_cpp.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#include <atomic>
#include <iostream>
 
// C++11 atomic CAS demonstration
 
std::atomic<int> counter(0);
 
// Using compare_exchange_strong (no spurious failures)
bool increment_if_below_max_strong(int max_value) {
    int expected = counter.load();  // Read current value
    
    while (expected < max_value) {
        int desired = expected + 1;
        
        // CAS: if counter == expected, set counter = desired
        // If fails, expected is updated to the actual current value
        if (counter.compare_exchange_strong(expected, desired)) {
            return true;  // Successfully incremented
        }
        // expected now contains actual value; loop will check condition
    }
    return false;  // Counter was at or above max
}
 
// Using compare_exchange_weak (can fail spuriously, use in loops)
bool increment_weak(void) {
    int expected = counter.load();
    int desired = expected + 1;
    
    // Weak CAS may fail spuriously on some architectures
    // More efficient in a loop because it maps directly to LL/SC
    while (!counter.compare_exchange_weak(expected, desired)) {
        // On failure, expected is updated to actual value
        desired = expected + 1;
    }
    return true;
}
 
// Memory ordering control
void increment_with_ordering() {
    int expected = counter.load(std::memory_order_relaxed);
    int desired;
    
    do {
        desired = expected + 1;
    } while (!counter.compare_exchange_weak(
        expected, 
        desired,
        std::memory_order_release,  // Memory order on success
        std::memory_order_relaxed   // Memory order on failure
    ));
}
 
/*
 * Key API differences:
 * 
 * compare_exchange_strong:
 *   - Returns true only if CAS succeeded
 *   - Returns false only if value was different
 *   - Never fails spuriously
 *   - Use when you need to distinguish "value changed" from "retry needed"
 * 
 * compare_exchange_weak:
 *   - May return false even if value matched (spurious failure)
 *   - More efficient on LL/SC architectures (ARM, RISC-V)
 *   - Use in loops where you'd retry anyway
 *   - May compile to fewer instructions on non-x86
 */

Java: AtomicInteger and VarHandle

Java provides atomic classes since JDK 1.5 and more flexible atomic operations via VarHandle since JDK 9:

CASJava.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import java.util.concurrent.atomic.AtomicInteger;
import java.lang.invoke.MethodHandles;
import java.lang.invoke.VarHandle;
 
public class CASExample {
    
    // Classic AtomicInteger approach
    private AtomicInteger counter = new AtomicInteger(0);
    
    public boolean incrementIfBelowMax(int max) {
        while (true) {
            int current = counter.get();
            if (current >= max) {
                return false;
            }
            // compareAndSet returns true if successful
            if (counter.compareAndSet(current, current + 1)) {
                return true;
            }
            // CAS failed, someone else modified counter, loop and retry
        }
    }
    
    // Using getAndUpdate (cleaner, internally uses CAS loop)
    public int incrementBounded(int max) {
        return counter.getAndUpdate(current -> {
            if (current >= max) return current;  // Don't increment
            return current + 1;                   // Increment
        });
    }
    
    // Modern VarHandle approach (JDK 9+) - more flexible
    private int value = 0;
    private static final VarHandle VALUE_HANDLE;
    
    static {
        try {
            VALUE_HANDLE = MethodHandles.lookup()
                .findVarHandle(CASExample.class, "value", int.class);
        } catch (Exception e) {
            throw new Error(e);
        }
    }
    
    public boolean casWithVarHandle(int expected, int newValue) {
        // VarHandle.compareAndSet works on regular fields
        return VALUE_HANDLE.compareAndSet(this, expected, newValue);
    }
    
    // compareAndExchange returns the witness value (old value)
    public int casExchange(int expected, int newValue) {
        // Returns the value that was in the field
        // If return == expected, CAS succeeded
        return (int) VALUE_HANDLE.compareAndExchange(this, expected, newValue);
    }
}
 
/*
 * Java CAS Considerations:
 * 
 * 1. AtomicInteger.compareAndSet: Simple boolean return
 *    - Returns true if CAS succeeded
 *    - Doesn't tell you what the actual value was on failure
 * 
 * 2. AtomicInteger.compareAndExchange (JDK 9+): Returns witness value
 *    - Returns the value that was actually in the variable
 *    - More efficient for retry loops (no separate load needed)
 * 
 * 3. VarHandle: Most flexible, works on any field
 *    - Can add atomic semantics to existing code
 *    - Supports various memory ordering modes
 */

Language CAS APIs

•C11: atomic_compare_exchange_strong/weak
•C++11: std::atomic<T>::compare_exchange_strong/weak
•Java: AtomicInteger.compareAndSet, VarHandle.compareAndSet
•Rust: AtomicUsize::compare_exchange
•Go: atomic.CompareAndSwapInt32/Int64/Pointer

Common Patterns

•Read-Modify-Write Loop: Load, compute, CAS, retry on failure
•State Transition: Atomically change from state A to B
•Flag Setting: Set a flag only if currently unset
•Pointer Swap: Atomically replace a pointer
•Counter Update: Increment/decrement with conditions

Prefer weak CAS in Loops

The ABA Problem

CAS is not without pitfalls. The most famous is the ABA problem, a subtle bug that can occur when a value changes from A to B and back to A between a read and a subsequent CAS operation.

The Problem Illustrated

Consider a lock-free stack implemented with CAS:

Thread 1 reads top = A (pointing to node A)
Thread 1 is suspended
Thread 2 pops A, pops B, pushes A back (perhaps a recycled node)
Thread 1 resumes, performs CAS(top, A, C)
CAS succeeds because top is still A!

But the stack structure has changed—perhaps B is now leaked, or the new A points to invalid memory. The CAS succeeded because it cannot detect that the value was modified and restored.

aba_problem.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Demonstrating the ABA Problem in a lock-free stack
 
typedef struct Node {
    int data;
    struct Node* next;
} Node;
 
Node* top = NULL;  // Stack top pointer
 
// Naive lock-free pop (VULNERABLE TO ABA)
Node* pop_naive(void) {
    while (true) {
        Node* old_top = top;              // Step 1: Read current top
        
        if (old_top == NULL) {
            return NULL;                  // Stack is empty
        }
        
        Node* new_top = old_top->next;    // Step 2: Read next pointer
        
        // DANGER ZONE: Between Step 1 and Step 3, another thread could:
        // - Pop 'old_top' (removing A)
        // - Pop the next node
        // - Push 'old_top' back (re-adding A)
        // Now 'top' equals 'old_top' again, but 'old_top->next' is different!
        
        // Step 3: CAS to update top
        if (CAS(&top, old_top, new_top)) {
            return old_top;               // Success, but may be corrupted!
        }
    }
}
 
/*
 * ABA Scenario Timeline:
 * 
 * Initial state: top -> A -> B -> C -> NULL
 * 
 * Thread 1:
 *   1. old_top = A
 *   2. new_top = A->next = B
 *   [SUSPENDED]
 * 
 * Thread 2:
 *   1. pop() returns A, now top -> B -> C -> NULL
 *   2. pop() returns B, now top -> C -> NULL
 *   3. push(A) with A->next = C, now top -> A -> C -> NULL
 * 
 * Thread 1 resumes:
 *   3. CAS(&top, A, B) -- SUCCEEDS because top == A !
 *   4. Returns A
 * 
 * Result: top = B, but B was already removed and may be:
 *   - Freed/deallocated (use-after-free)
 *   - Reused for something else (data corruption)
 *   - Pointing to invalid memory (crash)
 */

Solutions to the ABA Problem

Several techniques address the ABA problem:

1. Tagged Pointers (Version Counting)

2. Double-Wide CAS (DCAS/CMPXCHG16B)

Use 128-bit CAS to atomically compare-and-swap both a pointer and a counter. x86-64's CMPXCHG16B supports this.

3. Hazard Pointers

Threads announce which nodes they're accessing. Other threads skip reclamation of announced nodes. Ensures nodes aren't recycled while in use.

4. Epoch-Based Reclamation

Nodes are only recycled after all threads have completed any operation that started before the node was removed.

5. Reference Counting

Maintain reference counts on nodes. Nodes are only freed when their reference count reaches zero.

aba_solution_tagged.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// Solution 1: Tagged Pointers using CMPXCHG16B
// Combine pointer with a monotonic version counter
 
#include <stdint.h>
#include <stdbool.h>
 
// Tagged pointer: 64-bit pointer + 64-bit counter = 128 bits
typedef struct {
    Node* ptr;
    uint64_t tag;
} __attribute__((aligned(16))) TaggedPointer;
 
TaggedPointer top = {NULL, 0};
 
// Atomic 128-bit CAS (uses CMPXCHG16B on x86-64)
bool cas128(TaggedPointer* target, 
            TaggedPointer expected, 
            TaggedPointer desired) {
    // Compiler intrinsic or inline assembly for CMPXCHG16B
    return __atomic_compare_exchange(
        target, &expected, &desired,
        false,  // strong
        __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST
    );
}
 
// ABA-safe pop using tagged pointers
Node* pop_safe(void) {
    while (true) {
        TaggedPointer old_top = top;      // Atomically read both ptr and tag
        
        if (old_top.ptr == NULL) {
            return NULL;
        }
        
        Node* next = old_top.ptr->next;
        
        // Create new tagged pointer with incremented tag
        TaggedPointer new_top = {
            .ptr = next,
            .tag = old_top.tag + 1        // Increment tag on every modification
        };
        
        // 128-bit CAS: both pointer AND tag must match
        if (cas128(&top, old_top, new_top)) {
            return old_top.ptr;
        }
        // If tag changed (even if ptr is same), CAS fails - ABA prevented!
    }
}
 
/*
 * How tagged pointers prevent ABA:
 * 
 * Initial: top = {ptr: A, tag: 0}
 * 
 * Thread 1:
 *   old_top = {A, 0}
 *   [SUSPENDED]
 * 
 * Thread 2:
 *   pop() -> top = {B, 1}   // tag incremented
 *   pop() -> top = {C, 2}   // tag incremented
 *   push(A) -> top = {A, 3} // tag incremented
 * 
 * Thread 1 resumes:
 *   CAS(top, {A, 0}, ...) -> FAILS!
 *   Because current top is {A, 3}, not {A, 0}
 *   Tag mismatch prevents false success!
 */

ABA is Subtle and Dangerous

Memory Ordering and CAS

The Memory Visibility Problem

Without proper memory ordering, changes made by one thread may not be visible to other threads—or may become visible in a surprising order. Consider:

// Thread 1
data = 42;          // (A)
ready = true;       // (B)

// Thread 2
while (!ready);     // (C)
assert(data == 42); // (D) — May fail!

The assertion can fail because:

Compiler reordering: The compiler might move (A) after (B) as an optimization
CPU reordering: The processor might execute stores out of order
Cache visibility: Thread 2 might see ready = true before it sees data = 42

Memory Ordering Semantics

C11/C++11 define several memory orderings:

C/C++ Memory Orderings
Ordering	Guarantees	Use Case
`memory_order_relaxed`	Only atomicity, no ordering constraints	Counters where order doesn't matter
`memory_order_acquire`	Reads after this see writes released elsewhere	Reading a lock, loading a flag
`memory_order_release`	Writes before this are visible to acquires	Writing a lock, storing a flag
`memory_order_acq_rel`	Both acquire and release (for RMW ops)	Lock transfer, CAS in some protocols
`memory_order_seq_cst`	Full sequential consistency, total order	Default, safest, most intuitive (slowest)

CAS and Memory Ordering

Default behavior (seq_cst): By default, most languages use sequentially consistent ordering for CAS. This provides the strongest guarantees:

All threads agree on a total order of all seq_cst operations
Operations before the CAS (in program order) are visible to threads that see the CAS's effect
Operations after the CAS see all prior effects from other threads

Custom ordering in C++: C++ allows specifying separate orderings for success and failure cases:

cas_memory_order.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <atomic>
 
std::atomic<bool> lock(false);
int shared_data = 0;
 
void acquire_lock() {
    bool expected = false;
    
    while (!lock.compare_exchange_weak(
        expected,
        true,
        std::memory_order_acquire,  // On SUCCESS: acquire semantics
                                    // - All subsequent reads see prior writes
        std::memory_order_relaxed   // On FAILURE: no ordering needed
                                    // - We're just going to retry anyway
    )) {
        expected = false;  // Reset expected and retry
    }
    
    // CRITICAL SECTION
    // memory_order_acquire ensures we see all writes made before
    // the previous release of this lock
    shared_data++;  // Safe to access shared_data
}
 
void release_lock() {
    // memory_order_release ensures all our writes are visible
    // to threads that subsequently acquire the lock
    lock.store(false, std::memory_order_release);
}
 
/*
 * Memory Ordering Reasoning:
 * 
 * Thread 1 (lock holder):
 *   [writes to shared_data]
 *   lock.store(false, release)  <-- RELEASE barrier
 *   
 * Thread 2 (acquiring):
 *   lock.compare_exchange(..., acquire, relaxed) <-- ACQUIRE barrier
 *   [reads from shared_data]
 *   
 * The release-acquire pair creates a "happens-before" relationship:
 * All writes by Thread 1 before the release ARE GUARANTEED to be
 * visible to Thread 2's reads after the acquire.
 * 
 * Why relaxed on failure?
 * - If CAS fails, we didn't acquire the lock
 * - No synchronization is happening
 * - We're about to retry, so no ordering constraints needed
 * - Relaxed is free or nearly free on most architectures
 */

Platform-Specific Behavior

This architectural difference has implications:

Code relying on x86's strong ordering may break on ARM
Using proper memory orderings ensures portability
The C/C++ memory model is designed to be architecture-neutral

When in Doubt, Use seq_cst

CAS Limitations and Considerations

While CAS is powerful, it has limitations that affect algorithm design and performance. Understanding these limitations helps in choosing appropriate synchronization strategies.

Single-Word Limitation

Standard CAS operates on a single memory word (typically 32 or 64 bits, or 128 bits with CMPXCHG16B). This creates challenges when atomically updating multiple locations.

The Multi-Word Update Problem:

// Cannot make this atomic with single CAS:
struct Account {
    int balance;
    int last_transaction;
};

// We want: atomically update both balance AND last_transaction
// But CAS can only compare-and-swap one field at a time

Solutions:

Pack multiple small values into one word (e.g., two 32-bit values into 64 bits)
Use indirection: point to immutable structures, CAS the pointer
Use Software Transactional Memory (STM)
Use Multi-Word CAS implementations (MCAS) built from single CAS
Use locks when atomic multi-word update is required

Contention and Scalability

CAS performance degrades significantly under high contention:

The Contention Storm:

When many threads compete for the same CAS:

All threads read the same value
All threads compute new values
Only one CAS succeeds
All others fail and must retry
This repeats, wasting CPU cycles

Cache Line Bouncing:

Every CAS requires exclusive ownership of the cache line. Under contention:

Cache line constantly moves between cores
Each CAS requires cache coherence traffic
Memory bus becomes the bottleneck
Cores spend more time waiting for cache lines than doing useful work

CAS Performance Under Contention
Contention Level	CAS Success Rate	Cache Behavior	Effective Performance
None (single thread)	100%	Cache line stable	Maximum throughput
Low (few threads, rare conflict)	~90%+	Occasional cache miss	Good throughput
Medium (frequent conflict)	~50-80%	Frequent cache line bouncing	Degraded throughput
High (constant conflict)	<20%	Cache line thrashing	Near-zero effective throughput

Mitigation Strategies

Several techniques can reduce CAS contention:

1. Exponential Backoff

When CAS fails, wait before retrying. Increase wait time exponentially on repeated failures:

int backoff = 1;
while (!CAS(...)) {
    for (int i = 0; i < backoff; i++) {
        cpu_pause();  // Yield CPU momentarily
    }
    backoff = min(backoff * 2, MAX_BACKOFF);
}

2. Hierarchical Data Structures

Spread contention across multiple CAS targets. Instead of one central counter, use per-thread or per-shard counters:

// Instead of:
atomic<int> global_counter;  // Massive contention

// Use:
atomic<int> counters[NUM_SHARDS];  // Spread load
int total() {
    int sum = 0;
    for (int i = 0; i < NUM_SHARDS; i++) sum += counters[i];
    return sum;
}

3. Elimination

For operations with inverse (e.g., push/pop), pair threads to cancel out:

Thread A wants to push X
Thread B wants to pop
Instead of both fighting over the stack, A gives X directly to B

4. Combining

Aggregate multiple operations:

Multiple increment requests are combined into one CAS adding the total
Reduces contention on the central variable

When to Avoid CAS

Summary: The CAS Operation

Key Takeaways

•CAS is a conditional atomic update: It compares a memory location to an expected value and swaps in a new value only if they match, all as one atomic operation.
•CAS has consensus number ∞: Unlike simpler primitives (TAS, FAA), CAS can solve consensus for any number of processes, making it a universal synchronization primitive.
•Hardware provides CAS natively: x86 uses CMPXCHG with LOCK prefix; ARM uses LDREX/STREX (LL/SC); all modern architectures provide equivalent functionality.
•The ABA problem is a critical pitfall: Values changing A→B→A between read and CAS can cause subtle bugs. Solutions include tagged pointers, hazard pointers, and epoch-based reclamation.
•Memory ordering matters: CAS interacts with the memory model. Use appropriate ordering (acquire/release or seq_cst) to ensure visibility of related data.
•Contention degrades CAS performance: High contention leads to cache line bouncing and failed CAS attempts. Use backoff, sharding, or other mitigation strategies.

What's Next: Lock-Free Programming

We will examine:

What "lock-free" and "wait-free" mean formally
The progress guarantees these categories provide
Why lock-free programming matters for reliability and performance
Common lock-free patterns and idioms

Lock-free programming represents one of the most powerful applications of CAS, enabling concurrent systems that are immune to many problems associated with traditional locking.

Page Complete

1 / 5