Compare And Swap - Learning Module

Loading content...

0/227

CAS Loops: The Heart of Atomic Programming

The Universal Pattern of Atomic Operations

Every CAS-based algorithm—from simple counters to sophisticated lock-free data structures—relies on a common programming pattern: the CAS loop. This deceptively simple construct (read, compute, CAS, retry) is the fundamental idiom of atomic programming.

Getting CAS loops right requires understanding subtle issues: When should you retry? How do you avoid livelock? What memory ordering is sufficient? How do you handle spurious failures? What optimizations are safe?

This page provides a comprehensive treatment of CAS loop patterns, best practices, common pitfalls, and optimizations. Mastering these patterns is essential for writing correct and efficient concurrent code.

What You Will Learn

By the end of this page, you will understand the canonical CAS loop pattern, when to use weak vs strong CAS, how to handle contention with backoff strategies, common bugs in CAS loops and how to avoid them, and optimization techniques for high-performance CAS code. You will be equipped to write correct CAS-based algorithms confidently.

The Canonical CAS Loop

The basic CAS loop follows a simple template that applies universally across all CAS-based algorithms:

do {
    old_value = read(location);
    new_value = compute(old_value);
} while (!CAS(location, old_value, new_value));

Let's break down each step:

Read: Load the current value from the shared location
Compute: Calculate the desired new value based on the current value
CAS: Attempt to atomically update if the value hasn't changed
Retry: If CAS fails (value was modified by another thread), loop back

This pattern implements optimistic concurrency control: we optimistically compute the new state as if no one else will interfere, then verify at commit time (CAS) that our assumption was correct.

canonical_cas_loop.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include <atomic>
 
std::atomic<int> counter{0};
 
// Pattern 1: Simple update (increment by delta)
void add_to_counter(int delta) {
    int old_value = counter.load(std::memory_order_relaxed);
    
    // When using compare_exchange_weak:
    // - 'old_value' is updated to actual value on failure
    // - This means we don't need to call load() again!
    while (!counter.compare_exchange_weak(
        old_value,                        // Expected (updated on failure)
        old_value + delta,                // Desired
        std::memory_order_release,        // Success ordering
        std::memory_order_relaxed         // Failure ordering
    )) {
        // On failure, old_value now contains current value
        // Loop naturally retries with correct expected value
    }
}
 
// Pattern 2: Conditional update (only if condition met)
bool increment_if_below_max(int max_value) {
    int old_value = counter.load(std::memory_order_relaxed);
    
    while (old_value < max_value) {       // Check condition
        int new_value = old_value + 1;
        
        if (counter.compare_exchange_weak(
            old_value, new_value,
            std::memory_order_release,
            std::memory_order_relaxed
        )) {
            return true;                   // Successfully incremented
        }
        // old_value updated to current; loop rechecks condition
    }
    
    return false;                          // max_value reached
}
 
// Pattern 3: State transition (from specific state to another)
enum State { IDLE, RUNNING, STOPPED };
std::atomic<State> state{IDLE};
 
bool try_start() {
    State expected = IDLE;                 // Only start from IDLE
    
    // Here we use compare_exchange_strong because:
    // - We only want ONE attempt from this specific state
    // - If we're not IDLE, we don't want to retry (return false)
    return state.compare_exchange_strong(
        expected, RUNNING,
        std::memory_order_acq_rel
    );
}

The Key Insight: Expected Value Update

A crucial aspect of the CAS loop is that compare_exchange updates the expected value on failure. This is not just a convenience—it's essential for correct loop operation.

int expected = current_value;
if (!counter.compare_exchange_weak(expected, new_value)) {
    // 'expected' now contains the ACTUAL value found
    // We can use it directly without another load()
}

This behavior allows the loop to:

Discover the new current value without an extra load instruction
Immediately use it for the retry's computation
Avoid the race between a separate load and the CAS

Loop Structure Variations

There are two common structures for CAS loops, both equivalent:

Do-while style (try at least once):

do {
    expected = location.load();
    desired = compute(expected);
} while (!location.compare_exchange_weak(expected, desired));

While style (check exit condition first):

int expected = location.load();
while (!location.compare_exchange_weak(expected, compute(expected)));

The second style leverages the expected-value-update behavior more directly.

Let compare_exchange Do the Work

Never reload the value inside a CAS loop if you're using compare_exchange_weak/strong. These functions update the expected parameter on failure, giving you the current value automatically. An extra load() is not only unnecessary but may introduce a subtle bug (the value could change between your load and the next CAS).

Weak vs Strong CAS: When to Use Each

C++ and other languages offer two variants of CAS: compare_exchange_strong and compare_exchange_weak. Understanding the distinction is crucial for writing correct and efficient code.

compare_exchange_strong

Guarantees that CAS fails only if the value didn't match:

Returns false if and only if *addr != expected
Never fails spuriously
On x86, this is what the hardware provides natively

compare_exchange_weak

May fail spuriously even if the value matched:

Returns false if *addr != expected, OR due to spurious failure
Spurious failures can occur on LL/SC architectures (ARM, RISC-V, POWER)
More efficient in loops on LL/SC platforms (maps directly to hardware)

When to Use Each

Choosing Between Weak and Strong CAS
Use Case	Recommended	Reason
CAS in a retry loop	`compare_exchange_weak`	You're retrying anyway; spurious fail just causes another harmless iteration
Single CAS attempt	`compare_exchange_strong`	You need to know if it actually failed due to value mismatch
Conditional update (exit loop on condition)	Either	Depends on whether spurious failure affects correctness
State transition with side effects	`compare_exchange_strong`	Can't risk spurious failure if success has side effects
Hot loop, ARM/RISC-V platform	`compare_exchange_weak`	Avoids extra overhead of strong CAS implementation on LL/SC

weak_vs_strong.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <atomic>
 
std::atomic<int> counter{0};
std::atomic<bool> flag{false};
 
// ═══════════════════════════════════════════════════════════════
// GOOD: Use 'weak' in a retry loop
// ═══════════════════════════════════════════════════════════════
void increment_counter() {
    int expected = counter.load();
    // Weak is fine here - we're in a loop that retries anyway
    while (!counter.compare_exchange_weak(expected, expected + 1)) {
        // Spurious failure? Just retry. Same cost.
    }
}
 
// ═══════════════════════════════════════════════════════════════
// GOOD: Use 'strong' for single attempt / state transition
// ═══════════════════════════════════════════════════════════════
bool try_set_flag() {
    bool expected = false;
    // Strong because: if it fails, we want to know it REALLY failed
    // (i.e., flag was already set by someone else)
    return flag.compare_exchange_strong(expected, true);
}
 
// ═══════════════════════════════════════════════════════════════
// CAREFUL: Conditional loop with side effects
// ═══════════════════════════════════════════════════════════════
std::atomic<int*> global_ptr{nullptr};
 
bool try_install_pointer(int* new_ptr) {
    int* expected = nullptr;
    
    // Using 'weak' here is RISKY if there are side effects
    // Suppose: if we fail, we free new_ptr
    
    // WRONG with weak:
    // if (!global_ptr.compare_exchange_weak(expected, new_ptr)) {
    //     free(new_ptr);  // BUG: might free even though install succeeded!
    //     return false;
    // }
    
    // CORRECT: use strong when failure semantics matter
    if (!global_ptr.compare_exchange_strong(expected, new_ptr)) {
        free(new_ptr);     // Really failed - safe to free
        return false;
    }
    return true;
}
 
// ═══════════════════════════════════════════════════════════════
// ALSO OK: weak in conditional loop if failure is re-checked
// ═══════════════════════════════════════════════════════════════
bool increment_if_positive() {
    int expected = counter.load();
    
    while (expected > 0) {  // Re-check condition on each iteration
        // Weak is OK: spurious failure -> rechecks condition with new value
        if (counter.compare_exchange_weak(expected, expected + 1)) {
            return true;
        }
        // expected updated to current value; loop condition revalidates
    }
    return false;  // counter is non-positive
}

Platform Reality

On x86, there's no difference between weak and strong CAS in generated code—both map to LOCK CMPXCHG, which never fails spuriously. The distinction matters on ARM and similar LL/SC architectures, where weak CAS compiles to a simpler instruction sequence without retry logic. Writing portable code means using weak in loops (for ARM efficiency) and strong for single attempts (for correctness guarantees).

Contention Management in CAS Loops

Under high contention, CAS loops can exhibit pathological behavior: many threads competing for the same CAS, most failing, all retrying immediately, creating a storm of cache coherence traffic. Effective contention management is essential for scalable performance.

The Problem: Synchronized Failure

When multiple threads fail CAS simultaneously and retry immediately:

All threads re-read the shared location
All threads compute new values
All threads attempt CAS at roughly the same time
One succeeds, others fail
Repeat—synchronized wave of failures

This is similar to the thundering herd problem in lock acquisition. The solution is also similar: backoff.

Backoff Strategies

cas_backoff.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#include <atomic>
#include <chrono>
#include <thread>
#include <random>
 
std::atomic<int> counter{0};
 
// ═══════════════════════════════════════════════════════════════
// Strategy 1: Exponential Backoff
// ═══════════════════════════════════════════════════════════════
void increment_with_exponential_backoff() {
    constexpr int MIN_BACKOFF = 1;
    constexpr int MAX_BACKOFF = 1024;
    
    int backoff = MIN_BACKOFF;
    int expected = counter.load(std::memory_order_relaxed);
    
    while (!counter.compare_exchange_weak(
        expected, expected + 1,
        std::memory_order_release,
        std::memory_order_relaxed
    )) {
        // Backoff: spin for 'backoff' iterations
        for (int i = 0; i < backoff; ++i) {
            __builtin_ia32_pause();  // x86 PAUSE to reduce power & pipeline stalls
        }
        
        // Exponential increase with jitter
        backoff = std::min(backoff * 2, MAX_BACKOFF);
    }
}
 
// ═══════════════════════════════════════════════════════════════
// Strategy 2: Randomized Backoff
// ═══════════════════════════════════════════════════════════════
thread_local std::mt19937 rng(std::random_device{}());
 
void increment_with_random_backoff() {
    constexpr int MAX_BACKOFF = 512;
    
    int backoff = 16;
    int expected = counter.load(std::memory_order_relaxed);
    
    while (!counter.compare_exchange_weak(
        expected, expected + 1,
        std::memory_order_release,
        std::memory_order_relaxed
    )) {
        // Random delay to desynchronize retry attempts
        std::uniform_int_distribution<int> dist(0, backoff - 1);
        int delay = dist(rng);
        
        for (int i = 0; i < delay; ++i) {
            __builtin_ia32_pause();
        }
        
        // Increase range for next failure
        backoff = std::min(backoff * 2, MAX_BACKOFF);
    }
}
 
// ═══════════════════════════════════════════════════════════════
// Strategy 3: Yielding Backoff (for long contention)
// ═══════════════════════════════════════════════════════════════
void increment_with_yield_backoff() {
    int spin_count = 0;
    constexpr int SPINS_BEFORE_YIELD = 100;
    
    int expected = counter.load(std::memory_order_relaxed);
    
    while (!counter.compare_exchange_weak(
        expected, expected + 1,
        std::memory_order_release,
        std::memory_order_relaxed
    )) {
        ++spin_count;
        
        if (spin_count < SPINS_BEFORE_YIELD) {
            // First few attempts: just pause
            __builtin_ia32_pause();
        } else {
            // Many failures: yield CPU to other threads
            std::this_thread::yield();
            spin_count = 0;  // Reset counter after yield
        }
    }
}

Backoff Strategy Selection

Exponential backoff is the most common choice because:

It adapts to contention level: low contention → short waits, high contention → longer waits
Simple to implement
Well-studied behavior

Randomization is important to prevent synchronized retries:

Without randomness, threads often retry in lockstep
Adding jitter breaks the synchronization

Yielding is appropriate when:

Critical section is long (holder needs time)
Contention is expected to persist
CPU resources are at a premium

Anti-Pattern: No Backoff on Hot Paths

In extremely hot paths (millions of CAS per second), even minimal backoff overhead can be significant. Some designs accept no backoff, relying on the self-regulating nature of CAS contention. However, this can lead to livelock-like behavior under extreme contention.

Backoff Tuning is Critical

Backoff parameters (min, max, growth rate) significantly affect performance. Too aggressive backoff wastes time in low contention; too conservative backoff wastes CPU and bus bandwidth in high contention. Optimal values depend on workload, hardware, and core count. Production systems often auto-tune backoff parameters based on observed failure rates.

Common CAS Loop Bugs

CAS loops are deceptively simple, and subtle bugs are common. Here are the most frequent mistakes and how to avoid them.

Bug 1: Ignoring CAS Return Value

The most basic bug: not checking whether CAS succeeded.

cas_bugs.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
#include <atomic>
 
std::atomic<int> counter{0};
std::atomic<int*> ptr{nullptr};
 
// ═══════════════════════════════════════════════════════════════
// BUG 1: Ignoring CAS result
// ═══════════════════════════════════════════════════════════════
void wrong_increment() {
    int old_val = counter.load();
    int new_val = old_val + 1;
    
    // BUG: CAS might fail if another thread updated counter!
    counter.compare_exchange_strong(old_val, new_val);
    // We proceed as if it succeeded, but it might not have!
}
 
void correct_increment() {
    int old_val = counter.load();
    while (!counter.compare_exchange_weak(old_val, old_val + 1)) {
        // Retry loop handles failures
    }
}
 
// ═══════════════════════════════════════════════════════════════
// BUG 2: Reloading after CAS failure
// ═══════════════════════════════════════════════════════════════
void wrong_reload() {
    int old_val = counter.load();
    
    while (!counter.compare_exchange_weak(old_val, old_val + 1)) {
        old_val = counter.load();  // BUG: Unnecessary reload!
        // compare_exchange already updated old_val on failure
        // This load might get a DIFFERENT value than what CAS saw
    }
}
 
void correct_no_reload() {
    int old_val = counter.load();
    
    while (!counter.compare_exchange_weak(old_val, old_val + 1)) {
        // old_val was already updated to current value by failed CAS
        // No reload needed!
    }
}
 
// ═══════════════════════════════════════════════════════════════
// BUG 3: Not updating expected in conditional loop
// ═══════════════════════════════════════════════════════════════
bool wrong_conditional() {
    int old_val = counter.load();
    
    while (old_val < 100) {
        if (counter.compare_exchange_weak(old_val, old_val + 1)) {
            return true;
        }
        // BUG: If condition used 'old_val' but we reload separately:
        old_val = counter.load();  // Wait, this is what we checked above
        // Actually this is borderline - the real bug is when you DON'T
        // update old_val at all between loop iterations
    }
    return false;
}
 
// ═══════════════════════════════════════════════════════════════
// BUG 4: Missing memory ordering
// ═══════════════════════════════════════════════════════════════
std::atomic<bool> ready{false};
int data = 0;
 
void wrong_ordering() {
    data = 42;
    
    // BUG: relaxed ordering doesn't guarantee 'data' write is visible
    // before 'ready' write is seen by other threads
    ready.store(true, std::memory_order_relaxed);  // BUG!
}
 
void correct_ordering() {
    data = 42;
    
    // release ensures all prior writes (including 'data') are visible
    // to threads that see ready==true with acquire semantics
    ready.store(true, std::memory_order_release);  // Correct!
}
 
// ═══════════════════════════════════════════════════════════════
// BUG 5: Using local variable after CAS (stale reference)
// ═══════════════════════════════════════════════════════════════
struct Node { int value; Node* next; };
std::atomic<Node*> head{nullptr};
 
void wrong_stale_reference() {
    Node* old_head = head.load();
    
    if (old_head == nullptr) return;
    
    Node* new_head = old_head->next;
    
    if (head.compare_exchange_strong(old_head, new_head)) {
        int value = old_head->value;  // BUG: old_head might be freed!
        // Another thread could have also popped and freed this node
        // Use-after-free hazard!
    }
}
 
// Solution: use hazard pointers, epoch-based reclamation, etc.

Bug Summary Table

Common CAS Loop Bugs and Fixes
Bug	Symptom	Fix
Ignoring CAS result	Lost updates, inconsistent state	Always check return value; use retry loop
Unnecessary reload	Subtle race between load and CAS	Let CAS update expected value on failure
Wrong memory ordering	Data visible before flag, or vice versa	Use appropriate acquire/release ordering
Use-after-free	Crash, corruption	Use memory reclamation (hazard pointers, etc.)
ABA vulnerability	Corrupted data structures	Use tagged pointers or reclamation schemes
Infinite loop on failure	Hang under contention	Add backoff or yield

CAS Bugs Are Hard to Find

CAS bugs often don't manifest in testing because they require specific timing conditions. A CAS loop that works 99.99% of the time can fail catastrophically under load, causing hard-to-reproduce production issues. Careful code review, formal reasoning, and tools like ThreadSanitizer are essential. Never assume correctness from passing tests alone.

CAS Loop Optimizations

Once a CAS loop is correct, several optimizations can improve performance. These range from simple tweaks to significant structural changes.

Optimization 1: Read Before CAS (TTAS Pattern)

Before attempting CAS on a lock or guarded variable, check if it's even worth trying:

cas_optimizations.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
#include <atomic>
 
std::atomic<bool> lock{false};
std::atomic<int> counter{0};
 
// ═══════════════════════════════════════════════════════════════
// Optimization 1: Read before writing (TTAS pattern)
// ═══════════════════════════════════════════════════════════════
void acquire_lock_ttas() {
    while (true) {
        // READ first (cheap, can use shared cache line)
        while (lock.load(std::memory_order_relaxed)) {
            __builtin_ia32_pause();
        }
        
        // Now try CAS (expensive, requires exclusive cache line)
        bool expected = false;
        if (lock.compare_exchange_weak(
            expected, true,
            std::memory_order_acquire,
            std::memory_order_relaxed
        )) {
            return;  // Got it!
        }
        // Someone else grabbed it first, back to reading
    }
}
 
// ═══════════════════════════════════════════════════════════════
// Optimization 2: Hoisting invariant checks out of loop
// ═══════════════════════════════════════════════════════════════
std::atomic<int> bounded_counter{0};
constexpr int MAX_VALUE = 100;
 
bool increment_bounded_optimized() {
    int expected = bounded_counter.load(std::memory_order_relaxed);
    
    // Check termination condition ONCE before potentially expensive loop
    if (expected >= MAX_VALUE) {
        return false;  // Fast path: already at max
    }
    
    do {
        if (expected >= MAX_VALUE) {
            return false;  // Rechecked condition after failure
        }
    } while (!bounded_counter.compare_exchange_weak(
        expected, expected + 1,
        std::memory_order_release,
        std::memory_order_relaxed
    ));
    
    return true;
}
 
// ═══════════════════════════════════════════════════════════════
// Optimization 3: Batching operations to reduce contention
// ═══════════════════════════════════════════════════════════════
std::atomic<int> shared_counter{0};
thread_local int local_count = 0;
 
void increment_batched() {
    // Accumulate locally
    local_count++;
    
    // Flush to shared counter periodically
    if (local_count >= 100) {
        int old = shared_counter.load(std::memory_order_relaxed);
        while (!shared_counter.compare_exchange_weak(
            old, old + local_count,
            std::memory_order_release,
            std::memory_order_relaxed
        )) {}
        local_count = 0;
    }
}
 
int get_total() {
    // Note: reading from all thread-local counts would be needed
    // for exact count; shared_counter is approximate
    return shared_counter.load(std::memory_order_acquire);
}
 
// ═══════════════════════════════════════════════════════════════
// Optimization 4: Using fetch_add when applicable
// ═══════════════════════════════════════════════════════════════
void increment_simple() {
    // For simple increment, fetch_add is more efficient than CAS loop
    // Hardware can optimize fetch_add better than repeated CAS
    counter.fetch_add(1, std::memory_order_relaxed);
}
 
// CAS loop only needed when:
// - Update depends on current value in complex way
// - Conditional update required
// - Multiple fields must change atomically (via pointer swap)

Optimization Summary

CAS Loop Optimization Techniques
Technique	Benefit	When to Use
Read-before-CAS (TTAS)	Reduces write traffic when contended	Lock acquisition, guarded variables
Hoist invariant checks	Avoids unnecessary CAS attempts	Conditional updates with fast-path exits
Batching	Dramatically reduces contention	Counters, statistics, non-critical aggregates
Use fetch_add/fetch_sub	More efficient than CAS for simple ops	Simple increments/decrements
Sharding	Spreads contention across multiple variables	High-contention counters, data structures
Weak CAS in loops	Avoids overhead of strong CAS on LL/SC	Any CAS retry loop

Prefer Specialized Atomics

For simple operations like increment/decrement, use specialized atomics (fetch_add, fetch_sub, fetch_or, etc.) instead of CAS loops. These map directly to hardware instructions (LOCK XADD on x86) and are more efficient than compare-and-swap. Reserve CAS loops for operations that truly require conditional updates based on current value.

CAS with Complex Types

While CAS typically operates on single machine words (32 or 64 bits), many algorithms require atomic updates of more complex data. Several techniques address this challenge.

Approach 1: Pointer to Immutable Structure

The most elegant solution: use CAS on a pointer, with the pointed-to structure immutable after creation.

cas_complex_types.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#include <atomic>
#include <memory>
 
// ═══════════════════════════════════════════════════════════════
// Approach 1: CAS on pointer to immutable structure
// ═══════════════════════════════════════════════════════════════
struct Config {
    int timeout;
    int max_retries;
    std::string endpoint;
    
    // Config is immutable after construction
    // (no setters, all members const or effectively so)
};
 
std::atomic<Config*> current_config{nullptr};
 
void update_config(int new_timeout) {
    Config* old_config = current_config.load(std::memory_order_acquire);
    
    while (true) {
        // Create new immutable config
        Config* new_config = new Config{
            new_timeout,
            old_config ? old_config->max_retries : 3,
            old_config ? old_config->endpoint : "default"
        };
        
        // CAS the pointer
        if (current_config.compare_exchange_weak(
            old_config, new_config,
            std::memory_order_release,
            std::memory_order_acquire
        )) {
            // Success! Old config must be reclaimed safely
            // (using hazard pointers, RCU, etc.)
            schedule_reclamation(old_config);
            return;
        }
        
        // Failed - old_config now points to current config
        delete new_config;  // Discard our attempt
        // Loop retries with new old_config value
    }
}
 
// Readers get consistent snapshot automatically
Config* get_config() {
    return current_config.load(std::memory_order_acquire);
}
 
// ═══════════════════════════════════════════════════════════════
// Approach 2: Pack multiple values into one word
// ═══════════════════════════════════════════════════════════════
struct PackedState {
    uint32_t counter : 16;    // 16 bits
    uint32_t flags : 8;       // 8 bits
    uint32_t version : 8;     // 8 bits (for ABA prevention)
};
 
static_assert(sizeof(PackedState) == sizeof(uint32_t));
 
std::atomic<uint32_t> packed_state{0};
 
void increment_with_flags(uint8_t new_flags) {
    PackedState old_state, new_state;
    
    // Load as integer, interpret as packed struct
    uint32_t old_val = packed_state.load(std::memory_order_relaxed);
    
    do {
        old_state = *reinterpret_cast<PackedState*>(&old_val);
        
        new_state.counter = old_state.counter + 1;
        new_state.flags = new_flags;
        new_state.version = old_state.version + 1;
        
        uint32_t new_val = *reinterpret_cast<uint32_t*>(&new_state);
        
        if (packed_state.compare_exchange_weak(
            old_val, new_val,
            std::memory_order_acq_rel
        )) {
            return;
        }
        // old_val updated to current value
    } while (true);
}
 
// ═══════════════════════════════════════════════════════════════
// Approach 3: Double-width CAS (DCAS / CMPXCHG16B)
// ═══════════════════════════════════════════════════════════════
struct alignas(16) DoubleWord {
    void* ptr;
    uint64_t tag;
};
 
std::atomic<DoubleWord> tagged_ptr;
 
bool cas_tagged_ptr(void* expected_ptr, uint64_t expected_tag,
                    void* new_ptr, uint64_t new_tag) {
    DoubleWord expected{expected_ptr, expected_tag};
    DoubleWord desired{new_ptr, new_tag};
    
    return tagged_ptr.compare_exchange_strong(expected, desired);
}
 
// Note: Requires 16-byte aligned storage and CMPXCHG16B support
// C++ std::atomic<DoubleWord> should handle this on x86-64

Approach Comparison

Techniques for CAS with Complex Data
Technique	Data Size	Complexity	Best For
Pointer to immutable	Unlimited	Medium (needs reclamation)	Complex structures, COW updates
Packed bit fields	1 word (64 bits)	Low	Small, related values
Double-width CAS	2 words (128 bits)	Medium	Pointer + tag for ABA prevention
Split into parts	Varies	High	Multi-step protocols

Language and Platform Matters

Support for double-width CAS varies: x86-64 provides CMPXCHG16B but requires 16-byte alignment; ARM and other platforms may not have native double-width CAS. C++ std::atomic for types larger than a word may fall back to locks internally. Always verify that your atomic operations are truly lock-free using std::atomic<T>::is_lock_free() if this matters for your application.

Testing and Debugging CAS Code

CAS-based code is notoriously difficult to test and debug because bugs may only manifest under specific timing conditions. Several tools and techniques help address this challenge.

Tool 1: Thread Sanitizer (TSan)

ThreadSanitizer detects data races at runtime by tracking memory accesses and synchronization operations.

# Compile with ThreadSanitizer
g++ -fsanitize=thread -g mycode.cpp -o mycode

# Run - TSan reports races
./mycode

TSan catches:

Unsynchronized concurrent access to non-atomic variables
Incorrect memory ordering
Missing synchronization in CAS loops

Tool 2: Model Checkers

Model checkers (like CDSChecker, GenMC, or CHESS) systematically explore all possible thread interleavings, finding bugs that random testing would miss.

// CDSChecker example - finds all interleavings
#include "cds_model.h"

void thread1() { counter.fetch_add(1); }
void thread2() { counter.fetch_add(1); }

int main() {
    thrd_create(&t1, thread1, NULL);
    thrd_create(&t2, thread2, NULL);
    thrd_join(t1); thrd_join(t2);
    assert(counter.load() == 2);  // Checked for all interleavings
}

CAS Debugging Techniques

•Add assertions: Assert invariants after CAS success; they document assumptions and catch violations.
•Log CAS attempts: In debug builds, log expected/actual values on CAS failure to understand contention patterns.
•Stress testing: Run many threads hammering the data structure to increase chance of exposing races.
•Random delays: Insert random sleep() or yield() calls in debug builds to perturb timing.
•Deterministic replay: Tools like rr can record and replay executions, making intermittent bugs reproducible.
•Formal verification: For critical code, use tools like TLA+ or formal proofs to verify correctness.

cas_debugging.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#include <atomic>
#include <cassert>
#include <iostream>
 
std::atomic<int> counter{0};
 
// ═══════════════════════════════════════════════════════════════
// Defensive CAS loop with debugging support
// ═══════════════════════════════════════════════════════════════
void increment_defensive() {
    int attempts = 0;
    constexpr int MAX_ATTEMPTS = 1000000;
    
    int expected = counter.load(std::memory_order_relaxed);
    
    while (!counter.compare_exchange_weak(
        expected, expected + 1,
        std::memory_order_release,
        std::memory_order_relaxed
    )) {
        attempts++;
        
        // Detect potential livelock / extreme contention
        if (attempts > MAX_ATTEMPTS) {
            std::cerr << "WARNING: CAS loop exceeded " << MAX_ATTEMPTS 
                      << " attempts. Possible livelock." << std::endl;
            std::cerr << "Expected: " << expected 
                      << ", Current: " << counter.load() << std::endl;
            // In production: maybe switch to lock-based fallback
            assert(false && "CAS loop appears livelocked");
        }
        
        // Logging for debug analysis
        #ifdef DEBUG_CAS
        if (attempts % 100 == 0) {
            std::cerr << "CAS retry #" << attempts 
                      << ", expected=" << expected << std::endl;
        }
        #endif
    }
    
    // Post-condition: counter increased by exactly 1
    // (Hard to verify directly, but assertions on final state help)
}
 
// ═══════════════════════════════════════════════════════════════
// Invariant checking wrapper
// ═══════════════════════════════════════════════════════════════
class AtomicBoundedCounter {
    std::atomic<int> value_{0};
    int min_, max_;
    
public:
    AtomicBoundedCounter(int min, int max) : min_(min), max_(max) {}
    
    bool increment() {
        int expected = value_.load(std::memory_order_relaxed);
        
        while (expected < max_) {
            if (value_.compare_exchange_weak(
                expected, expected + 1,
                std::memory_order_release,
                std::memory_order_relaxed
            )) {
                // Assert invariant after successful modification
                int new_val = value_.load(std::memory_order_acquire);
                assert(new_val >= min_ && new_val <= max_ && 
                       "Invariant violated!");
                return true;
            }
        }
        return false;
    }
};

Testing Cannot Prove Correctness

No amount of testing can guarantee that CAS-based code is correct. The number of possible interleavings grows exponentially with code size and thread count. Testing may pass for years before a specific interleaving triggers a bug. For critical code, combine testing with formal analysis, code review by concurrency experts, and use of well-tested libraries.

Summary: CAS Loops

We have deeply explored CAS loops—the fundamental building block of all CAS-based algorithms. From the basic retry pattern to advanced optimizations and debugging techniques, these patterns are essential knowledge for concurrent programming.

Key Takeaways

•The canonical CAS loop is: read current value, compute desired value, CAS to update, retry on failure. The expected value is automatically updated on CAS failure.
•Use weak CAS in loops because spurious failures don't matter when you're retrying anyway. Use strong CAS for single-attempt operations.
•Backoff is essential under contention — exponential backoff with randomization prevents synchronized retry storms.
•Common bugs include: ignoring CAS return value, unnecessary reloading, wrong memory ordering, and use-after-free on succeeded CAS nodes.
•Optimizations include: read-before-CAS (TTAS), hoisting invariant checks, batching, and using specialized atomics (fetch_add) when applicable.
•Complex types can be handled by CASing pointers to immutable structures, packing values, or double-width CAS.
•Testing and debugging require specialized tools (TSan, model checkers) because bugs depend on specific interleavings.

What's Next: Performance Considerations

The final page of this module examines CAS performance in depth. We'll analyze how CAS operations perform under various conditions, the impact of cache coherence, NUMA effects, and how to measure and optimize CAS-heavy code.

We will cover:

CAS latency and throughput characteristics
Cache coherence traffic analysis
NUMA considerations for CAS
Profiling and benchmarking CAS operations
Architectural differences and portability

This performance perspective completes our understanding of CAS, enabling you to not only write correct CAS-based code but also reason about its efficiency.

Page Complete

You now understand CAS loops deeply—the patterns, the pitfalls, and the best practices. This knowledge is the key to writing correct CAS-based algorithms, whether you're implementing lock-free data structures, optimizing hot paths, or simply understanding the synchronization primitives you use every day.