Compare And Swap - Learning Module

Loading content...

0/227

CAS Performance: Latency, Throughput, and Optimization

The Performance Reality of Atomic Operations

Understanding CAS correctness is only half the battle. In performance-critical systems, the efficiency of CAS operations often determines whether a design succeeds or fails. A lock-free algorithm that is theoretically elegant but practically slow provides no benefit over simpler alternatives.

CAS performance is shaped by hardware realities: cache coherence protocols, memory bus bandwidth, NUMA topology, and CPU microarchitecture. These factors interact in complex ways, producing performance characteristics that can be counterintuitive.

This page provides a deep dive into CAS performance. We will examine latency and throughput under various conditions, analyze how cache coherence affects CAS-heavy code, explore NUMA considerations, and develop strategies for measuring and optimizing CAS performance. By the end, you will have a practitioner's understanding of what makes CAS fast or slow.

What You Will Learn

By the end of this page, you will understand CAS latency and throughput characteristics, how cache coherence protocols affect CAS performance, the impact of contention on CAS operations, NUMA effects on cross-socket CAS, profiling techniques for CAS-heavy code, and optimization strategies for high-performance atomic operations.

CAS Latency Fundamentals

The latency of a CAS operation—the time from instruction issue to completion—varies dramatically based on several factors. Understanding these factors is key to predicting and optimizing CAS performance.

Baseline: Uncontended CAS

In the best case (no contention, data in L1 cache), a CAS operation completes in just a few CPU cycles:

Scenario	Typical Latency (cycles)	Notes
L1 cache hit, no contention	5-15	Best case; comparable to normal memory access
L2 cache hit, no contention	15-30	Slightly slower, still fast
L3 cache hit, no contention	30-50	Cross-core L3 access
Memory access (cache miss)	100-300	Main memory round-trip
Cross-socket (NUMA)	150-400+	Remote memory access

What Makes CAS Expensive?

CAS is more expensive than a simple load or store because:

Atomic guarantee requires exclusive access: The cache line must be in Exclusive or Modified state
Memory barrier semantics: CAS typically implies memory ordering constraints
Bus locking or cache locking: Hardware must prevent interference during the operation

cas_latency_benchmark.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <atomic>
#include <chrono>
#include <iostream>
 
// Benchmark: Measure uncontended CAS latency
 
std::atomic<long> counter{0};
 
void benchmark_uncontended_cas(int iterations) {
    // Warm up cache
    for (int i = 0; i < 1000; i++) {
        counter.fetch_add(1, std::memory_order_relaxed);
    }
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int i = 0; i < iterations; i++) {
        long expected = counter.load(std::memory_order_relaxed);
        while (!counter.compare_exchange_weak(
            expected, expected + 1,
            std::memory_order_relaxed
        )) {}
    }
    
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    
    double ns_per_cas = static_cast<double>(duration.count()) / iterations;
    std::cout << "Uncontended CAS: " << ns_per_cas << " ns/op" << std::endl;
    
    // Typical results (varies by CPU):
    // Modern x86: 10-20 ns/op uncontended
    // This is ~30-60 CPU cycles at 3 GHz
}
 
// For comparison: simple store
void benchmark_store(int iterations) {
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int i = 0; i < iterations; i++) {
        counter.store(i, std::memory_order_relaxed);
    }
    
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    
    double ns_per_store = static_cast<double>(duration.count()) / iterations;
    std::cout << "Simple store: " << ns_per_store << " ns/op" << std::endl;
    
    // Typical: 1-2 ns/op (just a few cycles)
}
 
/*
 * Key insight: Uncontended CAS is ~10-20x slower than simple stores
 * Still fast in absolute terms (~10-20ns), but the difference 
 * matters in hot loops.
 * 
 * This is the "atomic tax" you pay for synchronization.
 */

Memory Ordering Impact

Memory ordering affects CAS latency:

Memory Order	Latency Impact	Why
`relaxed`	Lowest	No memory barriers needed
`acquire` / `release`	Low-moderate	One-directional barriers
`seq_cst`	Highest	Full memory fence, total ordering

On x86, the difference is often minimal because the architecture has strong ordering already. On ARM or POWER, the difference can be substantial—seq_cst may require explicit fence instructions that stall the pipeline.

Use Weakest Sufficient Ordering

For performance-critical code, use the weakest memory ordering that is correct for your algorithm. In many CAS loops, relaxed ordering is sufficient for the load-and-retry pattern, with acquire or release only needed at the successful CAS. This can provide measurable speedups on weakly-ordered architectures.

Cache Coherence and CAS Performance

Cache coherence protocols are the hidden machinery that determines CAS performance in multi-core systems. Understanding these protocols reveals why contention is so expensive.

The MESI Protocol

Most modern CPUs use variants of the MESI protocol (Modified, Exclusive, Shared, Invalid) to maintain cache coherence:

State	Meaning	CAS Behavior
Modified (M)	Only this cache has valid data; memory is stale	CAS can proceed immediately
Exclusive (E)	Only this cache has valid data; memory is current	CAS can proceed (transitions to M)
Shared (S)	Multiple caches have valid copy	Must signal others to invalidate (slow!)
Invalid (I)	Cache line is invalid	Must fetch from memory or other cache (slow!)

CAS Requires Exclusive Ownership

For a CAS to execute, the cache line must be in Exclusive or Modified state. If it's in Shared or Invalid:

Shared → Exclusive: Send invalidation messages to all sharers, wait for acknowledgments
Invalid → Exclusive: Fetch line from memory or from cache holding Modified copy

Both transitions involve inter-core or memory bus traffic, taking 50-200+ cycles.

cache_coherence_impact.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#include <atomic>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
 
// Demonstrate cache coherence impact on CAS performance
 
std::atomic<long> shared_counter{0};
 
// Test 1: Sequential CAS (best case - cache line stays local)
void benchmark_sequential() {
    const int ITERATIONS = 10'000'000;
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int i = 0; i < ITERATIONS; i++) {
        shared_counter.fetch_add(1, std::memory_order_relaxed);
    }
    
    auto end = std::chrono::high_resolution_clock::now();
    auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    
    std::cout << "Sequential: " << (double)ns / ITERATIONS << " ns/op" << std::endl;
}
 
// Test 2: Two threads alternating (cache line bouncing)
void benchmark_two_threads() {
    const int ITERATIONS = 5'000'000;  // 5M each = 10M total
    std::atomic<bool> stop{false};
    
    auto worker = [&]() {
        for (int i = 0; i < ITERATIONS; i++) {
            shared_counter.fetch_add(1, std::memory_order_relaxed);
        }
    };
    
    auto start = std::chrono::high_resolution_clock::now();
    
    std::thread t1(worker);
    std::thread t2(worker);
    t1.join();
    t2.join();
    
    auto end = std::chrono::high_resolution_clock::now();
    auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    
    std::cout << "Two threads: " << (double)ns / (2 * ITERATIONS) << " ns/op" << std::endl;
}
 
// Test 3: Many threads contending (cache line thrashing)
void benchmark_many_threads(int num_threads) {
    const int ITERATIONS_PER_THREAD = 1'000'000;
    shared_counter = 0;
    
    std::vector<std::thread> threads;
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int t = 0; t < num_threads; t++) {
        threads.emplace_back([&]() {
            for (int i = 0; i < ITERATIONS_PER_THREAD; i++) {
                shared_counter.fetch_add(1, std::memory_order_relaxed);
            }
        });
    }
    
    for (auto& t : threads) t.join();
    
    auto end = std::chrono::high_resolution_clock::now();
    auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    
    std::cout << num_threads << " threads: " 
              << (double)ns / (num_threads * ITERATIONS_PER_THREAD) 
              << " ns/op" << std::endl;
}
 
/*
 * Typical results on modern multi-core CPU:
 * 
 * Sequential:   10-15 ns/op
 * Two threads:  50-80 ns/op     (4-6x slower!)
 * Four threads: 80-150 ns/op    (8-10x slower!)
 * Eight threads: 150-300 ns/op  (15-20x slower!)
 * 
 * Cache line bouncing dominates performance under contention.
 * Each CAS forces cache line transfer between cores.
 */

False Sharing

False sharing occurs when unrelated variables share a cache line, causing coherence traffic even though threads access different data.

// FALSE SHARING: counters share cache line
struct BadLayout {
    std::atomic<int> counter1;  // Core 1 writes here
    std::atomic<int> counter2;  // Core 2 writes here
    // Both on same 64-byte cache line!
    // Every write by one core invalidates the other's cache
};

// FIXED: Pad to separate cache lines
struct GoodLayout {
    alignas(64) std::atomic<int> counter1;
    alignas(64) std::atomic<int> counter2;
    // Each on its own cache line - no interference
};

False sharing can cause 10-100x performance degradation. Always align frequently-modified atomics to cache line boundaries.

Contention is the Enemy

The dominant factor in CAS performance is contention. A single-threaded CAS taking 15ns can become a 200ns operation with 8 threads contending. Design to minimize contention: use thread-local storage where possible, shard data structures, or use combining techniques. A well-designed system with low contention will massively outperform a contended lock-free design.

Contention Scaling Analysis

How does CAS performance scale as contention increases? Understanding this scaling behavior is essential for capacity planning and design decisions.

Throughput Under Contention

Counter-intuitively, adding more threads to a heavily contended CAS operation can decrease total throughput:

Threads	Per-Thread Ops/sec	Total Ops/sec	Efficiency
1	100M	100M	100%
2	25M	50M	50%
4	8M	32M	32%
8	3M	24M	24%
16	1.5M	24M	24%
32	0.75M	24M	24%

Illustrative numbers—actual results vary by hardware

The pattern: throughput initially grows with threads, plateaus, and may even decline. Each additional thread adds more contention than it adds capacity.

Why Contention Kills Scalability

Under high contention:

Contention Costs

•CAS failures mount: With N threads, each CAS has ~1/N probability of success; N-1 threads wasted cycles.
•Cache line bouncing: Every successful CAS moves the cache line to a new core; others must fetch it.
•Memory bus saturation: Coherence traffic can saturate the memory bus or interconnect.
•Retry amplification: Failed CAS means the work (read-compute-CAS) must be repeated.
•Backoff overhead: Backoff delays add latency even to eventually-successful operations.

contention_analysis.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#include <atomic>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
#include <iomanip>
 
// Analyze CAS success rate and throughput vs thread count
 
std::atomic<long> counter{0};
std::atomic<long> total_attempts{0};
std::atomic<long> total_successes{0};
 
void contended_increment(int iterations) {
    int attempts = 0;
    
    for (int i = 0; i < iterations; i++) {
        long expected;
        do {
            expected = counter.load(std::memory_order_relaxed);
            attempts++;
        } while (!counter.compare_exchange_weak(
            expected, expected + 1,
            std::memory_order_relaxed
        ));
    }
    
    total_attempts.fetch_add(attempts, std::memory_order_relaxed);
    total_successes.fetch_add(iterations, std::memory_order_relaxed);  // iterations = successes
}
 
void analyze_contention(int num_threads) {
    const int OPS_PER_THREAD = 100'000;
    
    counter = 0;
    total_attempts = 0;
    total_successes = 0;
    
    std::vector<std::thread> threads;
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int t = 0; t < num_threads; t++) {
        threads.emplace_back(contended_increment, OPS_PER_THREAD);
    }
    
    for (auto& t : threads) t.join();
    
    auto end = std::chrono::high_resolution_clock::now();
    
    double elapsed_sec = std::chrono::duration<double>(end - start).count();
    long attempts = total_attempts.load();
    long successes = total_successes.load();
    
    double success_rate = 100.0 * successes / attempts;
    double throughput = successes / elapsed_sec;
    double wasted_ratio = (double)(attempts - successes) / attempts;
    
    std::cout << std::setw(3) << num_threads << " threads: "
              << std::fixed << std::setprecision(2)
              << "Success rate: " << std::setw(6) << success_rate << "% | "
              << "Throughput: " << std::setw(10) << (throughput / 1e6) << " Mops/s | "
              << "Wasted work: " << std::setw(6) << (wasted_ratio * 100) << "%" 
              << std::endl;
}
 
int main() {
    std::cout << "CAS Contention Analysis\n\n";
    
    for (int threads : {1, 2, 4, 8, 16, 32}) {
        analyze_contention(threads);
    }
    
    return 0;
}
 
/*
 * Sample output (8-core CPU):
 * 
 *   1 threads: Success rate: 100.00% | Throughput:      95.00 Mops/s | Wasted work:   0.00%
 *   2 threads: Success rate:  51.23% | Throughput:      48.50 Mops/s | Wasted work:  48.77%
 *   4 threads: Success rate:  26.15% | Throughput:      35.20 Mops/s | Wasted work:  73.85%
 *   8 threads: Success rate:  13.02% | Throughput:      28.10 Mops/s | Wasted work:  86.98%
 *  16 threads: Success rate:   6.45% | Throughput:      25.80 Mops/s | Wasted work:  93.55%
 *  32 threads: Success rate:   3.21% | Throughput:      24.50 Mops/s | Wasted work:  96.79%
 * 
 * Key insights:
 * - Success rate drops as 1/N (one winner per round)
 * - Throughput saturates around core count
 * - Vast majority of work is wasted under high contention
 */

Scalability Guidelines

If your design requires high-frequency CAS on a single location, you will not scale beyond a handful of cores. Solutions: (1) Shard data across multiple atomics, accessed by hash or round-robin; (2) Use combining or batching to aggregate operations; (3) Accept eventual consistency and use thread-local storage with periodic synchronization; (4) Consider whether you actually need strict atomicity.

NUMA Considerations for CAS

On NUMA (Non-Uniform Memory Access) systems, memory access latency depends on which processor is accessing which memory. This has significant implications for CAS-heavy code.

NUMA Architecture

In NUMA systems:

Each processor (socket) has its own local memory
Accessing local memory is fast (~100ns)
Accessing remote memory (another socket) is slow (~200-400ns or more)
Cache coherence traffic between sockets is even slower

CAS Across NUMA Boundaries

When threads on different nodes contend for a CAS:

Scenario	Latency	Why
Same core (uncontended)	15-20ns	L1 cache hit
Same node, different core	50-80ns	L3 cache + coherence
Different nodes (cross-socket)	200-400ns	Remote cache or memory access

Cross-socket CAS is 10-20x slower than local CAS. This dramatically amplifies contention costs on NUMA systems.

numa_awareness.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include <atomic>
#include <thread>
#include <numa.h>  // Linux NUMA library
#include <sched.h>
 
// NUMA-aware design: per-node sharding
 
struct PerNodeCounter {
    // Each counter on its own cache line to prevent false sharing
    alignas(64) std::atomic<long> count{0};
};
 
class NUMAShardedCounter {
private:
    std::vector<PerNodeCounter> node_counters;
    int num_nodes;
    
public:
    NUMAShardedCounter() {
        num_nodes = numa_num_configured_nodes();
        node_counters.resize(num_nodes);
    }
    
    void increment() {
        // Get current NUMA node
        int node = numa_node_of_cpu(sched_getcpu());
        
        // Increment local counter (fast - no cross-node traffic)
        node_counters[node].count.fetch_add(1, std::memory_order_relaxed);
    }
    
    long get_total() {
        long total = 0;
        for (int i = 0; i < num_nodes; i++) {
            total += node_counters[i].count.load(std::memory_order_relaxed);
        }
        return total;
    }
};
 
// Affinity helper: bind thread to specific node
void bind_to_node(int node) {
    struct bitmask* mask = numa_allocate_nodemask();
    numa_bitmask_setbit(mask, node);
    numa_bind(mask);
    numa_bitmask_free(mask);
}
 
/*
 * NUMA-aware design principles:
 * 
 * 1. Shard by node: Keep hot data node-local
 * 2. Allocate memory on the node that will use it (numa_alloc_local)
 * 3. Pin threads to nodes when possible
 * 4. Aggregate cross-node communication (batch updates)
 * 5. Monitor NUMA balance: 'numastat' and perf counters
 * 
 * Naive shared counter: 200-400ns/op cross-socket
 * NUMA-sharded counter: 15-20ns/op (each thread hits local counter)
 * 
 * This is a 10-20x speedup on multi-socket systems!
 */

NUMA-Aware Design Strategies
Strategy	Description	When to Use
Per-node sharding	Each NUMA node has its own copy of data	Counters, statistics, approximate aggregates
Memory affinity	Allocate memory on the node that uses it	Large allocations for thread-local work
Thread pinning	Bind threads to specific cores/nodes	Latency-sensitive, steady-state workloads
Hierarchical design	Local → node-local → global aggregation	Complex data structures with mixed access
NUMA-aware allocation	Use numa_alloc_onnode() or mbind()	Lock-free data structures with known access patterns

Cloud and Virtualization

In cloud environments, NUMA topology may be exposed to VMs (depending on hypervisor settings) or hidden. Code running in containers or VMs should either query NUMA topology at runtime (numa_available(), numa_num_configured_nodes()) or be designed to work efficiently without NUMA awareness. Making assumptions about topology can hurt performance when the assumption is wrong.

Architectural Differences in CAS Performance

Different CPU architectures implement CAS differently, leading to varying performance characteristics. Understanding these differences is important for writing portable, high-performance code.

x86/x64: CMPXCHG

Native CAS instruction with predictable semantics:

LOCK CMPXCHG provides atomic compare-and-swap
Strong memory ordering (Total Store Order) means few surprises
No spurious failures
Double-width CAS (CMPXCHG16B) available for 128-bit operations

Performance characteristics:

Uncontended: ~15-25 cycles
Well-optimized by decades of x86 tuning
Memory ordering is "free" (TSO provides strong guarantees)

ARM: LDREX/STREX and CAS

Load-Linked/Store-Conditional prior to ARMv8.1:

Two-instruction sequence: LDREX (load exclusive), STREX (store conditional)
May fail spuriously (contention, interrupts, cache line eviction)
Weaker memory model requires explicit barriers

ARMv8.1+ adds CAS instructions:

CASP (compare and swap pair) for 128-bit operations
Still weaker memory model than x86

Performance characteristics:

Generally lower single-threaded performance than x86 for atomics
Weaker ordering allows more instruction-level parallelism
Spurious failures add overhead in loops (but compare_exchange_weak handles this)

CAS Performance Across Architectures
Aspect	x86/x64	ARM (pre-8.1)	ARM (8.1+)	RISC-V
CAS instruction	CMPXCHG	LDREX/STREX	CAS, CASP	LR/SC
Spurious failure	Never	Possible	Never (CAS)	Possible
Memory model	TSO (strong)	Weak	Weak	Weak (RVWMO)
128-bit CAS	CMPXCHG16B	LDREXD/STREXD	CASP	LR/SC pair
Barrier overhead	Low (TSO)	High (explicit)	Medium	High
Uncontended latency	~15-25 cycles	~30-50 cycles	~20-30 cycles	~25-40 cycles

portable_atomics.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#include <atomic>
 
// Writing portable atomic code that performs well on all architectures
 
std::atomic<int> counter{0};
 
// ═══════════════════════════════════════════════════════════════
// Pattern: Use weak CAS in loops for ARM/RISC-V efficiency
// ═══════════════════════════════════════════════════════════════
void increment_portable() {
    int expected = counter.load(std::memory_order_relaxed);
    
    // compare_exchange_weak:
    // - On x86: compiles to same code as strong (no spurious failure)
    // - On ARM LL/SC: avoids retry loop inside the CAS primitive
    // - Best choice when in an outer retry loop anyway
    while (!counter.compare_exchange_weak(
        expected, expected + 1,
        std::memory_order_release,  // Just need release on success
        std::memory_order_relaxed   // Relaxed on failure (retrying anyway)
    )) {
        // expected is updated to current value automatically
    }
}
 
// ═══════════════════════════════════════════════════════════════
// Pattern: Minimize memory ordering for weak architectures
// ═══════════════════════════════════════════════════════════════
std::atomic<bool> flag{false};
int data = 0;
 
void producer_portable() {
    data = 42;
    
    // release is sufficient - ensures data write is visible before flag
    // On x86: no extra cost (TSO)
    // On ARM: generates barrier instruction (necessary for correctness)
    flag.store(true, std::memory_order_release);
}
 
void consumer_portable() {
    // acquire is sufficient - ensures we see data after seeing flag
    while (!flag.load(std::memory_order_acquire)) {
        // spin
    }
    
    // Safe to read data
    int x = data;
}
 
// ═══════════════════════════════════════════════════════════════
// Anti-pattern: Assuming x86 memory model
// ═══════════════════════════════════════════════════════════════
void dangerous_on_arm() {
    data = 42;
    
    // BUG: Relaxed store doesn't guarantee ordering!
    // Works on x86 (TSO), FAILS on ARM (may see flag=true before data=42)
    flag.store(true, std::memory_order_relaxed);  // DON'T DO THIS
}
 
/*
 * Portability guidelines:
 * 
 * 1. Use compare_exchange_weak in loops (no downside on x86)
 * 2. Use explicit memory orderings, not defaults (seq_cst)
 * 3. Never assume x86 memory model - code for weak ordering
 * 4. Test on ARM or use tools like CDSChecker to find ordering bugs
 * 5. Use relaxed only when you truly don't need ordering
 */

Test on Target Architecture

Code that works perfectly on x86 may be subtly broken on ARM due to weaker memory ordering. The x86 memory model 'covers up' many ordering bugs. Always test atomic code on the weakest-ordered architecture you'll deploy on, or use formal tools (ThreadSanitizer, CDSChecker) to find ordering issues.

Profiling and Measuring CAS Performance

Effective optimization requires accurate measurement. CAS performance profiling has unique challenges due to the timing sensitivities involved.

Measurement Challenges

Observer effect: Adding profiling code can change timing and affect results
Variability: CAS latency varies widely based on cache state and contention
Aggregation: Single-operation timing is noisy; need statistical aggregation
System noise: Other processes, interrupts, and frequency scaling affect measurements

Profiling Tools and Techniques

profiling_techniques.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#include <atomic>
#include <chrono>
#include <algorithm>
#include <vector>
#include <numeric>
#include <iostream>
#include <iomanip>
 
// Technique 1: Batch timing (amortize measurement overhead)
void benchmark_batch_timing() {
    std::atomic<long> counter{0};
    const int BATCH_SIZE = 100'000;
    const int NUM_BATCHES = 100;
    
    std::vector<double> batch_times;
    batch_times.reserve(NUM_BATCHES);
    
    for (int batch = 0; batch < NUM_BATCHES; batch++) {
        auto start = std::chrono::high_resolution_clock::now();
        
        for (int i = 0; i < BATCH_SIZE; i++) {
            counter.fetch_add(1, std::memory_order_relaxed);
        }
        
        auto end = std::chrono::high_resolution_clock::now();
        
        double ns_total = std::chrono::duration<double, std::nano>(end - start).count();
        batch_times.push_back(ns_total / BATCH_SIZE);
    }
    
    // Compute statistics
    std::sort(batch_times.begin(), batch_times.end());
    
    double median = batch_times[NUM_BATCHES / 2];
    double p99 = batch_times[NUM_BATCHES * 99 / 100];
    double mean = std::accumulate(batch_times.begin(), batch_times.end(), 0.0) / NUM_BATCHES;
    
    std::cout << "CAS Latency Statistics (ns/op):" << std::endl;
    std::cout << "  Mean:   " << std::fixed << std::setprecision(2) << mean << std::endl;
    std::cout << "  Median: " << median << std::endl;
    std::cout << "  P99:    " << p99 << std::endl;
}
 
// Technique 2: CPU cycle counting (finer granularity)
#ifdef __x86_64__
#include <x86intrin.h>
 
inline uint64_t rdtsc() {
    return __rdtsc();
}
 
void benchmark_rdtsc() {
    std::atomic<long> counter{0};
    const int ITERATIONS = 10'000'000;
    
    uint64_t start = rdtsc();
    
    for (int i = 0; i < ITERATIONS; i++) {
        counter.fetch_add(1, std::memory_order_relaxed);
    }
    
    uint64_t end = rdtsc();
    
    double cycles_per_op = (double)(end - start) / ITERATIONS;
    
    // Note: RDTSC may not be constant rate on all CPUs
    // For accurate time, still need to calibrate against wall clock
    std::cout << "Cycles per CAS: " << cycles_per_op << std::endl;
}
#endif
 
// Technique 3: Using perf counters (Linux)
// Run with: perf stat -e cache-misses,L1-dcache-loads ./program
// 
// Useful counters for CAS analysis:
// - cache-misses: High value indicates cache line bouncing
// - mem_load_retired.l3_miss: Remote memory access (NUMA)
// - offcore_response.all_data_rd.llc_miss.remote_hitm
//   ^ Measures cross-socket cache-to-cache transfers
 
/*
 * Profiling Best Practices:
 * 
 * 1. Warm up before measurement (cache priming)
 * 2. Run multiple iterations and compute statistics
 * 3. Disable CPU frequency scaling (set to performance governor)
 * 4. Pin threads to specific cores to reduce variability
 * 5. Isolate benchmark process (nice, cgroups, dedicated cores)
 * 6. Profile under realistic contention levels
 * 7. Look at both latency and throughput
 */

Key Metrics to Track

Metric	What It Tells You	How to Measure
Latency (ns/op)	Time per operation	Batch timing, rdtsc
Throughput (ops/sec)	Total operations per second	Aggregate timing
CAS success rate	Percentage of CAS attempts that succeed	Count successes vs attempts
Cache miss rate	How often CAS misses cache	perf stat, l1/l3 miss counters
Contention level	How many threads compete	Application-level counting
Fairness	Variance in per-thread throughput	Per-thread instrumentation

Benchmark What Matters

Microbenchmarks measure best-case or worst-case, rarely real-world. Profile your actual application under realistic load. A CAS that shows 15ns in isolation might effectively cost 200ns in your application due to cache pollution, contention, and other effects. Use application-level metrics (requests/sec, p99 latency) as the ultimate measure of success.

Optimization Strategies and Best Practices

Bringing together everything we've learned, here are the key strategies for optimizing CAS-heavy code.

Strategy 1: Reduce Contention (Most Important!)

The single most effective optimization is reducing how many threads compete for the same CAS location:

Sharding: Split one counter into N counters, each for a range/hash of threads
Thread-local accumulation: Accumulate locally, sync periodically
Elimination: For reversible operations (push/pop), let threads cancel each other
Combining: Have one thread perform batched operations on behalf of others

Strategy 2: Minimize Work in CAS Loops

Hoist invariant computations: Don't recompute what doesn't change
Use efficient data structures: O(1) operations in the loop body
Avoid allocation: Pre-allocate nodes, use object pools

optimization_examples.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#include <atomic>
#include <vector>
#include <thread>
 
// ═══════════════════════════════════════════════════════════════
// Strategy 1: Sharding to reduce contention
// ═══════════════════════════════════════════════════════════════
template<int NUM_SHARDS = 16>
class ShardedCounter {
    struct alignas(64) Shard {  // Prevent false sharing
        std::atomic<long> count{0};
    };
    
    Shard shards[NUM_SHARDS];
    
public:
    void increment(int thread_hint) {
        // Map thread to shard (reduce contention by NUM_SHARDS factor)
        int shard = thread_hint % NUM_SHARDS;
        shards[shard].count.fetch_add(1, std::memory_order_relaxed);
    }
    
    long get_total() {
        long total = 0;
        for (int i = 0; i < NUM_SHARDS; i++) {
            total += shards[i].count.load(std::memory_order_relaxed);
        }
        return total;
    }
};
 
// ═══════════════════════════════════════════════════════════════
// Strategy 2: Thread-local accumulation with periodic sync
// ═══════════════════════════════════════════════════════════════
std::atomic<long> global_counter{0};
thread_local long local_counter = 0;
constexpr int SYNC_THRESHOLD = 1000;
 
void increment_batched() {
    local_counter++;
    
    if (local_counter >= SYNC_THRESHOLD) {
        // One CAS per 1000 increments instead of 1000 CAS
        global_counter.fetch_add(local_counter, std::memory_order_relaxed);
        local_counter = 0;
    }
}
 
// ═══════════════════════════════════════════════════════════════
// Strategy 3: Read-before-write (TTAS pattern)
// ═══════════════════════════════════════════════════════════════
std::atomic<bool> lock{false};
 
void acquire_optimized() {
    while (true) {
        // Spin on read (cheap, uses shared cache line)
        while (lock.load(std::memory_order_relaxed)) {
            __builtin_ia32_pause();
        }
        
        // Then try CAS (expensive, but we have a good chance)
        bool expected = false;
        if (lock.compare_exchange_weak(expected, true,
                                       std::memory_order_acquire)) {
            return;
        }
    }
}
 
// ═══════════════════════════════════════════════════════════════
// Strategy 4: Use specialized operations when possible
// ═══════════════════════════════════════════════════════════════
std::atomic<int> flags{0};
 
void set_flag(int bit) {
    // fetch_or is more efficient than CAS loop for setting bits
    flags.fetch_or(1 << bit, std::memory_order_release);
}
 
void clear_flag(int bit) {
    // fetch_and for clearing
    flags.fetch_and(~(1 << bit), std::memory_order_release);
}
 
bool try_set_flag_if_clear(int bit) {
    // Still need CAS for conditional set
    int expected = flags.load(std::memory_order_relaxed);
    
    while (!(expected & (1 << bit))) {
        if (flags.compare_exchange_weak(
            expected, expected | (1 << bit),
            std::memory_order_acq_rel)) {
            return true;
        }
    }
    return false;  // Flag was already set
}

Performance Optimization Checklist

•Align atomics to cache line boundaries (64 bytes) — Prevents false sharing
•Use weak CAS in loops — More efficient on ARM/RISC-V
•Use minimum sufficient memory ordering — Avoid seq_cst when acquire/release suffices
•Prefer fetch_add/fetch_or/etc. over CAS loops — Hardware optimized
•Add backoff under contention — Reduce wasted CAS attempts
•Consider NUMA topology — Keep hot data node-local
•Shard or batch to reduce contention — N threads on N shards beats N on 1
•Profile with perf counters — Identify cache misses and cross-socket traffic

Optimization is Context-Dependent

The best optimization depends on your specific workload, hardware, and correctness requirements. A technique that helps in one scenario may hurt in another. Always profile before and after optimizing, and ensure correctness is maintained. The fastest code that produces wrong results is worthless.

Summary: CAS Performance

We have completed our deep exploration of CAS performance. From hardware fundamentals to optimization strategies, this knowledge enables you to reason about, measure, and improve the performance of CAS-based code.

Key Takeaways

•Uncontended CAS is fast but not free: ~15-25 cycles vs 1-2 for simple stores. The atomic tax is real.
•Contention is the dominant performance factor: 2-thread contention may slow CAS by 5x; 8-thread by 15x or more.
•Cache coherence determines contended performance: Every CAS requires exclusive cache line ownership, causing coherence traffic.
•False sharing is a silent killer: Unrelated atomics on the same cache line cause unexpected contention. Always align to 64 bytes.
•NUMA amplifies contention costs: Cross-socket CAS can be 10-20x slower than local CAS. Design for node-locality.
•Architectures differ significantly: x86 strong ordering hides bugs; ARM weak ordering exposes them. Test portability.
•Measure accurately: Use batch timing, CPU counters, and perf to understand real performance.
•Reduce contention above all: Sharding, batching, and thread-local accumulation are the most effective optimizations.

Module Complete: Compare-and-Swap (CAS)

With this page, we have completed our comprehensive exploration of Compare-and-Swap. Let's reflect on the journey:

CAS Operation: The fundamental semantics, hardware implementation, and role in concurrent programming
Lock-Free Programming: Using CAS to build progress-guaranteed algorithms
CAS-Based Locks: From spinlocks to MCS locks to blocking mutexes
CAS Loops: The patterns, pitfalls, and best practices for CAS code
Performance: Latency, contention, NUMA, and optimization strategies

You now possess deep knowledge of the most important atomic primitive in modern computing. Whether you're implementing high-performance data structures, debugging concurrency issues, or simply understanding the synchronization primitives you use daily, this foundation will serve you well.

Module Complete

Congratulations on completing the Compare-and-Swap module! You've gained comprehensive understanding of CAS operations, lock-free programming, CAS-based synchronization constructs, and performance optimization. This knowledge places you among developers who truly understand the foundations of concurrent systems.

CAS Performance: Latency, Throughput, and Optimization

The Performance Reality of Atomic Operations

What You Will Learn

CAS Latency Fundamentals

Baseline: Uncontended CAS

In the best case (no contention, data in L1 cache), a CAS operation completes in just a few CPU cycles:

Scenario	Typical Latency (cycles)	Notes
L1 cache hit, no contention	5-15	Best case; comparable to normal memory access
L2 cache hit, no contention	15-30	Slightly slower, still fast
L3 cache hit, no contention	30-50	Cross-core L3 access
Memory access (cache miss)	100-300	Main memory round-trip
Cross-socket (NUMA)	150-400+	Remote memory access

What Makes CAS Expensive?

CAS is more expensive than a simple load or store because:

Atomic guarantee requires exclusive access: The cache line must be in Exclusive or Modified state
Memory barrier semantics: CAS typically implies memory ordering constraints
Bus locking or cache locking: Hardware must prevent interference during the operation

cas_latency_benchmark.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <atomic>
#include <chrono>
#include <iostream>
 
// Benchmark: Measure uncontended CAS latency
 
std::atomic<long> counter{0};
 
void benchmark_uncontended_cas(int iterations) {
    // Warm up cache
    for (int i = 0; i < 1000; i++) {
        counter.fetch_add(1, std::memory_order_relaxed);
    }
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int i = 0; i < iterations; i++) {
        long expected = counter.load(std::memory_order_relaxed);
        while (!counter.compare_exchange_weak(
            expected, expected + 1,
            std::memory_order_relaxed
        )) {}
    }
    
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    
    double ns_per_cas = static_cast<double>(duration.count()) / iterations;
    std::cout << "Uncontended CAS: " << ns_per_cas << " ns/op" << std::endl;
    
    // Typical results (varies by CPU):
    // Modern x86: 10-20 ns/op uncontended
    // This is ~30-60 CPU cycles at 3 GHz
}
 
// For comparison: simple store
void benchmark_store(int iterations) {
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int i = 0; i < iterations; i++) {
        counter.store(i, std::memory_order_relaxed);
    }
    
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    
    double ns_per_store = static_cast<double>(duration.count()) / iterations;
    std::cout << "Simple store: " << ns_per_store << " ns/op" << std::endl;
    
    // Typical: 1-2 ns/op (just a few cycles)
}
 
/*
 * Key insight: Uncontended CAS is ~10-20x slower than simple stores
 * Still fast in absolute terms (~10-20ns), but the difference 
 * matters in hot loops.
 * 
 * This is the "atomic tax" you pay for synchronization.
 */

Memory Ordering Impact

Memory ordering affects CAS latency:

Memory Order	Latency Impact	Why
`relaxed`	Lowest	No memory barriers needed
`acquire` / `release`	Low-moderate	One-directional barriers
`seq_cst`	Highest	Full memory fence, total ordering

Use Weakest Sufficient Ordering

Cache Coherence and CAS Performance

Cache coherence protocols are the hidden machinery that determines CAS performance in multi-core systems. Understanding these protocols reveals why contention is so expensive.

The MESI Protocol

Most modern CPUs use variants of the MESI protocol (Modified, Exclusive, Shared, Invalid) to maintain cache coherence:

State	Meaning	CAS Behavior
Modified (M)	Only this cache has valid data; memory is stale	CAS can proceed immediately
Exclusive (E)	Only this cache has valid data; memory is current	CAS can proceed (transitions to M)
Shared (S)	Multiple caches have valid copy	Must signal others to invalidate (slow!)
Invalid (I)	Cache line is invalid	Must fetch from memory or other cache (slow!)

CAS Requires Exclusive Ownership

For a CAS to execute, the cache line must be in Exclusive or Modified state. If it's in Shared or Invalid:

Shared → Exclusive: Send invalidation messages to all sharers, wait for acknowledgments
Invalid → Exclusive: Fetch line from memory or from cache holding Modified copy

Both transitions involve inter-core or memory bus traffic, taking 50-200+ cycles.

cache_coherence_impact.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#include <atomic>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
 
// Demonstrate cache coherence impact on CAS performance
 
std::atomic<long> shared_counter{0};
 
// Test 1: Sequential CAS (best case - cache line stays local)
void benchmark_sequential() {
    const int ITERATIONS = 10'000'000;
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int i = 0; i < ITERATIONS; i++) {
        shared_counter.fetch_add(1, std::memory_order_relaxed);
    }
    
    auto end = std::chrono::high_resolution_clock::now();
    auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    
    std::cout << "Sequential: " << (double)ns / ITERATIONS << " ns/op" << std::endl;
}
 
// Test 2: Two threads alternating (cache line bouncing)
void benchmark_two_threads() {
    const int ITERATIONS = 5'000'000;  // 5M each = 10M total
    std::atomic<bool> stop{false};
    
    auto worker = [&]() {
        for (int i = 0; i < ITERATIONS; i++) {
            shared_counter.fetch_add(1, std::memory_order_relaxed);
        }
    };
    
    auto start = std::chrono::high_resolution_clock::now();
    
    std::thread t1(worker);
    std::thread t2(worker);
    t1.join();
    t2.join();
    
    auto end = std::chrono::high_resolution_clock::now();
    auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    
    std::cout << "Two threads: " << (double)ns / (2 * ITERATIONS) << " ns/op" << std::endl;
}
 
// Test 3: Many threads contending (cache line thrashing)
void benchmark_many_threads(int num_threads) {
    const int ITERATIONS_PER_THREAD = 1'000'000;
    shared_counter = 0;
    
    std::vector<std::thread> threads;
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int t = 0; t < num_threads; t++) {
        threads.emplace_back([&]() {
            for (int i = 0; i < ITERATIONS_PER_THREAD; i++) {
                shared_counter.fetch_add(1, std::memory_order_relaxed);
            }
        });
    }
    
    for (auto& t : threads) t.join();
    
    auto end = std::chrono::high_resolution_clock::now();
    auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    
    std::cout << num_threads << " threads: " 
              << (double)ns / (num_threads * ITERATIONS_PER_THREAD) 
              << " ns/op" << std::endl;
}
 
/*
 * Typical results on modern multi-core CPU:
 * 
 * Sequential:   10-15 ns/op
 * Two threads:  50-80 ns/op     (4-6x slower!)
 * Four threads: 80-150 ns/op    (8-10x slower!)
 * Eight threads: 150-300 ns/op  (15-20x slower!)
 * 
 * Cache line bouncing dominates performance under contention.
 * Each CAS forces cache line transfer between cores.
 */

False Sharing

False sharing occurs when unrelated variables share a cache line, causing coherence traffic even though threads access different data.

// FALSE SHARING: counters share cache line
struct BadLayout {
    std::atomic<int> counter1;  // Core 1 writes here
    std::atomic<int> counter2;  // Core 2 writes here
    // Both on same 64-byte cache line!
    // Every write by one core invalidates the other's cache
};

// FIXED: Pad to separate cache lines
struct GoodLayout {
    alignas(64) std::atomic<int> counter1;
    alignas(64) std::atomic<int> counter2;
    // Each on its own cache line - no interference
};

False sharing can cause 10-100x performance degradation. Always align frequently-modified atomics to cache line boundaries.

Contention is the Enemy

Contention Scaling Analysis

How does CAS performance scale as contention increases? Understanding this scaling behavior is essential for capacity planning and design decisions.

Throughput Under Contention

Counter-intuitively, adding more threads to a heavily contended CAS operation can decrease total throughput:

Threads	Per-Thread Ops/sec	Total Ops/sec	Efficiency
1	100M	100M	100%
2	25M	50M	50%
4	8M	32M	32%
8	3M	24M	24%
16	1.5M	24M	24%
32	0.75M	24M	24%

Illustrative numbers—actual results vary by hardware

The pattern: throughput initially grows with threads, plateaus, and may even decline. Each additional thread adds more contention than it adds capacity.

Why Contention Kills Scalability

Under high contention:

Contention Costs

•CAS failures mount: With N threads, each CAS has ~1/N probability of success; N-1 threads wasted cycles.
•Cache line bouncing: Every successful CAS moves the cache line to a new core; others must fetch it.
•Memory bus saturation: Coherence traffic can saturate the memory bus or interconnect.
•Retry amplification: Failed CAS means the work (read-compute-CAS) must be repeated.
•Backoff overhead: Backoff delays add latency even to eventually-successful operations.

contention_analysis.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#include <atomic>
#include <thread>
#include <vector>
#include <chrono>
#include <iostream>
#include <iomanip>
 
// Analyze CAS success rate and throughput vs thread count
 
std::atomic<long> counter{0};
std::atomic<long> total_attempts{0};
std::atomic<long> total_successes{0};
 
void contended_increment(int iterations) {
    int attempts = 0;
    
    for (int i = 0; i < iterations; i++) {
        long expected;
        do {
            expected = counter.load(std::memory_order_relaxed);
            attempts++;
        } while (!counter.compare_exchange_weak(
            expected, expected + 1,
            std::memory_order_relaxed
        ));
    }
    
    total_attempts.fetch_add(attempts, std::memory_order_relaxed);
    total_successes.fetch_add(iterations, std::memory_order_relaxed);  // iterations = successes
}
 
void analyze_contention(int num_threads) {
    const int OPS_PER_THREAD = 100'000;
    
    counter = 0;
    total_attempts = 0;
    total_successes = 0;
    
    std::vector<std::thread> threads;
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int t = 0; t < num_threads; t++) {
        threads.emplace_back(contended_increment, OPS_PER_THREAD);
    }
    
    for (auto& t : threads) t.join();
    
    auto end = std::chrono::high_resolution_clock::now();
    
    double elapsed_sec = std::chrono::duration<double>(end - start).count();
    long attempts = total_attempts.load();
    long successes = total_successes.load();
    
    double success_rate = 100.0 * successes / attempts;
    double throughput = successes / elapsed_sec;
    double wasted_ratio = (double)(attempts - successes) / attempts;
    
    std::cout << std::setw(3) << num_threads << " threads: "
              << std::fixed << std::setprecision(2)
              << "Success rate: " << std::setw(6) << success_rate << "% | "
              << "Throughput: " << std::setw(10) << (throughput / 1e6) << " Mops/s | "
              << "Wasted work: " << std::setw(6) << (wasted_ratio * 100) << "%" 
              << std::endl;
}
 
int main() {
    std::cout << "CAS Contention Analysis\n\n";
    
    for (int threads : {1, 2, 4, 8, 16, 32}) {
        analyze_contention(threads);
    }
    
    return 0;
}
 
/*
 * Sample output (8-core CPU):
 * 
 *   1 threads: Success rate: 100.00% | Throughput:      95.00 Mops/s | Wasted work:   0.00%
 *   2 threads: Success rate:  51.23% | Throughput:      48.50 Mops/s | Wasted work:  48.77%
 *   4 threads: Success rate:  26.15% | Throughput:      35.20 Mops/s | Wasted work:  73.85%
 *   8 threads: Success rate:  13.02% | Throughput:      28.10 Mops/s | Wasted work:  86.98%
 *  16 threads: Success rate:   6.45% | Throughput:      25.80 Mops/s | Wasted work:  93.55%
 *  32 threads: Success rate:   3.21% | Throughput:      24.50 Mops/s | Wasted work:  96.79%
 * 
 * Key insights:
 * - Success rate drops as 1/N (one winner per round)
 * - Throughput saturates around core count
 * - Vast majority of work is wasted under high contention
 */

Scalability Guidelines

NUMA Considerations for CAS

On NUMA (Non-Uniform Memory Access) systems, memory access latency depends on which processor is accessing which memory. This has significant implications for CAS-heavy code.

NUMA Architecture

In NUMA systems:

Each processor (socket) has its own local memory
Accessing local memory is fast (~100ns)
Accessing remote memory (another socket) is slow (~200-400ns or more)
Cache coherence traffic between sockets is even slower

CAS Across NUMA Boundaries

When threads on different nodes contend for a CAS:

Scenario	Latency	Why
Same core (uncontended)	15-20ns	L1 cache hit
Same node, different core	50-80ns	L3 cache + coherence
Different nodes (cross-socket)	200-400ns	Remote cache or memory access

Cross-socket CAS is 10-20x slower than local CAS. This dramatically amplifies contention costs on NUMA systems.

numa_awareness.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include <atomic>
#include <thread>
#include <numa.h>  // Linux NUMA library
#include <sched.h>
 
// NUMA-aware design: per-node sharding
 
struct PerNodeCounter {
    // Each counter on its own cache line to prevent false sharing
    alignas(64) std::atomic<long> count{0};
};
 
class NUMAShardedCounter {
private:
    std::vector<PerNodeCounter> node_counters;
    int num_nodes;
    
public:
    NUMAShardedCounter() {
        num_nodes = numa_num_configured_nodes();
        node_counters.resize(num_nodes);
    }
    
    void increment() {
        // Get current NUMA node
        int node = numa_node_of_cpu(sched_getcpu());
        
        // Increment local counter (fast - no cross-node traffic)
        node_counters[node].count.fetch_add(1, std::memory_order_relaxed);
    }
    
    long get_total() {
        long total = 0;
        for (int i = 0; i < num_nodes; i++) {
            total += node_counters[i].count.load(std::memory_order_relaxed);
        }
        return total;
    }
};
 
// Affinity helper: bind thread to specific node
void bind_to_node(int node) {
    struct bitmask* mask = numa_allocate_nodemask();
    numa_bitmask_setbit(mask, node);
    numa_bind(mask);
    numa_bitmask_free(mask);
}
 
/*
 * NUMA-aware design principles:
 * 
 * 1. Shard by node: Keep hot data node-local
 * 2. Allocate memory on the node that will use it (numa_alloc_local)
 * 3. Pin threads to nodes when possible
 * 4. Aggregate cross-node communication (batch updates)
 * 5. Monitor NUMA balance: 'numastat' and perf counters
 * 
 * Naive shared counter: 200-400ns/op cross-socket
 * NUMA-sharded counter: 15-20ns/op (each thread hits local counter)
 * 
 * This is a 10-20x speedup on multi-socket systems!
 */

NUMA-Aware Design Strategies
Strategy	Description	When to Use
Per-node sharding	Each NUMA node has its own copy of data	Counters, statistics, approximate aggregates
Memory affinity	Allocate memory on the node that uses it	Large allocations for thread-local work
Thread pinning	Bind threads to specific cores/nodes	Latency-sensitive, steady-state workloads
Hierarchical design	Local → node-local → global aggregation	Complex data structures with mixed access
NUMA-aware allocation	Use numa_alloc_onnode() or mbind()	Lock-free data structures with known access patterns

Cloud and Virtualization

Architectural Differences in CAS Performance

Different CPU architectures implement CAS differently, leading to varying performance characteristics. Understanding these differences is important for writing portable, high-performance code.

x86/x64: CMPXCHG

Native CAS instruction with predictable semantics:

LOCK CMPXCHG provides atomic compare-and-swap
Strong memory ordering (Total Store Order) means few surprises
No spurious failures
Double-width CAS (CMPXCHG16B) available for 128-bit operations

Performance characteristics:

Uncontended: ~15-25 cycles
Well-optimized by decades of x86 tuning
Memory ordering is "free" (TSO provides strong guarantees)

ARM: LDREX/STREX and CAS

Load-Linked/Store-Conditional prior to ARMv8.1:

Two-instruction sequence: LDREX (load exclusive), STREX (store conditional)
May fail spuriously (contention, interrupts, cache line eviction)
Weaker memory model requires explicit barriers

ARMv8.1+ adds CAS instructions:

CASP (compare and swap pair) for 128-bit operations
Still weaker memory model than x86

Performance characteristics:

Generally lower single-threaded performance than x86 for atomics
Weaker ordering allows more instruction-level parallelism
Spurious failures add overhead in loops (but compare_exchange_weak handles this)

CAS Performance Across Architectures
Aspect	x86/x64	ARM (pre-8.1)	ARM (8.1+)	RISC-V
CAS instruction	CMPXCHG	LDREX/STREX	CAS, CASP	LR/SC
Spurious failure	Never	Possible	Never (CAS)	Possible
Memory model	TSO (strong)	Weak	Weak	Weak (RVWMO)
128-bit CAS	CMPXCHG16B	LDREXD/STREXD	CASP	LR/SC pair
Barrier overhead	Low (TSO)	High (explicit)	Medium	High
Uncontended latency	~15-25 cycles	~30-50 cycles	~20-30 cycles	~25-40 cycles

portable_atomics.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#include <atomic>
 
// Writing portable atomic code that performs well on all architectures
 
std::atomic<int> counter{0};
 
// ═══════════════════════════════════════════════════════════════
// Pattern: Use weak CAS in loops for ARM/RISC-V efficiency
// ═══════════════════════════════════════════════════════════════
void increment_portable() {
    int expected = counter.load(std::memory_order_relaxed);
    
    // compare_exchange_weak:
    // - On x86: compiles to same code as strong (no spurious failure)
    // - On ARM LL/SC: avoids retry loop inside the CAS primitive
    // - Best choice when in an outer retry loop anyway
    while (!counter.compare_exchange_weak(
        expected, expected + 1,
        std::memory_order_release,  // Just need release on success
        std::memory_order_relaxed   // Relaxed on failure (retrying anyway)
    )) {
        // expected is updated to current value automatically
    }
}
 
// ═══════════════════════════════════════════════════════════════
// Pattern: Minimize memory ordering for weak architectures
// ═══════════════════════════════════════════════════════════════
std::atomic<bool> flag{false};
int data = 0;
 
void producer_portable() {
    data = 42;
    
    // release is sufficient - ensures data write is visible before flag
    // On x86: no extra cost (TSO)
    // On ARM: generates barrier instruction (necessary for correctness)
    flag.store(true, std::memory_order_release);
}
 
void consumer_portable() {
    // acquire is sufficient - ensures we see data after seeing flag
    while (!flag.load(std::memory_order_acquire)) {
        // spin
    }
    
    // Safe to read data
    int x = data;
}
 
// ═══════════════════════════════════════════════════════════════
// Anti-pattern: Assuming x86 memory model
// ═══════════════════════════════════════════════════════════════
void dangerous_on_arm() {
    data = 42;
    
    // BUG: Relaxed store doesn't guarantee ordering!
    // Works on x86 (TSO), FAILS on ARM (may see flag=true before data=42)
    flag.store(true, std::memory_order_relaxed);  // DON'T DO THIS
}
 
/*
 * Portability guidelines:
 * 
 * 1. Use compare_exchange_weak in loops (no downside on x86)
 * 2. Use explicit memory orderings, not defaults (seq_cst)
 * 3. Never assume x86 memory model - code for weak ordering
 * 4. Test on ARM or use tools like CDSChecker to find ordering bugs
 * 5. Use relaxed only when you truly don't need ordering
 */

Test on Target Architecture

Profiling and Measuring CAS Performance

Effective optimization requires accurate measurement. CAS performance profiling has unique challenges due to the timing sensitivities involved.

Measurement Challenges

Observer effect: Adding profiling code can change timing and affect results
Variability: CAS latency varies widely based on cache state and contention
Aggregation: Single-operation timing is noisy; need statistical aggregation
System noise: Other processes, interrupts, and frequency scaling affect measurements

Profiling Tools and Techniques

profiling_techniques.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#include <atomic>
#include <chrono>
#include <algorithm>
#include <vector>
#include <numeric>
#include <iostream>
#include <iomanip>
 
// Technique 1: Batch timing (amortize measurement overhead)
void benchmark_batch_timing() {
    std::atomic<long> counter{0};
    const int BATCH_SIZE = 100'000;
    const int NUM_BATCHES = 100;
    
    std::vector<double> batch_times;
    batch_times.reserve(NUM_BATCHES);
    
    for (int batch = 0; batch < NUM_BATCHES; batch++) {
        auto start = std::chrono::high_resolution_clock::now();
        
        for (int i = 0; i < BATCH_SIZE; i++) {
            counter.fetch_add(1, std::memory_order_relaxed);
        }
        
        auto end = std::chrono::high_resolution_clock::now();
        
        double ns_total = std::chrono::duration<double, std::nano>(end - start).count();
        batch_times.push_back(ns_total / BATCH_SIZE);
    }
    
    // Compute statistics
    std::sort(batch_times.begin(), batch_times.end());
    
    double median = batch_times[NUM_BATCHES / 2];
    double p99 = batch_times[NUM_BATCHES * 99 / 100];
    double mean = std::accumulate(batch_times.begin(), batch_times.end(), 0.0) / NUM_BATCHES;
    
    std::cout << "CAS Latency Statistics (ns/op):" << std::endl;
    std::cout << "  Mean:   " << std::fixed << std::setprecision(2) << mean << std::endl;
    std::cout << "  Median: " << median << std::endl;
    std::cout << "  P99:    " << p99 << std::endl;
}
 
// Technique 2: CPU cycle counting (finer granularity)
#ifdef __x86_64__
#include <x86intrin.h>
 
inline uint64_t rdtsc() {
    return __rdtsc();
}
 
void benchmark_rdtsc() {
    std::atomic<long> counter{0};
    const int ITERATIONS = 10'000'000;
    
    uint64_t start = rdtsc();
    
    for (int i = 0; i < ITERATIONS; i++) {
        counter.fetch_add(1, std::memory_order_relaxed);
    }
    
    uint64_t end = rdtsc();
    
    double cycles_per_op = (double)(end - start) / ITERATIONS;
    
    // Note: RDTSC may not be constant rate on all CPUs
    // For accurate time, still need to calibrate against wall clock
    std::cout << "Cycles per CAS: " << cycles_per_op << std::endl;
}
#endif
 
// Technique 3: Using perf counters (Linux)
// Run with: perf stat -e cache-misses,L1-dcache-loads ./program
// 
// Useful counters for CAS analysis:
// - cache-misses: High value indicates cache line bouncing
// - mem_load_retired.l3_miss: Remote memory access (NUMA)
// - offcore_response.all_data_rd.llc_miss.remote_hitm
//   ^ Measures cross-socket cache-to-cache transfers
 
/*
 * Profiling Best Practices:
 * 
 * 1. Warm up before measurement (cache priming)
 * 2. Run multiple iterations and compute statistics
 * 3. Disable CPU frequency scaling (set to performance governor)
 * 4. Pin threads to specific cores to reduce variability
 * 5. Isolate benchmark process (nice, cgroups, dedicated cores)
 * 6. Profile under realistic contention levels
 * 7. Look at both latency and throughput
 */

Key Metrics to Track

Metric	What It Tells You	How to Measure
Latency (ns/op)	Time per operation	Batch timing, rdtsc
Throughput (ops/sec)	Total operations per second	Aggregate timing
CAS success rate	Percentage of CAS attempts that succeed	Count successes vs attempts
Cache miss rate	How often CAS misses cache	perf stat, l1/l3 miss counters
Contention level	How many threads compete	Application-level counting
Fairness	Variance in per-thread throughput	Per-thread instrumentation

Benchmark What Matters

Optimization Strategies and Best Practices

Bringing together everything we've learned, here are the key strategies for optimizing CAS-heavy code.

Strategy 1: Reduce Contention (Most Important!)

The single most effective optimization is reducing how many threads compete for the same CAS location:

Sharding: Split one counter into N counters, each for a range/hash of threads
Thread-local accumulation: Accumulate locally, sync periodically
Elimination: For reversible operations (push/pop), let threads cancel each other
Combining: Have one thread perform batched operations on behalf of others

Strategy 2: Minimize Work in CAS Loops

Hoist invariant computations: Don't recompute what doesn't change
Use efficient data structures: O(1) operations in the loop body
Avoid allocation: Pre-allocate nodes, use object pools

optimization_examples.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#include <atomic>
#include <vector>
#include <thread>
 
// ═══════════════════════════════════════════════════════════════
// Strategy 1: Sharding to reduce contention
// ═══════════════════════════════════════════════════════════════
template<int NUM_SHARDS = 16>
class ShardedCounter {
    struct alignas(64) Shard {  // Prevent false sharing
        std::atomic<long> count{0};
    };
    
    Shard shards[NUM_SHARDS];
    
public:
    void increment(int thread_hint) {
        // Map thread to shard (reduce contention by NUM_SHARDS factor)
        int shard = thread_hint % NUM_SHARDS;
        shards[shard].count.fetch_add(1, std::memory_order_relaxed);
    }
    
    long get_total() {
        long total = 0;
        for (int i = 0; i < NUM_SHARDS; i++) {
            total += shards[i].count.load(std::memory_order_relaxed);
        }
        return total;
    }
};
 
// ═══════════════════════════════════════════════════════════════
// Strategy 2: Thread-local accumulation with periodic sync
// ═══════════════════════════════════════════════════════════════
std::atomic<long> global_counter{0};
thread_local long local_counter = 0;
constexpr int SYNC_THRESHOLD = 1000;
 
void increment_batched() {
    local_counter++;
    
    if (local_counter >= SYNC_THRESHOLD) {
        // One CAS per 1000 increments instead of 1000 CAS
        global_counter.fetch_add(local_counter, std::memory_order_relaxed);
        local_counter = 0;
    }
}
 
// ═══════════════════════════════════════════════════════════════
// Strategy 3: Read-before-write (TTAS pattern)
// ═══════════════════════════════════════════════════════════════
std::atomic<bool> lock{false};
 
void acquire_optimized() {
    while (true) {
        // Spin on read (cheap, uses shared cache line)
        while (lock.load(std::memory_order_relaxed)) {
            __builtin_ia32_pause();
        }
        
        // Then try CAS (expensive, but we have a good chance)
        bool expected = false;
        if (lock.compare_exchange_weak(expected, true,
                                       std::memory_order_acquire)) {
            return;
        }
    }
}
 
// ═══════════════════════════════════════════════════════════════
// Strategy 4: Use specialized operations when possible
// ═══════════════════════════════════════════════════════════════
std::atomic<int> flags{0};
 
void set_flag(int bit) {
    // fetch_or is more efficient than CAS loop for setting bits
    flags.fetch_or(1 << bit, std::memory_order_release);
}
 
void clear_flag(int bit) {
    // fetch_and for clearing
    flags.fetch_and(~(1 << bit), std::memory_order_release);
}
 
bool try_set_flag_if_clear(int bit) {
    // Still need CAS for conditional set
    int expected = flags.load(std::memory_order_relaxed);
    
    while (!(expected & (1 << bit))) {
        if (flags.compare_exchange_weak(
            expected, expected | (1 << bit),
            std::memory_order_acq_rel)) {
            return true;
        }
    }
    return false;  // Flag was already set
}

Performance Optimization Checklist

•Align atomics to cache line boundaries (64 bytes) — Prevents false sharing
•Use weak CAS in loops — More efficient on ARM/RISC-V
•Use minimum sufficient memory ordering — Avoid seq_cst when acquire/release suffices
•Prefer fetch_add/fetch_or/etc. over CAS loops — Hardware optimized
•Add backoff under contention — Reduce wasted CAS attempts
•Consider NUMA topology — Keep hot data node-local
•Shard or batch to reduce contention — N threads on N shards beats N on 1
•Profile with perf counters — Identify cache misses and cross-socket traffic

Optimization is Context-Dependent

Summary: CAS Performance

Key Takeaways

•Uncontended CAS is fast but not free: ~15-25 cycles vs 1-2 for simple stores. The atomic tax is real.
•Contention is the dominant performance factor: 2-thread contention may slow CAS by 5x; 8-thread by 15x or more.
•Cache coherence determines contended performance: Every CAS requires exclusive cache line ownership, causing coherence traffic.
•False sharing is a silent killer: Unrelated atomics on the same cache line cause unexpected contention. Always align to 64 bytes.
•NUMA amplifies contention costs: Cross-socket CAS can be 10-20x slower than local CAS. Design for node-locality.
•Architectures differ significantly: x86 strong ordering hides bugs; ARM weak ordering exposes them. Test portability.
•Measure accurately: Use batch timing, CPU counters, and perf to understand real performance.
•Reduce contention above all: Sharding, batching, and thread-local accumulation are the most effective optimizations.

Module Complete: Compare-and-Swap (CAS)

With this page, we have completed our comprehensive exploration of Compare-and-Swap. Let's reflect on the journey:

CAS Operation: The fundamental semantics, hardware implementation, and role in concurrent programming
Lock-Free Programming: Using CAS to build progress-guaranteed algorithms
CAS-Based Locks: From spinlocks to MCS locks to blocking mutexes
CAS Loops: The patterns, pitfalls, and best practices for CAS code
Performance: Latency, contention, NUMA, and optimization strategies

Module Complete