Loading content...
Understanding CAS correctness is only half the battle. In performance-critical systems, the efficiency of CAS operations often determines whether a design succeeds or fails. A lock-free algorithm that is theoretically elegant but practically slow provides no benefit over simpler alternatives.
CAS performance is shaped by hardware realities: cache coherence protocols, memory bus bandwidth, NUMA topology, and CPU microarchitecture. These factors interact in complex ways, producing performance characteristics that can be counterintuitive.
This page provides a deep dive into CAS performance. We will examine latency and throughput under various conditions, analyze how cache coherence affects CAS-heavy code, explore NUMA considerations, and develop strategies for measuring and optimizing CAS performance. By the end, you will have a practitioner's understanding of what makes CAS fast or slow.
By the end of this page, you will understand CAS latency and throughput characteristics, how cache coherence protocols affect CAS performance, the impact of contention on CAS operations, NUMA effects on cross-socket CAS, profiling techniques for CAS-heavy code, and optimization strategies for high-performance atomic operations.
The latency of a CAS operation—the time from instruction issue to completion—varies dramatically based on several factors. Understanding these factors is key to predicting and optimizing CAS performance.
In the best case (no contention, data in L1 cache), a CAS operation completes in just a few CPU cycles:
| Scenario | Typical Latency (cycles) | Notes |
|---|---|---|
| L1 cache hit, no contention | 5-15 | Best case; comparable to normal memory access |
| L2 cache hit, no contention | 15-30 | Slightly slower, still fast |
| L3 cache hit, no contention | 30-50 | Cross-core L3 access |
| Memory access (cache miss) | 100-300 | Main memory round-trip |
| Cross-socket (NUMA) | 150-400+ | Remote memory access |
CAS is more expensive than a simple load or store because:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
#include <atomic>#include <chrono>#include <iostream> // Benchmark: Measure uncontended CAS latency std::atomic<long> counter{0}; void benchmark_uncontended_cas(int iterations) { // Warm up cache for (int i = 0; i < 1000; i++) { counter.fetch_add(1, std::memory_order_relaxed); } auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < iterations; i++) { long expected = counter.load(std::memory_order_relaxed); while (!counter.compare_exchange_weak( expected, expected + 1, std::memory_order_relaxed )) {} } auto end = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start); double ns_per_cas = static_cast<double>(duration.count()) / iterations; std::cout << "Uncontended CAS: " << ns_per_cas << " ns/op" << std::endl; // Typical results (varies by CPU): // Modern x86: 10-20 ns/op uncontended // This is ~30-60 CPU cycles at 3 GHz} // For comparison: simple storevoid benchmark_store(int iterations) { auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < iterations; i++) { counter.store(i, std::memory_order_relaxed); } auto end = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start); double ns_per_store = static_cast<double>(duration.count()) / iterations; std::cout << "Simple store: " << ns_per_store << " ns/op" << std::endl; // Typical: 1-2 ns/op (just a few cycles)} /* * Key insight: Uncontended CAS is ~10-20x slower than simple stores * Still fast in absolute terms (~10-20ns), but the difference * matters in hot loops. * * This is the "atomic tax" you pay for synchronization. */Memory ordering affects CAS latency:
| Memory Order | Latency Impact | Why |
|---|---|---|
relaxed | Lowest | No memory barriers needed |
acquire / release | Low-moderate | One-directional barriers |
seq_cst | Highest | Full memory fence, total ordering |
On x86, the difference is often minimal because the architecture has strong ordering already. On ARM or POWER, the difference can be substantial—seq_cst may require explicit fence instructions that stall the pipeline.
For performance-critical code, use the weakest memory ordering that is correct for your algorithm. In many CAS loops, relaxed ordering is sufficient for the load-and-retry pattern, with acquire or release only needed at the successful CAS. This can provide measurable speedups on weakly-ordered architectures.
Cache coherence protocols are the hidden machinery that determines CAS performance in multi-core systems. Understanding these protocols reveals why contention is so expensive.
Most modern CPUs use variants of the MESI protocol (Modified, Exclusive, Shared, Invalid) to maintain cache coherence:
| State | Meaning | CAS Behavior |
|---|---|---|
| Modified (M) | Only this cache has valid data; memory is stale | CAS can proceed immediately |
| Exclusive (E) | Only this cache has valid data; memory is current | CAS can proceed (transitions to M) |
| Shared (S) | Multiple caches have valid copy | Must signal others to invalidate (slow!) |
| Invalid (I) | Cache line is invalid | Must fetch from memory or other cache (slow!) |
For a CAS to execute, the cache line must be in Exclusive or Modified state. If it's in Shared or Invalid:
Both transitions involve inter-core or memory bus traffic, taking 50-200+ cycles.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
#include <atomic>#include <thread>#include <vector>#include <chrono>#include <iostream> // Demonstrate cache coherence impact on CAS performance std::atomic<long> shared_counter{0}; // Test 1: Sequential CAS (best case - cache line stays local)void benchmark_sequential() { const int ITERATIONS = 10'000'000; auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < ITERATIONS; i++) { shared_counter.fetch_add(1, std::memory_order_relaxed); } auto end = std::chrono::high_resolution_clock::now(); auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count(); std::cout << "Sequential: " << (double)ns / ITERATIONS << " ns/op" << std::endl;} // Test 2: Two threads alternating (cache line bouncing)void benchmark_two_threads() { const int ITERATIONS = 5'000'000; // 5M each = 10M total std::atomic<bool> stop{false}; auto worker = [&]() { for (int i = 0; i < ITERATIONS; i++) { shared_counter.fetch_add(1, std::memory_order_relaxed); } }; auto start = std::chrono::high_resolution_clock::now(); std::thread t1(worker); std::thread t2(worker); t1.join(); t2.join(); auto end = std::chrono::high_resolution_clock::now(); auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count(); std::cout << "Two threads: " << (double)ns / (2 * ITERATIONS) << " ns/op" << std::endl;} // Test 3: Many threads contending (cache line thrashing)void benchmark_many_threads(int num_threads) { const int ITERATIONS_PER_THREAD = 1'000'000; shared_counter = 0; std::vector<std::thread> threads; auto start = std::chrono::high_resolution_clock::now(); for (int t = 0; t < num_threads; t++) { threads.emplace_back([&]() { for (int i = 0; i < ITERATIONS_PER_THREAD; i++) { shared_counter.fetch_add(1, std::memory_order_relaxed); } }); } for (auto& t : threads) t.join(); auto end = std::chrono::high_resolution_clock::now(); auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count(); std::cout << num_threads << " threads: " << (double)ns / (num_threads * ITERATIONS_PER_THREAD) << " ns/op" << std::endl;} /* * Typical results on modern multi-core CPU: * * Sequential: 10-15 ns/op * Two threads: 50-80 ns/op (4-6x slower!) * Four threads: 80-150 ns/op (8-10x slower!) * Eight threads: 150-300 ns/op (15-20x slower!) * * Cache line bouncing dominates performance under contention. * Each CAS forces cache line transfer between cores. */False sharing occurs when unrelated variables share a cache line, causing coherence traffic even though threads access different data.
// FALSE SHARING: counters share cache line
struct BadLayout {
std::atomic<int> counter1; // Core 1 writes here
std::atomic<int> counter2; // Core 2 writes here
// Both on same 64-byte cache line!
// Every write by one core invalidates the other's cache
};
// FIXED: Pad to separate cache lines
struct GoodLayout {
alignas(64) std::atomic<int> counter1;
alignas(64) std::atomic<int> counter2;
// Each on its own cache line - no interference
};
False sharing can cause 10-100x performance degradation. Always align frequently-modified atomics to cache line boundaries.
The dominant factor in CAS performance is contention. A single-threaded CAS taking 15ns can become a 200ns operation with 8 threads contending. Design to minimize contention: use thread-local storage where possible, shard data structures, or use combining techniques. A well-designed system with low contention will massively outperform a contended lock-free design.
How does CAS performance scale as contention increases? Understanding this scaling behavior is essential for capacity planning and design decisions.
Counter-intuitively, adding more threads to a heavily contended CAS operation can decrease total throughput:
| Threads | Per-Thread Ops/sec | Total Ops/sec | Efficiency |
|---|---|---|---|
| 1 | 100M | 100M | 100% |
| 2 | 25M | 50M | 50% |
| 4 | 8M | 32M | 32% |
| 8 | 3M | 24M | 24% |
| 16 | 1.5M | 24M | 24% |
| 32 | 0.75M | 24M | 24% |
Illustrative numbers—actual results vary by hardware
The pattern: throughput initially grows with threads, plateaus, and may even decline. Each additional thread adds more contention than it adds capacity.
Under high contention:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
#include <atomic>#include <thread>#include <vector>#include <chrono>#include <iostream>#include <iomanip> // Analyze CAS success rate and throughput vs thread count std::atomic<long> counter{0};std::atomic<long> total_attempts{0};std::atomic<long> total_successes{0}; void contended_increment(int iterations) { int attempts = 0; for (int i = 0; i < iterations; i++) { long expected; do { expected = counter.load(std::memory_order_relaxed); attempts++; } while (!counter.compare_exchange_weak( expected, expected + 1, std::memory_order_relaxed )); } total_attempts.fetch_add(attempts, std::memory_order_relaxed); total_successes.fetch_add(iterations, std::memory_order_relaxed); // iterations = successes} void analyze_contention(int num_threads) { const int OPS_PER_THREAD = 100'000; counter = 0; total_attempts = 0; total_successes = 0; std::vector<std::thread> threads; auto start = std::chrono::high_resolution_clock::now(); for (int t = 0; t < num_threads; t++) { threads.emplace_back(contended_increment, OPS_PER_THREAD); } for (auto& t : threads) t.join(); auto end = std::chrono::high_resolution_clock::now(); double elapsed_sec = std::chrono::duration<double>(end - start).count(); long attempts = total_attempts.load(); long successes = total_successes.load(); double success_rate = 100.0 * successes / attempts; double throughput = successes / elapsed_sec; double wasted_ratio = (double)(attempts - successes) / attempts; std::cout << std::setw(3) << num_threads << " threads: " << std::fixed << std::setprecision(2) << "Success rate: " << std::setw(6) << success_rate << "% | " << "Throughput: " << std::setw(10) << (throughput / 1e6) << " Mops/s | " << "Wasted work: " << std::setw(6) << (wasted_ratio * 100) << "%" << std::endl;} int main() { std::cout << "CAS Contention Analysis\n\n"; for (int threads : {1, 2, 4, 8, 16, 32}) { analyze_contention(threads); } return 0;} /* * Sample output (8-core CPU): * * 1 threads: Success rate: 100.00% | Throughput: 95.00 Mops/s | Wasted work: 0.00% * 2 threads: Success rate: 51.23% | Throughput: 48.50 Mops/s | Wasted work: 48.77% * 4 threads: Success rate: 26.15% | Throughput: 35.20 Mops/s | Wasted work: 73.85% * 8 threads: Success rate: 13.02% | Throughput: 28.10 Mops/s | Wasted work: 86.98% * 16 threads: Success rate: 6.45% | Throughput: 25.80 Mops/s | Wasted work: 93.55% * 32 threads: Success rate: 3.21% | Throughput: 24.50 Mops/s | Wasted work: 96.79% * * Key insights: * - Success rate drops as 1/N (one winner per round) * - Throughput saturates around core count * - Vast majority of work is wasted under high contention */If your design requires high-frequency CAS on a single location, you will not scale beyond a handful of cores. Solutions: (1) Shard data across multiple atomics, accessed by hash or round-robin; (2) Use combining or batching to aggregate operations; (3) Accept eventual consistency and use thread-local storage with periodic synchronization; (4) Consider whether you actually need strict atomicity.
On NUMA (Non-Uniform Memory Access) systems, memory access latency depends on which processor is accessing which memory. This has significant implications for CAS-heavy code.
In NUMA systems:
When threads on different nodes contend for a CAS:
| Scenario | Latency | Why |
|---|---|---|
| Same core (uncontended) | 15-20ns | L1 cache hit |
| Same node, different core | 50-80ns | L3 cache + coherence |
| Different nodes (cross-socket) | 200-400ns | Remote cache or memory access |
Cross-socket CAS is 10-20x slower than local CAS. This dramatically amplifies contention costs on NUMA systems.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
#include <atomic>#include <thread>#include <numa.h> // Linux NUMA library#include <sched.h> // NUMA-aware design: per-node sharding struct PerNodeCounter { // Each counter on its own cache line to prevent false sharing alignas(64) std::atomic<long> count{0};}; class NUMAShardedCounter {private: std::vector<PerNodeCounter> node_counters; int num_nodes; public: NUMAShardedCounter() { num_nodes = numa_num_configured_nodes(); node_counters.resize(num_nodes); } void increment() { // Get current NUMA node int node = numa_node_of_cpu(sched_getcpu()); // Increment local counter (fast - no cross-node traffic) node_counters[node].count.fetch_add(1, std::memory_order_relaxed); } long get_total() { long total = 0; for (int i = 0; i < num_nodes; i++) { total += node_counters[i].count.load(std::memory_order_relaxed); } return total; }}; // Affinity helper: bind thread to specific nodevoid bind_to_node(int node) { struct bitmask* mask = numa_allocate_nodemask(); numa_bitmask_setbit(mask, node); numa_bind(mask); numa_bitmask_free(mask);} /* * NUMA-aware design principles: * * 1. Shard by node: Keep hot data node-local * 2. Allocate memory on the node that will use it (numa_alloc_local) * 3. Pin threads to nodes when possible * 4. Aggregate cross-node communication (batch updates) * 5. Monitor NUMA balance: 'numastat' and perf counters * * Naive shared counter: 200-400ns/op cross-socket * NUMA-sharded counter: 15-20ns/op (each thread hits local counter) * * This is a 10-20x speedup on multi-socket systems! */| Strategy | Description | When to Use |
|---|---|---|
| Per-node sharding | Each NUMA node has its own copy of data | Counters, statistics, approximate aggregates |
| Memory affinity | Allocate memory on the node that uses it | Large allocations for thread-local work |
| Thread pinning | Bind threads to specific cores/nodes | Latency-sensitive, steady-state workloads |
| Hierarchical design | Local → node-local → global aggregation | Complex data structures with mixed access |
| NUMA-aware allocation | Use numa_alloc_onnode() or mbind() | Lock-free data structures with known access patterns |
In cloud environments, NUMA topology may be exposed to VMs (depending on hypervisor settings) or hidden. Code running in containers or VMs should either query NUMA topology at runtime (numa_available(), numa_num_configured_nodes()) or be designed to work efficiently without NUMA awareness. Making assumptions about topology can hurt performance when the assumption is wrong.
Different CPU architectures implement CAS differently, leading to varying performance characteristics. Understanding these differences is important for writing portable, high-performance code.
Native CAS instruction with predictable semantics:
LOCK CMPXCHG provides atomic compare-and-swapCMPXCHG16B) available for 128-bit operationsPerformance characteristics:
Load-Linked/Store-Conditional prior to ARMv8.1:
ARMv8.1+ adds CAS instructions:
CASP (compare and swap pair) for 128-bit operationsPerformance characteristics:
| Aspect | x86/x64 | ARM (pre-8.1) | ARM (8.1+) | RISC-V |
|---|---|---|---|---|
| CAS instruction | CMPXCHG | LDREX/STREX | CAS, CASP | LR/SC |
| Spurious failure | Never | Possible | Never (CAS) | Possible |
| Memory model | TSO (strong) | Weak | Weak | Weak (RVWMO) |
| 128-bit CAS | CMPXCHG16B | LDREXD/STREXD | CASP | LR/SC pair |
| Barrier overhead | Low (TSO) | High (explicit) | Medium | High |
| Uncontended latency | ~15-25 cycles | ~30-50 cycles | ~20-30 cycles | ~25-40 cycles |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
#include <atomic> // Writing portable atomic code that performs well on all architectures std::atomic<int> counter{0}; // ═══════════════════════════════════════════════════════════════// Pattern: Use weak CAS in loops for ARM/RISC-V efficiency// ═══════════════════════════════════════════════════════════════void increment_portable() { int expected = counter.load(std::memory_order_relaxed); // compare_exchange_weak: // - On x86: compiles to same code as strong (no spurious failure) // - On ARM LL/SC: avoids retry loop inside the CAS primitive // - Best choice when in an outer retry loop anyway while (!counter.compare_exchange_weak( expected, expected + 1, std::memory_order_release, // Just need release on success std::memory_order_relaxed // Relaxed on failure (retrying anyway) )) { // expected is updated to current value automatically }} // ═══════════════════════════════════════════════════════════════// Pattern: Minimize memory ordering for weak architectures// ═══════════════════════════════════════════════════════════════std::atomic<bool> flag{false};int data = 0; void producer_portable() { data = 42; // release is sufficient - ensures data write is visible before flag // On x86: no extra cost (TSO) // On ARM: generates barrier instruction (necessary for correctness) flag.store(true, std::memory_order_release);} void consumer_portable() { // acquire is sufficient - ensures we see data after seeing flag while (!flag.load(std::memory_order_acquire)) { // spin } // Safe to read data int x = data;} // ═══════════════════════════════════════════════════════════════// Anti-pattern: Assuming x86 memory model// ═══════════════════════════════════════════════════════════════void dangerous_on_arm() { data = 42; // BUG: Relaxed store doesn't guarantee ordering! // Works on x86 (TSO), FAILS on ARM (may see flag=true before data=42) flag.store(true, std::memory_order_relaxed); // DON'T DO THIS} /* * Portability guidelines: * * 1. Use compare_exchange_weak in loops (no downside on x86) * 2. Use explicit memory orderings, not defaults (seq_cst) * 3. Never assume x86 memory model - code for weak ordering * 4. Test on ARM or use tools like CDSChecker to find ordering bugs * 5. Use relaxed only when you truly don't need ordering */Code that works perfectly on x86 may be subtly broken on ARM due to weaker memory ordering. The x86 memory model 'covers up' many ordering bugs. Always test atomic code on the weakest-ordered architecture you'll deploy on, or use formal tools (ThreadSanitizer, CDSChecker) to find ordering issues.
Effective optimization requires accurate measurement. CAS performance profiling has unique challenges due to the timing sensitivities involved.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
#include <atomic>#include <chrono>#include <algorithm>#include <vector>#include <numeric>#include <iostream>#include <iomanip> // Technique 1: Batch timing (amortize measurement overhead)void benchmark_batch_timing() { std::atomic<long> counter{0}; const int BATCH_SIZE = 100'000; const int NUM_BATCHES = 100; std::vector<double> batch_times; batch_times.reserve(NUM_BATCHES); for (int batch = 0; batch < NUM_BATCHES; batch++) { auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < BATCH_SIZE; i++) { counter.fetch_add(1, std::memory_order_relaxed); } auto end = std::chrono::high_resolution_clock::now(); double ns_total = std::chrono::duration<double, std::nano>(end - start).count(); batch_times.push_back(ns_total / BATCH_SIZE); } // Compute statistics std::sort(batch_times.begin(), batch_times.end()); double median = batch_times[NUM_BATCHES / 2]; double p99 = batch_times[NUM_BATCHES * 99 / 100]; double mean = std::accumulate(batch_times.begin(), batch_times.end(), 0.0) / NUM_BATCHES; std::cout << "CAS Latency Statistics (ns/op):" << std::endl; std::cout << " Mean: " << std::fixed << std::setprecision(2) << mean << std::endl; std::cout << " Median: " << median << std::endl; std::cout << " P99: " << p99 << std::endl;} // Technique 2: CPU cycle counting (finer granularity)#ifdef __x86_64__#include <x86intrin.h> inline uint64_t rdtsc() { return __rdtsc();} void benchmark_rdtsc() { std::atomic<long> counter{0}; const int ITERATIONS = 10'000'000; uint64_t start = rdtsc(); for (int i = 0; i < ITERATIONS; i++) { counter.fetch_add(1, std::memory_order_relaxed); } uint64_t end = rdtsc(); double cycles_per_op = (double)(end - start) / ITERATIONS; // Note: RDTSC may not be constant rate on all CPUs // For accurate time, still need to calibrate against wall clock std::cout << "Cycles per CAS: " << cycles_per_op << std::endl;}#endif // Technique 3: Using perf counters (Linux)// Run with: perf stat -e cache-misses,L1-dcache-loads ./program// // Useful counters for CAS analysis:// - cache-misses: High value indicates cache line bouncing// - mem_load_retired.l3_miss: Remote memory access (NUMA)// - offcore_response.all_data_rd.llc_miss.remote_hitm// ^ Measures cross-socket cache-to-cache transfers /* * Profiling Best Practices: * * 1. Warm up before measurement (cache priming) * 2. Run multiple iterations and compute statistics * 3. Disable CPU frequency scaling (set to performance governor) * 4. Pin threads to specific cores to reduce variability * 5. Isolate benchmark process (nice, cgroups, dedicated cores) * 6. Profile under realistic contention levels * 7. Look at both latency and throughput */| Metric | What It Tells You | How to Measure |
|---|---|---|
| Latency (ns/op) | Time per operation | Batch timing, rdtsc |
| Throughput (ops/sec) | Total operations per second | Aggregate timing |
| CAS success rate | Percentage of CAS attempts that succeed | Count successes vs attempts |
| Cache miss rate | How often CAS misses cache | perf stat, l1/l3 miss counters |
| Contention level | How many threads compete | Application-level counting |
| Fairness | Variance in per-thread throughput | Per-thread instrumentation |
Microbenchmarks measure best-case or worst-case, rarely real-world. Profile your actual application under realistic load. A CAS that shows 15ns in isolation might effectively cost 200ns in your application due to cache pollution, contention, and other effects. Use application-level metrics (requests/sec, p99 latency) as the ultimate measure of success.
Bringing together everything we've learned, here are the key strategies for optimizing CAS-heavy code.
The single most effective optimization is reducing how many threads compete for the same CAS location:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
#include <atomic>#include <vector>#include <thread> // ═══════════════════════════════════════════════════════════════// Strategy 1: Sharding to reduce contention// ═══════════════════════════════════════════════════════════════template<int NUM_SHARDS = 16>class ShardedCounter { struct alignas(64) Shard { // Prevent false sharing std::atomic<long> count{0}; }; Shard shards[NUM_SHARDS]; public: void increment(int thread_hint) { // Map thread to shard (reduce contention by NUM_SHARDS factor) int shard = thread_hint % NUM_SHARDS; shards[shard].count.fetch_add(1, std::memory_order_relaxed); } long get_total() { long total = 0; for (int i = 0; i < NUM_SHARDS; i++) { total += shards[i].count.load(std::memory_order_relaxed); } return total; }}; // ═══════════════════════════════════════════════════════════════// Strategy 2: Thread-local accumulation with periodic sync// ═══════════════════════════════════════════════════════════════std::atomic<long> global_counter{0};thread_local long local_counter = 0;constexpr int SYNC_THRESHOLD = 1000; void increment_batched() { local_counter++; if (local_counter >= SYNC_THRESHOLD) { // One CAS per 1000 increments instead of 1000 CAS global_counter.fetch_add(local_counter, std::memory_order_relaxed); local_counter = 0; }} // ═══════════════════════════════════════════════════════════════// Strategy 3: Read-before-write (TTAS pattern)// ═══════════════════════════════════════════════════════════════std::atomic<bool> lock{false}; void acquire_optimized() { while (true) { // Spin on read (cheap, uses shared cache line) while (lock.load(std::memory_order_relaxed)) { __builtin_ia32_pause(); } // Then try CAS (expensive, but we have a good chance) bool expected = false; if (lock.compare_exchange_weak(expected, true, std::memory_order_acquire)) { return; } }} // ═══════════════════════════════════════════════════════════════// Strategy 4: Use specialized operations when possible// ═══════════════════════════════════════════════════════════════std::atomic<int> flags{0}; void set_flag(int bit) { // fetch_or is more efficient than CAS loop for setting bits flags.fetch_or(1 << bit, std::memory_order_release);} void clear_flag(int bit) { // fetch_and for clearing flags.fetch_and(~(1 << bit), std::memory_order_release);} bool try_set_flag_if_clear(int bit) { // Still need CAS for conditional set int expected = flags.load(std::memory_order_relaxed); while (!(expected & (1 << bit))) { if (flags.compare_exchange_weak( expected, expected | (1 << bit), std::memory_order_acq_rel)) { return true; } } return false; // Flag was already set}The best optimization depends on your specific workload, hardware, and correctness requirements. A technique that helps in one scenario may hurt in another. Always profile before and after optimizing, and ensure correctness is maintained. The fastest code that produces wrong results is worthless.
We have completed our deep exploration of CAS performance. From hardware fundamentals to optimization strategies, this knowledge enables you to reason about, measure, and improve the performance of CAS-based code.
With this page, we have completed our comprehensive exploration of Compare-and-Swap. Let's reflect on the journey:
You now possess deep knowledge of the most important atomic primitive in modern computing. Whether you're implementing high-performance data structures, debugging concurrency issues, or simply understanding the synchronization primitives you use daily, this foundation will serve you well.
Congratulations on completing the Compare-and-Swap module! You've gained comprehensive understanding of CAS operations, lock-free programming, CAS-based synchronization constructs, and performance optimization. This knowledge places you among developers who truly understand the foundations of concurrent systems.