System Design (LLD)Atomic Operations

Atomic Operations: The Indivisible Units of Concurrency

LevelAdvanced

Duration60 mins

TopicAtomic Operations

4 / 4

Lock-Free Programming Basics

Beyond Locks: The Promise and Peril

We've mastered atomic operations and CAS. Now we turn to their ultimate application: lock-free programming—writing concurrent code that never blocks, never deadlocks, and guarantees system-wide progress even when individual threads are delayed or suspended.

Lock-free programming is simultaneously one of the most powerful and most dangerous tools in a systems programmer's arsenal. When done correctly, it enables extraordinary performance and reliability. When done incorrectly, it creates bugs that are nearly impossible to diagnose—bugs that manifest once in a million runs, only in production, only under load.

This page provides the theoretical foundation and practical guidance you need to approach lock-free programming with appropriate respect and competence.

What You Will Learn

This page covers the progress guarantees (lock-free, wait-free, obstruction-free), how to design lock-free algorithms, common patterns and pitfalls, performance considerations, memory ordering requirements, and when lock-free is worth the complexity.

Progress Guarantees: A Hierarchy of Correctness

Concurrent algorithms are classified by the progress guarantees they provide. These guarantees describe what happens when multiple threads compete for a resource:

Blocking Algorithms (Locks)

With locks, if one thread holds the lock and is delayed (preempted, suspended, or crashes), all other threads waiting for that lock make no progress. A single slow/dead thread can block the entire system.

Non-Blocking Algorithms

Non-blocking algorithms guarantee progress even when some threads are delayed. They come in three flavors, each with stronger guarantees:

Non-Blocking Progress Guarantees
Guarantee	Definition	Starvation?	Use Case
Obstruction-Free	A thread completes if all other threads pause	Possible	Simplest to implement; rarely used alone
Lock-Free	At least one thread makes progress at all times	Possible	Most common non-blocking guarantee
Wait-Free	Every thread makes progress in bounded steps	Impossible	Strongest guarantee; hardest to achieve

Obstruction-Free

An algorithm is obstruction-free if any thread, when executing in isolation (all other threads paused), will complete its operation in a finite number of steps.

This is the weakest non-blocking guarantee. It says nothing about what happens under contention—threads might livelock (repeatedly interfering with each other without any making progress).

Lock-Free

An algorithm is lock-free if, when any number of threads execute operations, at least one thread completes its operation in a finite number of steps.

This guarantees system-wide progress: the system as a whole is never stuck. However, individual threads can starve—repeatedly losing the race to complete their operation while others succeed.

Wait-Free

An algorithm is wait-free if every thread completes its operation in a bounded number of steps, regardless of other threads' behavior.

This is the strongest guarantee: no starvation, no livelocks, predictable worst-case latency. But wait-free algorithms are significantly harder to design and often slower in practice due to the bookkeeping overhead for fairness.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// LOCK-FREE: At least one thread always succeeds
// The CAS loop is lock-free: if CAS fails, another thread succeeded
class LockFreeCounter {
    private value: Atomic<number> = new Atomic(0);
    
    increment(): void {
        while (true) {
            const current = this.value.load();
            if (this.value.compareAndSet(current, current + 1)) {
                return;  // Success!
            }
            // If we failed, someone else succeeded → progress was made
        }
    }
}
 
// WAIT-FREE: Every thread completes in bounded steps
// fetch_add is wait-free: completes in O(1) regardless of contention
class WaitFreeCounter {
    private value: Atomic<number> = new Atomic(0);
    
    increment(): number {
        // fetch_add always completes in one atomic operation
        // No retry loop, no possibility of spinning indefinitely
        return this.value.fetchAdd(1);
    }
}
 
// BLOCKING (for comparison): One slow thread blocks all
class BlockingCounter {
    private value: number = 0;
    private mutex: Mutex = new Mutex();
    
    increment(): void {
        this.mutex.lock();  // If holder is delayed, we wait here forever
        try {
            this.value++;
        } finally {
            this.mutex.unlock();
        }
    }
}

Lock-Free ≠ Fair

Lock-free algorithms can starve individual threads. Under heavy contention, a fast thread might repeatedly win the CAS race, leaving slow threads spinning indefinitely. If fairness matters, you need either wait-free algorithms or lock-free algorithms with added fairness mechanisms.

Lock-Free Design Principles

Designing lock-free algorithms requires a fundamentally different mindset from lock-based programming. Here are the core principles:

Core Lock-Free Design Principles

•Minimize Shared State — The more state shared between threads, the more contention and complexity. Design to minimize what must be synchronized.
•Single-Word CAS Focus — Most lock-free algorithms center on a single atomic variable (or a small set). Multiple-variable atomicity requires advanced techniques.
•Immutable Intermediate State — Use immutable objects where possible. If you can read a consistent snapshot, you don't need atomic updates.
•Helping and Cooperation — Threads should help complete pending operations by other threads, not just compete. This ensures progress.
•Careful Linearization Points — Identify the exact instruction where each operation 'takes effect.' This is crucial for correctness reasoning.
•Retry with Fresh Data — Always reload state after a failed CAS. Never assume the old value is still valid.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// HELPING: Other threads complete pending operations
 
class TwoPhaseSet<T> {
    // Pointer to current state (immutable once published)
    private state: AtomicReference<SetState<T>>;
    
    add(item: T): boolean {
        while (true) {
            // 1. Read current state
            const current = this.state.load();
            
            // 2. Check if another thread has a pending operation
            if (current.pendingOp !== null) {
                // HELP: Complete the pending operation first
                this.helpComplete(current.pendingOp);
                continue;  // Retry with fresh state
            }
            
            // 3. Check if item already exists
            if (current.items.has(item)) {
                return false;  // Already present
            }
            
            // 4. Create new state with our addition
            const newState = new SetState(
                new Set([...current.items, item]),
                null  // No pending operation
            );
            
            // 5. Try to CAS our new state in
            if (this.state.compareAndSet(current, newState)) {
                return true;
            }
            // CAS failed; someone else modified; retry
        }
    }
    
    private helpComplete(op: PendingOp): void {
        // Execute the pending operation on behalf of the original thread
        // This ensures the system makes progress even if the original
        // thread is delayed or crashed
    }
}
 
// Key insight: By helping other threads, we guarantee that
// the system never stalls waiting for any single thread.

The Linearization Point

Every operation in a lock-free data structure has a linearization point—the instant at which the operation appears to take effect. For correctness, operations must appear to execute atomically at their linearization points.

For a CAS-based push onto a stack:

1. Load current head           ← Could see old head
2. Set newNode.next = head     ← Preparing, not yet visible
3. CAS(head, oldHead, newNode) ← LINEARIZATION POINT (if successful)

If CAS succeeds, the push 'happened' at that instant.
If CAS fails, the push hasn't happened yet; retry.

Identifying linearization points is essential for proving correctness and understanding the observable behavior of concurrent operations.

Memory Ordering for Lock-Free Code

Lock-free programming requires careful attention to memory ordering. Modern CPUs and compilers reorder memory operations for performance, which can cause lock-free algorithms to behave incorrectly if ordering isn't specified.

The Problem:

// Thread 1
data = 42;           // Store A
ready.store(true);   // Store B

// Thread 2
if (ready.load()) {  // Load B
    use(data);       // Load A
}

Without proper ordering, Thread 2 might see ready == true but data != 42, because:

Thread 1's writes might be reordered (Store B before Store A)
Thread 2's reads might be reordered (Load A before Load B)
Stores might become visible to Thread 2 in different order than committed

Memory Order Specifications
Order	Prevents	Cost	Use Case
Relaxed	Nothing	Lowest	Statistics counters where order doesn't matter
Consume	Data-dependent reads moved before*	Low	Following pointer chains (rarely used)
Acquire	Later reads/writes moved before	Medium	Reading synchronization variables
Release	Earlier reads/writes moved after	Medium	Writing synchronization variables
Acq_Rel	Both acquire and release	Medium	Read-modify-write on sync variables
Seq_Cst	All reordering, plus total order	Highest	Simple correctness, sequential reasoning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <atomic>
 
std::atomic<int> data{0};
std::atomic<bool> ready{false};
 
// WRONG: No ordering guarantees
void publish_wrong() {
    data.store(42, std::memory_order_relaxed);
    ready.store(true, std::memory_order_relaxed);
    // Compiler/CPU might reorder these!
}
 
// CORRECT: Release/Acquire ordering
void publish_correct() {
    data.store(42, std::memory_order_relaxed);  // Ordinary store is fine...
    ready.store(true, std::memory_order_release);  // ...because release prevents reordering
}
 
bool consume_correct() {
    if (ready.load(std::memory_order_acquire)) {  // Acquire matches release
        return data.load(std::memory_order_relaxed) == 42;  // Guaranteed true!
    }
    return false;
}
 
// SIMPLEST: Sequential consistency (default)
void publish_simple() {
    data.store(42);                // Default: seq_cst
    ready.store(true);             // Default: seq_cst
    // No reordering possible; easiest to reason about
}
 
bool consume_simple() {
    if (ready.load()) {            // Default: seq_cst
        return data.load() == 42;  // Guaranteed true!
    }
    return false;
}

Start with Sequential Consistency

When writing lock-free code, start with sequential consistency (seq_cst) for all atomics. It's the easiest to reason about. Only switch to weaker orderings after profiling proves they're necessary AND you fully understand the implications. An incorrect memory order causes bugs that are virtually impossible to reproduce or debug.

The Release/Acquire Pattern

The most common pattern for synchronization is release-acquire: one thread releases data (makes it available), another acquires it (takes ownership or reads it).

Release store: All writes before this store are visible to any thread that does an acquire load on this variable
Acquire load: All writes that happened before the matching release are visible after this load

This creates a "synchronizes-with" relationship between the threads.

Common Lock-Free Patterns

Several patterns appear repeatedly in lock-free programming. Recognizing them accelerates understanding and design.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// READ-COPY-UPDATE: Updates create new versions; readers see consistent snapshots
 
class RCUStyle<T> {
    private current: AtomicReference<T>;
    
    // Readers are lock-free and fast
    read(): T {
        return this.current.load();  // Always sees a consistent version
    }
    
    // Writers create new version and swap
    update(modifyFn: (old: T) => T): void {
        while (true) {
            const oldVersion = this.current.load();
            const newVersion = modifyFn(oldVersion);  // Copy + modify
            
            if (this.current.compareAndSet(oldVersion, newVersion)) {
                // Old version can be garbage collected when safe
                return;
            }
            // Contention: another writer; retry
        }
    }
}
 
// RCU is ideal for read-heavy workloads:
// - Readers never block
// - Writers pay the cost of copying
// - Old versions remain valid until readers finish

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// STATE MACHINE: Atomic state transitions
 
enum State {
    IDLE,
    STARTING,
    RUNNING,
    STOPPING,
    STOPPED
}
 
class LifecycleManager {
    private state: Atomic<State> = new Atomic(State.IDLE);
    
    start(): boolean {
        // Only IDLE → STARTING transition is valid
        if (this.state.compareAndSet(State.IDLE, State.STARTING)) {
            this.doStart();
            this.state.store(State.RUNNING);
            return true;
        }
        return false;  // Already started or stopping
    }
    
    stop(): boolean {
        // Only RUNNING → STOPPING transition is valid
        if (this.state.compareAndSet(State.RUNNING, State.STOPPING)) {
            this.doStop();
            this.state.store(State.STOPPED);
            return true;
        }
        return false;  // Not running
    }
    
    getState(): State {
        return this.state.load();
    }
    
    private doStart(): void { /* initialization */ }
    private doStop(): void { /* cleanup */ }
}
 
// CAS ensures exactly one thread performs each transition

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// SEQLOCK: Optimistic read for data that's rarely written
 
class SeqLock<T> {
    private sequence: Atomic<number> = new Atomic(0);
    private data: T;  // NOT atomic; protected by sequence
    
    // Writer: Acquire exclusive access via odd sequence
    write(newData: T): void {
        // 1. Increment to odd (signals write in progress)
        this.sequence.fetchAdd(1);
        
        // 2. Write the data (non-atomic is fine; writes are exclusive)
        this.data = newData;
        
        // 3. Increment to even (signals write complete)
        this.sequence.fetchAdd(1);
    }
    
    // Reader: Optimistic read with retry
    read(): T {
        while (true) {
            // 1. Read sequence before
            const seq1 = this.sequence.load();
            
            // 2. If odd, write in progress; spin
            if (seq1 % 2 !== 0) continue;
            
            // 3. Read the data
            const result = this.data;
            
            // 4. Read sequence after
            const seq2 = this.sequence.load();
            
            // 5. If sequence changed, data might be torn; retry
            if (seq1 === seq2) {
                return result;  // Consistent read!
            }
            // Sequence changed; retry
        }
    }
}
 
// SeqLock is ideal when:
// - Writes are rare
// - Reads are frequent
// - Data is small (fits in cache line)

More Lock-Free Patterns

•Lock-Free Linked List — Chase pointers with CAS; may need helping for deletes
•SPSC Queue — Single producer, single consumer; no CAS needed, just fences
•MPSC Queue — Multiple producers, single consumer; simpler than full MPMC
•Work Stealing Deque — Chase-Lev deque; owners pop from one end, thieves steal from other
•Flat Combining — Threads post operations; one combiner executes on behalf of all

Performance Considerations

Lock-free algorithms promise better performance, but the reality is nuanced. Several factors determine whether lock-free is actually faster:

Contention Level

Low contention: Locks often win. An uncontended lock acquisition is just 20-50ns. Lock-free overhead (more complex code, memory barriers) may exceed this.
High contention: Lock-free shines. Threads don't block on each other; they retry. System makes continuous progress.
Extreme contention: Neither wins. CAS retries burn cycles; locks queue threads. Consider restructuring to reduce shared state.

Lock-Free vs Locks: Performance Factors
Factor	Favors Lock-Free	Favors Locks
Contention	High (many threads competing)	Low (rare conflicts)
Critical section	Short (few instructions)	Long (complex logic)
Read/Write ratio	Read-heavy	Write-heavy
Thread count	Many threads	Few threads
Latency requirement	Bounded worst-case	Average-case focus
Priority inversion	Must be avoided	Acceptable

Cache Effects

Atomic operations cause significant cache traffic:

Cache Line Bouncing: When multiple cores access the same cache line, it bounces between L1 caches (MESI protocol transitions). This is expensive—hundreds of cycles.
False Sharing: Unrelated variables on the same cache line interfere. Padding to separate cache lines is common.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <atomic>
 
// BAD: These atomics share a cache line (likely 64 bytes)
struct BadCounters {
    std::atomic<long> counter1;  // Offset 0
    std::atomic<long> counter2;  // Offset 8
    // Both on same 64-byte cache line!
    // Updating counter1 invalidates counter2 in other caches
};
 
// GOOD: Pad each counter to its own cache line
struct alignas(64) PaddedCounter {
    std::atomic<long> value;
    char padding[64 - sizeof(std::atomic<long>)];
};
 
struct GoodCounters {
    PaddedCounter counter1;  // Offset 0, own cache line
    PaddedCounter counter2;  // Offset 64, own cache line
    // Updating one doesn't affect the other
};
 
// C++17 provides std::hardware_destructive_interference_size
#include <new>
struct ModernPaddedCounter {
    alignas(std::hardware_destructive_interference_size) 
    std::atomic<long> value;
};

Backoff Strategies

Under contention, naive CAS loops spin tightly, wasting CPU. Backoff strategies help: pause/yield after failures, exponential backoff (wait longer after repeated failures), or adaptive backoff (adjust based on observed contention). These improve throughput and reduce power consumption.

When Not to Use Lock-Free

Lock-free programming is a powerful tool, but it's not always the right choice. Here are situations where locks are preferable:

Prefer Locks When...

•Critical sections are complex — Multi-step operations involving multiple variables are hard to make lock-free. Locks make complex logic tractable.
•Contention is low — If threads rarely conflict, lock overhead is minimal, and lock-free complexity isn't justified.
•Correctness is paramount — Locks are easier to reason about. Lock-free bugs are harder to find and fix.
•Team expertise is limited — Lock-free requires deep understanding. Without it, bugs are inevitable.
•Deadlock isn't a concern — If your lock acquisition order is always safe, locks work fine.
•Performance isn't critical — If you don't need the last 10% of throughput, don't pay the complexity cost.
•Debugging tools matter — Debuggers, profilers, and sanitizers work better with locks than with subtle memory ordering bugs.

Premature Lock-Free Is Worse Than Premature Optimization

Lock-free code is 2-10x more complex than lock-based equivalents. It's extremely easy to introduce subtle bugs that only manifest under specific timing conditions. Never go lock-free without profiling evidence that locks are the bottleneck AND confidence that you can verify correctness.

Consider Lock-Free When

•Lock contention is a measured bottleneck
•Operations are simple and short
•Bounded latency is required (real-time)
•Deadlock must be impossible
•Team has concurrency expertise
•You can verify correctness formally or with stress tests

Stick with Locks When

•Operations involve multiple data structures
•Correctness is more important than performance
•Code must be maintainable by others
•Debugging support is important
•Locks haven't been proven a bottleneck
•Fairness/ordering guarantees are needed

Testing and Verifying Lock-Free Code

Lock-free bugs are notoriously hard to reproduce. A race condition might manifest once in a billion executions—but that one time might be in production. Thorough testing is essential.

Testing Strategies for Lock-Free Code

•Stress Testing — Run millions of operations across many threads. Increase thread count, add random delays, vary timing. Look for crashes, hangs, or invariant violations.
•Systematic Testing (Concurrency Testing Tools) — Tools like Relacy, CDSChecker, or ThreadSanitizer explore different thread interleavings systematically.
•Model Checking — Formally verify the algorithm against a specification. Tools like TLA+ or Spin can exhaustively check small models.
•Linearizability Testing — Record all operations and their results. Verify that some sequential ordering of operations would produce the same results.
•Assertions and Invariants — Add runtime checks for expected invariants. If an invariant is violated, crash loudly with diagnostics.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// STRESS TEST: Hammer the data structure from many threads
 
async function stressTestStack(
    stack: LockFreeStack<number>,
    numThreads: number,
    opsPerThread: number
): Promise<void> {
    const promises: Promise<void>[] = [];
    const pushedCounts: number[] = new Array(numThreads).fill(0);
    const poppedSums: number[] = new Array(numThreads).fill(0);
    
    // Each thread pushes and pops randomly
    for (let t = 0; t < numThreads; t++) {
        promises.push((async () => {
            for (let i = 0; i < opsPerThread; i++) {
                if (Math.random() < 0.5) {
                    const value = t * opsPerThread + i;
                    stack.push(value);
                    pushedCounts[t]++;
                } else {
                    const value = stack.pop();
                    if (value !== null) {
                        poppedSums[t] += value;
                    }
                }
            }
        })());
    }
    
    await Promise.all(promises);
    
    // Drain remaining items
    let remaining = 0;
    let remainingSum = 0;
    while (true) {
        const value = stack.pop();
        if (value === null) break;
        remaining++;
        remainingSum += value;
    }
    
    // Verify: total pushed == total popped + remaining
    const totalPushed = pushedCounts.reduce((a, b) => a + b, 0);
    const totalPopped = (/* count from poppedSums */) + remaining;
    
    console.assert(
        stack.isEmpty(),
        "Stack should be empty after draining"
    );
    
    console.log(`Stress test: ${totalPushed} pushed, ${totalPopped} popped`);
}
 
// Run with: stressTestStack(new LockFreeStack(), 16, 100000);

Testing Doesn't Prove Correctness

No amount of testing can prove a lock-free algorithm is correct. Tests can find bugs, but absence of failures doesn't mean absence of bugs. For critical code, combine testing with formal verification or model checking.

Summary: Lock-Free Wisdom

We've covered the foundations of lock-free programming. Let's consolidate the key insights:

Key Takeaways

•Progress guarantees matter — Lock-free guarantees system progress; wait-free guarantees individual thread progress; both prevent deadlock
•Design principles differ — Minimize shared state, use single-word CAS, embrace helping, identify linearization points
•Memory ordering is essential — Atomicity without ordering is incomplete; understand release/acquire or use seq_cst
•Patterns provide structure — RCU, state machines, seqlocks, and others solve recurring problems
•Performance is situational — Lock-free wins under contention; locks win with low contention; cache effects matter
•Lock-free is not always better — Complexity, correctness risk, and team expertise all factor into the decision
•Testing is necessary but insufficient — Stress tests find bugs; formal methods prove their absence
•Start with locks — Use lock-free only when locks are proven bottlenecks and you're prepared for the complexity

What You've Accomplished:

You've completed the Atomic Operations module. You now understand:

What atomicity means and why it matters
The full spectrum of atomic primitives
Compare-and-Swap in depth, including the ABA problem
Lock-free programming foundations and practical guidance

This knowledge positions you to understand how concurrent containers work, design lock-free algorithms when appropriate, and make informed decisions about when lock-free complexity is justified.

Module Complete

Congratulations! You've mastered atomic operations—the indivisible building blocks of concurrent programming. From basic atomicity concepts through CAS and lock-free programming, you now have the knowledge to reason about and build correct, high-performance concurrent systems.

4 / 4

Loading learning content...

System Design (LLD)Atomic Operations

Atomic Operations: The Indivisible Units of Concurrency

LevelAdvanced

Duration60 mins

TopicAtomic Operations

4 / 4

Lock-Free Programming Basics

Beyond Locks: The Promise and Peril

This page provides the theoretical foundation and practical guidance you need to approach lock-free programming with appropriate respect and competence.

What You Will Learn

Progress Guarantees: A Hierarchy of Correctness

Concurrent algorithms are classified by the progress guarantees they provide. These guarantees describe what happens when multiple threads compete for a resource:

Blocking Algorithms (Locks)

Non-Blocking Algorithms

Non-blocking algorithms guarantee progress even when some threads are delayed. They come in three flavors, each with stronger guarantees:

Non-Blocking Progress Guarantees
Guarantee	Definition	Starvation?	Use Case
Obstruction-Free	A thread completes if all other threads pause	Possible	Simplest to implement; rarely used alone
Lock-Free	At least one thread makes progress at all times	Possible	Most common non-blocking guarantee
Wait-Free	Every thread makes progress in bounded steps	Impossible	Strongest guarantee; hardest to achieve

Obstruction-Free

An algorithm is obstruction-free if any thread, when executing in isolation (all other threads paused), will complete its operation in a finite number of steps.

This is the weakest non-blocking guarantee. It says nothing about what happens under contention—threads might livelock (repeatedly interfering with each other without any making progress).

Lock-Free

An algorithm is lock-free if, when any number of threads execute operations, at least one thread completes its operation in a finite number of steps.

This guarantees system-wide progress: the system as a whole is never stuck. However, individual threads can starve—repeatedly losing the race to complete their operation while others succeed.

Wait-Free

An algorithm is wait-free if every thread completes its operation in a bounded number of steps, regardless of other threads' behavior.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// LOCK-FREE: At least one thread always succeeds
// The CAS loop is lock-free: if CAS fails, another thread succeeded
class LockFreeCounter {
    private value: Atomic<number> = new Atomic(0);
    
    increment(): void {
        while (true) {
            const current = this.value.load();
            if (this.value.compareAndSet(current, current + 1)) {
                return;  // Success!
            }
            // If we failed, someone else succeeded → progress was made
        }
    }
}
 
// WAIT-FREE: Every thread completes in bounded steps
// fetch_add is wait-free: completes in O(1) regardless of contention
class WaitFreeCounter {
    private value: Atomic<number> = new Atomic(0);
    
    increment(): number {
        // fetch_add always completes in one atomic operation
        // No retry loop, no possibility of spinning indefinitely
        return this.value.fetchAdd(1);
    }
}
 
// BLOCKING (for comparison): One slow thread blocks all
class BlockingCounter {
    private value: number = 0;
    private mutex: Mutex = new Mutex();
    
    increment(): void {
        this.mutex.lock();  // If holder is delayed, we wait here forever
        try {
            this.value++;
        } finally {
            this.mutex.unlock();
        }
    }
}

Lock-Free ≠ Fair

Lock-Free Design Principles

Designing lock-free algorithms requires a fundamentally different mindset from lock-based programming. Here are the core principles:

Core Lock-Free Design Principles

•Minimize Shared State — The more state shared between threads, the more contention and complexity. Design to minimize what must be synchronized.
•Single-Word CAS Focus — Most lock-free algorithms center on a single atomic variable (or a small set). Multiple-variable atomicity requires advanced techniques.
•Immutable Intermediate State — Use immutable objects where possible. If you can read a consistent snapshot, you don't need atomic updates.
•Helping and Cooperation — Threads should help complete pending operations by other threads, not just compete. This ensures progress.
•Careful Linearization Points — Identify the exact instruction where each operation 'takes effect.' This is crucial for correctness reasoning.
•Retry with Fresh Data — Always reload state after a failed CAS. Never assume the old value is still valid.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// HELPING: Other threads complete pending operations
 
class TwoPhaseSet<T> {
    // Pointer to current state (immutable once published)
    private state: AtomicReference<SetState<T>>;
    
    add(item: T): boolean {
        while (true) {
            // 1. Read current state
            const current = this.state.load();
            
            // 2. Check if another thread has a pending operation
            if (current.pendingOp !== null) {
                // HELP: Complete the pending operation first
                this.helpComplete(current.pendingOp);
                continue;  // Retry with fresh state
            }
            
            // 3. Check if item already exists
            if (current.items.has(item)) {
                return false;  // Already present
            }
            
            // 4. Create new state with our addition
            const newState = new SetState(
                new Set([...current.items, item]),
                null  // No pending operation
            );
            
            // 5. Try to CAS our new state in
            if (this.state.compareAndSet(current, newState)) {
                return true;
            }
            // CAS failed; someone else modified; retry
        }
    }
    
    private helpComplete(op: PendingOp): void {
        // Execute the pending operation on behalf of the original thread
        // This ensures the system makes progress even if the original
        // thread is delayed or crashed
    }
}
 
// Key insight: By helping other threads, we guarantee that
// the system never stalls waiting for any single thread.

The Linearization Point

For a CAS-based push onto a stack:

1. Load current head           ← Could see old head
2. Set newNode.next = head     ← Preparing, not yet visible
3. CAS(head, oldHead, newNode) ← LINEARIZATION POINT (if successful)

If CAS succeeds, the push 'happened' at that instant.
If CAS fails, the push hasn't happened yet; retry.

Identifying linearization points is essential for proving correctness and understanding the observable behavior of concurrent operations.

Memory Ordering for Lock-Free Code

The Problem:

// Thread 1
data = 42;           // Store A
ready.store(true);   // Store B

// Thread 2
if (ready.load()) {  // Load B
    use(data);       // Load A
}

Without proper ordering, Thread 2 might see ready == true but data != 42, because:

Thread 1's writes might be reordered (Store B before Store A)
Thread 2's reads might be reordered (Load A before Load B)
Stores might become visible to Thread 2 in different order than committed

Memory Order Specifications
Order	Prevents	Cost	Use Case
Relaxed	Nothing	Lowest	Statistics counters where order doesn't matter
Consume	Data-dependent reads moved before*	Low	Following pointer chains (rarely used)
Acquire	Later reads/writes moved before	Medium	Reading synchronization variables
Release	Earlier reads/writes moved after	Medium	Writing synchronization variables
Acq_Rel	Both acquire and release	Medium	Read-modify-write on sync variables
Seq_Cst	All reordering, plus total order	Highest	Simple correctness, sequential reasoning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <atomic>
 
std::atomic<int> data{0};
std::atomic<bool> ready{false};
 
// WRONG: No ordering guarantees
void publish_wrong() {
    data.store(42, std::memory_order_relaxed);
    ready.store(true, std::memory_order_relaxed);
    // Compiler/CPU might reorder these!
}
 
// CORRECT: Release/Acquire ordering
void publish_correct() {
    data.store(42, std::memory_order_relaxed);  // Ordinary store is fine...
    ready.store(true, std::memory_order_release);  // ...because release prevents reordering
}
 
bool consume_correct() {
    if (ready.load(std::memory_order_acquire)) {  // Acquire matches release
        return data.load(std::memory_order_relaxed) == 42;  // Guaranteed true!
    }
    return false;
}
 
// SIMPLEST: Sequential consistency (default)
void publish_simple() {
    data.store(42);                // Default: seq_cst
    ready.store(true);             // Default: seq_cst
    // No reordering possible; easiest to reason about
}
 
bool consume_simple() {
    if (ready.load()) {            // Default: seq_cst
        return data.load() == 42;  // Guaranteed true!
    }
    return false;
}

Start with Sequential Consistency

The Release/Acquire Pattern

The most common pattern for synchronization is release-acquire: one thread releases data (makes it available), another acquires it (takes ownership or reads it).

Release store: All writes before this store are visible to any thread that does an acquire load on this variable
Acquire load: All writes that happened before the matching release are visible after this load

This creates a "synchronizes-with" relationship between the threads.

Common Lock-Free Patterns

Several patterns appear repeatedly in lock-free programming. Recognizing them accelerates understanding and design.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// READ-COPY-UPDATE: Updates create new versions; readers see consistent snapshots
 
class RCUStyle<T> {
    private current: AtomicReference<T>;
    
    // Readers are lock-free and fast
    read(): T {
        return this.current.load();  // Always sees a consistent version
    }
    
    // Writers create new version and swap
    update(modifyFn: (old: T) => T): void {
        while (true) {
            const oldVersion = this.current.load();
            const newVersion = modifyFn(oldVersion);  // Copy + modify
            
            if (this.current.compareAndSet(oldVersion, newVersion)) {
                // Old version can be garbage collected when safe
                return;
            }
            // Contention: another writer; retry
        }
    }
}
 
// RCU is ideal for read-heavy workloads:
// - Readers never block
// - Writers pay the cost of copying
// - Old versions remain valid until readers finish

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// STATE MACHINE: Atomic state transitions
 
enum State {
    IDLE,
    STARTING,
    RUNNING,
    STOPPING,
    STOPPED
}
 
class LifecycleManager {
    private state: Atomic<State> = new Atomic(State.IDLE);
    
    start(): boolean {
        // Only IDLE → STARTING transition is valid
        if (this.state.compareAndSet(State.IDLE, State.STARTING)) {
            this.doStart();
            this.state.store(State.RUNNING);
            return true;
        }
        return false;  // Already started or stopping
    }
    
    stop(): boolean {
        // Only RUNNING → STOPPING transition is valid
        if (this.state.compareAndSet(State.RUNNING, State.STOPPING)) {
            this.doStop();
            this.state.store(State.STOPPED);
            return true;
        }
        return false;  // Not running
    }
    
    getState(): State {
        return this.state.load();
    }
    
    private doStart(): void { /* initialization */ }
    private doStop(): void { /* cleanup */ }
}
 
// CAS ensures exactly one thread performs each transition

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// SEQLOCK: Optimistic read for data that's rarely written
 
class SeqLock<T> {
    private sequence: Atomic<number> = new Atomic(0);
    private data: T;  // NOT atomic; protected by sequence
    
    // Writer: Acquire exclusive access via odd sequence
    write(newData: T): void {
        // 1. Increment to odd (signals write in progress)
        this.sequence.fetchAdd(1);
        
        // 2. Write the data (non-atomic is fine; writes are exclusive)
        this.data = newData;
        
        // 3. Increment to even (signals write complete)
        this.sequence.fetchAdd(1);
    }
    
    // Reader: Optimistic read with retry
    read(): T {
        while (true) {
            // 1. Read sequence before
            const seq1 = this.sequence.load();
            
            // 2. If odd, write in progress; spin
            if (seq1 % 2 !== 0) continue;
            
            // 3. Read the data
            const result = this.data;
            
            // 4. Read sequence after
            const seq2 = this.sequence.load();
            
            // 5. If sequence changed, data might be torn; retry
            if (seq1 === seq2) {
                return result;  // Consistent read!
            }
            // Sequence changed; retry
        }
    }
}
 
// SeqLock is ideal when:
// - Writes are rare
// - Reads are frequent
// - Data is small (fits in cache line)

More Lock-Free Patterns

•Lock-Free Linked List — Chase pointers with CAS; may need helping for deletes
•SPSC Queue — Single producer, single consumer; no CAS needed, just fences
•MPSC Queue — Multiple producers, single consumer; simpler than full MPMC
•Work Stealing Deque — Chase-Lev deque; owners pop from one end, thieves steal from other
•Flat Combining — Threads post operations; one combiner executes on behalf of all

Performance Considerations

Lock-free algorithms promise better performance, but the reality is nuanced. Several factors determine whether lock-free is actually faster:

Contention Level

Low contention: Locks often win. An uncontended lock acquisition is just 20-50ns. Lock-free overhead (more complex code, memory barriers) may exceed this.
High contention: Lock-free shines. Threads don't block on each other; they retry. System makes continuous progress.
Extreme contention: Neither wins. CAS retries burn cycles; locks queue threads. Consider restructuring to reduce shared state.

Lock-Free vs Locks: Performance Factors
Factor	Favors Lock-Free	Favors Locks
Contention	High (many threads competing)	Low (rare conflicts)
Critical section	Short (few instructions)	Long (complex logic)
Read/Write ratio	Read-heavy	Write-heavy
Thread count	Many threads	Few threads
Latency requirement	Bounded worst-case	Average-case focus
Priority inversion	Must be avoided	Acceptable

Cache Effects

Atomic operations cause significant cache traffic:

Cache Line Bouncing: When multiple cores access the same cache line, it bounces between L1 caches (MESI protocol transitions). This is expensive—hundreds of cycles.
False Sharing: Unrelated variables on the same cache line interfere. Padding to separate cache lines is common.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <atomic>
 
// BAD: These atomics share a cache line (likely 64 bytes)
struct BadCounters {
    std::atomic<long> counter1;  // Offset 0
    std::atomic<long> counter2;  // Offset 8
    // Both on same 64-byte cache line!
    // Updating counter1 invalidates counter2 in other caches
};
 
// GOOD: Pad each counter to its own cache line
struct alignas(64) PaddedCounter {
    std::atomic<long> value;
    char padding[64 - sizeof(std::atomic<long>)];
};
 
struct GoodCounters {
    PaddedCounter counter1;  // Offset 0, own cache line
    PaddedCounter counter2;  // Offset 64, own cache line
    // Updating one doesn't affect the other
};
 
// C++17 provides std::hardware_destructive_interference_size
#include <new>
struct ModernPaddedCounter {
    alignas(std::hardware_destructive_interference_size) 
    std::atomic<long> value;
};

Backoff Strategies

When Not to Use Lock-Free

Lock-free programming is a powerful tool, but it's not always the right choice. Here are situations where locks are preferable:

Prefer Locks When...

•Critical sections are complex — Multi-step operations involving multiple variables are hard to make lock-free. Locks make complex logic tractable.
•Contention is low — If threads rarely conflict, lock overhead is minimal, and lock-free complexity isn't justified.
•Correctness is paramount — Locks are easier to reason about. Lock-free bugs are harder to find and fix.
•Team expertise is limited — Lock-free requires deep understanding. Without it, bugs are inevitable.
•Deadlock isn't a concern — If your lock acquisition order is always safe, locks work fine.
•Performance isn't critical — If you don't need the last 10% of throughput, don't pay the complexity cost.
•Debugging tools matter — Debuggers, profilers, and sanitizers work better with locks than with subtle memory ordering bugs.

Premature Lock-Free Is Worse Than Premature Optimization

Consider Lock-Free When

•Lock contention is a measured bottleneck
•Operations are simple and short
•Bounded latency is required (real-time)
•Deadlock must be impossible
•Team has concurrency expertise
•You can verify correctness formally or with stress tests

Stick with Locks When

•Operations involve multiple data structures
•Correctness is more important than performance
•Code must be maintainable by others
•Debugging support is important
•Locks haven't been proven a bottleneck
•Fairness/ordering guarantees are needed

Testing and Verifying Lock-Free Code

Lock-free bugs are notoriously hard to reproduce. A race condition might manifest once in a billion executions—but that one time might be in production. Thorough testing is essential.

Testing Strategies for Lock-Free Code

•Stress Testing — Run millions of operations across many threads. Increase thread count, add random delays, vary timing. Look for crashes, hangs, or invariant violations.
•Systematic Testing (Concurrency Testing Tools) — Tools like Relacy, CDSChecker, or ThreadSanitizer explore different thread interleavings systematically.
•Model Checking — Formally verify the algorithm against a specification. Tools like TLA+ or Spin can exhaustively check small models.
•Linearizability Testing — Record all operations and their results. Verify that some sequential ordering of operations would produce the same results.
•Assertions and Invariants — Add runtime checks for expected invariants. If an invariant is violated, crash loudly with diagnostics.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// STRESS TEST: Hammer the data structure from many threads
 
async function stressTestStack(
    stack: LockFreeStack<number>,
    numThreads: number,
    opsPerThread: number
): Promise<void> {
    const promises: Promise<void>[] = [];
    const pushedCounts: number[] = new Array(numThreads).fill(0);
    const poppedSums: number[] = new Array(numThreads).fill(0);
    
    // Each thread pushes and pops randomly
    for (let t = 0; t < numThreads; t++) {
        promises.push((async () => {
            for (let i = 0; i < opsPerThread; i++) {
                if (Math.random() < 0.5) {
                    const value = t * opsPerThread + i;
                    stack.push(value);
                    pushedCounts[t]++;
                } else {
                    const value = stack.pop();
                    if (value !== null) {
                        poppedSums[t] += value;
                    }
                }
            }
        })());
    }
    
    await Promise.all(promises);
    
    // Drain remaining items
    let remaining = 0;
    let remainingSum = 0;
    while (true) {
        const value = stack.pop();
        if (value === null) break;
        remaining++;
        remainingSum += value;
    }
    
    // Verify: total pushed == total popped + remaining
    const totalPushed = pushedCounts.reduce((a, b) => a + b, 0);
    const totalPopped = (/* count from poppedSums */) + remaining;
    
    console.assert(
        stack.isEmpty(),
        "Stack should be empty after draining"
    );
    
    console.log(`Stress test: ${totalPushed} pushed, ${totalPopped} popped`);
}
 
// Run with: stressTestStack(new LockFreeStack(), 16, 100000);

Testing Doesn't Prove Correctness

Summary: Lock-Free Wisdom

We've covered the foundations of lock-free programming. Let's consolidate the key insights:

Key Takeaways

•Progress guarantees matter — Lock-free guarantees system progress; wait-free guarantees individual thread progress; both prevent deadlock
•Design principles differ — Minimize shared state, use single-word CAS, embrace helping, identify linearization points
•Memory ordering is essential — Atomicity without ordering is incomplete; understand release/acquire or use seq_cst
•Patterns provide structure — RCU, state machines, seqlocks, and others solve recurring problems
•Performance is situational — Lock-free wins under contention; locks win with low contention; cache effects matter
•Lock-free is not always better — Complexity, correctness risk, and team expertise all factor into the decision
•Testing is necessary but insufficient — Stress tests find bugs; formal methods prove their absence
•Start with locks — Use lock-free only when locks are proven bottlenecks and you're prepared for the complexity

What You've Accomplished:

You've completed the Atomic Operations module. You now understand:

What atomicity means and why it matters
The full spectrum of atomic primitives
Compare-and-Swap in depth, including the ABA problem
Lock-free programming foundations and practical guidance

This knowledge positions you to understand how concurrent containers work, design lock-free algorithms when appropriate, and make informed decisions about when lock-free complexity is justified.

Module Complete

4 / 4