Loading learning content...
We've mastered atomic operations and CAS. Now we turn to their ultimate application: lock-free programming—writing concurrent code that never blocks, never deadlocks, and guarantees system-wide progress even when individual threads are delayed or suspended.
Lock-free programming is simultaneously one of the most powerful and most dangerous tools in a systems programmer's arsenal. When done correctly, it enables extraordinary performance and reliability. When done incorrectly, it creates bugs that are nearly impossible to diagnose—bugs that manifest once in a million runs, only in production, only under load.
This page provides the theoretical foundation and practical guidance you need to approach lock-free programming with appropriate respect and competence.
This page covers the progress guarantees (lock-free, wait-free, obstruction-free), how to design lock-free algorithms, common patterns and pitfalls, performance considerations, memory ordering requirements, and when lock-free is worth the complexity.
Concurrent algorithms are classified by the progress guarantees they provide. These guarantees describe what happens when multiple threads compete for a resource:
Blocking Algorithms (Locks)
With locks, if one thread holds the lock and is delayed (preempted, suspended, or crashes), all other threads waiting for that lock make no progress. A single slow/dead thread can block the entire system.
Non-Blocking Algorithms
Non-blocking algorithms guarantee progress even when some threads are delayed. They come in three flavors, each with stronger guarantees:
| Guarantee | Definition | Starvation? | Use Case |
|---|---|---|---|
| Obstruction-Free | A thread completes if all other threads pause | Possible | Simplest to implement; rarely used alone |
| Lock-Free | At least one thread makes progress at all times | Possible | Most common non-blocking guarantee |
| Wait-Free | Every thread makes progress in bounded steps | Impossible | Strongest guarantee; hardest to achieve |
Obstruction-Free
An algorithm is obstruction-free if any thread, when executing in isolation (all other threads paused), will complete its operation in a finite number of steps.
This is the weakest non-blocking guarantee. It says nothing about what happens under contention—threads might livelock (repeatedly interfering with each other without any making progress).
Lock-Free
An algorithm is lock-free if, when any number of threads execute operations, at least one thread completes its operation in a finite number of steps.
This guarantees system-wide progress: the system as a whole is never stuck. However, individual threads can starve—repeatedly losing the race to complete their operation while others succeed.
Wait-Free
An algorithm is wait-free if every thread completes its operation in a bounded number of steps, regardless of other threads' behavior.
This is the strongest guarantee: no starvation, no livelocks, predictable worst-case latency. But wait-free algorithms are significantly harder to design and often slower in practice due to the bookkeeping overhead for fairness.
123456789101112131415161718192021222324252627282930313233343536373839404142
// LOCK-FREE: At least one thread always succeeds// The CAS loop is lock-free: if CAS fails, another thread succeededclass LockFreeCounter { private value: Atomic<number> = new Atomic(0); increment(): void { while (true) { const current = this.value.load(); if (this.value.compareAndSet(current, current + 1)) { return; // Success! } // If we failed, someone else succeeded → progress was made } }} // WAIT-FREE: Every thread completes in bounded steps// fetch_add is wait-free: completes in O(1) regardless of contentionclass WaitFreeCounter { private value: Atomic<number> = new Atomic(0); increment(): number { // fetch_add always completes in one atomic operation // No retry loop, no possibility of spinning indefinitely return this.value.fetchAdd(1); }} // BLOCKING (for comparison): One slow thread blocks allclass BlockingCounter { private value: number = 0; private mutex: Mutex = new Mutex(); increment(): void { this.mutex.lock(); // If holder is delayed, we wait here forever try { this.value++; } finally { this.mutex.unlock(); } }}Lock-free algorithms can starve individual threads. Under heavy contention, a fast thread might repeatedly win the CAS race, leaving slow threads spinning indefinitely. If fairness matters, you need either wait-free algorithms or lock-free algorithms with added fairness mechanisms.
Designing lock-free algorithms requires a fundamentally different mindset from lock-based programming. Here are the core principles:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// HELPING: Other threads complete pending operations class TwoPhaseSet<T> { // Pointer to current state (immutable once published) private state: AtomicReference<SetState<T>>; add(item: T): boolean { while (true) { // 1. Read current state const current = this.state.load(); // 2. Check if another thread has a pending operation if (current.pendingOp !== null) { // HELP: Complete the pending operation first this.helpComplete(current.pendingOp); continue; // Retry with fresh state } // 3. Check if item already exists if (current.items.has(item)) { return false; // Already present } // 4. Create new state with our addition const newState = new SetState( new Set([...current.items, item]), null // No pending operation ); // 5. Try to CAS our new state in if (this.state.compareAndSet(current, newState)) { return true; } // CAS failed; someone else modified; retry } } private helpComplete(op: PendingOp): void { // Execute the pending operation on behalf of the original thread // This ensures the system makes progress even if the original // thread is delayed or crashed }} // Key insight: By helping other threads, we guarantee that// the system never stalls waiting for any single thread.The Linearization Point
Every operation in a lock-free data structure has a linearization point—the instant at which the operation appears to take effect. For correctness, operations must appear to execute atomically at their linearization points.
For a CAS-based push onto a stack:
1. Load current head ← Could see old head
2. Set newNode.next = head ← Preparing, not yet visible
3. CAS(head, oldHead, newNode) ← LINEARIZATION POINT (if successful)
If CAS succeeds, the push 'happened' at that instant.
If CAS fails, the push hasn't happened yet; retry.
Identifying linearization points is essential for proving correctness and understanding the observable behavior of concurrent operations.
Lock-free programming requires careful attention to memory ordering. Modern CPUs and compilers reorder memory operations for performance, which can cause lock-free algorithms to behave incorrectly if ordering isn't specified.
The Problem:
// Thread 1
data = 42; // Store A
ready.store(true); // Store B
// Thread 2
if (ready.load()) { // Load B
use(data); // Load A
}
Without proper ordering, Thread 2 might see ready == true but data != 42, because:
| Order | Prevents | Cost | Use Case |
|---|---|---|---|
| Relaxed | Nothing | Lowest | Statistics counters where order doesn't matter |
| Consume | Data-dependent reads moved before* | Low | Following pointer chains (rarely used) |
| Acquire | Later reads/writes moved before | Medium | Reading synchronization variables |
| Release | Earlier reads/writes moved after | Medium | Writing synchronization variables |
| Acq_Rel | Both acquire and release | Medium | Read-modify-write on sync variables |
| Seq_Cst | All reordering, plus total order | Highest | Simple correctness, sequential reasoning |
1234567891011121314151617181920212223242526272829303132333435363738
#include <atomic> std::atomic<int> data{0};std::atomic<bool> ready{false}; // WRONG: No ordering guaranteesvoid publish_wrong() { data.store(42, std::memory_order_relaxed); ready.store(true, std::memory_order_relaxed); // Compiler/CPU might reorder these!} // CORRECT: Release/Acquire orderingvoid publish_correct() { data.store(42, std::memory_order_relaxed); // Ordinary store is fine... ready.store(true, std::memory_order_release); // ...because release prevents reordering} bool consume_correct() { if (ready.load(std::memory_order_acquire)) { // Acquire matches release return data.load(std::memory_order_relaxed) == 42; // Guaranteed true! } return false;} // SIMPLEST: Sequential consistency (default)void publish_simple() { data.store(42); // Default: seq_cst ready.store(true); // Default: seq_cst // No reordering possible; easiest to reason about} bool consume_simple() { if (ready.load()) { // Default: seq_cst return data.load() == 42; // Guaranteed true! } return false;}When writing lock-free code, start with sequential consistency (seq_cst) for all atomics. It's the easiest to reason about. Only switch to weaker orderings after profiling proves they're necessary AND you fully understand the implications. An incorrect memory order causes bugs that are virtually impossible to reproduce or debug.
The Release/Acquire Pattern
The most common pattern for synchronization is release-acquire: one thread releases data (makes it available), another acquires it (takes ownership or reads it).
This creates a "synchronizes-with" relationship between the threads.
Several patterns appear repeatedly in lock-free programming. Recognizing them accelerates understanding and design.
1234567891011121314151617181920212223242526272829
// READ-COPY-UPDATE: Updates create new versions; readers see consistent snapshots class RCUStyle<T> { private current: AtomicReference<T>; // Readers are lock-free and fast read(): T { return this.current.load(); // Always sees a consistent version } // Writers create new version and swap update(modifyFn: (old: T) => T): void { while (true) { const oldVersion = this.current.load(); const newVersion = modifyFn(oldVersion); // Copy + modify if (this.current.compareAndSet(oldVersion, newVersion)) { // Old version can be garbage collected when safe return; } // Contention: another writer; retry } }} // RCU is ideal for read-heavy workloads:// - Readers never block// - Writers pay the cost of copying// - Old versions remain valid until readers finish123456789101112131415161718192021222324252627282930313233343536373839404142
// STATE MACHINE: Atomic state transitions enum State { IDLE, STARTING, RUNNING, STOPPING, STOPPED} class LifecycleManager { private state: Atomic<State> = new Atomic(State.IDLE); start(): boolean { // Only IDLE → STARTING transition is valid if (this.state.compareAndSet(State.IDLE, State.STARTING)) { this.doStart(); this.state.store(State.RUNNING); return true; } return false; // Already started or stopping } stop(): boolean { // Only RUNNING → STOPPING transition is valid if (this.state.compareAndSet(State.RUNNING, State.STOPPING)) { this.doStop(); this.state.store(State.STOPPED); return true; } return false; // Not running } getState(): State { return this.state.load(); } private doStart(): void { /* initialization */ } private doStop(): void { /* cleanup */ }} // CAS ensures exactly one thread performs each transition12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// SEQLOCK: Optimistic read for data that's rarely written class SeqLock<T> { private sequence: Atomic<number> = new Atomic(0); private data: T; // NOT atomic; protected by sequence // Writer: Acquire exclusive access via odd sequence write(newData: T): void { // 1. Increment to odd (signals write in progress) this.sequence.fetchAdd(1); // 2. Write the data (non-atomic is fine; writes are exclusive) this.data = newData; // 3. Increment to even (signals write complete) this.sequence.fetchAdd(1); } // Reader: Optimistic read with retry read(): T { while (true) { // 1. Read sequence before const seq1 = this.sequence.load(); // 2. If odd, write in progress; spin if (seq1 % 2 !== 0) continue; // 3. Read the data const result = this.data; // 4. Read sequence after const seq2 = this.sequence.load(); // 5. If sequence changed, data might be torn; retry if (seq1 === seq2) { return result; // Consistent read! } // Sequence changed; retry } }} // SeqLock is ideal when:// - Writes are rare// - Reads are frequent// - Data is small (fits in cache line)Lock-free algorithms promise better performance, but the reality is nuanced. Several factors determine whether lock-free is actually faster:
Contention Level
Low contention: Locks often win. An uncontended lock acquisition is just 20-50ns. Lock-free overhead (more complex code, memory barriers) may exceed this.
High contention: Lock-free shines. Threads don't block on each other; they retry. System makes continuous progress.
Extreme contention: Neither wins. CAS retries burn cycles; locks queue threads. Consider restructuring to reduce shared state.
| Factor | Favors Lock-Free | Favors Locks |
|---|---|---|
| Contention | High (many threads competing) | Low (rare conflicts) |
| Critical section | Short (few instructions) | Long (complex logic) |
| Read/Write ratio | Read-heavy | Write-heavy |
| Thread count | Many threads | Few threads |
| Latency requirement | Bounded worst-case | Average-case focus |
| Priority inversion | Must be avoided | Acceptable |
Cache Effects
Atomic operations cause significant cache traffic:
Cache Line Bouncing: When multiple cores access the same cache line, it bounces between L1 caches (MESI protocol transitions). This is expensive—hundreds of cycles.
False Sharing: Unrelated variables on the same cache line interfere. Padding to separate cache lines is common.
12345678910111213141516171819202122232425262728
#include <atomic> // BAD: These atomics share a cache line (likely 64 bytes)struct BadCounters { std::atomic<long> counter1; // Offset 0 std::atomic<long> counter2; // Offset 8 // Both on same 64-byte cache line! // Updating counter1 invalidates counter2 in other caches}; // GOOD: Pad each counter to its own cache linestruct alignas(64) PaddedCounter { std::atomic<long> value; char padding[64 - sizeof(std::atomic<long>)];}; struct GoodCounters { PaddedCounter counter1; // Offset 0, own cache line PaddedCounter counter2; // Offset 64, own cache line // Updating one doesn't affect the other}; // C++17 provides std::hardware_destructive_interference_size#include <new>struct ModernPaddedCounter { alignas(std::hardware_destructive_interference_size) std::atomic<long> value;};Under contention, naive CAS loops spin tightly, wasting CPU. Backoff strategies help: pause/yield after failures, exponential backoff (wait longer after repeated failures), or adaptive backoff (adjust based on observed contention). These improve throughput and reduce power consumption.
Lock-free programming is a powerful tool, but it's not always the right choice. Here are situations where locks are preferable:
Lock-free code is 2-10x more complex than lock-based equivalents. It's extremely easy to introduce subtle bugs that only manifest under specific timing conditions. Never go lock-free without profiling evidence that locks are the bottleneck AND confidence that you can verify correctness.
Lock-free bugs are notoriously hard to reproduce. A race condition might manifest once in a billion executions—but that one time might be in production. Thorough testing is essential.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
// STRESS TEST: Hammer the data structure from many threads async function stressTestStack( stack: LockFreeStack<number>, numThreads: number, opsPerThread: number): Promise<void> { const promises: Promise<void>[] = []; const pushedCounts: number[] = new Array(numThreads).fill(0); const poppedSums: number[] = new Array(numThreads).fill(0); // Each thread pushes and pops randomly for (let t = 0; t < numThreads; t++) { promises.push((async () => { for (let i = 0; i < opsPerThread; i++) { if (Math.random() < 0.5) { const value = t * opsPerThread + i; stack.push(value); pushedCounts[t]++; } else { const value = stack.pop(); if (value !== null) { poppedSums[t] += value; } } } })()); } await Promise.all(promises); // Drain remaining items let remaining = 0; let remainingSum = 0; while (true) { const value = stack.pop(); if (value === null) break; remaining++; remainingSum += value; } // Verify: total pushed == total popped + remaining const totalPushed = pushedCounts.reduce((a, b) => a + b, 0); const totalPopped = (/* count from poppedSums */) + remaining; console.assert( stack.isEmpty(), "Stack should be empty after draining" ); console.log(`Stress test: ${totalPushed} pushed, ${totalPopped} popped`);} // Run with: stressTestStack(new LockFreeStack(), 16, 100000);No amount of testing can prove a lock-free algorithm is correct. Tests can find bugs, but absence of failures doesn't mean absence of bugs. For critical code, combine testing with formal verification or model checking.
We've covered the foundations of lock-free programming. Let's consolidate the key insights:
What You've Accomplished:
You've completed the Atomic Operations module. You now understand:
This knowledge positions you to understand how concurrent containers work, design lock-free algorithms when appropriate, and make informed decisions about when lock-free complexity is justified.
Congratulations! You've mastered atomic operations—the indivisible building blocks of concurrent programming. From basic atomicity concepts through CAS and lock-free programming, you now have the knowledge to reason about and build correct, high-performance concurrent systems.