Loading content...
Locks have been the traditional mechanism for managing concurrent access to shared resources. But locks come with fundamental problems: deadlock, priority inversion, convoying, and blocking. When a thread holding a lock is delayed—by scheduling, page faults, or any other reason—all threads waiting for that lock are also blocked, creating a chain reaction of delays.
Lock-free programming offers an alternative paradigm. Using Compare-and-Swap (CAS) and other atomic primitives, we can build concurrent data structures where threads can always make progress, even if some threads are arbitrarily delayed. No thread can block another; if one thread stalls, others continue unimpeded.
This page provides a rigorous exploration of lock-free programming. We will formally define what "lock-free" means, distinguish it from related concepts like "wait-free" and "obstruction-free," examine why these guarantees matter, and establish the foundational principles for designing lock-free algorithms.
By the end of this page, you will understand the formal definitions of non-blocking progress guarantees (obstruction-free, lock-free, wait-free), why lock-free programming matters for system reliability and performance, the fundamental challenges and tradeoffs of lock-free design, and the principles that guide correct lock-free algorithm construction.
Before exploring lock-free alternatives, we must understand why locks are problematic. While locks are conceptually simple and widely used, they introduce several fundamental issues:
When a thread acquires a lock, all other threads that want that lock must wait. This waiting creates a dependency chain: the progress of waiting threads depends entirely on the progress of the lock holder.
This might seem acceptable—surely the lock holder will finish quickly? But consider these scenarios:
The lock holder is descheduled: The OS scheduler decides to run another process. Waiting threads spin or block for the remaining time quantum.
The lock holder triggers a page fault: The thread accesses memory that's been swapped out. Disk I/O takes millions of CPU cycles; waiting threads are blocked the entire time.
The lock holder receives a signal: It might pause for signal handling, blocking all waiters.
The lock holder crashes or is killed: All waiters may deadlock permanently.
In all these cases, the performance or even correctness of non-faulty threads depends on factors outside their control. This violates the principle of fault isolation.
| Problem | Description | Impact |
|---|---|---|
| Deadlock | Circular wait for locks: A holds L1 waiting for L2, B holds L2 waiting for L1 | Complete system halt among involved threads |
| Priority Inversion | High-priority thread waits for low-priority thread holding a lock | Real-time guarantees violated, system unresponsiveness |
| Convoying | When the lock holder is slow, all waiters queue up creating long convoys | Latency spikes, reduced throughput under contention |
| Blocking | Thread waits (spinning or sleeping) for lock with no guarantee of progress | CPU waste or context switch overhead |
| Failure Dependency | If lock holder crashes, all waiters may hang indefinitely | System reliability compromised |
Priority inversion deserves special attention because it nearly caused a mission failure on Mars Pathfinder in 1997.
The Scenario:
The result: the highest-priority task in the system cannot run, even though it's blocked only on a low-priority task. The system's priority scheme is inverted.
On Mars Pathfinder, priority inversion caused the spacecraft's watchdog timer to trigger repeated system resets. NASA engineers had to diagnose and patch the issue from 190 million kilometers away. The solution involved enabling priority inheritance—a lock-based mitigation where a low-priority lock holder temporarily inherits the priority of the highest-priority waiter.
But here's the key insight: lock-free data structures don't have priority inversion because no thread ever waits for another thread. High-priority threads can always make progress immediately.
The core problem with locks is that they create dependencies between threads. Thread A's ability to make progress depends on Thread B releasing the lock. This dependency couples the performance and reliability of all threads that share locks. Lock-free programming eliminates these dependencies: each thread's progress depends only on its own execution, not on any other thread's behavior.
Computer scientists have formalized different levels of progress guarantees for concurrent algorithms. Understanding these definitions is essential for evaluating and designing concurrent data structures.
From weakest to strongest:
1. Obstruction-Free (Weakest)
A method is obstruction-free if any thread that runs in isolation (without interference from other threads) completes in a bounded number of steps.
2. Lock-Free (Medium)
A method is lock-free if, at any point, at least one thread is guaranteed to complete its operation in a bounded number of steps.
3. Wait-Free (Strongest)
A method is wait-free if every thread completes its operation in a bounded number of steps, regardless of other threads' behavior.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
// Illustrating the difference between progress guarantees // ═══════════════════════════════════════════════════════════════════// BLOCKING (Traditional Lock-Based)// ═══════════════════════════════════════════════════════════════════void blocking_increment(Lock* lock, int* counter) { lock_acquire(lock); // May wait indefinitely if holder is delayed *counter += 1; lock_release(lock);}// Progress guarantee: NONE if lock holder fails // ═══════════════════════════════════════════════════════════════════// OBSTRUCTION-FREE// ═══════════════════════════════════════════════════════════════════void obstruction_free_increment(int* counter) { while (true) { int old = *counter; // In isolation, this completes quickly // Under contention, threads may interfere forever (livelock) if (CAS(counter, old, old + 1)) { return; // Success } // No backoff or priority mechanism - threads may conflict forever }}// Progress guarantee: Completes if run alone; may livelock under contention // ═══════════════════════════════════════════════════════════════════// LOCK-FREE// ═══════════════════════════════════════════════════════════════════void lock_free_increment(int* counter) { while (true) { int old = *counter; if (CAS(counter, old, old + 1)) { return; // Success } // Here's the insight: if our CAS fails, it's because ANOTHER // thread's CAS SUCCEEDED. So system-wide progress is guaranteed. }}// Progress guarantee: At least one thread always completes // ═══════════════════════════════════════════════════════════════════// WAIT-FREE// ═══════════════════════════════════════════════════════════════════void wait_free_increment(int* counters, int thread_id, int num_threads) { // Each thread has its own counter - no contention possible counters[thread_id] += 1; // Total is sum of all counters (read lazily or periodically)}// Progress guarantee: Every thread completes immediately// Tradeoff: Aggregation required to get total, design is more complexIn practice, lock-free is the most commonly targeted guarantee for high-performance concurrent data structures. Here's why:
Compared to Blocking (Locks):
Compared to Wait-Free:
The pragmatic observation is that lock-free provides excellent real-world performance while being significantly easier to implement than wait-free. Individual-thread starvation is theoretically possible but extremely rare with well-designed algorithms and backoff strategies.
A lock-free algorithm based on CAS has an elegant property: when a CAS fails, it's because another thread's CAS succeeded. Unlike locks where all threads can be stuck waiting, with CAS at least one thread is always making progress. This is the fundamental reason CAS enables lock-free programming.
Lock-free (and lock-based) concurrent data structures must satisfy a correctness criterion. The gold standard is linearizability—a strong consistency model that ensures concurrent operations appear to take effect instantaneously at some point during their execution.
A concurrent object is linearizable if:
Intuitively, linearizability means: even though operations overlap in time, we can construct a sequential history where each operation happens instantaneously, and this history respects the real-time ordering of non-overlapping operations.
The linearization point of an operation is the instant at which the operation "takes effect." For lock-free algorithms using CAS, the linearization point is typically the successful CAS instruction.
Time →
────────────────────────────────────────────────────────────────
Thread A: |-------- push(X) --------| (invocation to response)
↑
Linearization point: successful CAS
Thread B: |-------- pop() --------| (overlapping operation)
↑
Linearization point: successful CAS
Linearizable history: push(X) → pop() returns X
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
// Lock-free stack: identifying linearization points typedef struct Node { int value; struct Node* next;} Node; atomic<Node*> top = NULL; // PUSH operationvoid push(int value) { Node* new_node = malloc(sizeof(Node)); new_node->value = value; while (true) { Node* old_top = top.load(); // Read current top new_node->next = old_top; // Link new node to current stack // ═══════════════════════════════════════════════════════ // LINEARIZATION POINT: If this CAS succeeds, the push // "takes effect" at this instant. Before this point, // the new node is not visible. After, it's on the stack. // ═══════════════════════════════════════════════════════ if (top.compare_exchange_weak(old_top, new_node)) { return; // Push complete } // CAS failed, retry }} // POP operationNode* pop() { while (true) { Node* old_top = top.load(); // Read current top if (old_top == NULL) { // ═══════════════════════════════════════════════════ // LINEARIZATION POINT (empty case): The load that // returned NULL. The stack was empty at this instant. // ═══════════════════════════════════════════════════ return NULL; // Stack is empty } Node* new_top = old_top->next; // Read next pointer // ═══════════════════════════════════════════════════════ // LINEARIZATION POINT (success case): If this CAS succeeds, // the pop "takes effect" at this instant. The node is // atomically removed from the stack. // ═══════════════════════════════════════════════════════ if (top.compare_exchange_weak(old_top, new_top)) { return old_top; // Return removed node } // CAS failed, retry }} /* * Why is this linearizable? * * 1. Each successful CAS is an atomic, indivisible operation * 2. The effect (changing 'top') happens at that exact instant * 3. No partial states are ever visible * 4. If push A linearizes before pop B, pop B sees A's effect * 5. The sequential spec (LIFO stack) is preserved */Composability: Linearizable objects can be composed. If two data structures are independently linearizable, using them together in a larger system preserves intuitive sequential behavior.
Reasoning: We can reason about the concurrent system as if operations happened one at a time, in some sequential order. This dramatically simplifies understanding and verification.
Local Reasoning: We only need to examine the data structure's implementation to verify linearizability—we don't need to know how clients use it.
Linearizability is stronger than sequential consistency:
The additional real-time constraint means linearizability preserves intuitive causality: if I complete an operation before you start yours, my operation must appear before yours in the history.
Identifying linearization points requires careful analysis. The point must be within the operation's execution span, and the system state at that point must reflect the operation's effect. For CAS-based algorithms, the successful CAS is usually the linearization point. For read-only operations, it's often the load that determines the return value. Proving linearizability involves showing that consistent linearization points exist for all possible executions.
Designing lock-free algorithms requires a different mindset than lock-based programming. Several principles guide the construction of correct and efficient lock-free data structures.
The core idea of lock-free programming is to express state changes as single atomic operations. Instead of:
lock();
read state;
modify state;
write state;
unlock();
We use:
do {
old_state = read();
new_state = compute(old_state);
} while (!CAS(&state, old_state, new_state));
The entire state transition happens atomically via CAS. If another thread modifies the state between our read and our CAS attempt, the CAS fails and we retry with the new state.
Make data structures (or parts of them) immutable after publication. Instead of modifying an existing node, create a new node with the desired modifications and atomically swap the pointer.
This eliminates races on the data itself—the only race is on the pointer to the data, which can be handled with a single CAS.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
// Copy-on-Write pattern for lock-free updates typedef struct Config { int timeout_ms; int max_connections; char* server_name;} Config; atomic<Config*> current_config; // Lock-free config update using copy-on-writevoid update_timeout(int new_timeout) { while (true) { Config* old_config = current_config.load(); // Create a NEW config object - never modify existing Config* new_config = malloc(sizeof(Config)); // Copy existing values new_config->timeout_ms = new_timeout; // Updated value new_config->max_connections = old_config->max_connections; new_config->server_name = strdup(old_config->server_name); // Atomically swap the pointer if (current_config.compare_exchange_strong(old_config, new_config)) { // Success! Schedule old_config for eventual deallocation // (using hazard pointers, RCU, or epoch-based reclamation) schedule_free(old_config); return; } // CAS failed - someone else updated config // Free our new_config and retry free(new_config->server_name); free(new_config); }} // Lock-free read - always sees a consistent config snapshotConfig* get_config_snapshot() { return current_config.load(); // Returns a pointer to a complete, immutable config // Caller sees either old or new config, never a mix} /* * Why this works: * 1. Config objects are immutable after publication * 2. Readers see complete, consistent snapshots * 3. Updates atomically swap the pointer * 4. No locks, no blocking, no races on config contents * 5. Only complexity: memory reclamation (who frees old config?) */In lock-free algorithms, if one thread's operation is incomplete and visible to other threads, those other threads should help complete the operation rather than waiting or failing. This "helping" mechanism is crucial for ensuring lock-free progress.
Example: In a lock-free linked list, if Thread A starts removing a node but gets delayed, Thread B should recognize the partially-complete removal and help finish it, rather than blocking or failing its own operation.
Lock-free operations may be observed in intermediate states. Design the data structure so that these partial states are:
This often involves using flag bits in pointers or separate status fields to mark operations in progress.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
// Using marked pointers for lock-free list deletion // Pointer tagging: use the low bit of pointer as a "marked" flag// This works because malloc'd addresses are aligned (low bits are 0) #define MARK(p) ((Node*)((uintptr_t)(p) | 1))#define UNMARK(p) ((Node*)((uintptr_t)(p) & ~1))#define IS_MARKED(p) ((uintptr_t)(p) & 1) typedef struct Node { int key; atomic<Node*> next; // Low bit used as mark flag} Node; // Lock-free deletion with helping bool delete_node(Node* pred, Node* curr) { Node* succ = curr->next.load(); // Step 1: Logically delete by marking curr's next pointer // This signals to other threads: "curr is being deleted" while (!IS_MARKED(succ)) { if (curr->next.compare_exchange_weak(succ, MARK(succ))) { // Successfully marked - curr is logically deleted break; } succ = curr->next.load(); } // Step 2: Physically unlink by swinging pred's pointer past curr // Note: ANY thread that sees the marked node can do this! Node* unmarked_succ = UNMARK(succ); if (pred->next.compare_exchange_strong(curr, unmarked_succ)) { // Successfully unlinked - schedule curr for reclamation retire(curr); return true; } // CAS failed - another thread already unlinked curr, that's OK return false;} /* * How helping works here: * * Thread A starts deletion: * 1. Marks curr->next (logical deletion) * 2. Gets descheduled before physical unlink * * Thread B traverses list: * 1. Sees marked curr->next * 2. Recognizes: "curr is being deleted, let me help" * 3. Attempts to CAS pred->next from curr to UNMARK(curr->next) * 4. If succeeds, Thread B has completed Thread A's deletion * * Thread A resumes: * 1. Its CAS to unlink will fail (curr already unlinked) * 2. Thread A returns - deletion is complete (thanks to B) * * System-wide progress guaranteed: someone always completes */The 'helping' pattern is what transforms obstruction-free algorithms into lock-free ones. Without helping, a delayed thread could prevent others from completing their operations. With helping, other threads can complete the delayed operation, ensuring system-wide progress. This is why lock-free data structures are often more complex than their blocking counterparts—they must encode enough information for any thread to complete any operation.
One of the most challenging aspects of lock-free programming is memory reclamation—safely freeing memory that has been removed from a data structure. The problem is uniquely difficult in lock-free contexts because:
Consider this scenario:
Time →
────────────────────────────────────────────────────────────────
Thread A: read(node) [suspended] dereference(node) ← CRASH!
↑
node was freed!
Thread B: remove(node) free(node)
Thread A reads a pointer to a node. Thread B removes and frees that node. When Thread A resumes and tries to use the pointer, it accesses freed memory—a use-after-free bug that can cause crashes or corruption.
Several techniques address this problem, each with different tradeoffs:
| Technique | Concept | Pros | Cons |
|---|---|---|---|
| Hazard Pointers | Threads announce which nodes they're accessing | Precise, bounded memory | Per-access overhead, limited protected pointers |
| Epoch-Based Reclamation (EBR) | Track which 'epoch' each thread is in; reclaim when all advance | Low overhead, amortized | Unbounded memory if a thread stalls |
| Reference Counting | Track number of references to each object | Immediate reclamation | High overhead, contention on ref counts |
| Read-Copy-Update (RCU) | Defer reclamation until all readers complete | Very low read-side overhead | Kernel integration, delayed reclamation |
| Quiescent States | Reclaim when all threads pass through quiescent points | Simple concept | Requires application cooperation |
Hazard pointers, introduced by Maged Michael, are one of the most practical techniques. The idea:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
// Hazard Pointers for safe memory reclamation #define MAX_THREADS 64#define HP_PER_THREAD 2 // Per-thread hazard pointers (visible to all threads)atomic<void*> hazard_pointers[MAX_THREADS][HP_PER_THREAD]; // Per-thread retired list (to-be-freed nodes)struct RetiredNode { void* ptr; void (*free_func)(void*);};thread_local vector<RetiredNode> retired_list; // Read side: Protect a pointer before dereferencingvoid* protect_pointer(int thread_id, int hp_index, atomic<void*>* addr) { void* ptr; do { ptr = addr->load(); // Read pointer hazard_pointers[thread_id][hp_index].store(ptr); // Publish to HP // Re-read to ensure the pointer didn't change between read and publish } while (addr->load() != ptr); return ptr; // Now safe to dereference} // Read side: Done with protected pointervoid release_pointer(int thread_id, int hp_index) { hazard_pointers[thread_id][hp_index].store(nullptr);} // Write side: Retire a node (schedule for later reclamation)void retire_node(void* ptr, void (*free_func)(void*)) { retired_list.push_back({ptr, free_func}); // Periodically scan retired list and reclaim safe nodes if (retired_list.size() > RECLAIM_THRESHOLD) { scan_and_reclaim(); }} // Scan all hazard pointers and reclaim unprotected nodesvoid scan_and_reclaim() { // Collect all active hazard pointers set<void*> protected_ptrs; for (int t = 0; t < MAX_THREADS; t++) { for (int h = 0; h < HP_PER_THREAD; h++) { void* hp = hazard_pointers[t][h].load(); if (hp) protected_ptrs.insert(hp); } } // Remove and free unprotected retired nodes auto it = retired_list.begin(); while (it != retired_list.end()) { if (protected_ptrs.find(it->ptr) == protected_ptrs.end()) { // Not protected - safe to free it->free_func(it->ptr); it = retired_list.erase(it); } else { ++it; // Keep in retired list, still protected } }} /* * Usage pattern: * * void* dequeue(int tid) { * Node* head = protect_pointer(tid, 0, &queue_head); // Protect * if (!head) { * release_pointer(tid, 0); * return NULL; * } * * Node* next = protect_pointer(tid, 1, &head->next); // Protect next too * * if (CAS(&queue_head, head, next)) { * void* result = head->data; * release_pointer(tid, 0); * release_pointer(tid, 1); * retire_node(head, free); // Schedule for reclamation * return result; * } * * release_pointer(tid, 0); * release_pointer(tid, 1); * // Retry... * } */Memory reclamation is often the most complex part of lock-free data structures. Incorrect reclamation leads to use-after-free bugs, which are subtle and dangerous. Many production systems use garbage-collected languages (Java, Go) partly to avoid manual lock-free memory management. If you must manage memory manually, use well-tested libraries rather than implementing reclamation from scratch.
Lock-free programming is not a universal solution. It excels in specific scenarios and may be counterproductive in others. Understanding when to use (and not use) lock-free techniques is crucial.
1. Real-Time Systems
When worst-case latency matters more than average throughput:
2. Signal Handler / Interrupt Context
When code may execute in restricted contexts:
3. High-Contention Bottlenecks
When a single resource is accessed by many threads:
4. Failure Tolerance Requirements
When thread/process failures must not block others:
| Factor | Favor Lock-Free | Favor Lock-Based |
|---|---|---|
| Critical section length | Very short (few instructions) | Long (I/O, computation) |
| Contention level | Low to medium | Any, especially high |
| Latency requirements | Hard real-time, bounded latency | Average latency matters more |
| Failure tolerance | Must tolerate thread failures | Threads rarely fail |
| Complexity budget | Can invest in complex design | Need simple, maintainable code |
| Memory management | GC available, or reclamation solved | Stack-local or simple lifetime |
1. Long Critical Sections
If the critical section involves significant work, lock-free provides no advantage:
2. High Contention with Low Throughput
When many threads fight for one resource continuously:
3. Simple Correctness Requirements
When the team doesn't have lock-free expertise:
4. Composing Multiple Objects
When an operation must atomically update multiple data structures:
Start with locks. Profile under realistic load. If locks are the bottleneck (contention, latency spikes), consider lock-free alternatives for that specific data structure. Use well-tested libraries (java.util.concurrent, C++ folly, crossbeam-rs) rather than implementing from scratch. Lock-free programming is a precision tool—use it surgically where it provides clear benefits.
Several lock-free data structures have been developed and refined over decades of research. Understanding the landscape helps in choosing the right tool for each job.
1. Lock-Free Stack (Treiber Stack)
The simplest lock-free structure. Push and pop both CAS on the top pointer.
2. Lock-Free Queue (Michael-Scott Queue)
Enqueue and dequeue operate on different ends (tail and head), allowing some concurrency.
3. Lock-Free Linked List
Ordered list with lock-free insertion and deletion.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
// Treiber Stack: Classic lock-free stack// Named after R. Kent Treiber (IBM, 1986) #include <atomic>#include <memory> template<typename T>class TreiberStack {private: struct Node { T data; Node* next; Node(T d) : data(d), next(nullptr) {} }; std::atomic<Node*> top{nullptr}; public: void push(T value) { Node* new_node = new Node(value); // Load current top and point new node to it new_node->next = top.load(std::memory_order_relaxed); // CAS loop: try to make new_node the new top while (!top.compare_exchange_weak( new_node->next, // Expected (updated on failure) new_node, // Desired std::memory_order_release, // Success ordering std::memory_order_relaxed // Failure ordering )) { // On failure, new_node->next is updated to current top // Loop automatically retries with correct value } } std::optional<T> pop() { Node* old_top = top.load(std::memory_order_acquire); while (old_top != nullptr) { Node* new_top = old_top->next; // CAS: try to make next node the new top if (top.compare_exchange_weak( old_top, // Expected (updated on failure) new_top, // Desired std::memory_order_acquire, // Success ordering std::memory_order_relaxed // Failure ordering )) { T result = old_top->data; // Memory reclamation: old_top is now out of the stack // In production, use hazard pointers or epoch-based reclamation delete old_top; // DANGER: may cause use-after-free! return result; } // On failure, old_top is updated to current top, retry } return std::nullopt; // Stack was empty }}; /* * Properties: * - Lock-free: Every failed CAS means another thread succeeded * - Linearizable: Linearization point is the successful CAS * - Memory ordering: release-acquire ensures visibility * * Caveats: * - ABA problem: Use tagged pointers or hazard pointers * - Memory reclamation: The delete in pop() is unsafe in general * - Starvation possible: One thread may repeatedly lose CAS */4. Lock-Free Hash Maps
Hash maps are more complex due to resizing:
5. Lock-Free Skip Lists
Alternative to balanced trees:
6. Lock-Free B-Trees
For database indexes:
java.util.concurrent — ConcurrentLinkedQueue, ConcurrentSkipListMap, Atomic classesLock-free data structures are notoriously difficult to implement correctly. The algorithms are subtle, edge cases are numerous, and testing cannot cover all interleavings. In practice, prefer established, well-tested libraries over custom implementations. Reserve custom lock-free code for situations where libraries don't meet specific requirements—and then subject that code to rigorous verification.
We have explored the principles and practices of lock-free programming—a powerful paradigm that uses CAS to achieve concurrent algorithms with strong progress guarantees. Let us consolidate the key concepts before moving to CAS-based lock implementations.
Ironically, one of the most common uses of CAS is to implement locks themselves! The next page explores CAS-based lock implementations—from simple spinlocks to more sophisticated ticket locks and queue locks.
We will examine:
Understanding CAS-based locks bridges the gap between lock-free algorithms and traditional locking, showing how the same primitive underlies both approaches.
You now understand the principles of lock-free programming—what it means, why it matters, how to design lock-free algorithms, and when to use this approach. This knowledge is essential for understanding high-performance concurrent systems and for making informed decisions about synchronization strategies.