Compare And Swap - Learning Module

Loading content...

0/227

Lock-Free Programming: Progress Without Blocking

Breaking Free from Locks

Locks have been the traditional mechanism for managing concurrent access to shared resources. But locks come with fundamental problems: deadlock, priority inversion, convoying, and blocking. When a thread holding a lock is delayed—by scheduling, page faults, or any other reason—all threads waiting for that lock are also blocked, creating a chain reaction of delays.

Lock-free programming offers an alternative paradigm. Using Compare-and-Swap (CAS) and other atomic primitives, we can build concurrent data structures where threads can always make progress, even if some threads are arbitrarily delayed. No thread can block another; if one thread stalls, others continue unimpeded.

This page provides a rigorous exploration of lock-free programming. We will formally define what "lock-free" means, distinguish it from related concepts like "wait-free" and "obstruction-free," examine why these guarantees matter, and establish the foundational principles for designing lock-free algorithms.

What You Will Learn

By the end of this page, you will understand the formal definitions of non-blocking progress guarantees (obstruction-free, lock-free, wait-free), why lock-free programming matters for system reliability and performance, the fundamental challenges and tradeoffs of lock-free design, and the principles that guide correct lock-free algorithm construction.

The Problem with Locks

Before exploring lock-free alternatives, we must understand why locks are problematic. While locks are conceptually simple and widely used, they introduce several fundamental issues:

Blocking: The Core Problem

When a thread acquires a lock, all other threads that want that lock must wait. This waiting creates a dependency chain: the progress of waiting threads depends entirely on the progress of the lock holder.

This might seem acceptable—surely the lock holder will finish quickly? But consider these scenarios:

The lock holder is descheduled: The OS scheduler decides to run another process. Waiting threads spin or block for the remaining time quantum.
The lock holder triggers a page fault: The thread accesses memory that's been swapped out. Disk I/O takes millions of CPU cycles; waiting threads are blocked the entire time.
The lock holder receives a signal: It might pause for signal handling, blocking all waiters.
The lock holder crashes or is killed: All waiters may deadlock permanently.

In all these cases, the performance or even correctness of non-faulty threads depends on factors outside their control. This violates the principle of fault isolation.

Problems with Lock-Based Synchronization
Problem	Description	Impact
Deadlock	Circular wait for locks: A holds L1 waiting for L2, B holds L2 waiting for L1	Complete system halt among involved threads
Priority Inversion	High-priority thread waits for low-priority thread holding a lock	Real-time guarantees violated, system unresponsiveness
Convoying	When the lock holder is slow, all waiters queue up creating long convoys	Latency spikes, reduced throughput under contention
Blocking	Thread waits (spinning or sleeping) for lock with no guarantee of progress	CPU waste or context switch overhead
Failure Dependency	If lock holder crashes, all waiters may hang indefinitely	System reliability compromised

Priority Inversion: A Deeper Look

Priority inversion deserves special attention because it nearly caused a mission failure on Mars Pathfinder in 1997.

The Scenario:

Low-priority task L holds a shared lock
High-priority task H preempts L and tries to acquire the same lock
H blocks, waiting for L to release the lock
Medium-priority task M preempts L (since M has higher priority than L)
Now H is blocked on L, and L is blocked by M
High-priority H is effectively subordinate to medium-priority M!

The result: the highest-priority task in the system cannot run, even though it's blocked only on a low-priority task. The system's priority scheme is inverted.

The Mars Pathfinder Incident

On Mars Pathfinder, priority inversion caused the spacecraft's watchdog timer to trigger repeated system resets. NASA engineers had to diagnose and patch the issue from 190 million kilometers away. The solution involved enabling priority inheritance—a lock-based mitigation where a low-priority lock holder temporarily inherits the priority of the highest-priority waiter.

But here's the key insight: lock-free data structures don't have priority inversion because no thread ever waits for another thread. High-priority threads can always make progress immediately.

The Fundamental Lock Limitation

The core problem with locks is that they create dependencies between threads. Thread A's ability to make progress depends on Thread B releasing the lock. This dependency couples the performance and reliability of all threads that share locks. Lock-free programming eliminates these dependencies: each thread's progress depends only on its own execution, not on any other thread's behavior.

Progress Guarantees: A Formal Hierarchy

Computer scientists have formalized different levels of progress guarantees for concurrent algorithms. Understanding these definitions is essential for evaluating and designing concurrent data structures.

The Hierarchy of Non-Blocking Progress Guarantees

From weakest to strongest:

1. Obstruction-Free (Weakest)

A method is obstruction-free if any thread that runs in isolation (without interference from other threads) completes in a bounded number of steps.

Guarantees progress when a thread runs alone
Provides no guarantee when threads contend
Threads may livelock under contention (all spinning, none making progress)
Useful as a building block; rarely sufficient alone

2. Lock-Free (Medium)

A method is lock-free if, at any point, at least one thread is guaranteed to complete its operation in a bounded number of steps.

Guarantees system-wide progress: someone always makes progress
Individual threads may starve, but the system as a whole never deadlocks
The typical target for high-performance concurrent data structures

3. Wait-Free (Strongest)

A method is wait-free if every thread completes its operation in a bounded number of steps, regardless of other threads' behavior.

Guarantees per-thread progress: every thread makes progress
No thread can be starved or delayed indefinitely
Strongest guarantee, but often harder to achieve and may have higher overhead

progress_guarantees.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Illustrating the difference between progress guarantees
 
// ═══════════════════════════════════════════════════════════════════
// BLOCKING (Traditional Lock-Based)
// ═══════════════════════════════════════════════════════════════════
void blocking_increment(Lock* lock, int* counter) {
    lock_acquire(lock);   // May wait indefinitely if holder is delayed
    *counter += 1;
    lock_release(lock);
}
// Progress guarantee: NONE if lock holder fails
 
// ═══════════════════════════════════════════════════════════════════
// OBSTRUCTION-FREE
// ═══════════════════════════════════════════════════════════════════
void obstruction_free_increment(int* counter) {
    while (true) {
        int old = *counter;
        // In isolation, this completes quickly
        // Under contention, threads may interfere forever (livelock)
        if (CAS(counter, old, old + 1)) {
            return;  // Success
        }
        // No backoff or priority mechanism - threads may conflict forever
    }
}
// Progress guarantee: Completes if run alone; may livelock under contention
 
// ═══════════════════════════════════════════════════════════════════
// LOCK-FREE
// ═══════════════════════════════════════════════════════════════════
void lock_free_increment(int* counter) {
    while (true) {
        int old = *counter;
        if (CAS(counter, old, old + 1)) {
            return;  // Success
        }
        // Here's the insight: if our CAS fails, it's because ANOTHER
        // thread's CAS SUCCEEDED. So system-wide progress is guaranteed.
    }
}
// Progress guarantee: At least one thread always completes
 
// ═══════════════════════════════════════════════════════════════════
// WAIT-FREE
// ═══════════════════════════════════════════════════════════════════
void wait_free_increment(int* counters, int thread_id, int num_threads) {
    // Each thread has its own counter - no contention possible
    counters[thread_id] += 1;
    // Total is sum of all counters (read lazily or periodically)
}
// Progress guarantee: Every thread completes immediately
// Tradeoff: Aggregation required to get total, design is more complex

Why Lock-Free is the Sweet Spot

In practice, lock-free is the most commonly targeted guarantee for high-performance concurrent data structures. Here's why:

Compared to Blocking (Locks):

Lock-free eliminates deadlock, priority inversion, and convoying
Better fault tolerance: delayed threads don't block others
Often higher throughput under low-to-medium contention

Compared to Wait-Free:

Wait-free algorithms are often significantly more complex
Wait-free may have higher constant-factor overhead
Starvation in lock-free algorithms is rare in practice with proper design

The pragmatic observation is that lock-free provides excellent real-world performance while being significantly easier to implement than wait-free. Individual-thread starvation is theoretically possible but extremely rare with well-designed algorithms and backoff strategies.

Lock-Free Advantages

•No deadlock: No thread waits for another
•No priority inversion: High-priority threads always run
•Fault tolerant: Failed threads don't block others
•Low latency: No sleeping, no waking delays
•Signal safe: Can be called from signal handlers

Lock-Free Challenges

•Complex design: Harder to reason about than locks
•ABA problem: Requires special handling
•Memory management: Deferred reclamation needed
•High contention: Performance degrades significantly
•Testing difficulty: Race conditions hard to trigger

The Key Insight

A lock-free algorithm based on CAS has an elegant property: when a CAS fails, it's because another thread's CAS succeeded. Unlike locks where all threads can be stuck waiting, with CAS at least one thread is always making progress. This is the fundamental reason CAS enables lock-free programming.

Linearizability: Correctness for Concurrent Objects

Lock-free (and lock-based) concurrent data structures must satisfy a correctness criterion. The gold standard is linearizability—a strong consistency model that ensures concurrent operations appear to take effect instantaneously at some point during their execution.

What is Linearizability?

A concurrent object is linearizable if:

Each operation appears to take effect atomically at some single point in time (the linearization point)
This point is between the operation's invocation and response
The order of non-overlapping operations is preserved

Intuitively, linearizability means: even though operations overlap in time, we can construct a sequential history where each operation happens instantaneously, and this history respects the real-time ordering of non-overlapping operations.

Linearization Points

The linearization point of an operation is the instant at which the operation "takes effect." For lock-free algorithms using CAS, the linearization point is typically the successful CAS instruction.

Time →
────────────────────────────────────────────────────────────────
Thread A: |-------- push(X) --------|       (invocation to response)
                    ↑
           Linearization point: successful CAS

Thread B:    |-------- pop() --------|      (overlapping operation)
                          ↑
              Linearization point: successful CAS

Linearizable history: push(X) → pop() returns X

linearization_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// Lock-free stack: identifying linearization points
 
typedef struct Node {
    int value;
    struct Node* next;
} Node;
 
atomic<Node*> top = NULL;
 
// PUSH operation
void push(int value) {
    Node* new_node = malloc(sizeof(Node));
    new_node->value = value;
    
    while (true) {
        Node* old_top = top.load();           // Read current top
        new_node->next = old_top;             // Link new node to current stack
        
        // ═══════════════════════════════════════════════════════
        // LINEARIZATION POINT: If this CAS succeeds, the push
        // "takes effect" at this instant. Before this point,
        // the new node is not visible. After, it's on the stack.
        // ═══════════════════════════════════════════════════════
        if (top.compare_exchange_weak(old_top, new_node)) {
            return;  // Push complete
        }
        // CAS failed, retry
    }
}
 
// POP operation
Node* pop() {
    while (true) {
        Node* old_top = top.load();           // Read current top
        
        if (old_top == NULL) {
            // ═══════════════════════════════════════════════════
            // LINEARIZATION POINT (empty case): The load that
            // returned NULL. The stack was empty at this instant.
            // ═══════════════════════════════════════════════════
            return NULL;  // Stack is empty
        }
        
        Node* new_top = old_top->next;        // Read next pointer
        
        // ═══════════════════════════════════════════════════════
        // LINEARIZATION POINT (success case): If this CAS succeeds,
        // the pop "takes effect" at this instant. The node is
        // atomically removed from the stack.
        // ═══════════════════════════════════════════════════════
        if (top.compare_exchange_weak(old_top, new_top)) {
            return old_top;  // Return removed node
        }
        // CAS failed, retry
    }
}
 
/*
 * Why is this linearizable?
 * 
 * 1. Each successful CAS is an atomic, indivisible operation
 * 2. The effect (changing 'top') happens at that exact instant
 * 3. No partial states are ever visible
 * 4. If push A linearizes before pop B, pop B sees A's effect
 * 5. The sequential spec (LIFO stack) is preserved
 */

Why Linearizability Matters

Composability: Linearizable objects can be composed. If two data structures are independently linearizable, using them together in a larger system preserves intuitive sequential behavior.

Reasoning: We can reason about the concurrent system as if operations happened one at a time, in some sequential order. This dramatically simplifies understanding and verification.

Local Reasoning: We only need to examine the data structure's implementation to verify linearizability—we don't need to know how clients use it.

Linearizability vs. Sequential Consistency

Linearizability is stronger than sequential consistency:

Sequential consistency: There exists a sequential ordering of operations that is consistent with each thread's program order
Linearizability: Same as above, PLUS the sequential order respects real-time ordering of non-overlapping operations

The additional real-time constraint means linearizability preserves intuitive causality: if I complete an operation before you start yours, my operation must appear before yours in the history.

Linearization Point Placement

Identifying linearization points requires careful analysis. The point must be within the operation's execution span, and the system state at that point must reflect the operation's effect. For CAS-based algorithms, the successful CAS is usually the linearization point. For read-only operations, it's often the load that determines the return value. Proving linearizability involves showing that consistent linearization points exist for all possible executions.

Lock-Free Design Principles

Designing lock-free algorithms requires a different mindset than lock-based programming. Several principles guide the construction of correct and efficient lock-free data structures.

Principle 1: Atomic State Transitions

The core idea of lock-free programming is to express state changes as single atomic operations. Instead of:

lock();
read state;
modify state;
write state;
unlock();

We use:

do {
    old_state = read();
    new_state = compute(old_state);
} while (!CAS(&state, old_state, new_state));

The entire state transition happens atomically via CAS. If another thread modifies the state between our read and our CAS attempt, the CAS fails and we retry with the new state.

Principle 2: Immutability and Copy-on-Write

Make data structures (or parts of them) immutable after publication. Instead of modifying an existing node, create a new node with the desired modifications and atomically swap the pointer.

This eliminates races on the data itself—the only race is on the pointer to the data, which can be handled with a single CAS.

copy_on_write.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Copy-on-Write pattern for lock-free updates
 
typedef struct Config {
    int timeout_ms;
    int max_connections;
    char* server_name;
} Config;
 
atomic<Config*> current_config;
 
// Lock-free config update using copy-on-write
void update_timeout(int new_timeout) {
    while (true) {
        Config* old_config = current_config.load();
        
        // Create a NEW config object - never modify existing
        Config* new_config = malloc(sizeof(Config));
        
        // Copy existing values
        new_config->timeout_ms = new_timeout;           // Updated value
        new_config->max_connections = old_config->max_connections;
        new_config->server_name = strdup(old_config->server_name);
        
        // Atomically swap the pointer
        if (current_config.compare_exchange_strong(old_config, new_config)) {
            // Success! Schedule old_config for eventual deallocation
            // (using hazard pointers, RCU, or epoch-based reclamation)
            schedule_free(old_config);
            return;
        }
        
        // CAS failed - someone else updated config
        // Free our new_config and retry
        free(new_config->server_name);
        free(new_config);
    }
}
 
// Lock-free read - always sees a consistent config snapshot
Config* get_config_snapshot() {
    return current_config.load();
    // Returns a pointer to a complete, immutable config
    // Caller sees either old or new config, never a mix
}
 
/*
 * Why this works:
 * 1. Config objects are immutable after publication
 * 2. Readers see complete, consistent snapshots
 * 3. Updates atomically swap the pointer
 * 4. No locks, no blocking, no races on config contents
 * 5. Only complexity: memory reclamation (who frees old config?)
 */

Principle 3: Help Your Neighbors

In lock-free algorithms, if one thread's operation is incomplete and visible to other threads, those other threads should help complete the operation rather than waiting or failing. This "helping" mechanism is crucial for ensuring lock-free progress.

Example: In a lock-free linked list, if Thread A starts removing a node but gets delayed, Thread B should recognize the partially-complete removal and help finish it, rather than blocking or failing its own operation.

Principle 4: Handle Partial States

Lock-free operations may be observed in intermediate states. Design the data structure so that these partial states are:

Recognizable: Other threads can detect an operation in progress
Helpable: Other threads can assist in completing the operation
Recoverable: If the initiating thread resumes, it can detect if others helped and finish correctly

This often involves using flag bits in pointers or separate status fields to mark operations in progress.

marked_pointers.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Using marked pointers for lock-free list deletion
 
// Pointer tagging: use the low bit of pointer as a "marked" flag
// This works because malloc'd addresses are aligned (low bits are 0)
 
#define MARK(p)     ((Node*)((uintptr_t)(p) | 1))
#define UNMARK(p)   ((Node*)((uintptr_t)(p) & ~1))
#define IS_MARKED(p) ((uintptr_t)(p) & 1)
 
typedef struct Node {
    int key;
    atomic<Node*> next;  // Low bit used as mark flag
} Node;
 
// Lock-free deletion with helping
 
bool delete_node(Node* pred, Node* curr) {
    Node* succ = curr->next.load();
    
    // Step 1: Logically delete by marking curr's next pointer
    // This signals to other threads: "curr is being deleted"
    while (!IS_MARKED(succ)) {
        if (curr->next.compare_exchange_weak(succ, MARK(succ))) {
            // Successfully marked - curr is logically deleted
            break;
        }
        succ = curr->next.load();
    }
    
    // Step 2: Physically unlink by swinging pred's pointer past curr
    // Note: ANY thread that sees the marked node can do this!
    Node* unmarked_succ = UNMARK(succ);
    if (pred->next.compare_exchange_strong(curr, unmarked_succ)) {
        // Successfully unlinked - schedule curr for reclamation
        retire(curr);
        return true;
    }
    
    // CAS failed - another thread already unlinked curr, that's OK
    return false;
}
 
/*
 * How helping works here:
 * 
 * Thread A starts deletion:
 *   1. Marks curr->next (logical deletion)
 *   2. Gets descheduled before physical unlink
 * 
 * Thread B traverses list:
 *   1. Sees marked curr->next
 *   2. Recognizes: "curr is being deleted, let me help"
 *   3. Attempts to CAS pred->next from curr to UNMARK(curr->next)
 *   4. If succeeds, Thread B has completed Thread A's deletion
 * 
 * Thread A resumes:
 *   1. Its CAS to unlink will fail (curr already unlinked)
 *   2. Thread A returns - deletion is complete (thanks to B)
 * 
 * System-wide progress guaranteed: someone always completes
 */

The Helping Pattern

The 'helping' pattern is what transforms obstruction-free algorithms into lock-free ones. Without helping, a delayed thread could prevent others from completing their operations. With helping, other threads can complete the delayed operation, ensuring system-wide progress. This is why lock-free data structures are often more complex than their blocking counterparts—they must encode enough information for any thread to complete any operation.

The Memory Reclamation Problem

One of the most challenging aspects of lock-free programming is memory reclamation—safely freeing memory that has been removed from a data structure. The problem is uniquely difficult in lock-free contexts because:

No synchronization point: Unlike locks, we can't free memory "when no one is looking"
Readers don't announce reads: A thread may hold a pointer without any visible indication
Delayed threads: A thread may be suspended while holding a pointer to removed memory

The Reclamation Race

Consider this scenario:

Time →
────────────────────────────────────────────────────────────────
Thread A:  read(node)     [suspended]        dereference(node) ← CRASH!
                                                    ↑
                                              node was freed!

Thread B:        remove(node)   free(node)

Thread A reads a pointer to a node. Thread B removes and frees that node. When Thread A resumes and tries to use the pointer, it accesses freed memory—a use-after-free bug that can cause crashes or corruption.

Solutions Overview

Several techniques address this problem, each with different tradeoffs:

Memory Reclamation Techniques
Technique	Concept	Pros	Cons
Hazard Pointers	Threads announce which nodes they're accessing	Precise, bounded memory	Per-access overhead, limited protected pointers
Epoch-Based Reclamation (EBR)	Track which 'epoch' each thread is in; reclaim when all advance	Low overhead, amortized	Unbounded memory if a thread stalls
Reference Counting	Track number of references to each object	Immediate reclamation	High overhead, contention on ref counts
Read-Copy-Update (RCU)	Defer reclamation until all readers complete	Very low read-side overhead	Kernel integration, delayed reclamation
Quiescent States	Reclaim when all threads pass through quiescent points	Simple concept	Requires application cooperation

Hazard Pointers in Detail

Hazard pointers, introduced by Maged Michael, are one of the most practical techniques. The idea:

Each thread maintains a small list of hazard pointers (HP)
Before dereferencing a shared pointer, a thread "publishes" it to its HP list
When reclaiming, check all threads' HP lists; defer freeing if a pointer is protected
After finishing with a pointer, the thread clears its hazard pointer

hazard_pointers.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
// Hazard Pointers for safe memory reclamation
 
#define MAX_THREADS 64
#define HP_PER_THREAD 2
 
// Per-thread hazard pointers (visible to all threads)
atomic<void*> hazard_pointers[MAX_THREADS][HP_PER_THREAD];
 
// Per-thread retired list (to-be-freed nodes)
struct RetiredNode {
    void* ptr;
    void (*free_func)(void*);
};
thread_local vector<RetiredNode> retired_list;
 
// Read side: Protect a pointer before dereferencing
void* protect_pointer(int thread_id, int hp_index, atomic<void*>* addr) {
    void* ptr;
    do {
        ptr = addr->load();                                    // Read pointer
        hazard_pointers[thread_id][hp_index].store(ptr);       // Publish to HP
        
        // Re-read to ensure the pointer didn't change between read and publish
    } while (addr->load() != ptr);
    
    return ptr;  // Now safe to dereference
}
 
// Read side: Done with protected pointer
void release_pointer(int thread_id, int hp_index) {
    hazard_pointers[thread_id][hp_index].store(nullptr);
}
 
// Write side: Retire a node (schedule for later reclamation)
void retire_node(void* ptr, void (*free_func)(void*)) {
    retired_list.push_back({ptr, free_func});
    
    // Periodically scan retired list and reclaim safe nodes
    if (retired_list.size() > RECLAIM_THRESHOLD) {
        scan_and_reclaim();
    }
}
 
// Scan all hazard pointers and reclaim unprotected nodes
void scan_and_reclaim() {
    // Collect all active hazard pointers
    set<void*> protected_ptrs;
    for (int t = 0; t < MAX_THREADS; t++) {
        for (int h = 0; h < HP_PER_THREAD; h++) {
            void* hp = hazard_pointers[t][h].load();
            if (hp) protected_ptrs.insert(hp);
        }
    }
    
    // Remove and free unprotected retired nodes
    auto it = retired_list.begin();
    while (it != retired_list.end()) {
        if (protected_ptrs.find(it->ptr) == protected_ptrs.end()) {
            // Not protected - safe to free
            it->free_func(it->ptr);
            it = retired_list.erase(it);
        } else {
            ++it;  // Keep in retired list, still protected
        }
    }
}
 
/*
 * Usage pattern:
 * 
 * void* dequeue(int tid) {
 *     Node* head = protect_pointer(tid, 0, &queue_head);  // Protect
 *     if (!head) {
 *         release_pointer(tid, 0);
 *         return NULL;
 *     }
 *     
 *     Node* next = protect_pointer(tid, 1, &head->next);  // Protect next too
 *     
 *     if (CAS(&queue_head, head, next)) {
 *         void* result = head->data;
 *         release_pointer(tid, 0);
 *         release_pointer(tid, 1);
 *         retire_node(head, free);  // Schedule for reclamation
 *         return result;
 *     }
 *     
 *     release_pointer(tid, 0);
 *     release_pointer(tid, 1);
 *     // Retry...
 * }
 */

Memory Reclamation is Hard

Memory reclamation is often the most complex part of lock-free data structures. Incorrect reclamation leads to use-after-free bugs, which are subtle and dangerous. Many production systems use garbage-collected languages (Java, Go) partly to avoid manual lock-free memory management. If you must manage memory manually, use well-tested libraries rather than implementing reclamation from scratch.

When to Use Lock-Free Programming

Lock-free programming is not a universal solution. It excels in specific scenarios and may be counterproductive in others. Understanding when to use (and not use) lock-free techniques is crucial.

Scenarios Favoring Lock-Free

1. Real-Time Systems

When worst-case latency matters more than average throughput:

Lock-free provides bounded operation time (no waiting for lock holders)
Avoids priority inversion that can break real-time guarantees
Critical for audio processing, industrial control, autonomous vehicles

2. Signal Handler / Interrupt Context

When code may execute in restricted contexts:

Signal handlers cannot safely acquire most locks (deadlock risk)
Interrupt handlers must be non-blocking
Lock-free data structures can be safely used in these contexts

3. High-Contention Bottlenecks

When a single resource is accessed by many threads:

Locks create serialization bottlenecks
Lock-free allows concurrent forward progress
But only if operations are short (otherwise, contention storm)

4. Failure Tolerance Requirements

When thread/process failures must not block others:

Lock-free: if one thread dies, others continue unimpeded
Locks: a thread dying while holding a lock blocks all waiters

Lock-Free vs. Lock-Based: When to Choose Which
Factor	Favor Lock-Free	Favor Lock-Based
Critical section length	Very short (few instructions)	Long (I/O, computation)
Contention level	Low to medium	Any, especially high
Latency requirements	Hard real-time, bounded latency	Average latency matters more
Failure tolerance	Must tolerate thread failures	Threads rarely fail
Complexity budget	Can invest in complex design	Need simple, maintainable code
Memory management	GC available, or reclamation solved	Stack-local or simple lifetime

Scenarios Against Lock-Free

1. Long Critical Sections

If the critical section involves significant work, lock-free provides no advantage:

CAS loops will retry many times under contention
Each retry recomputes the entire operation
Locks let one thread complete unimpeded

2. High Contention with Low Throughput

When many threads fight for one resource continuously:

CAS contention storm: most attempts fail
Cache line bouncing destroys performance
A lock with queuing provides better fairness and throughput

3. Simple Correctness Requirements

When the team doesn't have lock-free expertise:

Lock-free bugs are extremely subtle
Testing is insufficient (can't trigger rare interleavings)
A well-written locked data structure is safer

4. Composing Multiple Objects

When an operation must atomically update multiple data structures:

Lock-free algorithms are typically single-object
Multi-object atomicity requires locks or transactions

The Pragmatic Approach

Start with locks. Profile under realistic load. If locks are the bottleneck (contention, latency spikes), consider lock-free alternatives for that specific data structure. Use well-tested libraries (java.util.concurrent, C++ folly, crossbeam-rs) rather than implementing from scratch. Lock-free programming is a precision tool—use it surgically where it provides clear benefits.

Common Lock-Free Data Structures

Several lock-free data structures have been developed and refined over decades of research. Understanding the landscape helps in choosing the right tool for each job.

Classic Lock-Free Structures

1. Lock-Free Stack (Treiber Stack)

The simplest lock-free structure. Push and pop both CAS on the top pointer.

Time complexity: O(1) per operation
Very low overhead
Ideal for LIFO workload distribution

2. Lock-Free Queue (Michael-Scott Queue)

Enqueue and dequeue operate on different ends (tail and head), allowing some concurrency.

Separate head and tail pointers
Uses dummy node to simplify edge cases
Classic for producer-consumer patterns

3. Lock-Free Linked List

Ordered list with lock-free insertion and deletion.

More complex than stack/queue (must handle mid-list operations)
Uses marked pointers for deletion
Basis for lock-free sets and maps

treiber_stack.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Treiber Stack: Classic lock-free stack
// Named after R. Kent Treiber (IBM, 1986)
 
#include <atomic>
#include <memory>
 
template<typename T>
class TreiberStack {
private:
    struct Node {
        T data;
        Node* next;
        Node(T d) : data(d), next(nullptr) {}
    };
    
    std::atomic<Node*> top{nullptr};
 
public:
    void push(T value) {
        Node* new_node = new Node(value);
        
        // Load current top and point new node to it
        new_node->next = top.load(std::memory_order_relaxed);
        
        // CAS loop: try to make new_node the new top
        while (!top.compare_exchange_weak(
            new_node->next,              // Expected (updated on failure)
            new_node,                    // Desired
            std::memory_order_release,   // Success ordering
            std::memory_order_relaxed    // Failure ordering
        )) {
            // On failure, new_node->next is updated to current top
            // Loop automatically retries with correct value
        }
    }
    
    std::optional<T> pop() {
        Node* old_top = top.load(std::memory_order_acquire);
        
        while (old_top != nullptr) {
            Node* new_top = old_top->next;
            
            // CAS: try to make next node the new top
            if (top.compare_exchange_weak(
                old_top,                     // Expected (updated on failure)
                new_top,                     // Desired
                std::memory_order_acquire,   // Success ordering
                std::memory_order_relaxed    // Failure ordering
            )) {
                T result = old_top->data;
                
                // Memory reclamation: old_top is now out of the stack
                // In production, use hazard pointers or epoch-based reclamation
                delete old_top;  // DANGER: may cause use-after-free!
                
                return result;
            }
            // On failure, old_top is updated to current top, retry
        }
        
        return std::nullopt;  // Stack was empty
    }
};
 
/*
 * Properties:
 * - Lock-free: Every failed CAS means another thread succeeded
 * - Linearizable: Linearization point is the successful CAS
 * - Memory ordering: release-acquire ensures visibility
 * 
 * Caveats:
 * - ABA problem: Use tagged pointers or hazard pointers
 * - Memory reclamation: The delete in pop() is unsafe in general
 * - Starvation possible: One thread may repeatedly lose CAS
 */

Advanced Lock-Free Structures

4. Lock-Free Hash Maps

Hash maps are more complex due to resizing:

Split-ordered lists (Michael, 2002): Clever ordering that avoids rehashing
Concurrent hash tries (Prokopec et al.): Tree-structured hash maps
Lock-free cuckoo hashing: High performance but complex

5. Lock-Free Skip Lists

Alternative to balanced trees:

Probabilistic structure (no rebalancing)
Each level is a lock-free list
Good for ordered maps with high concurrency

6. Lock-Free B-Trees

For database indexes:

Must handle splits and merges lock-free
Extremely complex; often "latch-free" (optimistic locking) instead

Lock-Free Libraries and Implementations

•Java: java.util.concurrent — ConcurrentLinkedQueue, ConcurrentSkipListMap, Atomic classes
•C++: folly (Facebook), libcds (Concurrent Data Structures Library), boost.lockfree
•Rust: crossbeam — queues, deques, epoch-based GC
•Go: Built-in channels (lock-free in some paths), sync/atomic
•Linux Kernel: RCU (Read-Copy-Update) for many kernel data structures

Use Established Libraries

Lock-free data structures are notoriously difficult to implement correctly. The algorithms are subtle, edge cases are numerous, and testing cannot cover all interleavings. In practice, prefer established, well-tested libraries over custom implementations. Reserve custom lock-free code for situations where libraries don't meet specific requirements—and then subject that code to rigorous verification.

Summary: Lock-Free Programming

We have explored the principles and practices of lock-free programming—a powerful paradigm that uses CAS to achieve concurrent algorithms with strong progress guarantees. Let us consolidate the key concepts before moving to CAS-based lock implementations.

Key Takeaways

•Locks have fundamental problems: Deadlock, priority inversion, convoying, and blocking all arise from the dependency created when one thread waits for another.
•Lock-free provides system-wide progress: At least one thread is always making progress, regardless of other threads' states or delays.
•Progress guarantees form a hierarchy: Obstruction-free < Lock-free < Wait-free, with lock-free being the practical sweet spot.
•Linearizability is the correctness criterion: Operations appear atomic, with linearization points (usually CAS) defining when they take effect.
•Design principles guide construction: Atomic state transitions, immutability/COW, helping neighbors, and handling partial states.
•Memory reclamation is the hard problem: Safe memory freeing requires techniques like hazard pointers, EBR, or RCU.
•Lock-free is not universally better: Use it for short critical sections, real-time requirements, or failure tolerance—not as a default.

What's Next: CAS-Based Locks

Ironically, one of the most common uses of CAS is to implement locks themselves! The next page explores CAS-based lock implementations—from simple spinlocks to more sophisticated ticket locks and queue locks.

We will examine:

How CAS implements basic mutex operations (lock/unlock)
The tradeoffs between different lock designs
Techniques for improving fairness and reducing contention
When CAS-based locks outperform blocking locks

Understanding CAS-based locks bridges the gap between lock-free algorithms and traditional locking, showing how the same primitive underlies both approaches.

Page Complete

You now understand the principles of lock-free programming—what it means, why it matters, how to design lock-free algorithms, and when to use this approach. This knowledge is essential for understanding high-performance concurrent systems and for making informed decisions about synchronization strategies.