Concurrency Concepts - Learning Module

Loading content...

0/240

Atomicity

The Indivisible Operation

In ancient Greek philosophy, atomos meant 'indivisible'—the smallest unit of matter that could not be further divided. In computing, we've borrowed this concept for perhaps the most fundamental property in concurrent programming: atomicity.

An atomic operation is one that appears to happen instantaneously and completely—or not at all. There is no intermediate state visible to other threads. Consider incrementing a counter:

counter = counter + 1;

This is not atomic on most hardware. It's actually three operations:

Load counter from memory into a register
Add 1 to the register
Store the result back to memory

Another thread could read counter between any of these steps, seeing an intermediate state. This is the root cause of lost updates and data corruption.

Atomic operations eliminate this problem entirely. An atomic increment is indivisible—other threads see only the before-state or the after-state, never an in-between.

What You Will Learn

By the end of this page, you will understand what atomicity truly means, how hardware provides atomic operations, the difference between atomicity and synchronization, how high-level atomic abstractions work, and how to use atomicity correctly in concurrent programs.

Defining Atomicity

Atomicity is the property of an operation appearing to occur instantaneously and indivisibly with respect to all other concurrent operations.

More formally:

An operation is atomic if it appears to all observers (other threads, processors, or systems) as if it:

Occurs at a single, distinct point in time

Either completes entirely or does not occur at all

Cannot be observed in any intermediate state

The 'All-or-Nothing' Principle:

Atomic operations have no partial completion. Either:

The entire operation completes and its effects are visible, or
The operation has no effect whatsoever

There is no scenario where 'half' of an atomic operation is visible.

Atomicity vs. Instantaneity:

An atomic operation doesn't actually happen instantly—CPUs take time. What matters is that no other operation can observe the 'during' phase. From any external viewpoint, the operation transitions from 'not started' to 'complete' with nothing in between.

The Linearization Point

Every atomic operation has a 'linearization point'—an instant when it logically 'happens.' All effects of the operation are invisible before this point and fully visible after it. This is a theoretical abstraction, but it's how we reason about correctness: operations can be ordered by their linearization points.

Examples of atomic vs. non-atomic operations:

Operation	Atomic?	Why
`int x = 42;` (aligned, word-sized)	Usually*	Single CPU instruction on most hardware
`long x = 42;` (32-bit system)	No	64-bit write requires two 32-bit stores
`counter++`	No	Read-modify-write: three operations
`ptr->field = value`	Maybe	Depends on alignment and architecture
`atomic_fetch_add(&x, 1)`	Yes	Hardware-guaranteed atomic instruction
`CAS(&x, old, new)`	Yes	Hardware-guaranteed atomic compare-and-swap

*Even simple writes may not be atomic if misaligned or if the C standard is strictly followed (where no guarantees exist without explicit atomics).

Hardware Atomic Instructions

Modern CPUs provide special instructions that are guaranteed atomic. These are the building blocks on which all higher-level synchronization is built.

1. Atomic Load and Store

Reading or writing a naturally-aligned, word-sized value is typically atomic on modern CPUs. However, the C/C++ standards don't guarantee this, so explicit atomic types should be used for portability.

2. Test-and-Set (TAS)

Atomically sets a memory location to a value and returns the old value:

TAS(addr): old = *addr; *addr = 1; return old;  // Atomic

3. Compare-and-Swap (CAS) / Compare-and-Exchange (CMPXCHG)

Atomically compares a memory location with an expected value; if they match, updates to a new value:

CAS(addr, expected, new):
    if (*addr == expected) {
        *addr = new;
        return true;  // Success
    }
    return false;     // Failure

This is the most powerful and widely-used atomic primitive.

4. Fetch-and-Add (FAA) / Atomic Increment

Atomically adds a value and returns the previous value:

FAA(addr, val): old = *addr; *addr = old + val; return old;  // Atomic

5. Load-Linked/Store-Conditional (LL/SC)

Used on ARM, PowerPC, RISC-V instead of CAS:

LL (Load-Linked): Reads a value and sets a 'reservation'
SC (Store-Conditional): Writes a value only if the reservation is still valid

If anything modified the location since LL, the SC fails. This is more general than CAS (avoids ABA problem) but requires a retry loop.

How hardware implements atomicity:

For single-location operations:

The CPU's cache coherence protocol ensures exclusive access during the operation
On x86: the LOCK prefix forces cache line lock or bus lock
The operation completes before any other CPU can access that cache line

For multi-location operations:

Hardware doesn't provide multi-location atomicity
Software must use locks or complex lock-free algorithms
Intel's TSX (Transactional Synchronization Extensions) was an attempt, but has been retired due to security issues

hardware_atomics.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Hardware atomic operations exposed through C11 atomics
 
#include <stdatomic.h>
 
atomic_int counter = 0;
 
// Atomic operations map to hardware instructions
 
// Atomic store: single instruction, full memory barrier on x86
atomic_store(&counter, 42);
 
// Atomic load: single instruction
int value = atomic_load(&counter);
 
// Fetch-and-add: x86 'lock xadd' instruction
int old = atomic_fetch_add(&counter, 1);  // Returns old value
 
// Compare-and-exchange: x86 'lock cmpxchg' instruction
int expected = 10;
int desired = 20;
bool success = atomic_compare_exchange_strong(&counter, &expected, desired);
// If counter was 10, it's now 20 and success=true
// If counter was not 10, expected now holds actual value, success=false
 
// These operations are TRULY atomic at hardware level
// No lock needed - the CPU ensures indivisibility

CAS Loops

CAS can fail spuriously (another thread modified the value). The common pattern is a CAS loop: read the current value, compute the new value, CAS to update; if CAS fails, retry. This enables lock-free algorithms but requires careful reasoning about the retry loop's correctness.

Atomicity vs. Synchronization

A crucial distinction that many developers miss: atomicity and synchronization are different concepts that often work together but solve different problems.

Atomicity ensures that an operation is indivisible—no partial states are visible.

Synchronization (memory ordering) ensures that operations occur in a predictable order visible to all threads.

You can have one without the other:

Atomic without ordering (memory_order_relaxed):

atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);

The increment itself is atomic (no lost updates), but there's no guarantee about when or in what order it becomes visible relative to other operations. Other threads may see this increment before or after surrounding non-atomic operations, in unpredictable order.

Ordering without atomicity:

Conceptually possible but not useful—memory barriers order operations, but if the operations themselves aren't atomic, you still get races.

Combined atomicity + ordering (the common case):

atomic_store_explicit(&ready, 1, memory_order_release);

The store is atomic AND provides release semantics (all prior operations are visible before this store becomes visible to other threads).

Atomicity vs. Synchronization
Concept	What It Provides	What It Doesn't Provide
Atomicity	Indivisibility—no partial states visible	No ordering guarantees with other operations
Synchronization	Ordering—predictable visibility of effects	No guarantee individual operations are indivisible
Both (full atomic)	Indivisible AND ordered operations	Higher overhead than relaxed atomics

When to use relaxed atomics:

Statistics counters — You don't care about ordering, just that increments aren't lost
Sequence numbers — Each thread gets a unique value, order doesn't matter
Fast paths with later synchronization — The relaxed operation is 'fixed up' later by a synchronized operation

When you need full synchronization:

Publication patterns — Writer must use release, reader must use acquire, to ensure data is visible
Lock implementations — Acquire/release semantics are essential for locks
Ordered communication — When threads must observe events in a specific order

Relaxed Atomics Are Subtle

Using memory_order_relaxed is an expert-level optimization. The lack of ordering guarantees can lead to extremely subtle bugs. Start with sequential consistency (the default). Only relax ordering after proving correctness and profiling to confirm it matters.

Atomic Types in Programming Languages

Modern programming languages provide atomic types that map to hardware atomic operations:

C11 Atomics:

#include <stdatomic.h>

atomic_int counter;                    // Atomic integer
atomic_flag lock = ATOMIC_FLAG_INIT;   // Atomic boolean (spinlock primitive)
_Atomic(struct Point) point;           // Any type can be atomic

C++11 Atomics:

#include <atomic>

std::atomic<int> counter{0};
std::atomic<bool> flag{false};
std::atomic<T> value;  // Any trivially copyable type

Java Atomics:

import java.util.concurrent.atomic.*;

AtomicInteger counter = new AtomicInteger(0);
AtomicReference<Node> head = new AtomicReference<>();
AtomicLong timestamp = new AtomicLong();

Go sync/atomic:

import "sync/atomic"

var counter int64
atomic.AddInt64(&counter, 1)
atomic.LoadInt64(&counter)

atomic_usage_example.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <atomic>
#include <thread>
#include <vector>
 
std::atomic<int> counter{0};
 
void increment_many_times(int iterations) {
    for (int i = 0; i < iterations; i++) {
        // Atomic increment - no lost updates!
        counter.fetch_add(1, std::memory_order_relaxed);
    }
}
 
int main() {
    constexpr int THREADS = 10;
    constexpr int ITERS = 100000;
    
    std::vector<std::thread> threads;
    for (int i = 0; i < THREADS; i++) {
        threads.emplace_back(increment_many_times, ITERS);
    }
    
    for (auto& t : threads) {
        t.join();
    }
    
    // Guaranteed to be exactly THREADS * ITERS
    // No lost updates because each increment is atomic
    std::cout << "Counter: " << counter.load() << std::endl;
    // Output: Counter: 1000000 (always!)
    
    return 0;
}

Lock-Free vs. Wait-Free

Operations using atomics can be lock-free (at least one thread always makes progress) or wait-free (every thread makes progress in bounded steps). atomic_fetch_add is typically wait-free—it always completes. CAS loops are lock-free but not wait-free—a thread's CAS might fail repeatedly if others keep succeeding.

The ABA Problem

One of the most subtle bugs in lock-free programming is the ABA problem, which can occur when using compare-and-swap.

The scenario:

Thread 1 reads value A from location X
Thread 1 is preempted
Thread 2 changes X from A to B
Thread 2 changes X from B back to A
Thread 1 resumes, performs CAS(X, A, C)
CAS succeeds! (X is A, matches expected)

Why this is a problem:

Thread 1's CAS succeeded, but the world has changed since it read A. If A was a pointer and Thread 2 freed and reallocated memory, Thread 1 might now be pointing to completely different data. The 'A' value is the same, but its meaning has changed.

Classic example: lock-free stack

aba_problem.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Lock-free stack with ABA problem
 
struct Node {
    void* data;
    struct Node* next;
};
 
atomic_ptr top;  // Assume this is an atomic pointer
 
void push(struct Node* node) {
    do {
        struct Node* old_top = atomic_load(&top);
        node->next = old_top;
    } while (!CAS(&top, old_top, node));  // Retry until success
}
 
struct Node* pop() {
    struct Node* old_top;
    struct Node* new_top;
    do {
        old_top = atomic_load(&top);           // Step 1: Read top
        if (old_top == NULL) return NULL;
        new_top = old_top->next;               // Step 2: Get next
    } while (!CAS(&top, old_top, new_top));    // Step 3: CAS to update
    return old_top;
}
 
// ABA scenario:
// Thread 1: pop() reads top = A, new_top = B, about to CAS...
// Thread 2: pop() succeeds, gets A
// Thread 2: pop() succeeds, gets B  
// Thread 2: push(A) - A is back on stack!
// Thread 1: CAS(&top, A, B) succeeds! But B was already popped!
// 
// Result: Stack is corrupted. B is "on" the stack but was already freed.

Solutions to the ABA problem:

1. Double-width CAS with version counter:

Instead of CAS on just the pointer, CAS on (pointer, version). Increment version on every modification. Even if the pointer returns to A, the version is different.

struct TaggedPointer {
    Node* ptr;
    uint64_t version;
};
// Use 128-bit CAS (cmpxchg16b on x86-64)

2. Hazard pointers:

Before accessing a node, publish a 'hazard pointer' to it. Other threads check hazard pointers before freeing, deferring reclamation if the node is protected.

3. Epoch-based reclamation:

Track 'epochs' of execution. Defer freeing until all threads have passed through a grace period, ensuring no thread holds old references.

4. LL/SC instead of CAS:

Load-Linked/Store-Conditional (on ARM, PowerPC) naturally avoids ABA because SC fails if any write occurred, not just if the value differs.

ABA Can Corrupt Memory

In the stack example, ABA can cause use-after-free, double-free, or memory corruption. These are security vulnerabilities, not just correctness bugs. Lock-free data structures are advanced territory—use proven implementations from libraries rather than rolling your own.

Implementing Locks with Atomics

Atomic operations are the foundation on which locks are built. Let's see how a simple spinlock works:

Basic spinlock with TAS:

atomic_flag lock = ATOMIC_FLAG_INIT;

void acquire() {
    while (atomic_flag_test_and_set(&lock)) {
        // Spin until we acquire the lock
    }
}

void release() {
    atomic_flag_clear(&lock);
}

How it works:

atomic_flag_test_and_set atomically sets the flag to true and returns the old value
If old value was false, we acquired the lock (we just set it)
If old value was true, someone else has the lock—spin and retry
atomic_flag_clear atomically sets the flag to false, releasing the lock

Better spinlock with CAS:

spinlock_implementation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#include <stdatomic.h>
 
// Simple spinlock using atomic_flag
atomic_flag spinlock = ATOMIC_FLAG_INIT;
 
void spin_lock() {
    while (atomic_flag_test_and_set_explicit(&spinlock, memory_order_acquire)) {
        // Spin - could add pause instruction for hyperthreading efficiency
        #ifdef __x86_64__
        __builtin_ia32_pause();  // Reduces power and improves SMT performance
        #endif
    }
    // We now hold the lock; acquire barrier ensures we see all prior writes
}
 
void spin_unlock() {
    atomic_flag_clear_explicit(&spinlock, memory_order_release);
    // Release barrier ensures all our writes are visible before unlock
}
 
 
// More sophisticated: Ticket lock (FIFO fairness)
atomic_uint ticket_counter = 0;
atomic_uint serving_counter = 0;
 
void ticket_lock() {
    unsigned my_ticket = atomic_fetch_add(&ticket_counter, 1);
    while (atomic_load(&serving_counter) != my_ticket) {
        // Wait for our turn - FIFO order guaranteed
    }
}
 
void ticket_unlock() {
    atomic_fetch_add(&serving_counter, 1);  // Next ticket, please!
}

Test-and-Test-and-Set (TTAS)

A simple optimization: before doing expensive atomic TAS, check if the lock appears free with a regular read. If it's locked, spin on the read (which can be satisfied from cache). Only attempt TAS when the lock appears free. This reduces cache line bouncing in contended scenarios.

From spinlock to blocking lock:

Production-quality mutexes add blocking:

Attempt to acquire with atomic operation
If failure, spin briefly (optimistic—lock may release soon)
If still locked, add to wait queue and sleep
Upon unlock, the releasing thread wakes a waiter

This combines the low-latency of spinning (for short critical sections) with the efficiency of blocking (for longer waits). This is what pthread_mutex and std::mutex typically implement.

Atomic Compound Operations

Hardware atomics operate on single memory locations. What about operations that must atomically update multiple values?

The challenge:

Consider transfer between accounts:

accounts[from] -= amount;  // Must be atomic with next line
accounts[to] += amount;    // Together, or neither

No single atomic instruction can update two separate memory locations atomically.

Solution 1: Locks

The straightforward solution—acquire a lock covering both locations:

lock(&accounts_lock);
accounts[from] -= amount;
accounts[to] += amount;
unlock(&accounts_lock);

Solution 2: Two-phase locking

For finer granularity, lock both accounts:

lock(&locks[from]); lock(&locks[to]);  // Lock ordering to prevent deadlock
// ... perform transfer ...
unlock(&locks[to]); unlock(&locks[from]);

Solution 3: Software Transactional Memory (STM)

Log changes, commit atomically at end:

atomic {
    accounts[from] -= amount;
    accounts[to] += amount;
}  // Commits all changes atomically or retries

Solution 4: Lock-free with CAS (complex)

For specific data structures, clever algorithms can achieve multi-location atomicity:

Mark-and-sweep: mark locations as 'being modified', complete update, unmark
Helping: if a thread sees an in-progress operation, it helps complete it
Double-CAS: some specialized hardware supports CAS on two locations

These are advanced techniques used in lock-free queues, skip lists, and other specialized structures.

Database Transactions:

Databases provide atomic transactions through:

Write-Ahead Logging (WAL): log all changes before applying
Two-Phase Commit (2PC): prepare all participants, then commit all at once
Multi-Version Concurrency Control (MVCC): each transaction sees a consistent snapshot

The Pragmatic Choice

For most applications, locking is the right choice for multi-location atomicity. Lock-free algorithms are error-prone, hard to verify, and often slower than well-designed locking for moderate contention levels. Use lock-free only when profiling proves locks are the bottleneck, and prefer battle-tested library implementations.

Memory Ordering with Atomics

C11 and C++11 atomics support explicit memory ordering specification. Understanding these options is essential for correct and efficient concurrent code:

memory_order_seq_cst (Sequential Consistency)

The default. Every atomic operation is part of a single total order that all threads agree on. Simplest to reason about; highest overhead.

memory_order_acquire (Acquire)

Used on reads. Prevents subsequent reads/writes from being reordered before this operation. "Everything after stays after."

Use case: Reading a flag that indicates data is ready.

memory_order_release (Release)

Used on writes. Prevents prior reads/writes from being reordered after this operation. "Everything before stays before."

Use case: Writing a flag that indicates data is ready.

memory_order_acq_rel (Acquire-Release)

Used on read-modify-write operations. Both acquire and release semantics.

Use case: Lock operations that both read (to check) and write (to acquire).

memory_order_relaxed (Relaxed)

No ordering guarantees whatsoever. Only atomicity is provided.

Use case: Counters where order doesn't matter.

memory_ordering_example.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <atomic>
#include <thread>
#include <cassert>
 
std::atomic<bool> ready{false};
int data = 0;  // Non-atomic, but protected by 'ready' synchronization
 
void producer() {
    data = 42;  // Regular write
    
    // Release: all prior writes (including data=42) are visible
    // to any thread that observes ready==true with acquire
    ready.store(true, std::memory_order_release);
}
 
void consumer() {
    // Acquire: if we see ready==true, we also see all writes
    // that happened before the store that made ready==true
    while (!ready.load(std::memory_order_acquire)) {
        // Wait
    }
    
    // Guaranteed to see data==42 due to release-acquire synchronization
    assert(data == 42);  // Never fails!
}
 
// Without proper ordering:
// ready.store(true, memory_order_relaxed);
// ready.load(memory_order_relaxed);
// The assertion might fail! data=42 might not be visible yet.
 
int main() {
    std::thread t1(producer);
    std::thread t2(consumer);
    t1.join();
    t2.join();
    return 0;
}

Default to Sequential Consistency

memory_order_seq_cst is the default for good reason—it's easiest to reason about and matches programmer intuition. Only use weaker orderings when profiling shows the atomic operations are a bottleneck AND you've carefully proven the weaker ordering is safe. Memory ordering bugs are among the hardest to diagnose.

Performance Considerations

Atomic operations are much cheaper than locks, but they're not free. Understanding their cost helps make informed design decisions.

Rough cost hierarchy (cycles on modern x86):

Operation	Approximate Cost	Notes
Regular read	1-4 cycles	Cache hit
Atomic read (relaxed)	~1-4 cycles	Same as regular if aligned
Atomic read (seq_cst)	~1-4 cycles	On x86, loads are naturally ordered
Regular write	1-4 cycles	Cache hit, buffered
Atomic write (relaxed)	~1-4 cycles	Same as regular
Atomic write (seq_cst)	~20-100 cycles	Must lock cache line
Atomic RMW (seq_cst)	~20-100 cycles	LOCK prefix required
Mutex lock (uncontended)	~15-30 cycles	Typically CAS + bookkeeping
Mutex lock (contended)	~1000+ cycles	Context switch
Cache miss	~100-300 cycles	Main memory access

Performance Best Practices

•Minimize atomic contention — If many threads hammer the same atomic variable, cache line bouncing will dominate. Consider per-thread counters with periodic aggregation.
•Batch updates — Instead of many atomic increments, accumulate locally, then do one atomic update.
•Use appropriate ordering — memory_order_relaxed is significantly cheaper than seq_cst when ordering isn't needed.
•Avoid false sharing — Ensure unrelated atomics are on separate cache lines (64 bytes apart on most systems). Use alignas(64).
•Consider read-mostly patterns — If reads vastly outnumber writes, RCU (Read-Copy-Update) or similar patterns can offer better performance than mutexes.

Profile Before Optimizing

Atomics and locks are rarely the bottleneck in most applications. Profile first. If atomics are the problem, the solution is often architectural (reduce contention, batch operations) rather than micro-optimization (weaker memory orders).

Summary: Atomicity

We've explored atomicity—the fundamental property that enables safe concurrent access to shared data:

Key Takeaways

•Definition — An atomic operation appears to occur instantaneously; no intermediate states are visible
•Hardware support — Modern CPUs provide TAS, CAS, FAA, and other atomic primitives via special instructions
•Atomicity vs. ordering — Atomicity ensures indivisibility; memory ordering ensures visibility sequence; both are needed
•Language support — C11 atomics, C++ std::atomic, Java Atomic* classes provide portable atomic operations
•ABA problem — CAS-based algorithms must handle the case where a value returns to its original state
•Building locks — Atomic operations are the foundation on which spinlocks and mutexes are constructed
•Multi-location atomicity — Requires locks, transactions, or complex lock-free algorithms
•Performance — Atomics are cheap but not free; contention and false sharing can hurt performance

Module Complete: Concurrency Concepts

With this page, we've completed the foundational module on concurrency concepts. You now understand:

Concurrency vs. Parallelism — Structure vs. simultaneous execution
Sequential Consistency — The ideal memory model and why hardware violates it
Data Races — The fundamental hazard of unsynchronized shared access
Critical Regions — The sections of code requiring mutual exclusion
Atomicity — The indivisible operations that make synchronization possible

These concepts form the theoretical foundation for all concurrent programming. Every mutex, semaphore, condition variable, channel, and lock-free algorithm builds on these principles.

Module Complete

Congratulations! You've mastered the fundamental concepts of concurrent programming. In the following modules, we'll see how these concepts are applied in practice—from fork-join parallelism and thread pools to asynchronous programming patterns and real-world concurrency problems.