Loading content...
In ancient Greek philosophy, atomos meant 'indivisible'—the smallest unit of matter that could not be further divided. In computing, we've borrowed this concept for perhaps the most fundamental property in concurrent programming: atomicity.
An atomic operation is one that appears to happen instantaneously and completely—or not at all. There is no intermediate state visible to other threads. Consider incrementing a counter:
counter = counter + 1;
This is not atomic on most hardware. It's actually three operations:
counter from memory into a registerAnother thread could read counter between any of these steps, seeing an intermediate state. This is the root cause of lost updates and data corruption.
Atomic operations eliminate this problem entirely. An atomic increment is indivisible—other threads see only the before-state or the after-state, never an in-between.
By the end of this page, you will understand what atomicity truly means, how hardware provides atomic operations, the difference between atomicity and synchronization, how high-level atomic abstractions work, and how to use atomicity correctly in concurrent programs.
Atomicity is the property of an operation appearing to occur instantaneously and indivisibly with respect to all other concurrent operations.
More formally:
An operation is atomic if it appears to all observers (other threads, processors, or systems) as if it:
- Occurs at a single, distinct point in time
- Either completes entirely or does not occur at all
- Cannot be observed in any intermediate state
The 'All-or-Nothing' Principle:
Atomic operations have no partial completion. Either:
There is no scenario where 'half' of an atomic operation is visible.
Atomicity vs. Instantaneity:
An atomic operation doesn't actually happen instantly—CPUs take time. What matters is that no other operation can observe the 'during' phase. From any external viewpoint, the operation transitions from 'not started' to 'complete' with nothing in between.
Every atomic operation has a 'linearization point'—an instant when it logically 'happens.' All effects of the operation are invisible before this point and fully visible after it. This is a theoretical abstraction, but it's how we reason about correctness: operations can be ordered by their linearization points.
Examples of atomic vs. non-atomic operations:
| Operation | Atomic? | Why |
|---|---|---|
int x = 42; (aligned, word-sized) | Usually* | Single CPU instruction on most hardware |
long x = 42; (32-bit system) | No | 64-bit write requires two 32-bit stores |
counter++ | No | Read-modify-write: three operations |
ptr->field = value | Maybe | Depends on alignment and architecture |
atomic_fetch_add(&x, 1) | Yes | Hardware-guaranteed atomic instruction |
CAS(&x, old, new) | Yes | Hardware-guaranteed atomic compare-and-swap |
*Even simple writes may not be atomic if misaligned or if the C standard is strictly followed (where no guarantees exist without explicit atomics).
Modern CPUs provide special instructions that are guaranteed atomic. These are the building blocks on which all higher-level synchronization is built.
1. Atomic Load and Store
Reading or writing a naturally-aligned, word-sized value is typically atomic on modern CPUs. However, the C/C++ standards don't guarantee this, so explicit atomic types should be used for portability.
2. Test-and-Set (TAS)
Atomically sets a memory location to a value and returns the old value:
TAS(addr): old = *addr; *addr = 1; return old; // Atomic
3. Compare-and-Swap (CAS) / Compare-and-Exchange (CMPXCHG)
Atomically compares a memory location with an expected value; if they match, updates to a new value:
CAS(addr, expected, new):
if (*addr == expected) {
*addr = new;
return true; // Success
}
return false; // Failure
This is the most powerful and widely-used atomic primitive.
4. Fetch-and-Add (FAA) / Atomic Increment
Atomically adds a value and returns the previous value:
FAA(addr, val): old = *addr; *addr = old + val; return old; // Atomic
5. Load-Linked/Store-Conditional (LL/SC)
Used on ARM, PowerPC, RISC-V instead of CAS:
LL (Load-Linked): Reads a value and sets a 'reservation'SC (Store-Conditional): Writes a value only if the reservation is still validIf anything modified the location since LL, the SC fails. This is more general than CAS (avoids ABA problem) but requires a retry loop.
How hardware implements atomicity:
For single-location operations:
LOCK prefix forces cache line lock or bus lockFor multi-location operations:
1234567891011121314151617181920212223242526
// Hardware atomic operations exposed through C11 atomics #include <stdatomic.h> atomic_int counter = 0; // Atomic operations map to hardware instructions // Atomic store: single instruction, full memory barrier on x86atomic_store(&counter, 42); // Atomic load: single instructionint value = atomic_load(&counter); // Fetch-and-add: x86 'lock xadd' instructionint old = atomic_fetch_add(&counter, 1); // Returns old value // Compare-and-exchange: x86 'lock cmpxchg' instructionint expected = 10;int desired = 20;bool success = atomic_compare_exchange_strong(&counter, &expected, desired);// If counter was 10, it's now 20 and success=true// If counter was not 10, expected now holds actual value, success=false // These operations are TRULY atomic at hardware level// No lock needed - the CPU ensures indivisibilityCAS can fail spuriously (another thread modified the value). The common pattern is a CAS loop: read the current value, compute the new value, CAS to update; if CAS fails, retry. This enables lock-free algorithms but requires careful reasoning about the retry loop's correctness.
A crucial distinction that many developers miss: atomicity and synchronization are different concepts that often work together but solve different problems.
Atomicity ensures that an operation is indivisible—no partial states are visible.
Synchronization (memory ordering) ensures that operations occur in a predictable order visible to all threads.
You can have one without the other:
Atomic without ordering (memory_order_relaxed):
atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
The increment itself is atomic (no lost updates), but there's no guarantee about when or in what order it becomes visible relative to other operations. Other threads may see this increment before or after surrounding non-atomic operations, in unpredictable order.
Ordering without atomicity:
Conceptually possible but not useful—memory barriers order operations, but if the operations themselves aren't atomic, you still get races.
Combined atomicity + ordering (the common case):
atomic_store_explicit(&ready, 1, memory_order_release);
The store is atomic AND provides release semantics (all prior operations are visible before this store becomes visible to other threads).
| Concept | What It Provides | What It Doesn't Provide |
|---|---|---|
| Atomicity | Indivisibility—no partial states visible | No ordering guarantees with other operations |
| Synchronization | Ordering—predictable visibility of effects | No guarantee individual operations are indivisible |
| Both (full atomic) | Indivisible AND ordered operations | Higher overhead than relaxed atomics |
When to use relaxed atomics:
When you need full synchronization:
Using memory_order_relaxed is an expert-level optimization. The lack of ordering guarantees can lead to extremely subtle bugs. Start with sequential consistency (the default). Only relax ordering after proving correctness and profiling to confirm it matters.
Modern programming languages provide atomic types that map to hardware atomic operations:
C11 Atomics:
#include <stdatomic.h>
atomic_int counter; // Atomic integer
atomic_flag lock = ATOMIC_FLAG_INIT; // Atomic boolean (spinlock primitive)
_Atomic(struct Point) point; // Any type can be atomic
C++11 Atomics:
#include <atomic>
std::atomic<int> counter{0};
std::atomic<bool> flag{false};
std::atomic<T> value; // Any trivially copyable type
Java Atomics:
import java.util.concurrent.atomic.*;
AtomicInteger counter = new AtomicInteger(0);
AtomicReference<Node> head = new AtomicReference<>();
AtomicLong timestamp = new AtomicLong();
Go sync/atomic:
import "sync/atomic"
var counter int64
atomic.AddInt64(&counter, 1)
atomic.LoadInt64(&counter)
123456789101112131415161718192021222324252627282930313233
#include <atomic>#include <thread>#include <vector> std::atomic<int> counter{0}; void increment_many_times(int iterations) { for (int i = 0; i < iterations; i++) { // Atomic increment - no lost updates! counter.fetch_add(1, std::memory_order_relaxed); }} int main() { constexpr int THREADS = 10; constexpr int ITERS = 100000; std::vector<std::thread> threads; for (int i = 0; i < THREADS; i++) { threads.emplace_back(increment_many_times, ITERS); } for (auto& t : threads) { t.join(); } // Guaranteed to be exactly THREADS * ITERS // No lost updates because each increment is atomic std::cout << "Counter: " << counter.load() << std::endl; // Output: Counter: 1000000 (always!) return 0;}Operations using atomics can be lock-free (at least one thread always makes progress) or wait-free (every thread makes progress in bounded steps). atomic_fetch_add is typically wait-free—it always completes. CAS loops are lock-free but not wait-free—a thread's CAS might fail repeatedly if others keep succeeding.
One of the most subtle bugs in lock-free programming is the ABA problem, which can occur when using compare-and-swap.
The scenario:
Why this is a problem:
Thread 1's CAS succeeded, but the world has changed since it read A. If A was a pointer and Thread 2 freed and reallocated memory, Thread 1 might now be pointing to completely different data. The 'A' value is the same, but its meaning has changed.
Classic example: lock-free stack
1234567891011121314151617181920212223242526272829303132333435
// Lock-free stack with ABA problem struct Node { void* data; struct Node* next;}; atomic_ptr top; // Assume this is an atomic pointer void push(struct Node* node) { do { struct Node* old_top = atomic_load(&top); node->next = old_top; } while (!CAS(&top, old_top, node)); // Retry until success} struct Node* pop() { struct Node* old_top; struct Node* new_top; do { old_top = atomic_load(&top); // Step 1: Read top if (old_top == NULL) return NULL; new_top = old_top->next; // Step 2: Get next } while (!CAS(&top, old_top, new_top)); // Step 3: CAS to update return old_top;} // ABA scenario:// Thread 1: pop() reads top = A, new_top = B, about to CAS...// Thread 2: pop() succeeds, gets A// Thread 2: pop() succeeds, gets B // Thread 2: push(A) - A is back on stack!// Thread 1: CAS(&top, A, B) succeeds! But B was already popped!// // Result: Stack is corrupted. B is "on" the stack but was already freed.Solutions to the ABA problem:
1. Double-width CAS with version counter:
Instead of CAS on just the pointer, CAS on (pointer, version). Increment version on every modification. Even if the pointer returns to A, the version is different.
struct TaggedPointer {
Node* ptr;
uint64_t version;
};
// Use 128-bit CAS (cmpxchg16b on x86-64)
2. Hazard pointers:
Before accessing a node, publish a 'hazard pointer' to it. Other threads check hazard pointers before freeing, deferring reclamation if the node is protected.
3. Epoch-based reclamation:
Track 'epochs' of execution. Defer freeing until all threads have passed through a grace period, ensuring no thread holds old references.
4. LL/SC instead of CAS:
Load-Linked/Store-Conditional (on ARM, PowerPC) naturally avoids ABA because SC fails if any write occurred, not just if the value differs.
In the stack example, ABA can cause use-after-free, double-free, or memory corruption. These are security vulnerabilities, not just correctness bugs. Lock-free data structures are advanced territory—use proven implementations from libraries rather than rolling your own.
Atomic operations are the foundation on which locks are built. Let's see how a simple spinlock works:
Basic spinlock with TAS:
atomic_flag lock = ATOMIC_FLAG_INIT;
void acquire() {
while (atomic_flag_test_and_set(&lock)) {
// Spin until we acquire the lock
}
}
void release() {
atomic_flag_clear(&lock);
}
How it works:
atomic_flag_test_and_set atomically sets the flag to true and returns the old valueatomic_flag_clear atomically sets the flag to false, releasing the lockBetter spinlock with CAS:
1234567891011121314151617181920212223242526272829303132333435
#include <stdatomic.h> // Simple spinlock using atomic_flagatomic_flag spinlock = ATOMIC_FLAG_INIT; void spin_lock() { while (atomic_flag_test_and_set_explicit(&spinlock, memory_order_acquire)) { // Spin - could add pause instruction for hyperthreading efficiency #ifdef __x86_64__ __builtin_ia32_pause(); // Reduces power and improves SMT performance #endif } // We now hold the lock; acquire barrier ensures we see all prior writes} void spin_unlock() { atomic_flag_clear_explicit(&spinlock, memory_order_release); // Release barrier ensures all our writes are visible before unlock} // More sophisticated: Ticket lock (FIFO fairness)atomic_uint ticket_counter = 0;atomic_uint serving_counter = 0; void ticket_lock() { unsigned my_ticket = atomic_fetch_add(&ticket_counter, 1); while (atomic_load(&serving_counter) != my_ticket) { // Wait for our turn - FIFO order guaranteed }} void ticket_unlock() { atomic_fetch_add(&serving_counter, 1); // Next ticket, please!}A simple optimization: before doing expensive atomic TAS, check if the lock appears free with a regular read. If it's locked, spin on the read (which can be satisfied from cache). Only attempt TAS when the lock appears free. This reduces cache line bouncing in contended scenarios.
From spinlock to blocking lock:
Production-quality mutexes add blocking:
This combines the low-latency of spinning (for short critical sections) with the efficiency of blocking (for longer waits). This is what pthread_mutex and std::mutex typically implement.
Hardware atomics operate on single memory locations. What about operations that must atomically update multiple values?
The challenge:
Consider transfer between accounts:
accounts[from] -= amount; // Must be atomic with next line
accounts[to] += amount; // Together, or neither
No single atomic instruction can update two separate memory locations atomically.
Solution 1: Locks
The straightforward solution—acquire a lock covering both locations:
lock(&accounts_lock);
accounts[from] -= amount;
accounts[to] += amount;
unlock(&accounts_lock);
Solution 2: Two-phase locking
For finer granularity, lock both accounts:
lock(&locks[from]); lock(&locks[to]); // Lock ordering to prevent deadlock
// ... perform transfer ...
unlock(&locks[to]); unlock(&locks[from]);
Solution 3: Software Transactional Memory (STM)
Log changes, commit atomically at end:
atomic {
accounts[from] -= amount;
accounts[to] += amount;
} // Commits all changes atomically or retries
Solution 4: Lock-free with CAS (complex)
For specific data structures, clever algorithms can achieve multi-location atomicity:
These are advanced techniques used in lock-free queues, skip lists, and other specialized structures.
Database Transactions:
Databases provide atomic transactions through:
For most applications, locking is the right choice for multi-location atomicity. Lock-free algorithms are error-prone, hard to verify, and often slower than well-designed locking for moderate contention levels. Use lock-free only when profiling proves locks are the bottleneck, and prefer battle-tested library implementations.
C11 and C++11 atomics support explicit memory ordering specification. Understanding these options is essential for correct and efficient concurrent code:
memory_order_seq_cst (Sequential Consistency)
The default. Every atomic operation is part of a single total order that all threads agree on. Simplest to reason about; highest overhead.
memory_order_acquire (Acquire)
Used on reads. Prevents subsequent reads/writes from being reordered before this operation. "Everything after stays after."
Use case: Reading a flag that indicates data is ready.
memory_order_release (Release)
Used on writes. Prevents prior reads/writes from being reordered after this operation. "Everything before stays before."
Use case: Writing a flag that indicates data is ready.
memory_order_acq_rel (Acquire-Release)
Used on read-modify-write operations. Both acquire and release semantics.
Use case: Lock operations that both read (to check) and write (to acquire).
memory_order_relaxed (Relaxed)
No ordering guarantees whatsoever. Only atomicity is provided.
Use case: Counters where order doesn't matter.
1234567891011121314151617181920212223242526272829303132333435363738
#include <atomic>#include <thread>#include <cassert> std::atomic<bool> ready{false};int data = 0; // Non-atomic, but protected by 'ready' synchronization void producer() { data = 42; // Regular write // Release: all prior writes (including data=42) are visible // to any thread that observes ready==true with acquire ready.store(true, std::memory_order_release);} void consumer() { // Acquire: if we see ready==true, we also see all writes // that happened before the store that made ready==true while (!ready.load(std::memory_order_acquire)) { // Wait } // Guaranteed to see data==42 due to release-acquire synchronization assert(data == 42); // Never fails!} // Without proper ordering:// ready.store(true, memory_order_relaxed);// ready.load(memory_order_relaxed);// The assertion might fail! data=42 might not be visible yet. int main() { std::thread t1(producer); std::thread t2(consumer); t1.join(); t2.join(); return 0;}memory_order_seq_cst is the default for good reason—it's easiest to reason about and matches programmer intuition. Only use weaker orderings when profiling shows the atomic operations are a bottleneck AND you've carefully proven the weaker ordering is safe. Memory ordering bugs are among the hardest to diagnose.
Atomic operations are much cheaper than locks, but they're not free. Understanding their cost helps make informed design decisions.
Rough cost hierarchy (cycles on modern x86):
| Operation | Approximate Cost | Notes |
|---|---|---|
| Regular read | 1-4 cycles | Cache hit |
| Atomic read (relaxed) | ~1-4 cycles | Same as regular if aligned |
| Atomic read (seq_cst) | ~1-4 cycles | On x86, loads are naturally ordered |
| Regular write | 1-4 cycles | Cache hit, buffered |
| Atomic write (relaxed) | ~1-4 cycles | Same as regular |
| Atomic write (seq_cst) | ~20-100 cycles | Must lock cache line |
| Atomic RMW (seq_cst) | ~20-100 cycles | LOCK prefix required |
| Mutex lock (uncontended) | ~15-30 cycles | Typically CAS + bookkeeping |
| Mutex lock (contended) | ~1000+ cycles | Context switch |
| Cache miss | ~100-300 cycles | Main memory access |
Atomics and locks are rarely the bottleneck in most applications. Profile first. If atomics are the problem, the solution is often architectural (reduce contention, batch operations) rather than micro-optimization (weaker memory orders).
We've explored atomicity—the fundamental property that enables safe concurrent access to shared data:
Module Complete: Concurrency Concepts
With this page, we've completed the foundational module on concurrency concepts. You now understand:
These concepts form the theoretical foundation for all concurrent programming. Every mutex, semaphore, condition variable, channel, and lock-free algorithm builds on these principles.
Congratulations! You've mastered the fundamental concepts of concurrent programming. In the following modules, we'll see how these concepts are applied in practice—from fork-join parallelism and thread pools to asynchronous programming patterns and real-world concurrency problems.