Loading content...
Consider this deceptively simple code, running on two threads:
Thread 1:
X = 1;
print(Y);
Thread 2:
Y = 1;
print(X);
Both X and Y start at 0. What outputs are possible?
Most developers would immediately answer:
But surely (0, 0) is impossible? After all, one of the writes must happen before both reads complete.
On modern hardware, (0, 0) can actually occur.
If this surprises you—or if you don't immediately understand why—then you need to understand memory models and, specifically, sequential consistency.
The scenario above is not contrived. Production systems have suffered mysterious bugs because developers assumed intuitive behavior that modern CPUs don't guarantee. Understanding memory models is essential for anyone writing concurrent code that accesses shared memory.
By the end of this page, you will understand what sequential consistency means, why it's the 'obvious' model most programmers assume, why modern hardware violates it for performance reasons, and what this implies for writing correct concurrent programs.
Sequential consistency is a memory model formalized by Leslie Lamport in 1979. It defines what guarantees a system provides about the ordering of memory operations across multiple threads or processors.
Lamport's Definition:
A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
Let's unpack this carefully—it contains two distinct requirements:
Requirement 1: Global Sequential Order
There must exist some total ordering of all memory operations from all processors such that the observed results are consistent with that ordering. In other words, you could take all the memory operations from every thread and arrange them in a single, global sequence where each operation's results make sense given the operations before it.
Requirement 2: Program Order Preserved
Within this global sequence, the operations from any single processor appear in the order that processor's program specifies. Thread 1's operations maintain their relative order; Thread 2's operations maintain their relative order. The interleaving between threads may vary, but within-thread order is preserved.
Sequential consistency is the memory model you probably assumed without knowing it. It says: 'Operations happen one at a time in some order, and each thread's operations happen in the order they're written.' This is how most programmers naturally think about concurrent execution.
Example: Sequential consistency in action
Returning to our opening example with X=0, Y=0 initially:
Thread 1: X = 1; print(Y);
Thread 2: Y = 1; print(X);
Under sequential consistency, we can form these possible global orderings:
Ordering A:
Ordering B:
Ordering C:
There is no valid sequential ordering where both prints occur before both writes (that would violate program order within each thread). Therefore, (0, 0) is impossible under sequential consistency.
Yet, as stated earlier, modern hardware can produce (0, 0). This means modern hardware does not provide sequential consistency by default.
Sequential consistency provides several properties that make reasoning about concurrent programs tractable:
1. Deterministic reasoning about interleavings
With sequential consistency, you can analyze a concurrent program by considering all possible interleavings of operations from different threads. While the number of interleavings may be large, they're finite and well-defined. Each interleaving produces a deterministic result.
2. Atomicity of individual operations
Under sequential consistency, each memory operation (read or write) appears to happen atomically—instantaneously, without overlap with other operations. An operation is either fully completed and visible to all threads, or it hasn't happened yet.
3. Causal reasoning
If Thread A writes a value and Thread B reads that value, then Thread B can reason that Thread A's write (and everything Thread A did before that write) has already happened. This enables building synchronization on top of shared memory.
4. Simple mental model
'Operations happen one at a time, in some order' is easy to understand and verify. You can trace through code manually, considering different interleavings.
Many concurrent algorithms and synchronization primitives were originally designed assuming sequential consistency. The model is often used as the specification for correctness—even if implementation requires additional work on weaker memory models.
If sequential consistency is so intuitive and useful, why don't modern processors provide it? The answer lies in performance optimization.
Modern CPUs achieve their remarkable speed through aggressive optimization techniques that fundamentally conflict with sequential consistency:
1. Store Buffers
When a CPU writes to memory, it doesn't immediately update main memory. Instead, the write goes into a store buffer—a small, fast queue local to the processor core. This allows the CPU to continue executing without waiting for the slow write to complete.
Problem: The write in the store buffer is visible to the local core but not yet visible to other cores. This breaks sequential consistency because other threads may read stale values.
2. Cache Hierarchies
Modern CPUs have multiple levels of cache (L1, L2, L3). Each core typically has private L1 and L2 caches. When core A writes a value, it may update its local cache without immediately propagating the change to core B's cache.
Problem: Different cores may temporarily have different views of the same memory location.
3. Out-of-Order Execution
CPUs reorder instructions for efficiency. If instruction A and instruction B are independent (no data dependency), the CPU may execute B before A even though A appears first in the program.
Problem: This can change the order in which memory operations become visible to other threads.
4. Compiler Reordering
Before code even reaches the CPU, compilers may reorder instructions. A compiler optimizing for a single-threaded view might move a write after a read if it seems safe—but this can break multi-threaded invariants.
5. Speculative Execution
CPUs may speculatively execute instructions ahead of time, later discarding results if speculation was wrong. This can affect memory operation ordering.
The performance rationale:
These optimizations are essential for modern CPU performance:
Enforcing sequential consistency would require:
This would devastate performance—potentially slowing CPUs by 10× or more for memory-intensive code.
Hardware architects faced a choice: provide intuitive memory semantics at a massive performance cost, or provide weaker guarantees with full performance. They chose performance. The burden of correctness shifted to programmers (and programming languages) to explicitly request stronger ordering when needed.
Real hardware implements weaker (more relaxed) memory models that allow more reordering than sequential consistency. Different processor architectures have different models:
x86/x64: Total Store Order (TSO)
Intel and AMD processors provide TSO, which is fairly close to sequential consistency:
This is why our (0, 0) example can occur on x86: each thread's read can be reordered before its write, observing the other thread's initial value.
ARM/PowerPC: Relaxed Memory Models
ARM and PowerPC provide much weaker ordering:
These architectures require explicit memory barriers (fences) to enforce any ordering at all.
| Architecture | Model | Ordering Strength | Common Reorderings Allowed |
|---|---|---|---|
| Theoretical | Sequential Consistency | Strongest | None—program order is globally visible |
| x86/x64 (Intel, AMD) | Total Store Order (TSO) | Strong | Read can pass earlier write to different address |
| ARM (v7, v8) | Weakly Ordered | Weak | Almost any reordering possible |
| PowerPC | Weakly Ordered | Weak | Almost any reordering possible |
| RISC-V | RVWMO | Weak (configurable) | Most reorderings possible without fences |
What the (0, 0) result means:
Returning to our example:
// Thread 1 // Thread 2
X = 1; Y = 1;
print(Y); print(X);
Even on x86 (which allows read/write reordering to different locations), the hardware may:
X = 1 in store buffer (not visible yet)Y = 1 in store buffer (not visible yet)Result: (0, 0) — impossible under sequential consistency, but perfectly valid on real hardware.
Code that 'works' on x86 may fail on ARM because ARM allows far more reordering. If you test only on x86, you may never encounter race conditions that will manifest on ARM-based servers or mobile devices.
Since hardware doesn't provide sequential consistency by default, programmers must explicitly request ordering where needed. This is done through memory barriers (also called fences).
What is a memory barrier?
A memory barrier is a special instruction that constrains the reordering of memory operations around it. It doesn't access memory itself—it's a synchronization point that affects how other memory operations are ordered.
Types of memory barriers:
1. Full Barrier (mfence on x86, dmb on ARM)
2. Store Barrier (sfence on x86, dmb st on ARM)
3. Load Barrier (lfence on x86, dmb ld on ARM)
4. Acquire Barrier
5. Release Barrier
1234567891011121314151617
// Example: Using barriers to fix the (0, 0) problem // Initial: X = 0, Y = 0 // Thread 1 // Thread 2X = 1; Y = 1;__sync_synchronize(); __sync_synchronize(); // Full barrierprintf("%d\n", Y); printf("%d\n", X); // With barriers, (0, 0) is no longer possible:// - Thread 1's barrier ensures X=1 is visible before reading Y// - Thread 2's barrier ensures Y=1 is visible before reading X// - At least one thread will see the other's write // Alternative: Acquire-Release for fine-grained controlatomic_store_explicit(&X, 1, memory_order_release); // Release barrierint y = atomic_load_explicit(&Y, memory_order_acquire); // Acquire barrierMemory barriers are not free. A full barrier may cost 100+ CPU cycles—as expensive as a cache miss. This is why modern programming languages provide fine-grained memory ordering options (acquire, release, relaxed) rather than always using full barriers. Use the weakest ordering that provides the correctness you need.
Since hardware memory models vary by architecture, programming languages define their own memory models to provide portable semantics. These language models abstract away hardware differences.
C11/C++11 Memory Model:
C and C++ define a memory model with explicit ordering options:
memory_order_seq_cst — Sequential consistency. Most restrictive, safest.memory_order_acquire — Acquire semantics (for reads). Subsequent operations can't move before.memory_order_release — Release semantics (for writes). Prior operations can't move after.memory_order_acq_rel — Both acquire and release (for read-modify-write).memory_order_relaxed — No ordering guarantees. Maximum performance, maximum danger.Java Memory Model:
Java defines a memory model where:
volatile reads/writes are sequentially consistentGo Memory Model:
Go defines happens-before relationships through:
sync package primitives (Mutex, WaitGroup, etc.)Without these, no ordering is guaranteed between goroutines.
| Ordering | Constraint | Use Case |
|---|---|---|
seq_cst | Total order on all seq_cst operations | Simple, safe default; highest overhead |
acquire | Subsequent ops can't be reordered before | Reading a flag or pointer to shared data |
release | Prior ops can't be reordered after | Writing a flag or pointer to shared data |
acq_rel | Both acquire and release | Read-modify-write (CAS, fetch_add) |
relaxed | No ordering, only atomicity | Counters where order doesn't matter |
In C/C++, a data race (two threads accessing the same memory location, at least one writing, without synchronization) is undefined behavior. The compiler assumes races don't exist and may make optimizations that break racing code in spectacular ways. Always use atomic types or proper synchronization.
When reasoning about memory visibility in weak memory models, the concept of happens-before is essential. It defines when one operation is guaranteed to see the effects of another.
Definition:
Operation A happens-before operation B if A's effects are guaranteed to be visible to B.
This doesn't necessarily mean A completes before B in real time—it means the system guarantees B will observe A's effects.
Happens-before is established by:
Program order — Within a single thread, earlier operations happen-before later operations
Synchronization — Release operations happen-before acquire operations that observe them:
Transitivity — If A happens-before B and B happens-before C, then A happens-before C
123456789101112131415161718192021222324
// Example: Happens-before through synchronization int data = 0; // Shared dataatomic_int ready = 0; // Synchronization flag // Thread 1 (producer)void producer() { data = 42; // A: Write data atomic_store_explicit(&ready, 1, memory_order_release); // B: Release} // Thread 2 (consumer)void consumer() { while (atomic_load_explicit(&ready, memory_order_acquire) != 1) // C: Acquire ; // spin printf("%d\n", data); // D: Read data} // Happens-before relationships:// 1. A happens-before B (program order in Thread 1)// 2. B happens-before C (release-acquire synchronization when C reads 1)// 3. C happens-before D (program order in Thread 2)// 4. By transitivity: A happens-before D// Therefore: D is guaranteed to read 42Happens-before is about visibility guarantees, not clock time. Operation A might physically execute after B in real time, yet A happens-before B if the memory model guarantees B sees A's effects. Similarly, operations with no happens-before relationship may execute in any order with any visibility—even if they appear ordered in real time.
Understanding sequential consistency and why hardware doesn't provide it has direct practical implications:
1. Never assume ordering without synchronization
If two threads access shared memory without locks, atomics, or barriers, the operations may appear in any order. Code that 'works' may be relying on lucky timing that will fail under load, on different hardware, or with compiler optimizations enabled.
2. Use high-level synchronization when possible
Mutexes, condition variables, channels, and other high-level primitives handle memory ordering correctly. You don't need to think about barriers if you use pthread_mutex_lock/unlock—it handles acquire/release internally.
3. When using atomics, choose the right ordering
seq_cst for correctnessacquire/release after careful reasoningrelaxed when you've proven ordering doesn't matter (rare)4. Beware of publication patterns
When one thread writes data, then signals another thread to read it:
In C/C++, 'volatile' prevents compiler optimizations but provides NO memory ordering guarantees. A volatile variable is not thread-safe. In Java, 'volatile' does provide acquire/release semantics—but this difference causes frequent bugs when developers assume behavior based on the wrong language.
Modern programming languages offer a compromise between performance and understandability called DRF-SC (Data-Race-Free Sequential Consistency):
If a program is free of data races, it behaves as if sequentially consistent.
What this means:
The contract:
This is the model adopted by Java, C11/C++11, and most modern languages. It's the foundation of practical concurrent programming:
For most concurrent programming: use mutexes, semaphores, channels, or high-level concurrency abstractions. These establish happens-before relationships that compose into DRF programs. Under this discipline, you can reason as if SC holds—the easy, intuitive model—while the implementation handles hardware-specific details.
When DRF-SC isn't enough:
For lock-free data structures, high-performance concurrent algorithms, or OS-level code, you may need to work with weaker orderings directly. This requires:
This is advanced territory. Most developers should stick to DRF-SC.
We've explored the memory model that defines what concurrent programs can assume about memory operations:
What's next:
Sequential consistency assumes proper synchronization. But what happens when synchronization is missing or incorrect? The next page examines data races—the fundamental hazard of concurrent programming, where unsynchronized access to shared memory leads to undefined, non-deterministic behavior.
You now understand sequential consistency—the intuitive memory model, why hardware violates it for performance, and how happens-before relationships and DRF-SC provide a practical path to correct concurrent programs. This foundation is essential for understanding the bugs that arise when these guarantees are violated.