Concurrency Concepts - Learning Module

Loading content...

0/227

Sequential Consistency

When Shared Memory Behaves Unexpectedly

Consider this deceptively simple code, running on two threads:

Thread 1:

X = 1;
print(Y);

Thread 2:

Y = 1;
print(X);

Both X and Y start at 0. What outputs are possible?

Most developers would immediately answer:

(1, 0) — Thread 1 runs completely first
(0, 1) — Thread 2 runs completely first
(1, 1) — Both writes happen before both reads

But surely (0, 0) is impossible? After all, one of the writes must happen before both reads complete.

On modern hardware, (0, 0) can actually occur.

If this surprises you—or if you don't immediately understand why—then you need to understand memory models and, specifically, sequential consistency.

This Is Not Academic

The scenario above is not contrived. Production systems have suffered mysterious bugs because developers assumed intuitive behavior that modern CPUs don't guarantee. Understanding memory models is essential for anyone writing concurrent code that accesses shared memory.

What You Will Learn

By the end of this page, you will understand what sequential consistency means, why it's the 'obvious' model most programmers assume, why modern hardware violates it for performance reasons, and what this implies for writing correct concurrent programs.

Defining Sequential Consistency

Sequential consistency is a memory model formalized by Leslie Lamport in 1979. It defines what guarantees a system provides about the ordering of memory operations across multiple threads or processors.

Lamport's Definition:

A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

Let's unpack this carefully—it contains two distinct requirements:

Requirement 1: Global Sequential Order

There must exist some total ordering of all memory operations from all processors such that the observed results are consistent with that ordering. In other words, you could take all the memory operations from every thread and arrange them in a single, global sequence where each operation's results make sense given the operations before it.

Requirement 2: Program Order Preserved

Within this global sequence, the operations from any single processor appear in the order that processor's program specifies. Thread 1's operations maintain their relative order; Thread 2's operations maintain their relative order. The interleaving between threads may vary, but within-thread order is preserved.

The Intuitive Model

Sequential consistency is the memory model you probably assumed without knowing it. It says: 'Operations happen one at a time in some order, and each thread's operations happen in the order they're written.' This is how most programmers naturally think about concurrent execution.

Example: Sequential consistency in action

Returning to our opening example with X=0, Y=0 initially:

Thread 1: X = 1; print(Y); Thread 2: Y = 1; print(X);

Under sequential consistency, we can form these possible global orderings:

Ordering A:

Thread 1: X = 1
Thread 1: print(Y) → outputs 0
Thread 2: Y = 1
Thread 2: print(X) → outputs 1 Result: (0, 1)

Ordering B:

Thread 2: Y = 1
Thread 2: print(X) → outputs 0
Thread 1: X = 1
Thread 1: print(Y) → outputs 1 Result: (1, 0)

Ordering C:

Thread 1: X = 1
Thread 2: Y = 1
Thread 1: print(Y) → outputs 1
Thread 2: print(X) → outputs 1 Result: (1, 1)

There is no valid sequential ordering where both prints occur before both writes (that would violate program order within each thread). Therefore, (0, 0) is impossible under sequential consistency.

Yet, as stated earlier, modern hardware can produce (0, 0). This means modern hardware does not provide sequential consistency by default.

Why Sequential Consistency Matters

Sequential consistency provides several properties that make reasoning about concurrent programs tractable:

1. Deterministic reasoning about interleavings

With sequential consistency, you can analyze a concurrent program by considering all possible interleavings of operations from different threads. While the number of interleavings may be large, they're finite and well-defined. Each interleaving produces a deterministic result.

2. Atomicity of individual operations

Under sequential consistency, each memory operation (read or write) appears to happen atomically—instantaneously, without overlap with other operations. An operation is either fully completed and visible to all threads, or it hasn't happened yet.

3. Causal reasoning

If Thread A writes a value and Thread B reads that value, then Thread B can reason that Thread A's write (and everything Thread A did before that write) has already happened. This enables building synchronization on top of shared memory.

4. Simple mental model

'Operations happen one at a time, in some order' is easy to understand and verify. You can trace through code manually, considering different interleavings.

Sequential Consistency Guarantees

•Total Order — All operations can be arranged in a single global sequence
•Program Order — Each thread's operations appear in their written order within the global sequence
•Atomicity — Each operation completes fully before the next begins
•Visibility — Once an operation completes, its effects are visible to all subsequent operations
•Causal Consistency — If thread B sees an effect of thread A, it sees all prior effects of thread A

The Correctness Model

Many concurrent algorithms and synchronization primitives were originally designed assuming sequential consistency. The model is often used as the specification for correctness—even if implementation requires additional work on weaker memory models.

Why Modern Hardware Violates Sequential Consistency

If sequential consistency is so intuitive and useful, why don't modern processors provide it? The answer lies in performance optimization.

Modern CPUs achieve their remarkable speed through aggressive optimization techniques that fundamentally conflict with sequential consistency:

1. Store Buffers

When a CPU writes to memory, it doesn't immediately update main memory. Instead, the write goes into a store buffer—a small, fast queue local to the processor core. This allows the CPU to continue executing without waiting for the slow write to complete.

Problem: The write in the store buffer is visible to the local core but not yet visible to other cores. This breaks sequential consistency because other threads may read stale values.

2. Cache Hierarchies

Modern CPUs have multiple levels of cache (L1, L2, L3). Each core typically has private L1 and L2 caches. When core A writes a value, it may update its local cache without immediately propagating the change to core B's cache.

Problem: Different cores may temporarily have different views of the same memory location.

3. Out-of-Order Execution

CPUs reorder instructions for efficiency. If instruction A and instruction B are independent (no data dependency), the CPU may execute B before A even though A appears first in the program.

Problem: This can change the order in which memory operations become visible to other threads.

4. Compiler Reordering

Before code even reaches the CPU, compilers may reorder instructions. A compiler optimizing for a single-threaded view might move a write after a read if it seems safe—but this can break multi-threaded invariants.

5. Speculative Execution

CPUs may speculatively execute instructions ahead of time, later discarding results if speculation was wrong. This can affect memory operation ordering.

The performance rationale:

These optimizations are essential for modern CPU performance:

Store buffers hide memory latency (writing to a buffer takes ~1 cycle; writing to memory takes ~100+ cycles)
Caches reduce memory access from ~100 cycles to ~1-10 cycles
Out-of-order execution keeps all CPU execution units busy instead of stalling
Compiler reordering enables vast optimization opportunities

Enforcing sequential consistency would require:

Flushing store buffers after every write
Invalidating caches immediately across all cores
Preventing instruction reordering around memory operations

This would devastate performance—potentially slowing CPUs by 10× or more for memory-intensive code.

The Performance Trade-off

Hardware architects faced a choice: provide intuitive memory semantics at a massive performance cost, or provide weaker guarantees with full performance. They chose performance. The burden of correctness shifted to programmers (and programming languages) to explicitly request stronger ordering when needed.

Weaker Memory Models

Real hardware implements weaker (more relaxed) memory models that allow more reordering than sequential consistency. Different processor architectures have different models:

x86/x64: Total Store Order (TSO)

Intel and AMD processors provide TSO, which is fairly close to sequential consistency:

Reads are never reordered with other reads
Writes are never reordered with other writes
Writes are never reordered with older reads
But: Reads may be reordered with older writes (to different locations)

This is why our (0, 0) example can occur on x86: each thread's read can be reordered before its write, observing the other thread's initial value.

ARM/PowerPC: Relaxed Memory Models

ARM and PowerPC provide much weaker ordering:

Reads may be reordered with other reads
Writes may be reordered with other writes
Reads and writes may be reordered with each other
Dependent operations may not even execute in order

These architectures require explicit memory barriers (fences) to enforce any ordering at all.

Memory Model Comparison
Architecture	Model	Ordering Strength	Common Reorderings Allowed
Theoretical	Sequential Consistency	Strongest	None—program order is globally visible
x86/x64 (Intel, AMD)	Total Store Order (TSO)	Strong	Read can pass earlier write to different address
ARM (v7, v8)	Weakly Ordered	Weak	Almost any reordering possible
PowerPC	Weakly Ordered	Weak	Almost any reordering possible
RISC-V	RVWMO	Weak (configurable)	Most reorderings possible without fences

What the (0, 0) result means:

Returning to our example:

// Thread 1       // Thread 2
X = 1;            Y = 1;
print(Y);         print(X);

Even on x86 (which allows read/write reordering to different locations), the hardware may:

Thread 1: Put X = 1 in store buffer (not visible yet)
Thread 2: Put Y = 1 in store buffer (not visible yet)
Thread 1: Read Y from memory → 0 (Thread 2's write still in buffer)
Thread 2: Read X from memory → 0 (Thread 1's write still in buffer)
Store buffers drain: X = 1, Y = 1 now visible (too late)

Result: (0, 0) — impossible under sequential consistency, but perfectly valid on real hardware.

Portability Hazard

Code that 'works' on x86 may fail on ARM because ARM allows far more reordering. If you test only on x86, you may never encounter race conditions that will manifest on ARM-based servers or mobile devices.

Memory Barriers and Fences

Since hardware doesn't provide sequential consistency by default, programmers must explicitly request ordering where needed. This is done through memory barriers (also called fences).

What is a memory barrier?

A memory barrier is a special instruction that constrains the reordering of memory operations around it. It doesn't access memory itself—it's a synchronization point that affects how other memory operations are ordered.

Types of memory barriers:

1. Full Barrier (mfence on x86, dmb on ARM)

No memory operations may be reordered across the barrier
All operations before the barrier complete before any operation after it begins
Expensive: may stall the CPU for many cycles

2. Store Barrier (sfence on x86, dmb st on ARM)

Ensures all prior stores complete before subsequent stores
Doesn't constrain reads

3. Load Barrier (lfence on x86, dmb ld on ARM)

Ensures all prior loads complete before subsequent loads
Doesn't constrain writes

4. Acquire Barrier

Prevents subsequent operations from being reordered before the barrier
Used when 'acquiring' a resource (e.g., taking a lock)

5. Release Barrier

Prevents prior operations from being reordered after the barrier
Used when 'releasing' a resource (e.g., releasing a lock)

barriers_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Example: Using barriers to fix the (0, 0) problem
 
// Initial: X = 0, Y = 0
 
// Thread 1                        // Thread 2
X = 1;                             Y = 1;
__sync_synchronize();              __sync_synchronize();  // Full barrier
printf("%d\n", Y);                 printf("%d\n", X);
 
// With barriers, (0, 0) is no longer possible:
// - Thread 1's barrier ensures X=1 is visible before reading Y
// - Thread 2's barrier ensures Y=1 is visible before reading X
// - At least one thread will see the other's write
 
// Alternative: Acquire-Release for fine-grained control
atomic_store_explicit(&X, 1, memory_order_release);  // Release barrier
int y = atomic_load_explicit(&Y, memory_order_acquire); // Acquire barrier

The Cost of Barriers

Memory barriers are not free. A full barrier may cost 100+ CPU cycles—as expensive as a cache miss. This is why modern programming languages provide fine-grained memory ordering options (acquire, release, relaxed) rather than always using full barriers. Use the weakest ordering that provides the correctness you need.

Language-Level Memory Models

Since hardware memory models vary by architecture, programming languages define their own memory models to provide portable semantics. These language models abstract away hardware differences.

C11/C++11 Memory Model:

C and C++ define a memory model with explicit ordering options:

memory_order_seq_cst — Sequential consistency. Most restrictive, safest.
memory_order_acquire — Acquire semantics (for reads). Subsequent operations can't move before.
memory_order_release — Release semantics (for writes). Prior operations can't move after.
memory_order_acq_rel — Both acquire and release (for read-modify-write).
memory_order_relaxed — No ordering guarantees. Maximum performance, maximum danger.

Java Memory Model:

Java defines a memory model where:

All volatile reads/writes are sequentially consistent
Non-volatile accesses may be reordered
Proper synchronization (locks, volatile, final) creates happens-before relationships

Go Memory Model:

Go defines happens-before relationships through:

Goroutine creation
Channel sends and receives
sync package primitives (Mutex, WaitGroup, etc.)

Without these, no ordering is guaranteed between goroutines.

C11/C++ Memory Orderings
Ordering	Constraint	Use Case
`seq_cst`	Total order on all seq_cst operations	Simple, safe default; highest overhead
`acquire`	Subsequent ops can't be reordered before	Reading a flag or pointer to shared data
`release`	Prior ops can't be reordered after	Writing a flag or pointer to shared data
`acq_rel`	Both acquire and release	Read-modify-write (CAS, fetch_add)
`relaxed`	No ordering, only atomicity	Counters where order doesn't matter

The Data Race Rule

In C/C++, a data race (two threads accessing the same memory location, at least one writing, without synchronization) is undefined behavior. The compiler assumes races don't exist and may make optimizations that break racing code in spectacular ways. Always use atomic types or proper synchronization.

The Happens-Before Relationship

When reasoning about memory visibility in weak memory models, the concept of happens-before is essential. It defines when one operation is guaranteed to see the effects of another.

Definition:

Operation A happens-before operation B if A's effects are guaranteed to be visible to B.

This doesn't necessarily mean A completes before B in real time—it means the system guarantees B will observe A's effects.

Happens-before is established by:

Program order — Within a single thread, earlier operations happen-before later operations
Synchronization — Release operations happen-before acquire operations that observe them:
- Mutex unlock (release) happens-before subsequent lock (acquire)
- Volatile/atomic write (release) happens-before read that sees the written value (acquire)
- Thread creation happens-before the new thread's first operation
- Thread termination happens-before join() returns
Transitivity — If A happens-before B and B happens-before C, then A happens-before C

happens_before_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Example: Happens-before through synchronization
 
int data = 0;              // Shared data
atomic_int ready = 0;       // Synchronization flag
 
// Thread 1 (producer)
void producer() {
    data = 42;                                           // A: Write data
    atomic_store_explicit(&ready, 1, memory_order_release);  // B: Release
}
 
// Thread 2 (consumer)
void consumer() {
    while (atomic_load_explicit(&ready, memory_order_acquire) != 1) // C: Acquire
        ; // spin
    printf("%d\n", data);                                // D: Read data
}
 
// Happens-before relationships:
// 1. A happens-before B (program order in Thread 1)
// 2. B happens-before C (release-acquire synchronization when C reads 1)
// 3. C happens-before D (program order in Thread 2)
// 4. By transitivity: A happens-before D
// Therefore: D is guaranteed to read 42

Happens-Before vs Real Time

Happens-before is about visibility guarantees, not clock time. Operation A might physically execute after B in real time, yet A happens-before B if the memory model guarantees B sees A's effects. Similarly, operations with no happens-before relationship may execute in any order with any visibility—even if they appear ordered in real time.

Practical Implications for Developers

Understanding sequential consistency and why hardware doesn't provide it has direct practical implications:

1. Never assume ordering without synchronization

If two threads access shared memory without locks, atomics, or barriers, the operations may appear in any order. Code that 'works' may be relying on lucky timing that will fail under load, on different hardware, or with compiler optimizations enabled.

2. Use high-level synchronization when possible

Mutexes, condition variables, channels, and other high-level primitives handle memory ordering correctly. You don't need to think about barriers if you use pthread_mutex_lock/unlock—it handles acquire/release internally.

3. When using atomics, choose the right ordering

Start with seq_cst for correctness
Only relax to acquire/release after careful reasoning
Only use relaxed when you've proven ordering doesn't matter (rare)

4. Beware of publication patterns

When one thread writes data, then signals another thread to read it:

The write must 'release' (ensure data is visible)
The signal read must 'acquire' (ensure subsequent reads see the data)
Without this, the reader may see uninitialized or partially written data

Common Mistakes

•Using regular variables for inter-thread flags
•Assuming writes are immediately visible
•Testing only on x86 (stronger memory model)
•Thinking volatile means thread-safe
•Ignoring compiler reordering
•Hand-rolling synchronization without barriers

Best Practices

•Use mutex/lock for non-trivial shared state
•Use atomic types with explicit ordering
•Test on weakly-ordered architectures (ARM)
•Use thread sanitizers (TSan)
•Reason about happens-before, not time
•Review concurrent code with memory model in mind

Volatile Is Not Enough

In C/C++, 'volatile' prevents compiler optimizations but provides NO memory ordering guarantees. A volatile variable is not thread-safe. In Java, 'volatile' does provide acquire/release semantics—but this difference causes frequent bugs when developers assume behavior based on the wrong language.

Sequential Consistency for Data-Race-Free Programs

Modern programming languages offer a compromise between performance and understandability called DRF-SC (Data-Race-Free Sequential Consistency):

If a program is free of data races, it behaves as if sequentially consistent.

What this means:

The language memory model allows reordering in the presence of data races
But if you properly synchronize all shared accesses (using mutexes, atomics, etc.), the program behaves as if SC held
You get the intuitive, easy-to-reason-about model without paying the full performance cost of hardware SC

The contract:

You guarantee: No data races (all shared accesses properly synchronized)
Language guarantees: Sequential consistency behavior

This is the model adopted by Java, C11/C++11, and most modern languages. It's the foundation of practical concurrent programming:

Write code with proper synchronization
Reason about it using sequential consistency
Trust the language to insert barriers as needed for any hardware

The Practical Approach

For most concurrent programming: use mutexes, semaphores, channels, or high-level concurrency abstractions. These establish happens-before relationships that compose into DRF programs. Under this discipline, you can reason as if SC holds—the easy, intuitive model—while the implementation handles hardware-specific details.

When DRF-SC isn't enough:

For lock-free data structures, high-performance concurrent algorithms, or OS-level code, you may need to work with weaker orderings directly. This requires:

Deep understanding of acquire/release semantics
Formal reasoning about happens-before
Testing on multiple architectures
Usually: review by experts

This is advanced territory. Most developers should stick to DRF-SC.

Summary: Sequential Consistency

We've explored the memory model that defines what concurrent programs can assume about memory operations:

Key Takeaways

•Sequential consistency — A memory model where all operations appear to execute in a single global order, with each thread's operations in program order
•Intuitive but expensive — SC is what programmers naturally assume, but enforcing it costs significant performance
•Hardware doesn't provide SC — Modern CPUs use store buffers, caches, and reordering for performance, violating SC
•Memory barriers enforce order — Explicit barrier instructions constrain reordering where needed
•Languages abstract hardware — Language memory models (C11, Java, Go) provide portable semantics across architectures
•Happens-before is key — Reasoning about visibility requires understanding happens-before relationships
•DRF-SC is the compromise — Write race-free code and get SC semantics; the language handles the barriers

What's next:

Sequential consistency assumes proper synchronization. But what happens when synchronization is missing or incorrect? The next page examines data races—the fundamental hazard of concurrent programming, where unsynchronized access to shared memory leads to undefined, non-deterministic behavior.

Page Complete

You now understand sequential consistency—the intuitive memory model, why hardware violates it for performance, and how happens-before relationships and DRF-SC provide a practical path to correct concurrent programs. This foundation is essential for understanding the bugs that arise when these guarantees are violated.