Loading content...
When you execute a Test-and-Set instruction, something remarkable happens in the hardware. For a fleeting moment—measured in nanoseconds—your processor gains exclusive control over a specific memory location. No other processor can read it, write it, or even snoop on it. The entire multiprocessor system briefly serializes its memory accesses at that point, transforming a parallel machine into a sequential one for just one memory operation.
This isn't magic—it's careful engineering dating back to the earliest multiprocessor systems. Understanding how hardware provides atomicity gives you profound insight into the true cost of synchronization and helps you make informed decisions about lock design and usage.
In this page, we'll descend through the abstraction layers: from the instruction set interface, through the pipeline, into the cache hierarchy, down to the memory bus and interconnect protocols. By the end, you'll understand exactly what happens when those electrons flow to make an atomic operation atomic.
By the end of this page, you will understand: (1) The historical evolution from bus locking to cache-based atomicity, (2) How cache coherence protocols enable atomic operations, (3) The MESI protocol and its role in synchronization, (4) Different atomic implementation strategies across architectures, and (5) The performance implications of various hardware mechanisms.
The original mechanism for atomic operations was conceptually simple, if brutally expensive: lock the entire memory bus. Early multiprocessor systems connected multiple CPUs to a shared memory through a common bus, and the hardware could electrically prevent any other processor from accessing memory during an atomic operation.
The LOCK Signal:
Processors like the Intel 8086/8088 and its successors through the Pentium Pro provided a physical #LOCK pin. When asserted during a memory operation, this pin:
12345678910111213141516
; x86 Assembly: LOCK prefix for atomicity; The LOCK prefix asserts the #LOCK signal for the duration; of the following instruction ; Non-atomic increment (INCORRECT for shared memory)inc dword [counter] ; Load, increment, store - NOT atomic! ; Atomic increment (CORRECT)lock inc dword [counter] ; Entire operation is atomic ; Note: LOCK is only valid with memory operands and specific; instructions: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,; CMPXCHG8B/16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, XCHG ; XCHG with memory is implicitly locked (no LOCK prefix needed)xchg eax, [lock_var] ; Automatically asserts LOCKThe Cost of Bus Locking:
Bus locking was effective but created a severe scalability bottleneck. While one processor held the bus lock:
This worked tolerably when systems had only 2-4 processors, but as systems scaled to 8, 16, or more processors, bus locking became catastrophically expensive.
| Processors | Locks/Second | Avg Time Blocked | Throughput Loss |
|---|---|---|---|
| 2 | 10,000 | 100ns per lock | ~1% |
| 4 | 10,000 | 100ns per lock | ~3% |
| 8 | 10,000 | 100ns per lock | ~7% |
| 16 | 10,000 | 100ns per lock | ~15% |
| 32 | 10,000 | 100ns per lock | ~31% |
Modern processors rarely use true bus locks. Since the Pentium Pro (1995), Intel processors use cache locking for most atomic operations. The LOCK prefix still exists, but its implementation has fundamentally changed. However, understanding bus locking provides essential context for appreciating the elegance of modern cache-based solutions.
When Bus Locks Still Occur:
Even on modern processors, actual bus/memory locks occur in specific cases:
Split locks are particularly expensive and can be 100x slower than cache-locked atomics. Modern operating systems monitor for split locks and may terminate offending processes.
1234567891011121314151617
// Example: Accidentally causing a split lock// BAD: This structure might cross cache line boundaries! struct BadAlignment { char padding[62]; // Push lock to end of cache line int lock; // 4 bytes might span two cache lines!}; // GOOD: Ensure atomic variables are properly alignedstruct GoodAlignment { _Alignas(64) int lock; // Aligned to cache line boundary char padding[60]; // Other data in same cache line}; // Modern CPUs can detect and optionally trap on split locks:// - Intel: AC (Alignment Check) split-lock detection// - Linux kernel: split_lock_detect boot parameterThe revolution that made modern multiprocessor systems practical was cache coherence protocols. Instead of locking the entire memory bus, we can lock just the relevant cache line and use coherence mechanisms to guarantee exclusive access.
The Cache Coherence Problem:
Consider a simple scenario: both CPU 0 and CPU 1 have cached the same memory address. What happens when CPU 0 writes to it?
Without coherence:
With coherence:
Coherence operates on cache lines (typically 64 bytes on modern systems), not individual bytes or words. Even if you only modify one byte, the entire 64-byte cache line is transferred between caches. This has significant performance implications—unrelated data sharing a cache line can cause 'false sharing' performance problems.
Coherence Invariants:
A cache coherence protocol must maintain two fundamental invariants:
1. Single-Writer-Multiple-Reader (SWMR): At any moment, either:
2. Data-Value Invariant: The value returned by a read is always the value written by the most recent write (as defined by the memory ordering model).
These invariants allow atomic operations to be implemented using cache-level mechanisms rather than bus-level locks.
Snoopy Cache Coherence:
The dominant coherence mechanism in multicore processors is snooping. Each cache controller monitors (snoops) the shared bus or interconnect, watching for memory operations from other caches that affect its cached data.
When CPU 0 performs a write:
Directory-Based Coherence:
For large-scale systems with many processors (typically 8+), snooping becomes inefficient—every transaction must be broadcast to all caches. Directory-based coherence maintains a central directory recording which caches hold each memory block, enabling point-to-point messages instead of broadcasts.
NUMA (Non-Uniform Memory Access) systems universally use directory-based coherence because the interconnect topology doesn't support efficient broadcasts. Intel's QPI/UPI and AMD's Infinity Fabric implement hybrid protocols with both snooping and directory characteristics.
The MESI protocol (Modified, Exclusive, Shared, Invalid) is the foundation of cache coherence in virtually all modern multiprocessor systems. Understanding MESI is essential for reasoning about atomic operation behavior and performance.
The Four States:
| State | Exclusive? | Modified? | Can Read? | Can Write? |
|---|---|---|---|---|
| Modified | Yes | Yes | Yes | Yes |
| Exclusive | Yes | No | Yes | Yes (silent→M) |
| Shared | No | No | Yes | No (must upgrade) |
| Invalid | — | — | No | No |
State Transitions:
The power of MESI lies in its state transitions, which are triggered by local processor operations and snooped bus transactions:
12345678910111213141516171819
MESI State Transition Triggers================================ Local Processor Operations:---------------------------Read Miss : I → S or E (depends on other caches' state)Read Hit : M, E, S → no changeWrite Miss : I → MWrite Hit : M → M, E → M (silent), S → M (bus transaction required) Snooped Bus Transactions (from other CPUs):-------------------------------------------Snoop Read : M → S (write back), E → SSnoop Write : M → I (write back), E → I, S → I Key Insight:- Atomic RMW operations require M state during execution- Other caches must be I state for atomicity to hold- Cache achieves "mini-lock" on the line in M stateHow MESI Enables Atomic Operations:
Here's the crucial insight: when a cache line is in Modified state, no other cache has a valid copy (they're all Invalid). This is exactly the exclusivity we need for atomic operations!
For an atomic Test-and-Set:
The M state effectively creates a 'micro-lock' on a cache line. During an atomic RMW operation, the processor holds the line in M state without responding to snoop requests until the operation completes. This is much cheaper than bus locking because only the specific cache line is affected—all other memory remains accessible.
MESI Extensions (MOESI, MESIF):
Modern processors use extended protocols:
MOESI (AMD): Adds Owned state—cache has modified copy, but shares with others. Allows cache-to-cache transfer of dirty data without writing back to memory.
MESIF (Intel): Adds Forward state—one designated responder for shared data, reducing duplicate responses on bus.
These extensions improve performance without changing the fundamental atomicity guarantees.
With cache coherence as our foundation, let's examine exactly how a processor implements an atomic Test-and-Set (or any read-modify-write operation) using the cache system.
The Atomic Pipeline:
When an atomic instruction enters the CPU pipeline, it receives special treatment at multiple stages:
12345678910111213141516171819202122232425262728
Atomic Test-and-Set: Detailed Timeline======================================= Assumptions:- Lock variable is in another CPU's cache (Shared state)- Modern x86 processor with MESI protocol Cycle | Event------+------------------------------------------------------------------ 1 | XCHG instruction decoded, identified as atomic 2 | Probe L1 cache: MISS (line not present) 3 | Probe L2 cache: HIT in Shared state 4 | Cannot write to Shared line; must upgrade 5 | Issue coherence request: Read-For-Ownership (RFO) 6 | RFO broadcast on ring/mesh interconnect 7 | Other caches snoop RFO, begin invalidating their copies 8 | Wait for invalidation ACKs from all sharers 9 | (waiting) 10 | All ACKs received; line promoted to Exclusive 11 | Line locked against snoops (internal flag set) 12 | READ: Load current value from cache line 13 | MODIFY: Write new value (1) to cache line 14 | Line state: Modified (dirty) 15 | Line unlocked; snoop handling resumes 16 | Instruction retires with old value in register Total latency: ~15 cycles (uncontended, line in L2 Shared)With contention: 50-500+ cycles (coherence traffic delays)The Critical Section of Hardware:
Note cycles 11-15 in the timeline above. During this window, the processor has:
This 4-cycle window is the hardware equivalent of a critical section. No other processor can observe intermediate states because:
The key improvement over bus locking: only ONE cache line is inaccessible during the atomic operation. All other memory addresses remain accessible. If CPU 0 is atomically updating address A, CPU 1 can simultaneously access addresses B, C, D without delay. This enables true parallelism despite synchronization.
Handling Competing Atomic Operations:
What if two CPUs execute atomic operations on the same address simultaneously? The coherence protocol serializes them:
| Time | CPU 0 | CPU 1 | Line State (CPU0 wins) |
|---|---|---|---|
| T1 | Issue RFO | Issue RFO | I/I |
| T2 | Win arbitration | Lose, wait | E (CPU 0) |
| T3 | Lock line, RMW | Still waiting | Locked at CPU 0 |
| T4 | Complete, unlock | RFO arrives at CPU 0 | M (CPU 0) |
| T5 | Writeback, invalidate | Receive line | I/E (CPU 1) |
| T6 | — | Lock line, RMW | Locked at CPU 1 |
| T7 | — | Complete | I/M (CPU 1) |
The coherence protocol guarantees that even if both atomic operations are issued at exactly the same moment, one completes entirely before the other begins. This is the serialization property that makes mutual exclusion possible.
Different processor architectures implement atomic operations with varying strategies. Understanding these variations is important for writing portable code and reasoning about performance across platforms.
x86/x64: Lock Prefix and Implicit Locking:
The x86 architecture provides the strongest and most straightforward atomic support. The LOCK prefix can be applied to specific read-modify-write instructions, and XCHG with memory is implicitly locked.
123456789101112131415161718192021222324252627
; x86/x64 Atomic Operations ; Atomic incrementlock inc dword [counter] ; Atomic addlock add dword [value], 10 ; Atomic bitwise ORlock or dword [flags], 0x1 ; Atomic exchange (implicit lock)mov eax, 1xchg eax, [lock_var] ; EAX = old, [lock_var] = 1 ; Atomic compare-and-swap (CMPXCHG)mov eax, expected_valuemov ebx, new_valuelock cmpxchg [target], ebx ; If [target]==EAX: [target]=EBX, ZF=1 ; Else: EAX=[target], ZF=0 ; Atomic exchange-and-add (XADD)mov eax, incrementlock xadd [counter], eax ; EAX = old, [counter] += old_EAX ; 8-byte atomics (x64)lock cmpxchg16b [qword_addr] ; 16-byte compare-and-swapARM: Load-Linked/Store-Conditional (LL/SC):
ARM (prior to ARMv8.1) uses a fundamentally different model: Load-Linked (LDREX) and Store-Conditional (STREX). This provides atomic read-modify-write capability without requiring a specialized atomic instruction for each operation.
1234567891011121314151617
; ARM (ARMv7, ARMv8): Load-Linked/Store-Conditional; Implements atomic test-and-set test_and_set: mov r1, #1 ; Value to writeretry: ldrex r0, [r2] ; Load-Exclusive: read lock, mark reservation strex r3, r1, [r2] ; Store-Exclusive: try to write 1 cmp r3, #0 ; Did store succeed? bne retry ; No: reservation lost, retry dmb ; Data memory barrier for ordering bx lr ; Return old value in r0 ; ARMv8.1 adds true atomic instructions:; LDADD, LDCLR, LDEOR, LDSET (atomic fetch-and-op); SWPA, SWPAL (atomic swap with acquire/release); CAS, CASA, CASAL (compare-and-swap variants)Store-Conditional can fail even if no other processor touched the reservation—context switches, interrupts, or cache line evictions can clear the reservation. Code must always be prepared to retry. This is fundamentally different from x86 atomics, which never spuriously fail.
LL/SC vs. Direct Atomic: Tradeoffs:
| Aspect | x86 (Direct Atomic) | ARM LL/SC |
|---|---|---|
| Implementation | Single instruction does RMW | Two instructions + retry loop |
| Failure mode | Never fails | SC can fail spuriously |
| Flexibility | Fixed operations (INC, ADD, XCHG...) | Any RMW can be built |
| Hardware cost | More complex instruction | Simpler instructions + reservation |
| ABA problem | Possible (addressed by width) | Possible (addressed by reservation) |
| Contention behavior | Guaranteed progress | Livelock possible (rare) |
RISC-V Atomic Extension (A):
RISC-V provides both models: LL/SC instructions (LR/SC) and direct atomic operations (AMO instructions). The AMO instructions like AMOSWAP behave similarly to x86 XCHG.
12345678910111213141516171819202122
; RISC-V Atomic Operations ; Atomic swap (like x86 XCHG)li t0, 1 ; Value to writeamoswap.w.aq t1, t0, (a0) ; t1 = old [a0]; [a0] = t0 ; .aq = acquire ordering ; Atomic addli t0, 1amoadd.w t1, t0, (a0) ; t1 = old [a0]; [a0] += t0 ; LL/SC style (LR/SC)retry: lr.w t1, (a0) ; Load-Reserved addi t2, t1, 1 ; Modify sc.w t3, t2, (a0) ; Store-Conditional bnez t3, retry ; Retry if failed ; Ordering suffixes:; .aq = acquire (no following ops reorder before); .rl = release (no preceding ops reorder after); .aqrl = sequentially consistentAtomic operations don't just prevent interleaving—they also impact memory ordering. Modern processors reorder memory operations aggressively, and synchronization code must control this reordering. Memory barriers (also called memory fences) are hardware mechanisms that enforce ordering constraints.
The Need for Barriers:
Consider code protecting shared data with a lock. Without ordering constraints:
12345678910111213141516171819
// Thread 1: Producervoid produce() { data = compute(); // 1: Compute data ready = true; // 2: Signal ready} // Thread 2: Consumer void consume() { while (!ready) { } // 3: Wait for ready use(data); // 4: Use data} // PROBLEM: CPU might reorder (2) before (1)!// Consumer sees ready=true but data is not yet written// // PROBLEM: CPU might reorder (4) before (3)!// Consumer reads data before checking ready // Solution: Memory barriers enforce orderingTypes of Memory Barriers:
Barrier Semantics Across Architectures:
| Type | x86/x64 | ARM | POWER |
|---|---|---|---|
| Load barrier | lfence (rarely needed) | dmb ld | lwsync or isync |
| Store barrier | sfence | dmb st | lwsync |
| Full barrier | mfence | dmb sy | sync |
| Acquire | Implicit in LOCK ops | dmb ish + ldar | lwsync or isync |
| Release | Implicit in LOCK ops | dmb ish + stlr | lwsync |
x86 provides a relatively strong memory model: stores are never reordered with other stores, and loads are never reordered with other loads. Only store-load reordering is allowed (a store can be delayed, allowing a subsequent load to complete first). This simplifies x86 lock implementations but lulls developers into habits that break on weaker architectures.
1234567891011121314151617181920212223242526
// C11 atomics with proper memory ordering#include <stdatomic.h> atomic_int ready = 0;int data = 0; // Producervoid produce() { data = 42; // Regular store // Release: ensures data=42 is visible before ready=1 atomic_store_explicit(&ready, 1, memory_order_release);} // Consumervoid consume() { // Acquire: ensures we don't read data until after seeing ready=1 while (atomic_load_explicit(&ready, memory_order_acquire) == 0) { // spin } use(data); // Guaranteed to see 42} // The acquire-release pair creates a "synchronizes-with" relationship:// All writes before the release are visible after the acquireBarriers in Lock Implementations:
A proper Test-and-Set lock needs:
On x86, the locked XCHG instruction provides an implicit full barrier, automatically satisfying both requirements. On ARM and other weakly-ordered architectures, explicit barriers (or acquire/release variants of atomic instructions) are required.
We have journeyed deep into the hardware mechanisms that make atomic operations possible. This knowledge is essential for understanding the true cost of synchronization and for writing correct, portable concurrent code. Let's consolidate the key insights:
What's Next:
With the hardware foundation established, we'll now focus on implementing locks using Test-and-Set. We'll examine the basic spinlock implementation, explore optimizations like Test-and-Test-and-Set, and analyze the performance characteristics of various locking strategies. This practical knowledge will enable you to implement and debug synchronization primitives with confidence.
You now understand how hardware makes atomicity possible—from cache coherence protocols through memory barriers. This knowledge transforms atomic operations from mysterious primitives into understandable mechanisms with predictable costs and behaviors. You're ready to apply this understanding to practical lock implementation.