Test And Set Lock - Learning Module

Loading content...

0/227

Hardware Support: How Processors Guarantee Atomicity

Inside the Machine: Where Atomicity Lives

When you execute a Test-and-Set instruction, something remarkable happens in the hardware. For a fleeting moment—measured in nanoseconds—your processor gains exclusive control over a specific memory location. No other processor can read it, write it, or even snoop on it. The entire multiprocessor system briefly serializes its memory accesses at that point, transforming a parallel machine into a sequential one for just one memory operation.

This isn't magic—it's careful engineering dating back to the earliest multiprocessor systems. Understanding how hardware provides atomicity gives you profound insight into the true cost of synchronization and helps you make informed decisions about lock design and usage.

In this page, we'll descend through the abstraction layers: from the instruction set interface, through the pipeline, into the cache hierarchy, down to the memory bus and interconnect protocols. By the end, you'll understand exactly what happens when those electrons flow to make an atomic operation atomic.

What You Will Learn

By the end of this page, you will understand: (1) The historical evolution from bus locking to cache-based atomicity, (2) How cache coherence protocols enable atomic operations, (3) The MESI protocol and its role in synchronization, (4) Different atomic implementation strategies across architectures, and (5) The performance implications of various hardware mechanisms.

The Bus Locking Era

The original mechanism for atomic operations was conceptually simple, if brutally expensive: lock the entire memory bus. Early multiprocessor systems connected multiple CPUs to a shared memory through a common bus, and the hardware could electrically prevent any other processor from accessing memory during an atomic operation.

The LOCK Signal:

Processors like the Intel 8086/8088 and its successors through the Pentium Pro provided a physical #LOCK pin. When asserted during a memory operation, this pin:

Prevented other bus masters (CPUs, DMA controllers) from gaining bus access
Held exclusive bus ownership for the duration of the locked operation
Allowed a complete read-modify-write sequence without interference

lock_prefix.s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
; x86 Assembly: LOCK prefix for atomicity
; The LOCK prefix asserts the #LOCK signal for the duration
; of the following instruction
 
; Non-atomic increment (INCORRECT for shared memory)
inc dword [counter]    ; Load, increment, store - NOT atomic!
 
; Atomic increment (CORRECT)
lock inc dword [counter]    ; Entire operation is atomic
 
; Note: LOCK is only valid with memory operands and specific
; instructions: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,
; CMPXCHG8B/16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, XCHG
 
; XCHG with memory is implicitly locked (no LOCK prefix needed)
xchg eax, [lock_var]    ; Automatically asserts LOCK

The Cost of Bus Locking:

Bus locking was effective but created a severe scalability bottleneck. While one processor held the bus lock:

All other processors stalled, unable to access memory
DMA transfers stopped
I/O operations backed up
The entire system essentially became single-core for the lock duration

This worked tolerably when systems had only 2-4 processors, but as systems scaled to 8, 16, or more processors, bus locking became catastrophically expensive.

Bus Lock Impact on System Throughput
Processors	Locks/Second	Avg Time Blocked	Throughput Loss
2	10,000	100ns per lock	~1%
4	10,000	100ns per lock	~3%
8	10,000	100ns per lock	~7%
16	10,000	100ns per lock	~15%
32	10,000	100ns per lock	~31%

Bus Locking is Mostly Obsolete

Modern processors rarely use true bus locks. Since the Pentium Pro (1995), Intel processors use cache locking for most atomic operations. The LOCK prefix still exists, but its implementation has fundamentally changed. However, understanding bus locking provides essential context for appreciating the elegance of modern cache-based solutions.

When Bus Locks Still Occur:

Even on modern processors, actual bus/memory locks occur in specific cases:

Split locks: When an atomic operation spans a cache line boundary (e.g., a 4-byte atomic at address 0x...FE)
Uncacheable memory: Operations on memory-mapped I/O or regions marked uncacheable
Legacy mode: Certain real-mode or early boot conditions

Split locks are particularly expensive and can be 100x slower than cache-locked atomics. Modern operating systems monitor for split locks and may terminate offending processes.

split_lock_detection.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Example: Accidentally causing a split lock
// BAD: This structure might cross cache line boundaries!
 
struct BadAlignment {
    char padding[62];    // Push lock to end of cache line
    int lock;            // 4 bytes might span two cache lines!
};
 
// GOOD: Ensure atomic variables are properly aligned
struct GoodAlignment {
    _Alignas(64) int lock;  // Aligned to cache line boundary
    char padding[60];        // Other data in same cache line
};
 
// Modern CPUs can detect and optionally trap on split locks:
// - Intel: AC (Alignment Check) split-lock detection
// - Linux kernel: split_lock_detect boot parameter

Cache Coherence Fundamentals

The revolution that made modern multiprocessor systems practical was cache coherence protocols. Instead of locking the entire memory bus, we can lock just the relevant cache line and use coherence mechanisms to guarantee exclusive access.

The Cache Coherence Problem:

Consider a simple scenario: both CPU 0 and CPU 1 have cached the same memory address. What happens when CPU 0 writes to it?

Without coherence:

CPU 0 updates its cache
CPU 1 still sees the old value in its cache
Memory might have yet another value
System state becomes incoherent—different observers see different values

With coherence:

CPU 0 announces its intent to write
CPU 1's cached copy is invalidated
CPU 0 performs the write with exclusive ownership
All subsequent reads see the new value

Cache Lines as Units of Coherence

Coherence operates on cache lines (typically 64 bytes on modern systems), not individual bytes or words. Even if you only modify one byte, the entire 64-byte cache line is transferred between caches. This has significant performance implications—unrelated data sharing a cache line can cause 'false sharing' performance problems.

Coherence Invariants:

A cache coherence protocol must maintain two fundamental invariants:

1. Single-Writer-Multiple-Reader (SWMR): At any moment, either:

Exactly one processor has read-write (exclusive/modified) access, OR
Any number of processors have read-only (shared) access

2. Data-Value Invariant: The value returned by a read is always the value written by the most recent write (as defined by the memory ordering model).

These invariants allow atomic operations to be implemented using cache-level mechanisms rather than bus-level locks.

Snoopy Cache Coherence:

The dominant coherence mechanism in multicore processors is snooping. Each cache controller monitors (snoops) the shared bus or interconnect, watching for memory operations from other caches that affect its cached data.

When CPU 0 performs a write:

CPU 0's cache controller broadcasts a "read-for-ownership" or invalidation request
All other cache controllers see this broadcast
Any cache holding the same line takes action:
- If in Shared state: invalidates its copy
- If in Modified state: writes back and invalidates
CPU 0 gains exclusive ownership
CPU 0 performs its write

Converting Mermaid diagram...

Directory-Based Coherence:

For large-scale systems with many processors (typically 8+), snooping becomes inefficient—every transaction must be broadcast to all caches. Directory-based coherence maintains a central directory recording which caches hold each memory block, enabling point-to-point messages instead of broadcasts.

NUMA (Non-Uniform Memory Access) systems universally use directory-based coherence because the interconnect topology doesn't support efficient broadcasts. Intel's QPI/UPI and AMD's Infinity Fabric implement hybrid protocols with both snooping and directory characteristics.

The MESI Protocol: Industry Standard Coherence

The MESI protocol (Modified, Exclusive, Shared, Invalid) is the foundation of cache coherence in virtually all modern multiprocessor systems. Understanding MESI is essential for reasoning about atomic operation behavior and performance.

The Four States:

MESI Cache Line States

•Modified (M): This cache has the only valid copy, and it has been written. Memory is stale. If evicted, must be written back to memory.
•Exclusive (E): This cache has the only valid copy, matching memory. Can transition to Modified without bus transaction. Silent upgrade to M on write.
•Shared (S): Multiple caches may have this line. All copies match memory. Must obtain exclusive ownership before writing.
•Invalid (I): This cache does not have a valid copy. Any access requires fetching from memory or another cache.

MESI State Properties
State	Exclusive?	Modified?	Can Read?	Can Write?
Modified	Yes	Yes	Yes	Yes
Exclusive	Yes	No	Yes	Yes (silent→M)
Shared	No	No	Yes	No (must upgrade)
Invalid	—	—	No	No

State Transitions:

The power of MESI lies in its state transitions, which are triggered by local processor operations and snooped bus transactions:

mesi_transitions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
MESI State Transition Triggers
================================
 
Local Processor Operations:
---------------------------
Read Miss    : I → S or E (depends on other caches' state)
Read Hit     : M, E, S → no change
Write Miss   : I → M
Write Hit    : M → M, E → M (silent), S → M (bus transaction required)
 
Snooped Bus Transactions (from other CPUs):
-------------------------------------------
Snoop Read   : M → S (write back), E → S
Snoop Write  : M → I (write back), E → I, S → I
 
Key Insight:
- Atomic RMW operations require M state during execution
- Other caches must be I state for atomicity to hold
- Cache achieves "mini-lock" on the line in M state

How MESI Enables Atomic Operations:

Here's the crucial insight: when a cache line is in Modified state, no other cache has a valid copy (they're all Invalid). This is exactly the exclusivity we need for atomic operations!

For an atomic Test-and-Set:

Processor issues RMW (Read-Modify-Write) instruction
Cache controller requests exclusive ownership (I/S → M transition)
All other copies are invalidated
While in M state, the processor performs read, modify, write—uninterruptible
Other processors attempting to access get delayed; they can't acquire the line
Once complete, normal coherence resumes

The Coherence Lock

The M state effectively creates a 'micro-lock' on a cache line. During an atomic RMW operation, the processor holds the line in M state without responding to snoop requests until the operation completes. This is much cheaper than bus locking because only the specific cache line is affected—all other memory remains accessible.

MESI Extensions (MOESI, MESIF):

Modern processors use extended protocols:

MOESI (AMD): Adds Owned state—cache has modified copy, but shares with others. Allows cache-to-cache transfer of dirty data without writing back to memory.

MESIF (Intel): Adds Forward state—one designated responder for shared data, reducing duplicate responses on bus.

These extensions improve performance without changing the fundamental atomicity guarantees.

Implementing Atomic Read-Modify-Write

With cache coherence as our foundation, let's examine exactly how a processor implements an atomic Test-and-Set (or any read-modify-write operation) using the cache system.

The Atomic Pipeline:

When an atomic instruction enters the CPU pipeline, it receives special treatment at multiple stages:

Atomic RMW Execution Stages

•Decode: Identifies instruction as atomic RMW; sets internal flags
•Issue: Waits until all prior memory operations are complete (ordering)
•Cache Probe: Checks local cache for target address
•Coherence: If line not in M state, issues bus request. Waits for all invalidation acknowledgments
•Lock Line: Blocks snoop responses for this line (holds exclusive access)
•Execute: Performs read, modify, write—all to the local cache line
•Unlock Line: Re-enables snooping; other processors can now request the line
•Retire: Instruction completes; subsequent instructions may execute

atomic_execution_timeline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Atomic Test-and-Set: Detailed Timeline
=======================================
 
Assumptions:
- Lock variable is in another CPU's cache (Shared state)
- Modern x86 processor with MESI protocol
 
Cycle | Event
------+------------------------------------------------------------------
  1   | XCHG instruction decoded, identified as atomic
  2   | Probe L1 cache: MISS (line not present)
  3   | Probe L2 cache: HIT in Shared state
  4   | Cannot write to Shared line; must upgrade
  5   | Issue coherence request: Read-For-Ownership (RFO)
  6   | RFO broadcast on ring/mesh interconnect
  7   | Other caches snoop RFO, begin invalidating their copies
  8   | Wait for invalidation ACKs from all sharers
  9   | (waiting)
 10   | All ACKs received; line promoted to Exclusive
 11   | Line locked against snoops (internal flag set)
 12   | READ: Load current value from cache line
 13   | MODIFY: Write new value (1) to cache line
 14   | Line state: Modified (dirty)
 15   | Line unlocked; snoop handling resumes
 16   | Instruction retires with old value in register
 
Total latency: ~15 cycles (uncontended, line in L2 Shared)
With contention: 50-500+ cycles (coherence traffic delays)

The Critical Section of Hardware:

Note cycles 11-15 in the timeline above. During this window, the processor has:

Obtained exclusive cache line ownership
Disabled responses to external snoop requests for this line
Performed the complete read-modify-write sequence

This 4-cycle window is the hardware equivalent of a critical section. No other processor can observe intermediate states because:

They cannot obtain the cache line (exclusivity)
Even if they could snoop, the local cache doesn't respond (snoop blocking)

Cache Line Locking vs. Bus Locking

The key improvement over bus locking: only ONE cache line is inaccessible during the atomic operation. All other memory addresses remain accessible. If CPU 0 is atomically updating address A, CPU 1 can simultaneously access addresses B, C, D without delay. This enables true parallelism despite synchronization.

Handling Competing Atomic Operations:

What if two CPUs execute atomic operations on the same address simultaneously? The coherence protocol serializes them:

Both CPUs issue Read-For-Ownership requests
The interconnect arbitrates—one wins, one loses
Winner receives exclusive access first, completes atomic op
Winner's line goes to Modified state
Loser's RFO arrives; winner must share (M→I, or M→S in some protocols)
Loser obtains line, performs its atomic operation
Operations are serialized—no interleaving possible

Competing Atomic Operations Timeline
Time	CPU 0	CPU 1	Line State (CPU0 wins)
T1	Issue RFO	Issue RFO	I/I
T2	Win arbitration	Lose, wait	E (CPU 0)
T3	Lock line, RMW	Still waiting	Locked at CPU 0
T4	Complete, unlock	RFO arrives at CPU 0	M (CPU 0)
T5	Writeback, invalidate	Receive line	I/E (CPU 1)
T6	—	Lock line, RMW	Locked at CPU 1
T7	—	Complete	I/M (CPU 1)

The coherence protocol guarantees that even if both atomic operations are issued at exactly the same moment, one completes entirely before the other begins. This is the serialization property that makes mutual exclusion possible.

Architecture Variations in Atomic Support

Different processor architectures implement atomic operations with varying strategies. Understanding these variations is important for writing portable code and reasoning about performance across platforms.

x86/x64: Lock Prefix and Implicit Locking:

The x86 architecture provides the strongest and most straightforward atomic support. The LOCK prefix can be applied to specific read-modify-write instructions, and XCHG with memory is implicitly locked.

x86_atomics.s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
; x86/x64 Atomic Operations
 
; Atomic increment
lock inc dword [counter]
 
; Atomic add
lock add dword [value], 10
 
; Atomic bitwise OR
lock or dword [flags], 0x1
 
; Atomic exchange (implicit lock)
mov eax, 1
xchg eax, [lock_var]      ; EAX = old, [lock_var] = 1
 
; Atomic compare-and-swap (CMPXCHG)
mov eax, expected_value
mov ebx, new_value
lock cmpxchg [target], ebx ; If [target]==EAX: [target]=EBX, ZF=1
                           ; Else: EAX=[target], ZF=0
 
; Atomic exchange-and-add (XADD)
mov eax, increment
lock xadd [counter], eax   ; EAX = old, [counter] += old_EAX
 
; 8-byte atomics (x64)
lock cmpxchg16b [qword_addr]  ; 16-byte compare-and-swap

ARM: Load-Linked/Store-Conditional (LL/SC):

ARM (prior to ARMv8.1) uses a fundamentally different model: Load-Linked (LDREX) and Store-Conditional (STREX). This provides atomic read-modify-write capability without requiring a specialized atomic instruction for each operation.

arm_ll_sc.s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
; ARM (ARMv7, ARMv8): Load-Linked/Store-Conditional
; Implements atomic test-and-set
 
test_and_set:
    mov     r1, #1          ; Value to write
retry:
    ldrex   r0, [r2]        ; Load-Exclusive: read lock, mark reservation
    strex   r3, r1, [r2]    ; Store-Exclusive: try to write 1
    cmp     r3, #0          ; Did store succeed?
    bne     retry           ; No: reservation lost, retry
    dmb                     ; Data memory barrier for ordering
    bx      lr              ; Return old value in r0
 
; ARMv8.1 adds true atomic instructions:
; LDADD, LDCLR, LDEOR, LDSET (atomic fetch-and-op)
; SWPA, SWPAL (atomic swap with acquire/release)
; CAS, CASA, CASAL (compare-and-swap variants)

LL/SC Spurious Failures

Store-Conditional can fail even if no other processor touched the reservation—context switches, interrupts, or cache line evictions can clear the reservation. Code must always be prepared to retry. This is fundamentally different from x86 atomics, which never spuriously fail.

LL/SC vs. Direct Atomic: Tradeoffs:

Comparing Atomic Implementation Strategies
Aspect	x86 (Direct Atomic)	ARM LL/SC
Implementation	Single instruction does RMW	Two instructions + retry loop
Failure mode	Never fails	SC can fail spuriously
Flexibility	Fixed operations (INC, ADD, XCHG...)	Any RMW can be built
Hardware cost	More complex instruction	Simpler instructions + reservation
ABA problem	Possible (addressed by width)	Possible (addressed by reservation)
Contention behavior	Guaranteed progress	Livelock possible (rare)

RISC-V Atomic Extension (A):

RISC-V provides both models: LL/SC instructions (LR/SC) and direct atomic operations (AMO instructions). The AMO instructions like AMOSWAP behave similarly to x86 XCHG.

riscv_atomics.s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
; RISC-V Atomic Operations
 
; Atomic swap (like x86 XCHG)
li     t0, 1                    ; Value to write
amoswap.w.aq  t1, t0, (a0)     ; t1 = old [a0]; [a0] = t0
                                ; .aq = acquire ordering
 
; Atomic add
li     t0, 1
amoadd.w  t1, t0, (a0)         ; t1 = old [a0]; [a0] += t0
 
; LL/SC style (LR/SC)
retry:
    lr.w    t1, (a0)           ; Load-Reserved
    addi    t2, t1, 1          ; Modify
    sc.w    t3, t2, (a0)       ; Store-Conditional
    bnez    t3, retry          ; Retry if failed
 
; Ordering suffixes:
; .aq = acquire (no following ops reorder before)
; .rl = release (no preceding ops reorder after)
; .aqrl = sequentially consistent

Memory Barriers and Ordering Enforcement

Atomic operations don't just prevent interleaving—they also impact memory ordering. Modern processors reorder memory operations aggressively, and synchronization code must control this reordering. Memory barriers (also called memory fences) are hardware mechanisms that enforce ordering constraints.

The Need for Barriers:

Consider code protecting shared data with a lock. Without ordering constraints:

ordering_need.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Thread 1: Producer
void produce() {
    data = compute();      // 1: Compute data
    ready = true;          // 2: Signal ready
}
 
// Thread 2: Consumer  
void consume() {
    while (!ready) { }     // 3: Wait for ready
    use(data);             // 4: Use data
}
 
// PROBLEM: CPU might reorder (2) before (1)!
// Consumer sees ready=true but data is not yet written
// 
// PROBLEM: CPU might reorder (4) before (3)!
// Consumer reads data before checking ready
 
// Solution: Memory barriers enforce ordering

Types of Memory Barriers:

Memory Barrier Types

•Load Barrier (rmb/lfence): All loads before the barrier complete before any loads after the barrier begin. Prevents load-load reordering.
•Store Barrier (wmb/sfence): All stores before the barrier complete before any stores after the barrier begin. Prevents store-store reordering.
•Full Barrier (mb/mfence): All memory operations before the barrier complete before any memory operations after. Prevents all reordering across the barrier.
•Acquire Barrier: Loads/stores after cannot move before. Used after lock acquisition.
•Release Barrier: Loads/stores before cannot move after. Used before lock release.

Barrier Semantics Across Architectures:

Memory Barriers Across Architectures
Type	x86/x64	ARM	POWER
Load barrier	lfence (rarely needed)	dmb ld	lwsync or isync
Store barrier	sfence	dmb st	lwsync
Full barrier	mfence	dmb sy	sync
Acquire	Implicit in LOCK ops	dmb ish + ldar	lwsync or isync
Release	Implicit in LOCK ops	dmb ish + stlr	lwsync

x86 Is Strongly Ordered

x86 provides a relatively strong memory model: stores are never reordered with other stores, and loads are never reordered with other loads. Only store-load reordering is allowed (a store can be delayed, allowing a subsequent load to complete first). This simplifies x86 lock implementations but lulls developers into habits that break on weaker architectures.

proper_synchronization.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// C11 atomics with proper memory ordering
#include <stdatomic.h>
 
atomic_int ready = 0;
int data = 0;
 
// Producer
void produce() {
    data = 42;  // Regular store
    
    // Release: ensures data=42 is visible before ready=1
    atomic_store_explicit(&ready, 1, memory_order_release);
}
 
// Consumer
void consume() {
    // Acquire: ensures we don't read data until after seeing ready=1
    while (atomic_load_explicit(&ready, memory_order_acquire) == 0) {
        // spin
    }
    
    use(data);  // Guaranteed to see 42
}
 
// The acquire-release pair creates a "synchronizes-with" relationship:
// All writes before the release are visible after the acquire

Barriers in Lock Implementations:

A proper Test-and-Set lock needs:

Acquire barrier after lock acquisition: Prevents protected reads from being reordered before the lock is acquired
Release barrier before lock release: Ensures all protected writes are visible before the lock is released

On x86, the locked XCHG instruction provides an implicit full barrier, automatically satisfying both requirements. On ARM and other weakly-ordered architectures, explicit barriers (or acquire/release variants of atomic instructions) are required.

Summary: The Hardware Foundation of Atomicity

We have journeyed deep into the hardware mechanisms that make atomic operations possible. This knowledge is essential for understanding the true cost of synchronization and for writing correct, portable concurrent code. Let's consolidate the key insights:

Key Takeaways

•Bus locking (historical) — Early systems locked the entire memory bus for atomic operations. Effective but doesn't scale. Modern systems rarely use true bus locks.
•Cache coherence — MESI and related protocols maintain consistent cache views across processors. Coherence protocols enable fine-grained atomicity without global locks.
•Cache line locking — Modern atomics lock only the relevant cache line in Modified state. Other memory remains accessible, enabling parallelism.
•RMW execution — Atomic read-modify-write blocks snoops during execution, creating a hardware critical section lasting just a few cycles.
•Architecture variations — x86 provides direct atomic instructions; ARM/RISC-V use LL/SC or both. Understanding these differences enables portable code.
•Memory barriers — Atomics interact with memory ordering. Proper barriers (or atomic instruction variants) ensure visibility of protected data.

What's Next:

With the hardware foundation established, we'll now focus on implementing locks using Test-and-Set. We'll examine the basic spinlock implementation, explore optimizations like Test-and-Test-and-Set, and analyze the performance characteristics of various locking strategies. This practical knowledge will enable you to implement and debug synchronization primitives with confidence.

Page Complete

You now understand how hardware makes atomicity possible—from cache coherence protocols through memory barriers. This knowledge transforms atomic operations from mysterious primitives into understandable mechanisms with predictable costs and behaviors. You're ready to apply this understanding to practical lock implementation.