Loading content...
Peterson's Solution is an elegant, mathematically proven algorithm. Yet if you implement it directly in C or any modern language and run it on a modern multi-core processor, it will fail. Mutual exclusion violations will occur. Data races will happen.
How can a proven-correct algorithm fail? The answer lies in the gap between the idealized memory model under which Peterson's Solution was proven correct and the relaxed memory models of real hardware. This final page explores the practical limitations that prevent direct use of Peterson's Solution, and why understanding these limitations is crucial for any systems programmer.
By the end of this page, you will understand why modern CPUs reorder memory operations, how store buffers and caches break sequential consistency, why Peterson's Solution requires memory barriers on real hardware, the scalability limitations for more than two processes, and what modern alternatives exist.
Peterson's Solution was designed assuming sequential consistency (SC)—a memory model defined by Leslie Lamport in 1979. Under sequential consistency:
The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
What Sequential Consistency Guarantees:
Why Peterson's Solution Requires SC:
The algorithm depends critically on the order of operations:
12345678910111213141516171819
// Peterson's entry section for Process 0flag[0] = true; // WRITE to flag[0] ← MUST HAPPEN FIRSTturn = 1; // WRITE to turn ← MUST HAPPEN SECONDwhile (flag[1] && turn == 1) { // READ flag[1] and turn ← MUST SEE LATEST // Wait...} // The algorithm DEPENDS on:// 1. flag[0] = true being visible to Process 1 before Process 0// checks flag[1] in the while loop.//// 2. turn = 1 being visible to Process 1 before Process 0 // proceeds past the while loop.//// 3. Process 0 seeing the most recent value of flag[1] written// by Process 1.//// If any of these ordering constraints are violated, the algorithm// can fail to provide mutual exclusion.Almost no modern CPU provides sequential consistency by default. x86 provides TSO (Total Store Order), which is weaker than SC. ARM and POWER architectures provide even weaker guarantees. This isn't a bug—it's a deliberate design choice for performance.
Modern CPUs use store buffers to improve performance. When a CPU executes a write instruction, the write goes to a local buffer first, then eventually propagates to cache and main memory. This creates a delay between when a write executes and when other processors can see it.
How Store Buffers Break Peterson's Solution:
1234567891011121314151617181920212223242526272829303132333435363738394041
// CPU 0 CPU 1// ───── ─────// Store Buffer: [empty] Store Buffer: [empty]// // Entry section: Entry section:// flag[0] = true; ──┐ flag[1] = true; ──┐// │ │// ▼ ▼// Store Buffer: [flag[0]=true] Store Buffer: [flag[1]=true]// (NOT yet in memory!) (NOT yet in memory!)// // turn = 1; ──────────┐ turn = 0; ──────────┐// │ │// ▼ ▼ // Store Buffer: [flag[0]=true, Store Buffer: [flag[1]=true,// turn=1] turn=0]// // While check: While check:// READ flag[1] READ flag[0]// → From MEMORY (not buffer) → From MEMORY (not buffer)// → flag[1] = false! (stale) → flag[0] = false! (stale)// READ turn READ turn// → turn might be 0 or 1 → turn might be 0 or 1//// If both CPUs see the other's flag as false:// Condition: false && anything = false// BOTH ENTER CRITICAL SECTION!//// 💥 MUTUAL EXCLUSION VIOLATED! // Timeline showing the race:// ═══════════════════════════════════════════════════════════════════// T1: CPU 0 writes flag[0]=true to store buffer// T2: CPU 1 writes flag[1]=true to store buffer// T3: CPU 0 writes turn=1 to store buffer// T4: CPU 1 writes turn=0 to store buffer// T5: CPU 0 reads flag[1] from memory → still false (buffer not drained!)// T6: CPU 1 reads flag[0] from memory → still false (buffer not drained!)// T7: Both CPUs enter critical section// // The stores were in the buffer but not yet visible to other CPUs!The Core Problem:
Store buffers create a window where a processor's writes are visible to itself (it can read its own recent writes from the buffer) but not to other processors. During this window:
flag[0] = true (from its own buffer)flag[0] = false (from stale memory)This asymmetry breaks the algorithm's assumption that writes are immediately globally visible.
Processors can forward their own writes from the store buffer to subsequent reads—a feature called store-to-load forwarding. This is why a CPU sees its own writes immediately. However, other CPUs must wait until the write propagates to cache and is coherent across the system.
Beyond store buffers, modern CPUs and compilers can reorder instructions for performance optimization. An operation that appears before another in source code may execute after it—as long as the reordering is invisible to the single-threaded execution.
Types of Reordering:
| Reordering Type | Description | Effect on Peterson's |
|---|---|---|
| Load-Load | A later load executes before an earlier load | Can read stale turn/flag values |
| Store-Store | A later store becomes visible before an earlier store | flag and turn writes may appear in wrong order |
| Load-Store | A later store executes before an earlier load | Write to flag may appear before reading other's flag |
| Store-Load | A later load executes before an earlier store | Most common problem: read flag before own write visible |
Store-Load Reordering: The Critical Case:
The most problematic reordering for Peterson's Solution is store-load reordering—where a load (read) is executed before a preceding store (write) has become visible to other processors.
12345678910111213141516171819202122232425262728
// Source code (Process 0):flag[0] = true; // STOREturn = 1; // STORE x = flag[1]; // LOAD (in while condition) // What the CPU might ACTUALLY execute:x = flag[1]; // LOAD - executed FIRST (speculation!)flag[0] = true; // STORE - executed laterturn = 1; // STORE - executed even later // Why? The CPU doesn't yet know the values for flag[0] and turn,// but it can speculatively load flag[1] while computing the write values.// This is safe for SINGLE-THREADED code (same result either way).// But it breaks multi-threaded algorithms! // How this breaks mutual exclusion:// ═══════════════════════════════════════════════════════════════════// // CPU 0 reorders to: CPU 1 reorders to:// LOAD flag[1] → false LOAD flag[0] → false// STORE flag[0] = true STORE flag[1] = true // STORE turn = 1 STORE turn = 0//// Both CPUs saw the other's flag as false because they loaded BEFORE// the other's store was executed.//// Both while conditions are false.// BOTH ENTER CRITICAL SECTION!Even x86, which has a relatively strong memory model (TSO), allows store-load reordering. This single allowance is enough to break Peterson's Solution. ARM and POWER allow all four types of reordering, making them particularly problematic for algorithms assuming sequential consistency.
Peterson's Solution can be made to work on modern hardware by inserting memory barriers (also called memory fences). A memory barrier is a CPU instruction that enforces ordering constraints on memory operations.
Types of Memory Barriers:
Peterson's Solution with Barriers:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
#include <stdatomic.h> // Shared variables (can use volatile + explicit barriers or atomics)volatile bool flag[2] = {false, false};volatile int turn = 0; // Memory barrier macros (platform-specific)#ifdef __x86_64__ #define FULL_BARRIER() __asm__ __volatile__("mfence" ::: "memory")#elif defined(__aarch64__) #define FULL_BARRIER() __asm__ __volatile__("dmb ish" ::: "memory")#else #define FULL_BARRIER() atomic_thread_fence(memory_order_seq_cst)#endif void process(int i) { int j = 1 - i; while (1) { // Non-critical section // ═══════════════════════════════════════════════════════════ // ENTRY SECTION (with barriers) // ═══════════════════════════════════════════════════════════ flag[i] = true; // STORE: announce intent FULL_BARRIER(); // ← BARRIER 1: Ensure flag write is visible // before turn write and before reading flag[j] turn = j; // STORE: yield to other process FULL_BARRIER(); // ← BARRIER 2: Ensure turn write is visible // before reading flag[j] and turn in loop while (flag[j] && turn == j) { // Busy wait // No barrier needed inside loop on x86 (loads acquire from cache) // On ARM/POWER, may need a load barrier to see updates } // ACQUIRE semantics - we now "hold" the lock FULL_BARRIER(); // ← BARRIER 3: Ensure we see all writes by // the previous CS holder // ═══════════════════════════════════════════════════════════ // CRITICAL SECTION // ═══════════════════════════════════════════════════════════ access_shared_resource(); // ═══════════════════════════════════════════════════════════ // EXIT SECTION (with barrier) // ═══════════════════════════════════════════════════════════ FULL_BARRIER(); // ← BARRIER 4: Ensure all CS writes complete // before clearing flag (RELEASE semantics) flag[i] = false; // STORE: release interest }}Instead of explicit barriers, modern C/C++ provides atomic types with memory ordering guarantees. Using atomic_store_explicit with memory_order_seq_cst for stores and atomic_load_explicit for loads achieves the same effect with more portable code. The compiler inserts appropriate barriers for the target architecture.
Peterson's Solution as presented is for exactly two processes. Extending it to N processes is possible but introduces significant complexity and limitations.
Peterson's N-Process Generalization:
The standard generalization uses a filter lock, also known as Peterson's filter algorithm:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
// Peterson's Filter Lock for N processes#define N 4 // Number of processes int level[N]; // level[i] = current level of process i (0 to N-1)int victim[N]; // victim[L] = which process is the victim at level L void process(int i) { while (1) { // ═══════════════════════════════════════════════════════════ // ENTRY SECTION: Go through N-1 levels // ═══════════════════════════════════════════════════════════ for (int L = 1; L < N; L++) { level[i] = L; // I'm at level L victim[L] = i; // I'm the victim at level L // Wait while: // - I'm the victim at this level, AND // - Some other process is at this level or higher while (victim[L] == i && exists_higher(i, L)) { // Busy wait } } // ═══════════════════════════════════════════════════════════ // CRITICAL SECTION // ═══════════════════════════════════════════════════════════ access_shared_resource(); // ═══════════════════════════════════════════════════════════ // EXIT SECTION // ═══════════════════════════════════════════════════════════ level[i] = 0; // Back to level 0 }} bool exists_higher(int i, int L) { // Check if any process k != i is at level L or higher for (int k = 0; k < N; k++) { if (k != i && level[k] >= L) { return true; } } return false;} /* * PROBLEMS WITH N-PROCESS PETERSON: * * 1. Space Complexity: O(N) shared variables * * 2. Time Complexity: O(N) loop iterations to enter CS * Even if no one else is competing! * * 3. Contention: All N processes might spin on the same victim array * causing cache line bouncing (false sharing) * * 4. Starvation: The bounded waiting bound is O(N) - a process might * wait for N-1 other processes to pass first * * 5. Memory Model: Even more sensitive to reordering with more variables */| Factor | Two-Process | N-Process | Impact |
|---|---|---|---|
| Entry Time (no contention) | O(1) | O(N) | Slow entry even without competition |
| Space Overhead | 3 variables | 2N variables | Memory scales linearly |
| Bounded Waiting | 1 CS | O(N) CS | Waiting time grows with N |
| Memory Traffic | Low | High | Cache coherence overhead |
| Implementation Ease | Simple | Complex | More variables to handle correctly |
For N processes, modern systems use scalable locks like MCS locks, CLH locks, or ticket locks. These use hardware atomics and provide O(1) space per waiter, O(1) entry time under no contention, and fair FIFO ordering. They're designed with cache coherence in mind.
Peterson's Solution uses busy waiting (also called spinning)—a process continuously checks the loop condition, consuming CPU cycles while waiting. This is fundamentally inefficient in several ways:
Problems with Busy Waiting:
12345678910111213141516171819202122232425262728
// Priority Inversion Scenario with Peterson's Solution// ═══════════════════════════════════════════════════════════════════ Process L (Low Priority): - Enters critical section - Gets preempted by scheduler (time slice expired) - Still holds lock (flag[L] = true) Process H (High Priority): - Wants to enter critical section - Checks: flag[L] && turn == L - flag[L] = true → Must wait! - SPINS... consuming CPU... Scheduler behavior: - H has higher priority, so H gets CPU time - L cannot run because H is using the CPU - L cannot release the lock because it can't run - H cannot acquire the lock because L holds it - DEADLOCK-LIKE SITUATION (priority inversion) // On single-CPU systems, this can cause complete starvation.// Process H effectively prevents Process L from ever running. // Solutions in real systems:// 1. Priority Inheritance: Temporarily boost L's priority to H's level// 2. Priority Ceiling: Lock has a priority; holder gets that priority// 3. Use blocking synchronization instead of spinningWhen Spinning Is Acceptable:
Despite its problems, spinning is appropriate in specific contexts:
Modern mutex implementations often use a two-phase approach: spin briefly (hoping for quick lock release), then block (sleeping) if the lock isn't available. This adapts to both short and long critical sections. The Linux kernel's adaptive mutexes use this approach.
Even if the CPU behaved correctly, the compiler can break Peterson's Solution through optimizations. Compilers are free to reorder, combine, or eliminate memory operations as long as the single-threaded behavior is preserved.
Dangerous Compiler Optimizations:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// ═══════════════════════════════════════════════════════════════════// PROBLEM 1: Register Caching// ═══════════════════════════════════════════════════════════════════ // Source code:while (flag[j] && turn == j) { // Wait} // What the compiler might generate (WRONG!):bool cached_flag = flag[j]; // Load once into registerint cached_turn = turn; while (cached_flag && cached_turn == j) { // Infinite loop! We never re-read from memory.} // The compiler "optimizes" by not re-reading values that don't// change within this function. It doesn't know another thread writes them! // ═══════════════════════════════════════════════════════════════════// PROBLEM 2: Store Combining// ═══════════════════════════════════════════════════════════════════ // Source code:flag[i] = true;// ... some computation ...flag[i] = false; // What the compiler might do:// Eliminate both stores if it thinks no one reads flag[i] between them.// Or just generate: flag[i] = false; (dead store elimination) // ═══════════════════════════════════════════════════════════════════// PROBLEM 3: Instruction Reordering// ═══════════════════════════════════════════════════════════════════ // Source code:flag[i] = true;turn = j; // Compiler might reorder to:turn = j;flag[i] = true; // This reordering is "safe" for single-threaded code but breaks// the algorithm's correctness for multi-threaded execution.Solutions for Compiler Issues:
volatile Keyword: Tells the compiler the variable may change unexpectedly. Prevents register caching and some reordering. NOT sufficient for multi-threaded correctness alone!asm volatile("" ::: "memory") in GCC prevents the compiler from reordering across the barrier._Atomic or std::atomic informs the compiler that variables are accessed by multiple threads.memory_order_seq_cst, memory_order_acquire, memory_order_release specify ordering requirements.A common misconception is that volatile provides thread-safety. It doesn't! volatile only prevents compiler optimizations. It doesn't prevent CPU reordering or ensure atomic access to multi-byte values. For thread-safety, you need atomics or explicit memory barriers.
Given Peterson's Solution's limitations, what do modern systems use instead? The answer is hardware-supported atomic operations combined with proper memory ordering.
Hardware Atomic Operations:
| Operation | Description | Use Case |
|---|---|---|
| Test-and-Set (TAS) | Atomically read and set a value | Simple spinlocks |
| Compare-and-Swap (CAS) | Atomically compare and conditionally update | Lock-free algorithms, spinlocks |
| Fetch-and-Add | Atomically add and return old value | Ticket locks, counters |
| Load-Linked/Store-Conditional | Pair of operations for atomic update | ARM/POWER lock-free code |
Modern Lock Implementation Example:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
#include <stdatomic.h> // Simple CAS-based spinlock using C11 atomicstypedef struct { atomic_bool locked;} spinlock_t; void spinlock_init(spinlock_t *lock) { atomic_store(&lock->locked, false);} void spinlock_lock(spinlock_t *lock) { bool expected = false; // Keep trying to change locked from false to true while (!atomic_compare_exchange_weak( &lock->locked, // pointer to atomic variable &expected, // expected value (false) true // desired value (true) )) { expected = false; // Reset expected (CAS updates it on failure) // Optional: spin-wait hint for CPU power efficiency #ifdef __x86_64__ __asm__ __volatile__("pause"); #endif } // Acquired! The CAS provides acquire semantics by default.} void spinlock_unlock(spinlock_t *lock) { // Release semantics: all previous writes are visible before unlock atomic_store(&lock->locked, false);} /* * WHY THIS IS BETTER THAN PETERSON'S: * * 1. ATOMIC: The CAS operation is a single hardware instruction that * cannot be interleaved. No need for multiple variables and careful * ordering. * * 2. IMPLICIT BARRIERS: Memory order is specified in the atomic operations. * The compiler and CPU know what ordering is required. * * 3. PORTABLE: Works correctly on x86, ARM, POWER, etc. The compiler * generates the right instructions for each architecture. * * 4. SCALABLE: Can be extended to N processes trivially. * * 5. SIMPLE: One variable, straightforward logic. */In practice, use your language's or OS's mutex/lock implementations (pthread_mutex, std::mutex, synchronized blocks). They're optimized for your platform, handle priority inversion, integrate with the scheduler, and have been extensively tested. Write your own locks only when you have a compelling reason and deep platform knowledge.
We've explored the substantial gap between Peterson's theoretical elegance and practical applicability. Let's consolidate our understanding:
Why Study Peterson's Solution Then?
Despite its limitations, Peterson's Solution teaches invaluable lessons:
Understanding Peterson's Solution and its limitations makes you a better developer of concurrent systems—you appreciate what your synchronization primitives do and why they're designed as they are.
Congratulations! You've mastered Peterson's Solution—from its elegant two-variable design to its formal correctness proofs to its practical limitations. This understanding forms a foundation for studying more advanced synchronization primitives, lock-free algorithms, and memory models. Next, you might explore Dekker's Algorithm (the historical predecessor), hardware synchronization primitives, or semaphores and monitors.