Petersons Solution - Learning Module

Loading content...

0/240

Limitations

The Gap Between Theory and Practice

Peterson's Solution is an elegant, mathematically proven algorithm. Yet if you implement it directly in C or any modern language and run it on a modern multi-core processor, it will fail. Mutual exclusion violations will occur. Data races will happen.

How can a proven-correct algorithm fail? The answer lies in the gap between the idealized memory model under which Peterson's Solution was proven correct and the relaxed memory models of real hardware. This final page explores the practical limitations that prevent direct use of Peterson's Solution, and why understanding these limitations is crucial for any systems programmer.

What You Will Learn

By the end of this page, you will understand why modern CPUs reorder memory operations, how store buffers and caches break sequential consistency, why Peterson's Solution requires memory barriers on real hardware, the scalability limitations for more than two processes, and what modern alternatives exist.

The Sequential Consistency Assumption

Peterson's Solution was designed assuming sequential consistency (SC)—a memory model defined by Leslie Lamport in 1979. Under sequential consistency:

The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

What Sequential Consistency Guarantees:

Sequential Consistency Properties

•Program Order Preserved: Operations from each process appear in the sequence in the order they were written.
•Global Ordering: There exists a total order over all operations that all processes agree upon.
•Immediate Visibility: A write by one process is immediately visible to reads by other processes.
•No Reordering: No memory operation is observed out of order from any vantage point.

Why Peterson's Solution Requires SC:

The algorithm depends critically on the order of operations:

sc_dependency.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Peterson's entry section for Process 0
flag[0] = true;     // WRITE to flag[0]    ← MUST HAPPEN FIRST
turn = 1;           // WRITE to turn       ← MUST HAPPEN SECOND
while (flag[1] && turn == 1) {  // READ flag[1] and turn ← MUST SEE LATEST
    // Wait...
}
 
// The algorithm DEPENDS on:
// 1. flag[0] = true being visible to Process 1 before Process 0
//    checks flag[1] in the while loop.
//
// 2. turn = 1 being visible to Process 1 before Process 0 
//    proceeds past the while loop.
//
// 3. Process 0 seeing the most recent value of flag[1] written
//    by Process 1.
//
// If any of these ordering constraints are violated, the algorithm
// can fail to provide mutual exclusion.

Modern CPUs Don't Provide SC!

Almost no modern CPU provides sequential consistency by default. x86 provides TSO (Total Store Order), which is weaker than SC. ARM and POWER architectures provide even weaker guarantees. This isn't a bug—it's a deliberate design choice for performance.

Store Buffers and Visibility Delays

Modern CPUs use store buffers to improve performance. When a CPU executes a write instruction, the write goes to a local buffer first, then eventually propagates to cache and main memory. This creates a delay between when a write executes and when other processors can see it.

How Store Buffers Break Peterson's Solution:

store_buffer_problem.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// CPU 0                              CPU 1
// ─────                              ─────
// Store Buffer: [empty]              Store Buffer: [empty]
// 
// Entry section:                     Entry section:
// flag[0] = true;  ──┐               flag[1] = true;  ──┐
//                    │                                   │
//                    ▼                                   ▼
// Store Buffer: [flag[0]=true]       Store Buffer: [flag[1]=true]
// (NOT yet in memory!)               (NOT yet in memory!)
// 
// turn = 1;  ──────────┐             turn = 0;  ──────────┐
//                      │                                   │
//                      ▼                                   ▼  
// Store Buffer: [flag[0]=true,       Store Buffer: [flag[1]=true,
//                turn=1]                            turn=0]
// 
// While check:                       While check:
// READ flag[1]                       READ flag[0]
//   → From MEMORY (not buffer)         → From MEMORY (not buffer)
//   → flag[1] = false! (stale)         → flag[0] = false! (stale)
// READ turn                          READ turn
//   → turn might be 0 or 1             → turn might be 0 or 1
//
// If both CPUs see the other's flag as false:
// Condition: false && anything = false
// BOTH ENTER CRITICAL SECTION!
//
// 💥 MUTUAL EXCLUSION VIOLATED!
 
// Timeline showing the race:
// ═══════════════════════════════════════════════════════════════════
// T1: CPU 0 writes flag[0]=true to store buffer
// T2: CPU 1 writes flag[1]=true to store buffer
// T3: CPU 0 writes turn=1 to store buffer
// T4: CPU 1 writes turn=0 to store buffer
// T5: CPU 0 reads flag[1] from memory → still false (buffer not drained!)
// T6: CPU 1 reads flag[0] from memory → still false (buffer not drained!)
// T7: Both CPUs enter critical section
// 
// The stores were in the buffer but not yet visible to other CPUs!

The Core Problem:

Store buffers create a window where a processor's writes are visible to itself (it can read its own recent writes from the buffer) but not to other processors. During this window:

CPU 0 sees flag[0] = true (from its own buffer)
CPU 1 sees flag[0] = false (from stale memory)

This asymmetry breaks the algorithm's assumption that writes are immediately globally visible.

Store Buffer Forwarding

Processors can forward their own writes from the store buffer to subsequent reads—a feature called store-to-load forwarding. This is why a CPU sees its own writes immediately. However, other CPUs must wait until the write propagates to cache and is coherent across the system.

Instruction Reordering

Beyond store buffers, modern CPUs and compilers can reorder instructions for performance optimization. An operation that appears before another in source code may execute after it—as long as the reordering is invisible to the single-threaded execution.

Types of Reordering:

Types of Memory Operation Reordering
Reordering Type	Description	Effect on Peterson's
Load-Load	A later load executes before an earlier load	Can read stale turn/flag values
Store-Store	A later store becomes visible before an earlier store	flag and turn writes may appear in wrong order
Load-Store	A later store executes before an earlier load	Write to flag may appear before reading other's flag
Store-Load	A later load executes before an earlier store	Most common problem: read flag before own write visible

Store-Load Reordering: The Critical Case:

The most problematic reordering for Peterson's Solution is store-load reordering—where a load (read) is executed before a preceding store (write) has become visible to other processors.

store_load_reorder.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Source code (Process 0):
flag[0] = true;           // STORE
turn = 1;                 // STORE  
x = flag[1];              // LOAD (in while condition)
 
// What the CPU might ACTUALLY execute:
x = flag[1];              // LOAD - executed FIRST (speculation!)
flag[0] = true;           // STORE - executed later
turn = 1;                 // STORE - executed even later
 
// Why? The CPU doesn't yet know the values for flag[0] and turn,
// but it can speculatively load flag[1] while computing the write values.
// This is safe for SINGLE-THREADED code (same result either way).
// But it breaks multi-threaded algorithms!
 
// How this breaks mutual exclusion:
// ═══════════════════════════════════════════════════════════════════
// 
// CPU 0 reorders to:          CPU 1 reorders to:
// LOAD flag[1]  → false       LOAD flag[0]  → false
// STORE flag[0] = true        STORE flag[1] = true  
// STORE turn = 1              STORE turn = 0
//
// Both CPUs saw the other's flag as false because they loaded BEFORE
// the other's store was executed.
//
// Both while conditions are false.
// BOTH ENTER CRITICAL SECTION!

x86 and Store-Load Reordering

Even x86, which has a relatively strong memory model (TSO), allows store-load reordering. This single allowance is enough to break Peterson's Solution. ARM and POWER allow all four types of reordering, making them particularly problematic for algorithms assuming sequential consistency.

Fixing Peterson's with Memory Barriers

Peterson's Solution can be made to work on modern hardware by inserting memory barriers (also called memory fences). A memory barrier is a CPU instruction that enforces ordering constraints on memory operations.

Types of Memory Barriers:

Memory Barrier Types

•Full Barrier (mfence/sync): Prevents all reordering across the barrier. All memory operations before the barrier complete before any after it starts.
•Store Barrier (sfence/dmb st): Ensures all stores before the barrier complete before stores after it.
•Load Barrier (lfence/dmb ld): Ensures all loads before the barrier complete before loads after it.
•Acquire Barrier: Prevents loads/stores from being moved before the barrier (for lock acquisition).
•Release Barrier: Prevents loads/stores from being moved after the barrier (for lock release).

Peterson's Solution with Barriers:

peterson_with_barriers.c
C with Barriers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#include <stdatomic.h>
 
// Shared variables (can use volatile + explicit barriers or atomics)
volatile bool flag[2] = {false, false};
volatile int turn = 0;
 
// Memory barrier macros (platform-specific)
#ifdef __x86_64__
    #define FULL_BARRIER() __asm__ __volatile__("mfence" ::: "memory")
#elif defined(__aarch64__)
    #define FULL_BARRIER() __asm__ __volatile__("dmb ish" ::: "memory")
#else
    #define FULL_BARRIER() atomic_thread_fence(memory_order_seq_cst)
#endif
 
void process(int i) {
    int j = 1 - i;
    
    while (1) {
        // Non-critical section
        
        // ═══════════════════════════════════════════════════════════
        // ENTRY SECTION (with barriers)
        // ═══════════════════════════════════════════════════════════
        
        flag[i] = true;       // STORE: announce intent
        FULL_BARRIER();       // ← BARRIER 1: Ensure flag write is visible
                              //   before turn write and before reading flag[j]
        
        turn = j;             // STORE: yield to other process
        FULL_BARRIER();       // ← BARRIER 2: Ensure turn write is visible
                              //   before reading flag[j] and turn in loop
        
        while (flag[j] && turn == j) {
            // Busy wait
            // No barrier needed inside loop on x86 (loads acquire from cache)
            // On ARM/POWER, may need a load barrier to see updates
        }
        
        // ACQUIRE semantics - we now "hold" the lock
        FULL_BARRIER();       // ← BARRIER 3: Ensure we see all writes by 
                              //   the previous CS holder
        
        // ═══════════════════════════════════════════════════════════
        // CRITICAL SECTION
        // ═══════════════════════════════════════════════════════════
        access_shared_resource();
        
        // ═══════════════════════════════════════════════════════════
        // EXIT SECTION (with barrier)
        // ═══════════════════════════════════════════════════════════
        
        FULL_BARRIER();       // ← BARRIER 4: Ensure all CS writes complete
                              //   before clearing flag (RELEASE semantics)
        
        flag[i] = false;      // STORE: release interest
    }
}

Modern C11/C++11 Approach

Instead of explicit barriers, modern C/C++ provides atomic types with memory ordering guarantees. Using atomic_store_explicit with memory_order_seq_cst for stores and atomic_load_explicit for loads achieves the same effect with more portable code. The compiler inserts appropriate barriers for the target architecture.

Scalability to N Processes

Peterson's Solution as presented is for exactly two processes. Extending it to N processes is possible but introduces significant complexity and limitations.

Peterson's N-Process Generalization:

The standard generalization uses a filter lock, also known as Peterson's filter algorithm:

peterson_n_process.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Peterson's Filter Lock for N processes
#define N 4  // Number of processes
 
int level[N];    // level[i] = current level of process i (0 to N-1)
int victim[N];   // victim[L] = which process is the victim at level L
 
void process(int i) {
    while (1) {
        // ═══════════════════════════════════════════════════════════
        // ENTRY SECTION: Go through N-1 levels
        // ═══════════════════════════════════════════════════════════
        for (int L = 1; L < N; L++) {
            level[i] = L;             // I'm at level L
            victim[L] = i;            // I'm the victim at level L
            
            // Wait while:
            // - I'm the victim at this level, AND
            // - Some other process is at this level or higher
            while (victim[L] == i && exists_higher(i, L)) {
                // Busy wait
            }
        }
        
        // ═══════════════════════════════════════════════════════════
        // CRITICAL SECTION
        // ═══════════════════════════════════════════════════════════
        access_shared_resource();
        
        // ═══════════════════════════════════════════════════════════
        // EXIT SECTION
        // ═══════════════════════════════════════════════════════════
        level[i] = 0;   // Back to level 0
    }
}
 
bool exists_higher(int i, int L) {
    // Check if any process k != i is at level L or higher
    for (int k = 0; k < N; k++) {
        if (k != i && level[k] >= L) {
            return true;
        }
    }
    return false;
}
 
/* 
 * PROBLEMS WITH N-PROCESS PETERSON:
 * 
 * 1. Space Complexity: O(N) shared variables
 * 
 * 2. Time Complexity: O(N) loop iterations to enter CS
 *    Even if no one else is competing!
 *    
 * 3. Contention: All N processes might spin on the same victim array
 *    causing cache line bouncing (false sharing)
 * 
 * 4. Starvation: The bounded waiting bound is O(N) - a process might
 *    wait for N-1 other processes to pass first
 *    
 * 5. Memory Model: Even more sensitive to reordering with more variables
 */

Peterson's Scalability Issues
Factor	Two-Process	N-Process	Impact
Entry Time (no contention)	O(1)	O(N)	Slow entry even without competition
Space Overhead	3 variables	2N variables	Memory scales linearly
Bounded Waiting	1 CS	O(N) CS	Waiting time grows with N
Memory Traffic	Low	High	Cache coherence overhead
Implementation Ease	Simple	Complex	More variables to handle correctly

Modern Scalable Locks

For N processes, modern systems use scalable locks like MCS locks, CLH locks, or ticket locks. These use hardware atomics and provide O(1) space per waiter, O(1) entry time under no contention, and fair FIFO ordering. They're designed with cache coherence in mind.

Busy Waiting and CPU Efficiency

Peterson's Solution uses busy waiting (also called spinning)—a process continuously checks the loop condition, consuming CPU cycles while waiting. This is fundamentally inefficient in several ways:

Problems with Busy Waiting:

Busy Waiting Drawbacks

•CPU Waste: A spinning process consumes 100% of a CPU core doing no useful work. On a single-core system, this can prevent the lock holder from making progress.
•Power Consumption: Spinning keeps the CPU in a high-power active state. Modern systems need to balance power and performance.
•Priority Inversion: A low-priority process holding the lock can be preempted by a high-priority spinning process. The high-priority process spins, preventing the low-priority process from releasing the lock.
•No Fairness on Time-Sliced Systems: If processes are time-sliced, a process might be preempted while holding the lock, causing long waits for others.
•Memory Bus Saturation: Continuous polling of shared variables can saturate the memory bus and cache coherence system.

priority_inversion.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Priority Inversion Scenario with Peterson's Solution
// ═══════════════════════════════════════════════════════════════════
 
Process L (Low Priority):
  - Enters critical section
  - Gets preempted by scheduler (time slice expired)
  - Still holds lock (flag[L] = true)
 
Process H (High Priority):
  - Wants to enter critical section
  - Checks: flag[L] && turn == L
  - flag[L] = true → Must wait!
  - SPINS... consuming CPU...
 
Scheduler behavior:
  - H has higher priority, so H gets CPU time
  - L cannot run because H is using the CPU
  - L cannot release the lock because it can't run
  - H cannot acquire the lock because L holds it
  - DEADLOCK-LIKE SITUATION (priority inversion)
 
// On single-CPU systems, this can cause complete starvation.
// Process H effectively prevents Process L from ever running.
 
// Solutions in real systems:
// 1. Priority Inheritance: Temporarily boost L's priority to H's level
// 2. Priority Ceiling: Lock has a priority; holder gets that priority
// 3. Use blocking synchronization instead of spinning

When Spinning Is Acceptable:

Despite its problems, spinning is appropriate in specific contexts:

Very Short Critical Sections: If the lock is held for only a few cycles, spinning is faster than context switching.
Dedicated CPUs: In real-time systems where each task has its own CPU, spinning doesn't block other tasks.
Spinlock-then-block Hybrids: Modern implementations spin briefly, then block if the lock isn't released. This gets the best of both approaches.

The Two-Phase Lock Pattern

Modern mutex implementations often use a two-phase approach: spin briefly (hoping for quick lock release), then block (sleeping) if the lock isn't available. This adapts to both short and long critical sections. The Linux kernel's adaptive mutexes use this approach.

Compiler Optimizations

Even if the CPU behaved correctly, the compiler can break Peterson's Solution through optimizations. Compilers are free to reorder, combine, or eliminate memory operations as long as the single-threaded behavior is preserved.

Dangerous Compiler Optimizations:

compiler_optimizations.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// ═══════════════════════════════════════════════════════════════════
// PROBLEM 1: Register Caching
// ═══════════════════════════════════════════════════════════════════
 
// Source code:
while (flag[j] && turn == j) {
    // Wait
}
 
// What the compiler might generate (WRONG!):
bool cached_flag = flag[j];  // Load once into register
int cached_turn = turn;      
while (cached_flag && cached_turn == j) {
    // Infinite loop! We never re-read from memory.
}
 
// The compiler "optimizes" by not re-reading values that don't
// change within this function. It doesn't know another thread writes them!
 
// ═══════════════════════════════════════════════════════════════════
// PROBLEM 2: Store Combining
// ═══════════════════════════════════════════════════════════════════
 
// Source code:
flag[i] = true;
// ... some computation ...
flag[i] = false;
 
// What the compiler might do:
// Eliminate both stores if it thinks no one reads flag[i] between them.
// Or just generate: flag[i] = false; (dead store elimination)
 
// ═══════════════════════════════════════════════════════════════════
// PROBLEM 3: Instruction Reordering
// ═══════════════════════════════════════════════════════════════════
 
// Source code:
flag[i] = true;
turn = j;
 
// Compiler might reorder to:
turn = j;
flag[i] = true;
 
// This reordering is "safe" for single-threaded code but breaks
// the algorithm's correctness for multi-threaded execution.

Solutions for Compiler Issues:

Preventing Compiler Optimizations

•volatile Keyword: Tells the compiler the variable may change unexpectedly. Prevents register caching and some reordering. NOT sufficient for multi-threaded correctness alone!
•Compiler Barriers: asm volatile("" ::: "memory") in GCC prevents the compiler from reordering across the barrier.
•Atomic Types (C11/C++11): Using _Atomic or std::atomic informs the compiler that variables are accessed by multiple threads.
•Memory Order Specifications: memory_order_seq_cst, memory_order_acquire, memory_order_release specify ordering requirements.

volatile Is Not Enough!

A common misconception is that volatile provides thread-safety. It doesn't! volatile only prevents compiler optimizations. It doesn't prevent CPU reordering or ensure atomic access to multi-byte values. For thread-safety, you need atomics or explicit memory barriers.

Modern Alternatives

Given Peterson's Solution's limitations, what do modern systems use instead? The answer is hardware-supported atomic operations combined with proper memory ordering.

Hardware Atomic Operations:

Hardware Atomic Primitives
Operation	Description	Use Case
Test-and-Set (TAS)	Atomically read and set a value	Simple spinlocks
Compare-and-Swap (CAS)	Atomically compare and conditionally update	Lock-free algorithms, spinlocks
Fetch-and-Add	Atomically add and return old value	Ticket locks, counters
Load-Linked/Store-Conditional	Pair of operations for atomic update	ARM/POWER lock-free code

Modern Lock Implementation Example:

modern_spinlock.c
C11 Atomics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#include <stdatomic.h>
 
// Simple CAS-based spinlock using C11 atomics
typedef struct {
    atomic_bool locked;
} spinlock_t;
 
void spinlock_init(spinlock_t *lock) {
    atomic_store(&lock->locked, false);
}
 
void spinlock_lock(spinlock_t *lock) {
    bool expected = false;
    
    // Keep trying to change locked from false to true
    while (!atomic_compare_exchange_weak(
            &lock->locked,              // pointer to atomic variable
            &expected,                  // expected value (false)
            true                        // desired value (true)
    )) {
        expected = false;  // Reset expected (CAS updates it on failure)
        
        // Optional: spin-wait hint for CPU power efficiency
        #ifdef __x86_64__
        __asm__ __volatile__("pause");
        #endif
    }
    
    // Acquired! The CAS provides acquire semantics by default.
}
 
void spinlock_unlock(spinlock_t *lock) {
    // Release semantics: all previous writes are visible before unlock
    atomic_store(&lock->locked, false);
}
 
/*
 * WHY THIS IS BETTER THAN PETERSON'S:
 * 
 * 1. ATOMIC: The CAS operation is a single hardware instruction that
 *    cannot be interleaved. No need for multiple variables and careful
 *    ordering.
 *
 * 2. IMPLICIT BARRIERS: Memory order is specified in the atomic operations.
 *    The compiler and CPU know what ordering is required.
 *
 * 3. PORTABLE: Works correctly on x86, ARM, POWER, etc. The compiler
 *    generates the right instructions for each architecture.
 *
 * 4. SCALABLE: Can be extended to N processes trivially.
 *
 * 5. SIMPLE: One variable, straightforward logic.
 */

Use High-Level Primitives

In practice, use your language's or OS's mutex/lock implementations (pthread_mutex, std::mutex, synchronized blocks). They're optimized for your platform, handle priority inversion, integrate with the scheduler, and have been extensively tested. Write your own locks only when you have a compelling reason and deep platform knowledge.

Summary: Limitations

We've explored the substantial gap between Peterson's theoretical elegance and practical applicability. Let's consolidate our understanding:

Key Takeaways

•Sequential Consistency isn't provided — Modern CPUs use weaker memory models (TSO, relaxed) for performance. Peterson's Solution assumes stronger guarantees than hardware provides.
•Store buffers delay visibility — Writes to shared variables may not be immediately visible to other processors, causing races.
•Instruction reordering breaks ordering assumptions — CPUs and compilers reorder operations, violating the algorithm's required execution order.
•Memory barriers fix but complicate — Explicit barriers can restore correctness but add overhead and platform-specific code.
•Scaling to N processes is expensive — The N-process generalization has O(N) entry time and poor cache behavior.
•Busy waiting wastes resources — Spinning consumes CPU cycles, power, and can cause priority inversion.
•Compiler optimizations require attention — Without proper annotations, compilers may 'optimize' away the algorithm's correctness.
•Modern alternatives are superior — Hardware atomics, OS mutexes, and standard library locks are faster, portable, and correct.

Why Study Peterson's Solution Then?

Despite its limitations, Peterson's Solution teaches invaluable lessons:

What correctness means for concurrent algorithms (mutual exclusion, progress, bounded waiting)
How subtle concurrency bugs arise from ordering assumptions
Why hardware/compiler support matters for efficient synchronization
Proof techniques for concurrent algorithms
The evolution of synchronization from software-only to hardware-assisted

Understanding Peterson's Solution and its limitations makes you a better developer of concurrent systems—you appreciate what your synchronization primitives do and why they're designed as they are.

Module Complete

Congratulations! You've mastered Peterson's Solution—from its elegant two-variable design to its formal correctness proofs to its practical limitations. This understanding forms a foundation for studying more advanced synchronization primitives, lock-free algorithms, and memory models. Next, you might explore Dekker's Algorithm (the historical predecessor), hardware synchronization primitives, or semaphores and monitors.