Operating Systemsparbegin-parend Model

The parbegin-parend Model: Structured Concurrent Execution

LevelIntermediate

Duration60 mins

Topicparbegin-parend Model

4 / 5

Synchronization Points

Where Parallel Paths Converge

Parallel execution is only useful when we can combine the results. A program that spawns three parallel computations but never collects their results hasn't accomplished meaningful work. The point where parallel paths converge—where concurrent execution reunites into a single flow—is the synchronization point.

In the parbegin/parend model, the most prominent synchronization point is parend itself—the barrier that waits for all parallel branches to complete. But synchronization is a richer concept than this single barrier. Understanding synchronization points in depth reveals how parallel programs coordinate, where performance bottlenecks arise, and how to balance parallelism with correctness.

What You Will Learn

By the end of this page, you will understand implicit barrier synchronization at parend, explicit synchronization within parallel blocks, partial barriers and phased synchronization, the performance implications of synchronization granularity, and how synchronization points relate to correctness properties.

The parend Barrier: Implicit Synchronization

The most fundamental synchronization point in the parbegin/parend model is parend itself. This is an implicit barrier—a synchronization construct that blocks execution until all participating threads arrive.

Barrier Semantics:

A barrier is a synchronization primitive with the following behavior:

Multiple threads (or processes) execute toward the barrier
Each thread that reaches the barrier waits
When the last thread arrives, all threads are released simultaneously
All threads proceed past the barrier

In parbegin/parend, the barrier is implicit:

parbegin
    S1;    // Branch 1 executes...
    S2;    // Branch 2 executes...
    S3;    // Branch 3 executes...
parend     // ← IMPLICIT BARRIER: Wait for S1, S2, S3 to complete

S4;        // Executes only after all branches complete

Unlike explicit barriers in thread libraries (which can be placed anywhere), the parend barrier is structurally defined—it's always at the exact lexical end of the parallel block.

Converting Mermaid diagram...

The Implicit Nature:

The parend barrier is implicit in two senses:

No explicit wait statement: You don't write 'wait for all branches.' The parend keyword automatically implies this.
No barrier omission possible: Unlike explicit synchronization, you cannot 'forget' the barrier. The syntax enforces it.

This implicitness is precisely what makes parbegin/parend structured. The synchronization is part of the construct's definition, not a programmer responsibility.

Barrier ≠ Termination

The barrier releases when all branches complete, but this doesn't mean all branches terminate successfully. If a branch throws an exception, it still 'completes' (in an exceptional state), and the barrier releases. Error handling semantics determine what happens after the barrier—whether exceptions propagate or are aggregated.

Explicit Synchronization Within Parallel Blocks

While parend provides barrier synchronization at the end of a parallel block, real concurrent programs often need finer-grained synchronization within the block. This is where explicit synchronization primitives come into play.

Why Explicit Synchronization?

Consider a parallel block where branches must coordinate their access to shared data:

shared counter := 0;

parbegin
    // Branch 1
    counter := counter + 1;  // Read-modify-write
    
    // Branch 2
    counter := counter + 1;  // Read-modify-write
parend

// Expected: counter = 2
// Actual: counter might be 1 (race condition!)

The parend barrier ensures both branches complete, but it does nothing to prevent race conditions during execution. We need mutual exclusion or atomic operations within the branches.

explicit-sync.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Explicit synchronization with semaphores
shared counter := 0;
shared mutex := Semaphore(1);  // Binary semaphore for mutual exclusion
 
parbegin
    // Branch 1 (with explicit synchronization)
    begin
        P(mutex);            // Acquire lock
        counter := counter + 1;
        V(mutex);            // Release lock
    end;
    
    // Branch 2 (with explicit synchronization)
    begin
        P(mutex);            // Acquire lock
        counter := counter + 1;
        V(mutex);            // Release lock
    end;
parend
 
// Guaranteed: counter = 2
 
// The synchronization points are:
// 1. P(mutex) - wait if mutex is 0, otherwise acquire
// 2. V(mutex) - release, potentially waking waiting threads
// 3. parend  - implicit barrier

Types of Explicit Synchronization

•Mutual Exclusion (Mutex/Lock) — Only one branch can execute a critical section at a time. Others wait until the lock is released.
•Semaphores — Generalization of mutexes; counting semaphores allow N branches to proceed, binary semaphores behave like mutexes.
•Condition Variables — Allow branches to wait for specific conditions; combined with mutexes for producer-consumer patterns.
•Reader-Writer Locks — Multiple readers OR one writer; more parallelism for read-heavy workloads.
•Barriers — Explicit barriers within a parallel block; all branches synchronize at intermediate points.
•Atomic Operations — Hardware-supported indivisible operations (compare-and-swap, fetch-and-add) for lock-free synchronization.

Synchronization Within Structure

Explicit synchronization is constrained by the parbegin/parend structure. A mutex acquired in one branch must be released in the same branch before parend—otherwise, the structure's completion guarantee would be violated (a branch would 'complete' while holding a lock that other branches might need).

Barrier Granularity and Phased Execution

Complex parallel algorithms often require multiple synchronization points during execution—not just at the end. This leads to phased execution patterns where branches synchronize at intermediate barriers.

The Problem with Single Barriers:

Consider a parallel algorithm with dependencies between phases:

// Phase 1: Compute partial results
// Phase 2: Exchange partial results between branches
// Phase 3: Compute final results using exchanged data

With only parend as synchronization, this doesn't work:

parbegin
    compute_partial_A(); exchange_data(); compute_final_A();
    compute_partial_B(); exchange_data(); compute_final_B();
parend

Problem: Branch A might reach exchange_data() while Branch B is still computing partial results. The data that A tries to read from B doesn't exist yet!

We need a barrier between phases.

phased-execution.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Phased execution with explicit barriers
shared barrier := Barrier(2);  // Barrier for 2 threads
 
parbegin
    // Branch A
    begin
        compute_partial_A();      // Phase 1
        barrier.wait();           // Synchronization point 1
        exchange_data_A();        // Phase 2
        barrier.wait();           // Synchronization point 2
        compute_final_A();        // Phase 3
    end;
    
    // Branch B
    begin
        compute_partial_B();      // Phase 1
        barrier.wait();           // Synchronization point 1
        exchange_data_B();        // Phase 2
        barrier.wait();           // Synchronization point 2
        compute_final_B();        // Phase 3
    end;
parend  // Synchronization point 3 (implicit)
 
// Synchronization points:
// SP1: After Phase 1, before Phase 2 (explicit barrier)
// SP2: After Phase 2, before Phase 3 (explicit barrier)
// SP3: After Phase 3, at parend (implicit barrier)

Phased Execution Patterns:

Phased execution is common in parallel algorithms:

Iterative Algorithms: Each iteration is a phase. Branches synchronize before the next iteration reads updated values.
Pipeline Stages: Data flows through stages; each stage completes for all branches before the next begins.
Bulk Synchronous Parallel (BSP): A formal model where computation alternates between computation phases and communication phases, with barriers between.
MapReduce Phases: The shuffle/barrier between map and reduce phases ensures all mappers complete before reducers begin.

Converting Mermaid diagram...

Barrier Overhead

Each barrier is a synchronization point where threads may wait. More barriers means more potential idle time (waiting for the slowest thread). The art of parallel algorithm design is minimizing barrier frequency while maintaining correctness—coarser granularity often yields better performance.

Partial Synchronization

The parend barrier is global—it waits for ALL branches. But sometimes we need partial synchronization: waiting for a subset of branches or for specific conditions, without blocking all branches.

Scenarios Requiring Partial Synchronization:

Dependent Subsets: Branch C needs results from A and B, but D is independent:
- C should wait for A and B
- D shouldn't wait for anything
- parend waits for all four
First-to-Complete: Wait for ANY branch to complete, then continue (racing):
- Useful for redundant computation or timeouts
Quorum: Wait for at least K of N branches to complete:
- Common in distributed systems (majority agreement)
Conditional Barriers: Only synchronize if a condition is true:
- Branches that don't need the barrier proceed independently

partial-sync.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Partial synchronization with condition variables
shared result_a, result_b: Data;
shared a_done, b_done: Boolean := false;
shared cond: ConditionVariable;
shared mutex: Mutex;
 
parbegin
    // Branch A
    begin
        result_a := compute_a();
        mutex.lock();
        a_done := true;
        cond.signal_all();  // Wake waiting branches
        mutex.unlock();
    end;
    
    // Branch B
    begin
        result_b := compute_b();
        mutex.lock();
        b_done := true;
        cond.signal_all();  // Wake waiting branches
        mutex.unlock();
    end;
    
    // Branch C (waits for A and B, but not D)
    begin
        mutex.lock();
        while not (a_done and b_done) do
            cond.wait(mutex);  // Partial sync: wait for A and B only
        end;
        mutex.unlock();
        result_c := combine(result_a, result_b);
    end;
    
    // Branch D (independent, no waiting)
    begin
        result_d := compute_d();  // Doesn't wait for anyone
    end;
parend

Partial Sync vs. Restructuring:

Partial synchronization can often be avoided by restructuring the parallel block:

// Before: Partial sync within single parbegin
parbegin
    A; B; C (waits for A, B); D;
parend

// After: Nested structure eliminates partial sync
parbegin
    begin
        parbegin A; B; parend
        C;  // C naturally follows A and B
    end;
    D;  // D runs in parallel with the above
parend

The restructured version uses the natural barrier at inner parend. This is often clearer than explicit partial synchronization.

Structure vs. Flexibility

The parbegin/parend model favors structure over flexibility. Partial synchronization is possible with explicit primitives, but restructuring to use natural barriers is often preferable. When partial sync is unavoidable, it indicates that the problem's dependency graph doesn't fit neatly into structured parallelism.

Performance Implications of Synchronization

Synchronization points are where parallelism is constrained. Every barrier, lock acquisition, and wait operation is a potential bottleneck. Understanding the performance implications is crucial for designing efficient parallel programs.

Amdahl's Law and Synchronization:

Amdahl's Law describes the theoretical speedup of parallel execution:

Speedup = 1 / (S + P/N)

Where:

S = fraction of execution that is inherently sequential
P = fraction of execution that is parallelizable (P = 1 - S)
N = number of parallel processors/threads

Critical insight: Synchronization points create sequential sections. When all threads wait at a barrier, that's sequential execution—only the last-arriving thread does useful work; others idle.

Performance Impact of Synchronization
Synchronization Type	Overhead	Contention Pattern	Mitigation
parend barrier	O(N) threads wait for slowest	Slowest branch determines total time	Load balancing, work stealing
Mutex critical section	Only one thread proceeds at a time	Queue builds at hot locks	Lock-free algorithms, finer granularity
Reader-writer lock	Writers block all; readers share	Writers starve with many readers	Writer priority, RCU patterns
Condition variable wait	Thread sleeps until signaled	Thundering herd on broadcast	Selective signaling, work queues
Multiple barriers	Threads idle at each barrier	Cumulative wait time	Reduce barrier count, merge phases

load-imbalance.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Load imbalance at synchronization points
parbegin
    task_a();   // Takes 100ms
    task_b();   // Takes 200ms
    task_c();   // Takes 50ms
parend
// Total time: max(100, 200, 50) = 200ms
// Ideal parallel time: (100+200+50)/3 = 117ms
// Efficiency: 117/200 = 58.5%
 
// Branches A and C wait 100ms and 150ms respectively, idle at barrier
 
// Improved version with load balancing:
work_items = [a1, a2, b1, b2, b3, b4, c];  // Smaller chunks
parbegin
    worker_1: process_items(subset_1);  // More balanced workload
    worker_2: process_items(subset_2);
    worker_3: process_items(subset_3);
parend
// Better load balance → less idle time at barrier

Performance Optimization Strategies

•Load Balancing — Divide work evenly so branches complete at similar times, reducing idle time at barriers.
•Work Stealing — Idle threads steal work from busy threads, dynamically rebalancing load.
•Lock-Free Algorithms — Use atomic operations instead of mutex locks to eliminate contention.
•Reduce Critical Section Size — Move work outside critical sections to minimize time under lock.
•Batch Synchronization — Synchronize periodically rather than per-item to amortize overhead.
•Asynchronous Communication — Avoid blocking sends/receives; use buffers and check later.

The Serial Bottleneck

A barrier where threads wait is not parallel execution—it's effectively sequential. If your parallel algorithm spends 50% of time at barriers (waiting for slow branches), you can at best achieve 2x speedup regardless of how many processors you use. Identify and minimize synchronization bottlenecks.

Synchronization and Correctness

Synchronization isn't just about performance—it's essential for correctness. Without proper synchronization, parallel programs exhibit race conditions, data corruption, and non-deterministic behavior.

Correctness Properties and Synchronization:

1. Safety Properties (Nothing bad happens):

Mutual Exclusion: At most one thread in a critical section at any time
Absence of Data Races: Concurrent accesses to shared data are properly ordered
Invariant Preservation: Shared data structures maintain invariants

2. Liveness Properties (Something good eventually happens):

Deadlock Freedom: The system doesn't freeze with all threads waiting
Starvation Freedom: Every thread eventually makes progress
Lock Freedom: At least one thread makes progress (for lock-free algorithms)

Synchronization points are where we establish these properties. Proper barrier usage ensures safety; careful synchronization design ensures liveness.

correctness-example.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Correctness through synchronization
shared linked_list: List;
shared mutex: Mutex;
 
parbegin
    // Branch A: Insert operation
    begin
        node := allocate_node(data_a);
        mutex.lock();
        // CRITICAL SECTION START
        // Invariant expected: list is well-formed
        node.next := linked_list.head;
        linked_list.head := node;
        // Invariant maintained: list is still well-formed
        // CRITICAL SECTION END
        mutex.unlock();
    end;
    
    // Branch B: Insert operation
    begin
        node := allocate_node(data_b);
        mutex.lock();
        // CRITICAL SECTION START
        // Branch B enters only after A exits (mutual exclusion)
        node.next := linked_list.head;
        linked_list.head := node;
        // CRITICAL SECTION END
        mutex.unlock();
    end;
    
    // Branch C: Read operation
    begin
        mutex.lock();
        // CRITICAL SECTION START
        // Sees a consistent list (either with both A and B, 
        // just A, just B, or neither - but always valid)
        result := traverse(linked_list);
        // CRITICAL SECTION END
        mutex.unlock();
    end;
parend
 
// Correctness properties:
// - Mutual exclusion: Only one branch in critical section
// - Data race freedom: All accesses to linked_list are protected
// - Invariant preservation: List is always well-formed at section boundaries

Common Correctness Pitfalls

•Missing Synchronization — Accessing shared data without locks, leading to races. Bugs may be rare but catastrophic.
•Wrong Lock — Protecting data with a lock but using different locks for different accesses. Lock only works if all accesses use the same lock.
•Lock Order Violation — Acquiring locks in different orders in different branches, causing deadlock.
•Signal Before Check — Condition variable signal sent when no thread is waiting; signal is lost forever.
•Barrier Mismatch — Different branches reaching different numbers of barriers, causing indefinite wait.

The Value of Structure

parbegin/parend's structured approach helps correctness: the implicit barrier at parend means all branches definitely synchronize there. You can't 'forget' this barrier. Combined with proper internal synchronization, the structure provides a correctness framework that unstructured parallelism lacks.

Memory Consistency and Synchronization

At a deeper level, synchronization points interact with memory consistency models—the rules governing when writes by one thread become visible to other threads. This is particularly relevant on modern hardware with caches, out-of-order execution, and memory hierarchies.

The Memory Visibility Problem:

Consider:

shared x := 0;
shared flag := false;

parbegin
    // Branch A (writer)
    x := 42;
    flag := true;
    
    // Branch B (reader)
    while not flag do skip; end;
    print(x);
parend

Question: Does Branch B necessarily print 42?

Answer: Not necessarily! Due to:

Compiler reordering: The compiler might reorder x := 42 and flag := true
Hardware reordering: The CPU might execute stores out of order
Cache coherence delays: Branch B's CPU might see flag = true before seeing x = 42

Synchronization primitives include memory barriers (fences) that enforce ordering.

Memory Ordering from Synchronization
Synchronization Primitive	Memory Ordering Guarantee
Mutex lock acquire	All reads/writes after acquire see all writes before prior release (acquire semantics)
Mutex lock release	All reads/writes before release are visible to subsequent acquires (release semantics)
Barrier	All operations before barrier are visible to all threads after barrier (full fence)
Atomic read-modify-write	Ordering depends on memory order specification (acquire, release, seq_cst)
parend (implicit barrier)	Typically provides full memory fence; all writes visible after barrier

memory-ordering.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Correct memory ordering through synchronization
shared x := 0;
shared mutex: Mutex;
 
parbegin
    // Branch A (writer)
    begin
        mutex.lock();   // ACQUIRE barrier
        x := 42;        // Write is ordered after acquire
        mutex.unlock(); // RELEASE barrier - x=42 visible to subsequent lock
    end;
    
    // Branch B (reader)
    begin
        mutex.lock();   // ACQUIRE - sees writes before prior release
        value := x;     // Guaranteed to see x = 42 if B acquires after A releases
        mutex.unlock();
    end;
parend
 
// The mutex lock/unlock provide memory ordering:
// - A's write to x happens-before A's unlock
// - A's unlock synchronizes-with B's lock (if B acquires after A releases)
// - B's lock happens-before B's read of x
// Therefore: A's write to x happens-before B's read of x

parend as Memory Barrier

The implicit barrier at parend typically includes a full memory fence. This guarantees that all writes from all branches are visible to code after parend. You don't need explicit memory barriers at the end of a parbegin/parend block—the semantics include proper memory ordering.

Summary: Synchronization Points

We've explored synchronization points in depth—from the implicit parend barrier to explicit synchronization, performance implications, and memory consistency. Let's consolidate our understanding:

Key Takeaways

•parend is an implicit barrier — All branches must complete before execution proceeds past parend. This is enforced by the syntax, not programmer vigilance.
•Explicit synchronization for data sharing — Barriers don't prevent races. Use mutexes, semaphores, or atomic operations when branches share mutable data.
•Phased execution uses internal barriers — When branches must synchronize mid-execution, explicit barriers create synchronization points between phases.
•Partial synchronization is possible but complex — Condition variables enable waiting for subsets of branches; restructuring often yields cleaner solutions.
•Synchronization impacts performance — Every barrier is where parallelism waits. Minimize barriers, balance load, and consider lock-free alternatives.
•Synchronization provides memory ordering — Lock acquire/release and barriers establish happens-before relationships, ensuring memory visibility.

What's Next:

We've now covered the practical aspects of the parbegin/parend model. Our final page examines the academic notation conventions—how this model is presented in textbooks, research papers, and formal specifications, and why understanding these conventions matters for your continued study of concurrent programming.

Page Complete

You now understand synchronization points in the parbegin/parend model—both the implicit parend barrier and explicit synchronization within parallel blocks. This knowledge is essential for designing parallel programs that are both correct and performant.

4 / 5

Loading learning content...

Operating Systemsparbegin-parend Model

The parbegin-parend Model: Structured Concurrent Execution

LevelIntermediate

Duration60 mins

Topicparbegin-parend Model

4 / 5

Synchronization Points

Where Parallel Paths Converge

What You Will Learn

The parend Barrier: Implicit Synchronization

Barrier Semantics:

A barrier is a synchronization primitive with the following behavior:

Multiple threads (or processes) execute toward the barrier
Each thread that reaches the barrier waits
When the last thread arrives, all threads are released simultaneously
All threads proceed past the barrier

In parbegin/parend, the barrier is implicit:

parbegin
    S1;    // Branch 1 executes...
    S2;    // Branch 2 executes...
    S3;    // Branch 3 executes...
parend     // ← IMPLICIT BARRIER: Wait for S1, S2, S3 to complete

S4;        // Executes only after all branches complete

Unlike explicit barriers in thread libraries (which can be placed anywhere), the parend barrier is structurally defined—it's always at the exact lexical end of the parallel block.

Converting Mermaid diagram...

The Implicit Nature:

The parend barrier is implicit in two senses:

No explicit wait statement: You don't write 'wait for all branches.' The parend keyword automatically implies this.
No barrier omission possible: Unlike explicit synchronization, you cannot 'forget' the barrier. The syntax enforces it.

This implicitness is precisely what makes parbegin/parend structured. The synchronization is part of the construct's definition, not a programmer responsibility.

Barrier ≠ Termination

Explicit Synchronization Within Parallel Blocks

Why Explicit Synchronization?

Consider a parallel block where branches must coordinate their access to shared data:

shared counter := 0;

parbegin
    // Branch 1
    counter := counter + 1;  // Read-modify-write
    
    // Branch 2
    counter := counter + 1;  // Read-modify-write
parend

// Expected: counter = 2
// Actual: counter might be 1 (race condition!)

The parend barrier ensures both branches complete, but it does nothing to prevent race conditions during execution. We need mutual exclusion or atomic operations within the branches.

explicit-sync.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Explicit synchronization with semaphores
shared counter := 0;
shared mutex := Semaphore(1);  // Binary semaphore for mutual exclusion
 
parbegin
    // Branch 1 (with explicit synchronization)
    begin
        P(mutex);            // Acquire lock
        counter := counter + 1;
        V(mutex);            // Release lock
    end;
    
    // Branch 2 (with explicit synchronization)
    begin
        P(mutex);            // Acquire lock
        counter := counter + 1;
        V(mutex);            // Release lock
    end;
parend
 
// Guaranteed: counter = 2
 
// The synchronization points are:
// 1. P(mutex) - wait if mutex is 0, otherwise acquire
// 2. V(mutex) - release, potentially waking waiting threads
// 3. parend  - implicit barrier

Types of Explicit Synchronization

•Mutual Exclusion (Mutex/Lock) — Only one branch can execute a critical section at a time. Others wait until the lock is released.
•Semaphores — Generalization of mutexes; counting semaphores allow N branches to proceed, binary semaphores behave like mutexes.
•Condition Variables — Allow branches to wait for specific conditions; combined with mutexes for producer-consumer patterns.
•Reader-Writer Locks — Multiple readers OR one writer; more parallelism for read-heavy workloads.
•Barriers — Explicit barriers within a parallel block; all branches synchronize at intermediate points.
•Atomic Operations — Hardware-supported indivisible operations (compare-and-swap, fetch-and-add) for lock-free synchronization.

Synchronization Within Structure

Barrier Granularity and Phased Execution

The Problem with Single Barriers:

Consider a parallel algorithm with dependencies between phases:

// Phase 1: Compute partial results
// Phase 2: Exchange partial results between branches
// Phase 3: Compute final results using exchanged data

With only parend as synchronization, this doesn't work:

parbegin
    compute_partial_A(); exchange_data(); compute_final_A();
    compute_partial_B(); exchange_data(); compute_final_B();
parend

Problem: Branch A might reach exchange_data() while Branch B is still computing partial results. The data that A tries to read from B doesn't exist yet!

We need a barrier between phases.

phased-execution.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Phased execution with explicit barriers
shared barrier := Barrier(2);  // Barrier for 2 threads
 
parbegin
    // Branch A
    begin
        compute_partial_A();      // Phase 1
        barrier.wait();           // Synchronization point 1
        exchange_data_A();        // Phase 2
        barrier.wait();           // Synchronization point 2
        compute_final_A();        // Phase 3
    end;
    
    // Branch B
    begin
        compute_partial_B();      // Phase 1
        barrier.wait();           // Synchronization point 1
        exchange_data_B();        // Phase 2
        barrier.wait();           // Synchronization point 2
        compute_final_B();        // Phase 3
    end;
parend  // Synchronization point 3 (implicit)
 
// Synchronization points:
// SP1: After Phase 1, before Phase 2 (explicit barrier)
// SP2: After Phase 2, before Phase 3 (explicit barrier)
// SP3: After Phase 3, at parend (implicit barrier)

Phased Execution Patterns:

Phased execution is common in parallel algorithms:

Iterative Algorithms: Each iteration is a phase. Branches synchronize before the next iteration reads updated values.
Pipeline Stages: Data flows through stages; each stage completes for all branches before the next begins.
Bulk Synchronous Parallel (BSP): A formal model where computation alternates between computation phases and communication phases, with barriers between.
MapReduce Phases: The shuffle/barrier between map and reduce phases ensures all mappers complete before reducers begin.

Converting Mermaid diagram...

Barrier Overhead

Partial Synchronization

Scenarios Requiring Partial Synchronization:

Dependent Subsets: Branch C needs results from A and B, but D is independent:
- C should wait for A and B
- D shouldn't wait for anything
- parend waits for all four
First-to-Complete: Wait for ANY branch to complete, then continue (racing):
- Useful for redundant computation or timeouts
Quorum: Wait for at least K of N branches to complete:
- Common in distributed systems (majority agreement)
Conditional Barriers: Only synchronize if a condition is true:
- Branches that don't need the barrier proceed independently

partial-sync.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Partial synchronization with condition variables
shared result_a, result_b: Data;
shared a_done, b_done: Boolean := false;
shared cond: ConditionVariable;
shared mutex: Mutex;
 
parbegin
    // Branch A
    begin
        result_a := compute_a();
        mutex.lock();
        a_done := true;
        cond.signal_all();  // Wake waiting branches
        mutex.unlock();
    end;
    
    // Branch B
    begin
        result_b := compute_b();
        mutex.lock();
        b_done := true;
        cond.signal_all();  // Wake waiting branches
        mutex.unlock();
    end;
    
    // Branch C (waits for A and B, but not D)
    begin
        mutex.lock();
        while not (a_done and b_done) do
            cond.wait(mutex);  // Partial sync: wait for A and B only
        end;
        mutex.unlock();
        result_c := combine(result_a, result_b);
    end;
    
    // Branch D (independent, no waiting)
    begin
        result_d := compute_d();  // Doesn't wait for anyone
    end;
parend

Partial Sync vs. Restructuring:

Partial synchronization can often be avoided by restructuring the parallel block:

// Before: Partial sync within single parbegin
parbegin
    A; B; C (waits for A, B); D;
parend

// After: Nested structure eliminates partial sync
parbegin
    begin
        parbegin A; B; parend
        C;  // C naturally follows A and B
    end;
    D;  // D runs in parallel with the above
parend

The restructured version uses the natural barrier at inner parend. This is often clearer than explicit partial synchronization.

Structure vs. Flexibility

Performance Implications of Synchronization

Amdahl's Law and Synchronization:

Amdahl's Law describes the theoretical speedup of parallel execution:

Speedup = 1 / (S + P/N)

Where:

S = fraction of execution that is inherently sequential
P = fraction of execution that is parallelizable (P = 1 - S)
N = number of parallel processors/threads

Critical insight: Synchronization points create sequential sections. When all threads wait at a barrier, that's sequential execution—only the last-arriving thread does useful work; others idle.

Performance Impact of Synchronization
Synchronization Type	Overhead	Contention Pattern	Mitigation
parend barrier	O(N) threads wait for slowest	Slowest branch determines total time	Load balancing, work stealing
Mutex critical section	Only one thread proceeds at a time	Queue builds at hot locks	Lock-free algorithms, finer granularity
Reader-writer lock	Writers block all; readers share	Writers starve with many readers	Writer priority, RCU patterns
Condition variable wait	Thread sleeps until signaled	Thundering herd on broadcast	Selective signaling, work queues
Multiple barriers	Threads idle at each barrier	Cumulative wait time	Reduce barrier count, merge phases

load-imbalance.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Load imbalance at synchronization points
parbegin
    task_a();   // Takes 100ms
    task_b();   // Takes 200ms
    task_c();   // Takes 50ms
parend
// Total time: max(100, 200, 50) = 200ms
// Ideal parallel time: (100+200+50)/3 = 117ms
// Efficiency: 117/200 = 58.5%
 
// Branches A and C wait 100ms and 150ms respectively, idle at barrier
 
// Improved version with load balancing:
work_items = [a1, a2, b1, b2, b3, b4, c];  // Smaller chunks
parbegin
    worker_1: process_items(subset_1);  // More balanced workload
    worker_2: process_items(subset_2);
    worker_3: process_items(subset_3);
parend
// Better load balance → less idle time at barrier

Performance Optimization Strategies

•Load Balancing — Divide work evenly so branches complete at similar times, reducing idle time at barriers.
•Work Stealing — Idle threads steal work from busy threads, dynamically rebalancing load.
•Lock-Free Algorithms — Use atomic operations instead of mutex locks to eliminate contention.
•Reduce Critical Section Size — Move work outside critical sections to minimize time under lock.
•Batch Synchronization — Synchronize periodically rather than per-item to amortize overhead.
•Asynchronous Communication — Avoid blocking sends/receives; use buffers and check later.

The Serial Bottleneck

Synchronization and Correctness

Correctness Properties and Synchronization:

1. Safety Properties (Nothing bad happens):

Mutual Exclusion: At most one thread in a critical section at any time
Absence of Data Races: Concurrent accesses to shared data are properly ordered
Invariant Preservation: Shared data structures maintain invariants

2. Liveness Properties (Something good eventually happens):

Deadlock Freedom: The system doesn't freeze with all threads waiting
Starvation Freedom: Every thread eventually makes progress
Lock Freedom: At least one thread makes progress (for lock-free algorithms)

Synchronization points are where we establish these properties. Proper barrier usage ensures safety; careful synchronization design ensures liveness.

correctness-example.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Correctness through synchronization
shared linked_list: List;
shared mutex: Mutex;
 
parbegin
    // Branch A: Insert operation
    begin
        node := allocate_node(data_a);
        mutex.lock();
        // CRITICAL SECTION START
        // Invariant expected: list is well-formed
        node.next := linked_list.head;
        linked_list.head := node;
        // Invariant maintained: list is still well-formed
        // CRITICAL SECTION END
        mutex.unlock();
    end;
    
    // Branch B: Insert operation
    begin
        node := allocate_node(data_b);
        mutex.lock();
        // CRITICAL SECTION START
        // Branch B enters only after A exits (mutual exclusion)
        node.next := linked_list.head;
        linked_list.head := node;
        // CRITICAL SECTION END
        mutex.unlock();
    end;
    
    // Branch C: Read operation
    begin
        mutex.lock();
        // CRITICAL SECTION START
        // Sees a consistent list (either with both A and B, 
        // just A, just B, or neither - but always valid)
        result := traverse(linked_list);
        // CRITICAL SECTION END
        mutex.unlock();
    end;
parend
 
// Correctness properties:
// - Mutual exclusion: Only one branch in critical section
// - Data race freedom: All accesses to linked_list are protected
// - Invariant preservation: List is always well-formed at section boundaries

Common Correctness Pitfalls

•Missing Synchronization — Accessing shared data without locks, leading to races. Bugs may be rare but catastrophic.
•Wrong Lock — Protecting data with a lock but using different locks for different accesses. Lock only works if all accesses use the same lock.
•Lock Order Violation — Acquiring locks in different orders in different branches, causing deadlock.
•Signal Before Check — Condition variable signal sent when no thread is waiting; signal is lost forever.
•Barrier Mismatch — Different branches reaching different numbers of barriers, causing indefinite wait.

The Value of Structure

Memory Consistency and Synchronization

The Memory Visibility Problem:

Consider:

shared x := 0;
shared flag := false;

parbegin
    // Branch A (writer)
    x := 42;
    flag := true;
    
    // Branch B (reader)
    while not flag do skip; end;
    print(x);
parend

Question: Does Branch B necessarily print 42?

Answer: Not necessarily! Due to:

Compiler reordering: The compiler might reorder x := 42 and flag := true
Hardware reordering: The CPU might execute stores out of order
Cache coherence delays: Branch B's CPU might see flag = true before seeing x = 42

Synchronization primitives include memory barriers (fences) that enforce ordering.

Memory Ordering from Synchronization
Synchronization Primitive	Memory Ordering Guarantee
Mutex lock acquire	All reads/writes after acquire see all writes before prior release (acquire semantics)
Mutex lock release	All reads/writes before release are visible to subsequent acquires (release semantics)
Barrier	All operations before barrier are visible to all threads after barrier (full fence)
Atomic read-modify-write	Ordering depends on memory order specification (acquire, release, seq_cst)
parend (implicit barrier)	Typically provides full memory fence; all writes visible after barrier

memory-ordering.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Correct memory ordering through synchronization
shared x := 0;
shared mutex: Mutex;
 
parbegin
    // Branch A (writer)
    begin
        mutex.lock();   // ACQUIRE barrier
        x := 42;        // Write is ordered after acquire
        mutex.unlock(); // RELEASE barrier - x=42 visible to subsequent lock
    end;
    
    // Branch B (reader)
    begin
        mutex.lock();   // ACQUIRE - sees writes before prior release
        value := x;     // Guaranteed to see x = 42 if B acquires after A releases
        mutex.unlock();
    end;
parend
 
// The mutex lock/unlock provide memory ordering:
// - A's write to x happens-before A's unlock
// - A's unlock synchronizes-with B's lock (if B acquires after A releases)
// - B's lock happens-before B's read of x
// Therefore: A's write to x happens-before B's read of x

parend as Memory Barrier

Summary: Synchronization Points

We've explored synchronization points in depth—from the implicit parend barrier to explicit synchronization, performance implications, and memory consistency. Let's consolidate our understanding:

Key Takeaways

•parend is an implicit barrier — All branches must complete before execution proceeds past parend. This is enforced by the syntax, not programmer vigilance.
•Explicit synchronization for data sharing — Barriers don't prevent races. Use mutexes, semaphores, or atomic operations when branches share mutable data.
•Phased execution uses internal barriers — When branches must synchronize mid-execution, explicit barriers create synchronization points between phases.
•Partial synchronization is possible but complex — Condition variables enable waiting for subsets of branches; restructuring often yields cleaner solutions.
•Synchronization impacts performance — Every barrier is where parallelism waits. Minimize barriers, balance load, and consider lock-free alternatives.
•Synchronization provides memory ordering — Lock acquire/release and barriers establish happens-before relationships, ensuring memory visibility.

What's Next:

Page Complete

4 / 5