Loading learning content...
Parallel execution is only useful when we can combine the results. A program that spawns three parallel computations but never collects their results hasn't accomplished meaningful work. The point where parallel paths converge—where concurrent execution reunites into a single flow—is the synchronization point.
In the parbegin/parend model, the most prominent synchronization point is parend itself—the barrier that waits for all parallel branches to complete. But synchronization is a richer concept than this single barrier. Understanding synchronization points in depth reveals how parallel programs coordinate, where performance bottlenecks arise, and how to balance parallelism with correctness.
By the end of this page, you will understand implicit barrier synchronization at parend, explicit synchronization within parallel blocks, partial barriers and phased synchronization, the performance implications of synchronization granularity, and how synchronization points relate to correctness properties.
The most fundamental synchronization point in the parbegin/parend model is parend itself. This is an implicit barrier—a synchronization construct that blocks execution until all participating threads arrive.
Barrier Semantics:
A barrier is a synchronization primitive with the following behavior:
In parbegin/parend, the barrier is implicit:
parbegin
S1; // Branch 1 executes...
S2; // Branch 2 executes...
S3; // Branch 3 executes...
parend // ← IMPLICIT BARRIER: Wait for S1, S2, S3 to complete
S4; // Executes only after all branches complete
Unlike explicit barriers in thread libraries (which can be placed anywhere), the parend barrier is structurally defined—it's always at the exact lexical end of the parallel block.
The Implicit Nature:
The parend barrier is implicit in two senses:
No explicit wait statement: You don't write 'wait for all branches.' The parend keyword automatically implies this.
No barrier omission possible: Unlike explicit synchronization, you cannot 'forget' the barrier. The syntax enforces it.
This implicitness is precisely what makes parbegin/parend structured. The synchronization is part of the construct's definition, not a programmer responsibility.
The barrier releases when all branches complete, but this doesn't mean all branches terminate successfully. If a branch throws an exception, it still 'completes' (in an exceptional state), and the barrier releases. Error handling semantics determine what happens after the barrier—whether exceptions propagate or are aggregated.
While parend provides barrier synchronization at the end of a parallel block, real concurrent programs often need finer-grained synchronization within the block. This is where explicit synchronization primitives come into play.
Why Explicit Synchronization?
Consider a parallel block where branches must coordinate their access to shared data:
shared counter := 0;
parbegin
// Branch 1
counter := counter + 1; // Read-modify-write
// Branch 2
counter := counter + 1; // Read-modify-write
parend
// Expected: counter = 2
// Actual: counter might be 1 (race condition!)
The parend barrier ensures both branches complete, but it does nothing to prevent race conditions during execution. We need mutual exclusion or atomic operations within the branches.
1234567891011121314151617181920212223242526
// Explicit synchronization with semaphoresshared counter := 0;shared mutex := Semaphore(1); // Binary semaphore for mutual exclusion parbegin // Branch 1 (with explicit synchronization) begin P(mutex); // Acquire lock counter := counter + 1; V(mutex); // Release lock end; // Branch 2 (with explicit synchronization) begin P(mutex); // Acquire lock counter := counter + 1; V(mutex); // Release lock end;parend // Guaranteed: counter = 2 // The synchronization points are:// 1. P(mutex) - wait if mutex is 0, otherwise acquire// 2. V(mutex) - release, potentially waking waiting threads// 3. parend - implicit barrierExplicit synchronization is constrained by the parbegin/parend structure. A mutex acquired in one branch must be released in the same branch before parend—otherwise, the structure's completion guarantee would be violated (a branch would 'complete' while holding a lock that other branches might need).
Complex parallel algorithms often require multiple synchronization points during execution—not just at the end. This leads to phased execution patterns where branches synchronize at intermediate barriers.
The Problem with Single Barriers:
Consider a parallel algorithm with dependencies between phases:
// Phase 1: Compute partial results
// Phase 2: Exchange partial results between branches
// Phase 3: Compute final results using exchanged data
With only parend as synchronization, this doesn't work:
parbegin
compute_partial_A(); exchange_data(); compute_final_A();
compute_partial_B(); exchange_data(); compute_final_B();
parend
Problem: Branch A might reach exchange_data() while Branch B is still computing partial results. The data that A tries to read from B doesn't exist yet!
We need a barrier between phases.
123456789101112131415161718192021222324252627
// Phased execution with explicit barriersshared barrier := Barrier(2); // Barrier for 2 threads parbegin // Branch A begin compute_partial_A(); // Phase 1 barrier.wait(); // Synchronization point 1 exchange_data_A(); // Phase 2 barrier.wait(); // Synchronization point 2 compute_final_A(); // Phase 3 end; // Branch B begin compute_partial_B(); // Phase 1 barrier.wait(); // Synchronization point 1 exchange_data_B(); // Phase 2 barrier.wait(); // Synchronization point 2 compute_final_B(); // Phase 3 end;parend // Synchronization point 3 (implicit) // Synchronization points:// SP1: After Phase 1, before Phase 2 (explicit barrier)// SP2: After Phase 2, before Phase 3 (explicit barrier)// SP3: After Phase 3, at parend (implicit barrier)Phased Execution Patterns:
Phased execution is common in parallel algorithms:
Iterative Algorithms: Each iteration is a phase. Branches synchronize before the next iteration reads updated values.
Pipeline Stages: Data flows through stages; each stage completes for all branches before the next begins.
Bulk Synchronous Parallel (BSP): A formal model where computation alternates between computation phases and communication phases, with barriers between.
MapReduce Phases: The shuffle/barrier between map and reduce phases ensures all mappers complete before reducers begin.
Each barrier is a synchronization point where threads may wait. More barriers means more potential idle time (waiting for the slowest thread). The art of parallel algorithm design is minimizing barrier frequency while maintaining correctness—coarser granularity often yields better performance.
The parend barrier is global—it waits for ALL branches. But sometimes we need partial synchronization: waiting for a subset of branches or for specific conditions, without blocking all branches.
Scenarios Requiring Partial Synchronization:
Dependent Subsets: Branch C needs results from A and B, but D is independent:
First-to-Complete: Wait for ANY branch to complete, then continue (racing):
Quorum: Wait for at least K of N branches to complete:
Conditional Barriers: Only synchronize if a condition is true:
12345678910111213141516171819202122232425262728293031323334353637383940
// Partial synchronization with condition variablesshared result_a, result_b: Data;shared a_done, b_done: Boolean := false;shared cond: ConditionVariable;shared mutex: Mutex; parbegin // Branch A begin result_a := compute_a(); mutex.lock(); a_done := true; cond.signal_all(); // Wake waiting branches mutex.unlock(); end; // Branch B begin result_b := compute_b(); mutex.lock(); b_done := true; cond.signal_all(); // Wake waiting branches mutex.unlock(); end; // Branch C (waits for A and B, but not D) begin mutex.lock(); while not (a_done and b_done) do cond.wait(mutex); // Partial sync: wait for A and B only end; mutex.unlock(); result_c := combine(result_a, result_b); end; // Branch D (independent, no waiting) begin result_d := compute_d(); // Doesn't wait for anyone end;parendPartial Sync vs. Restructuring:
Partial synchronization can often be avoided by restructuring the parallel block:
// Before: Partial sync within single parbegin
parbegin
A; B; C (waits for A, B); D;
parend
// After: Nested structure eliminates partial sync
parbegin
begin
parbegin A; B; parend
C; // C naturally follows A and B
end;
D; // D runs in parallel with the above
parend
The restructured version uses the natural barrier at inner parend. This is often clearer than explicit partial synchronization.
The parbegin/parend model favors structure over flexibility. Partial synchronization is possible with explicit primitives, but restructuring to use natural barriers is often preferable. When partial sync is unavoidable, it indicates that the problem's dependency graph doesn't fit neatly into structured parallelism.
Synchronization points are where parallelism is constrained. Every barrier, lock acquisition, and wait operation is a potential bottleneck. Understanding the performance implications is crucial for designing efficient parallel programs.
Amdahl's Law and Synchronization:
Amdahl's Law describes the theoretical speedup of parallel execution:
Speedup = 1 / (S + P/N)
Where:
Critical insight: Synchronization points create sequential sections. When all threads wait at a barrier, that's sequential execution—only the last-arriving thread does useful work; others idle.
| Synchronization Type | Overhead | Contention Pattern | Mitigation |
|---|---|---|---|
| parend barrier | O(N) threads wait for slowest | Slowest branch determines total time | Load balancing, work stealing |
| Mutex critical section | Only one thread proceeds at a time | Queue builds at hot locks | Lock-free algorithms, finer granularity |
| Reader-writer lock | Writers block all; readers share | Writers starve with many readers | Writer priority, RCU patterns |
| Condition variable wait | Thread sleeps until signaled | Thundering herd on broadcast | Selective signaling, work queues |
| Multiple barriers | Threads idle at each barrier | Cumulative wait time | Reduce barrier count, merge phases |
1234567891011121314151617181920
// Load imbalance at synchronization pointsparbegin task_a(); // Takes 100ms task_b(); // Takes 200ms task_c(); // Takes 50msparend// Total time: max(100, 200, 50) = 200ms// Ideal parallel time: (100+200+50)/3 = 117ms// Efficiency: 117/200 = 58.5% // Branches A and C wait 100ms and 150ms respectively, idle at barrier // Improved version with load balancing:work_items = [a1, a2, b1, b2, b3, b4, c]; // Smaller chunksparbegin worker_1: process_items(subset_1); // More balanced workload worker_2: process_items(subset_2); worker_3: process_items(subset_3);parend// Better load balance → less idle time at barrierA barrier where threads wait is not parallel execution—it's effectively sequential. If your parallel algorithm spends 50% of time at barriers (waiting for slow branches), you can at best achieve 2x speedup regardless of how many processors you use. Identify and minimize synchronization bottlenecks.
Synchronization isn't just about performance—it's essential for correctness. Without proper synchronization, parallel programs exhibit race conditions, data corruption, and non-deterministic behavior.
Correctness Properties and Synchronization:
1. Safety Properties (Nothing bad happens):
2. Liveness Properties (Something good eventually happens):
Synchronization points are where we establish these properties. Proper barrier usage ensures safety; careful synchronization design ensures liveness.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Correctness through synchronizationshared linked_list: List;shared mutex: Mutex; parbegin // Branch A: Insert operation begin node := allocate_node(data_a); mutex.lock(); // CRITICAL SECTION START // Invariant expected: list is well-formed node.next := linked_list.head; linked_list.head := node; // Invariant maintained: list is still well-formed // CRITICAL SECTION END mutex.unlock(); end; // Branch B: Insert operation begin node := allocate_node(data_b); mutex.lock(); // CRITICAL SECTION START // Branch B enters only after A exits (mutual exclusion) node.next := linked_list.head; linked_list.head := node; // CRITICAL SECTION END mutex.unlock(); end; // Branch C: Read operation begin mutex.lock(); // CRITICAL SECTION START // Sees a consistent list (either with both A and B, // just A, just B, or neither - but always valid) result := traverse(linked_list); // CRITICAL SECTION END mutex.unlock(); end;parend // Correctness properties:// - Mutual exclusion: Only one branch in critical section// - Data race freedom: All accesses to linked_list are protected// - Invariant preservation: List is always well-formed at section boundariesparbegin/parend's structured approach helps correctness: the implicit barrier at parend means all branches definitely synchronize there. You can't 'forget' this barrier. Combined with proper internal synchronization, the structure provides a correctness framework that unstructured parallelism lacks.
At a deeper level, synchronization points interact with memory consistency models—the rules governing when writes by one thread become visible to other threads. This is particularly relevant on modern hardware with caches, out-of-order execution, and memory hierarchies.
The Memory Visibility Problem:
Consider:
shared x := 0;
shared flag := false;
parbegin
// Branch A (writer)
x := 42;
flag := true;
// Branch B (reader)
while not flag do skip; end;
print(x);
parend
Question: Does Branch B necessarily print 42?
Answer: Not necessarily! Due to:
Synchronization primitives include memory barriers (fences) that enforce ordering.
| Synchronization Primitive | Memory Ordering Guarantee |
|---|---|
| Mutex lock acquire | All reads/writes after acquire see all writes before prior release (acquire semantics) |
| Mutex lock release | All reads/writes before release are visible to subsequent acquires (release semantics) |
| Barrier | All operations before barrier are visible to all threads after barrier (full fence) |
| Atomic read-modify-write | Ordering depends on memory order specification (acquire, release, seq_cst) |
| parend (implicit barrier) | Typically provides full memory fence; all writes visible after barrier |
12345678910111213141516171819202122232425
// Correct memory ordering through synchronizationshared x := 0;shared mutex: Mutex; parbegin // Branch A (writer) begin mutex.lock(); // ACQUIRE barrier x := 42; // Write is ordered after acquire mutex.unlock(); // RELEASE barrier - x=42 visible to subsequent lock end; // Branch B (reader) begin mutex.lock(); // ACQUIRE - sees writes before prior release value := x; // Guaranteed to see x = 42 if B acquires after A releases mutex.unlock(); end;parend // The mutex lock/unlock provide memory ordering:// - A's write to x happens-before A's unlock// - A's unlock synchronizes-with B's lock (if B acquires after A releases)// - B's lock happens-before B's read of x// Therefore: A's write to x happens-before B's read of xThe implicit barrier at parend typically includes a full memory fence. This guarantees that all writes from all branches are visible to code after parend. You don't need explicit memory barriers at the end of a parbegin/parend block—the semantics include proper memory ordering.
We've explored synchronization points in depth—from the implicit parend barrier to explicit synchronization, performance implications, and memory consistency. Let's consolidate our understanding:
What's Next:
We've now covered the practical aspects of the parbegin/parend model. Our final page examines the academic notation conventions—how this model is presented in textbooks, research papers, and formal specifications, and why understanding these conventions matters for your continued study of concurrent programming.
You now understand synchronization points in the parbegin/parend model—both the implicit parend barrier and explicit synchronization within parallel blocks. This knowledge is essential for designing parallel programs that are both correct and performant.