Operating SystemsConcurrent Programming

Fork-Join Model

LevelIntermediate

Duration90 mins

TopicConcurrent Programming

3 / 5

Task Decomposition

The Art of Breaking Work into Parallel Pieces

Task decomposition is the process of breaking a computational problem into smaller, independent pieces that can execute in parallel. It's the critical first step in parallel algorithm design—before you can fork and join, you must identify what to fork.

Effective task decomposition is both an art and a science. The art lies in recognizing parallelism opportunities in seemingly sequential algorithms. The science lies in understanding the quantitative tradeoffs: overhead costs, load balancing, cache effects, and synchronization requirements.

Poor decomposition leads to disappointing parallel speedups—or worse, parallel slowdowns where overhead exceeds sequential execution. Excellent decomposition unlocks near-linear scaling with available cores, turning a 10-minute computation into a 30-second one.

What You Will Learn

By the end of this page, you will understand: how to identify parallelizable work, the distinction between data parallelism and task parallelism, granularity selection strategies, load balancing techniques, decomposition patterns for common problem types, and the mathematical framework for analyzing decomposition quality.

Amdahl's Law: The Fundamental Limit:

Before exploring decomposition strategies, we must understand Amdahl's Law—the fundamental limit on parallel speedup:

$$Speedup = \frac{1}{(1-P) + \frac{P}{N}}$$

Where:

P = fraction of the program that can be parallelized
N = number of processors
(1-P) = inherently sequential fraction

If only 50% of your work is parallelizable (P = 0.5), the maximum speedup is 2x—regardless of how many processors you have. This underscores the importance of decomposition: maximizing P is often more impactful than adding more processors.

Identifying Parallelizable Work

The first step in task decomposition is identifying which parts of a computation can execute independently. This requires understanding data dependencies—the relationships between operations that constrain execution order.

Types of Dependencies:

True Dependency (Read-After-Write): Operation B reads a value written by A. B cannot execute until A completes.
Anti-Dependency (Write-After-Read): Operation B writes a value read by A. B must wait for A to read the old value.
Output Dependency (Write-After-Write): Both A and B write to the same location. Order must be preserved.
No Dependency: A and B access disjoint data, or both only read shared data. These can execute in parallel.

dependency_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
"""
Dependency Analysis Examples:
Identifying which operations can execute in parallel.
"""
 
# Example 1: Loop with NO dependencies - PARALLELIZABLE
def independent_iterations():
    """Each iteration operates on independent data."""
    results = [0] * 1000
    
    # Each iteration writes to a different index
    # No iteration reads another's output
    # All iterations can execute in parallel
    for i in range(1000):
        results[i] = expensive_compute(i)  # ✓ Parallel
    
    return results
 
 
# Example 2: Loop with TRUE dependency - SEQUENTIAL
def dependent_iterations():
    """Each iteration depends on the previous result."""
    result = 1
    
    # Each iteration reads 'result' written by previous
    # True dependency: read-after-write
    for i in range(1, 1000):
        result = result * i  # ✗ Sequential
    
    return result
 
 
# Example 3: Loop with REDUCIBLE dependency - PARTIALLY PARALLELIZABLE
def reducible_pattern():
    """Reduction: combine results associatively."""
    data = [1, 2, 3, 4, 5, 6, 7, 8]
    
    # Naive sequential sum
    total = 0
    for x in data:
        total += x  # Seems sequential due to 'total' dependency
    
    # But addition is associative: (a+b)+c = a+(b+c)
    # Can decompose into parallel partial sums, then combine:
    # Partition: [1,2,3,4], [5,6,7,8]
    # Parallel:  sum1=10,   sum2=26
    # Combine:   10 + 26 = 36
    # This is the REDUCTION pattern
 
 
# Example 4: Map pattern - EMBARRASSINGLY PARALLEL
def map_pattern(data):
    """Apply independent function to each element."""
    
    # Each call is independent - perfect parallelism
    results = [transform(x) for x in data]  # ✓ Parallel
    
    return results
 
 
# Example 5: Prefix sum - SEEMINGLY SEQUENTIAL, ACTUALLY PARALLEL
def prefix_sum_analysis():
    """
    Prefix sum: result[i] = sum(data[0:i+1])
    
    Naive approach seems sequential:
        result[0] = data[0]
        result[i] = result[i-1] + data[i]  # Depends on previous!
    
    But there's a parallel algorithm (Blelloch scan):
        O(n) work, O(log n) span
    
    Key insight: operations can be restructured to expose parallelism.
    """
    pass
 
 
# Example 6: Dependency Graph Analysis
def analyze_dependencies():
    """
    Consider these operations:
    
    A: x = input1 * 2
    B: y = input2 + 3  
    C: z = x + y
    D: w = input3 - 1
    E: result = z * w
    
    Dependency graph:
    
        A ──┐
            ├──► C ──┐
        B ──┘       ├──► E
        D ─────────┘
    
    Parallel schedule:
    - Level 1: A, B, D (all independent)
    - Level 2: C (depends on A, B)
    - Level 3: E (depends on C, D)
    
    Speedup: 5 sequential operations → 3 parallel steps
    """
    pass

Questions to Identify Parallelism

•Do iterations access disjoint data? If each iteration works on independent elements, the loop is parallelizable.
•Is the operation associative? If yes, consider reduction parallelization (sum, product, min, max, merge).
•Can the algorithm be restructured? Some seemingly sequential algorithms have parallel variants (prefix sum, parallel merge sort).
•What's the critical path? Even if some operations are sequential, others may proceed in parallel with them.
•Are dependencies on indices or values? Index-only dependencies are easier to parallelize than value dependencies.

Data Parallelism vs Task Parallelism

There are two fundamental approaches to decomposing parallel work: data parallelism (same operation on multiple data elements) and task parallelism (different operations executing concurrently). Understanding which approach fits your problem is crucial for effective decomposition.

Data Parallelism:

In data parallelism, we partition data across parallel workers, with each worker applying the same computation to its partition. This is the dominant model for numerical and array-based computations.

Characteristics:

Same operation applied uniformly
Data is partitioned, function is replicated
Scales with data size
Examples: map, filter, parallel loops, matrix operations, image processing

Data Parallelism

•Partition DATA, replicate FUNCTION
•Uniform computation per element
•Load balanced if elements have equal work
•SIMD/GPU friendly
•Example: parallel array sum
•Example: image convolution
•Example: Monte Carlo simulation

Task Parallelism

•Partition FUNCTION, may share DATA
•Different computations per task
•Load balanced only if tasks sized equally
•Better for heterogeneous work
•Example: web server handling requests
•Example: video encoding pipeline
•Example: game engine subsystems

parallelism_comparison.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import java.util.concurrent.*;
import java.util.stream.*;
 
/**
 * Comparison of data parallelism and task parallelism approaches.
 */
public class ParallelismComparison {
    
    // ==========================================
    // DATA PARALLELISM: Same operation on partitioned data
    // ==========================================
    
    /**
     * Data-parallel array processing.
     * Each element processed identically; data is partitioned.
     */
    static long dataParallelSum(int[] data) {
        // Fork-Join style: divide array, sum partitions, combine
        return IntStream.of(data)
            .parallel()  // Automatic data partitioning
            .mapToLong(x -> x)  // Same operation on each element
            .sum();  // Reduction
    }
    
    /**
     * Data-parallel image processing.
     * Each pixel processed by same kernel; image partitioned into tiles.
     */
    static void dataParallelImageProcess(int[][] image, int[][] output) {
        int height = image.length;
        int width = image[0].length;
        
        IntStream.range(0, height).parallel().forEach(y -> {
            for (int x = 0; x < width; x++) {
                // Same convolution operation applied to each pixel
                output[y][x] = applyConvolution(image, x, y);
            }
        });
    }
    
    // ==========================================
    // TASK PARALLELISM: Different operations concurrently
    // ==========================================
    
    /**
     * Task-parallel pipeline.
     * Different stages process concurrently on different data.
     */
    static void taskParallelPipeline(Queue<RawData> input, 
                                      Queue<Result> output) {
        ExecutorService executor = Executors.newFixedThreadPool(3);
        
        BlockingQueue<DecodedData> decoded = new LinkedBlockingQueue<>();
        BlockingQueue<ProcessedData> processed = new LinkedBlockingQueue<>();
        
        // Task 1: Decode (runs concurrently)
        executor.submit(() -> {
            while (true) {
                RawData raw = input.poll();
                if (raw == null) break;
                decoded.put(decode(raw));  // Different function
            }
        });
        
        // Task 2: Process (runs concurrently)
        executor.submit(() -> {
            while (true) {
                DecodedData d = decoded.take();
                processed.put(process(d));  // Different function
            }
        });
        
        // Task 3: Encode (runs concurrently)
        executor.submit(() -> {
            while (true) {
                ProcessedData p = processed.take();
                output.add(encode(p));  // Different function
            }
        });
        
        executor.shutdown();
    }
    
    /**
     * Task-parallel independent computations.
     * Different analyses run on same data concurrently.
     */
    static AnalysisResult taskParallelAnalysis(Data data) 
            throws Exception {
        ExecutorService executor = Executors.newFixedThreadPool(4);
        
        // Four different analyses on same data
        Future<StatisticalResult> statsF = 
            executor.submit(() -> computeStatistics(data));
        Future<TrendResult> trendF = 
            executor.submit(() -> detectTrends(data));
        Future<AnomalyResult> anomalyF = 
            executor.submit(() -> findAnomalies(data));
        Future<ForecastResult> forecastF = 
            executor.submit(() -> generateForecast(data));
        
        // Wait for all (join)
        return new AnalysisResult(
            statsF.get(),
            trendF.get(),
            anomalyF.get(),
            forecastF.get()
        );
    }
    
    // ==========================================
    // HYBRID: Both data and task parallelism
    // ==========================================
    
    /**
     * Hybrid parallelism: parallel pipeline with data-parallel stages.
     */
    static void hybridParallelism(int[][][] videoFrames) {
        // Task parallelism: different stages
        // Data parallelism within each stage: parallel over frames/pixels
        
        ForkJoinPool pool = ForkJoinPool.commonPool();
        
        pool.submit(() -> {
            Arrays.stream(videoFrames)
                .parallel()  // Data parallel over frames
                .forEach(frame -> {
                    // Each frame processed by many workers
                    IntStream.range(0, frame.length)
                        .parallel()  // Data parallel over rows
                        .forEach(y -> processRow(frame[y]));
                });
        });
    }
}

Choosing the Right Decomposition

Use data parallelism when: (1) you have homogeneous work over large datasets, (2) each element can be processed independently, (3) you want to leverage SIMD or GPU. Use task parallelism when: (1) you have heterogeneous computations, (2) different resources are needed for different operations, (3) you're building pipelines or actor systems. Hybrid approaches are often optimal for complex applications.

Granularity Selection: Finding the Sweet Spot

Granularity refers to the size of individual tasks in a parallel decomposition. Choosing the right granularity is critical—too fine-grained and overhead dominates; too coarse-grained and parallelism is limited.

The Granularity Tradeoff:

Granularity	Overhead	Load Balance	Parallelism	Cache Behavior
Too Fine	High (fork/join cost per unit)	Excellent	Maximum	Poor (less reuse)
Optimal	Amortized	Good	Sufficient	Good
Too Coarse	Low	Poor (some cores idle)	Limited	Excellent

Golden Rule: A task should perform at least several thousand CPU operations to amortize fork-join overhead. The exact threshold depends on your framework and hardware.

granularity_tuning.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import java.util.concurrent.*;
 
/**
 * Demonstrates granularity selection and sequential thresholds.
 */
public class GranularityExample extends RecursiveTask<Long> {
    
    // Sequential threshold - tune based on workload
    // Start with ~1000-10000 elements, benchmark to optimize
    private static final int SEQUENTIAL_THRESHOLD = 5000;
    
    private final long[] array;
    private final int lo;
    private final int hi;
    
    public GranularityExample(long[] array, int lo, int hi) {
        this.array = array;
        this.lo = lo;
        this.hi = hi;
    }
    
    @Override
    protected Long compute() {
        int size = hi - lo;
        
        // COARSE CHECK: Is task small enough to do sequentially?
        if (size <= SEQUENTIAL_THRESHOLD) {
            // Sequential execution - no more forking
            return sequentialSum(array, lo, hi);
        }
        
        // DECOMPOSITION: Split into two halves
        int mid = (lo + hi) / 2;
        
        // Fork left half
        GranularityExample left = new GranularityExample(array, lo, mid);
        left.fork();
        
        // Compute right half directly (fork-once pattern)
        GranularityExample right = new GranularityExample(array, mid, hi);
        Long rightResult = right.compute();
        
        // Join left and combine results
        Long leftResult = left.join();
        
        return leftResult + rightResult;
    }
    
    /**
     * Sequential computation for small chunks.
     * Optimized for cache locality - sequential access pattern.
     */
    private long sequentialSum(long[] arr, int start, int end) {
        long sum = 0;
        for (int i = start; i < end; i++) {
            sum += arr[i];
            
            // If per-element work is expensive, threshold can be lower
            // If per-element work is trivial, threshold should be higher
        }
        return sum;
    }
    
    // ==========================================
    // ADAPTIVE GRANULARITY
    // ==========================================
    
    /**
     * Adaptive approach: consider available parallelism and work remaining.
     */
    static class AdaptiveTask extends RecursiveTask<Long> {
        private static final int MIN_GRANULARITY = 1000;
        
        private final long[] array;
        private final int lo, hi;
        private final int surplusTasks;
        
        AdaptiveTask(long[] array, int lo, int hi, int surplusTasks) {
            this.array = array;
            this.lo = lo;
            this.hi = hi;
            this.surplusTasks = surplusTasks;
        }
        
        @Override
        protected Long compute() {
            int size = hi - lo;
            
            // Stop splitting when:
            // 1. Task is small enough, OR
            // 2. We have enough tasks queued (surplus)
            if (size <= MIN_GRANULARITY || surplusTasks <= 0) {
                return sequentialSum(array, lo, hi);
            }
            
            int mid = (lo + hi) / 2;
            
            // Reduce surplus as we create more tasks
            AdaptiveTask left = new AdaptiveTask(
                array, lo, mid, surplusTasks - 1);
            AdaptiveTask right = new AdaptiveTask(
                array, mid, hi, surplusTasks - 1);
            
            left.fork();
            Long rightResult = right.compute();
            Long leftResult = left.join();
            
            return leftResult + rightResult;
        }
        
        private long sequentialSum(long[] arr, int start, int end) {
            long sum = 0;
            for (int i = start; i < end; i++) sum += arr[i];
            return sum;
        }
    }
    
    // ==========================================
    // BENCHMARKING GRANULARITY
    // ==========================================
    
    public static void benchmarkThresholds() {
        long[] data = new long[10_000_000];
        for (int i = 0; i < data.length; i++) data[i] = i;
        
        int[] thresholds = {100, 1000, 5000, 10000, 50000, 100000};
        ForkJoinPool pool = ForkJoinPool.commonPool();
        
        System.out.println("Threshold | Time (ms) | Tasks Created");
        System.out.println("----------+-----------+--------------");
        
        for (int threshold : thresholds) {
            // Reset task counter...
            long start = System.nanoTime();
            
            // Would need to modify SEQUENTIAL_THRESHOLD dynamically
            // or create threshold-specific subclass
            Long result = pool.invoke(
                new GranularityExample(data, 0, data.length));
            
            long elapsed = (System.nanoTime() - start) / 1_000_000;
            
            // Optimal threshold minimizes time
            System.out.printf("%9d | %9d | (measure)%n", threshold, elapsed);
        }
    }
}

Tuning the Sequential Threshold

The optimal threshold varies by workload. Benchmark with different values: start at array.length / (numCores * 4) and adjust. More work per element → lower threshold. Simpler work per element → higher threshold. When in doubt, 5,000-10,000 elements is a reasonable starting point for array operations on modern hardware.

Load Balancing Strategies

Even with correct decomposition and appropriate granularity, parallel performance can suffer from load imbalance—when some workers finish early and sit idle while others are still working. Effective load balancing ensures all workers stay busy until the computation completes.

Sources of Load Imbalance:

Non-uniform task sizes: Some tasks take longer than others
Data-dependent computation: Time depends on input values, not just input size
Irregular problem structure: Trees, graphs with varying node degrees
External interference: OS scheduling, garbage collection, I/O
Static assignment: Pre-assigning work ignores runtime variations

Load Balancing Approaches Comparison
Strategy	Mechanism	Overhead	Balance Quality	Best For
Static/Block	Pre-divide into equal chunks	Minimal	Poor if work varies	Uniform work per element
Static/Cyclic	Round-robin assignment	Minimal	Good if work varies regularly	Predictable variation patterns
Dynamic/Chunked	Workers grab chunks from queue	Medium	Good	Moderate variation
Work-Stealing	Idle workers steal from busy	Low	Excellent	Unpredictable variation
Guided	Decreasing chunk sizes	Medium	Very good	Unknown variation

load_balancing.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
import java.util.concurrent.*;
import java.util.concurrent.atomic.*;
 
/**
 * Load balancing strategies for parallel decomposition.
 */
public class LoadBalancing {
    
    // ==========================================
    // PROBLEM: Non-uniform work distribution
    // ==========================================
    
    /**
     * Example: Processing items where work varies dramatically.
     * Item 0 might take 1ms, item 999 might take 1000ms.
     */
    static void nonUniformWorkProblem() {
        int[] items = new int[1000];
        
        // Work per item: items[i] takes i milliseconds
        // Total work: 0+1+2+...+999 = ~500k ms = 500 seconds
        // If we split evenly: Worker 1 handles [0-499], Worker 2 handles [500-999]
        // Worker 1 work: 0+1+...+499 = 124,750 ms
        // Worker 2 work: 500+501+...+999 = 374,750 ms
        // Worker 1 finishes 3x earlier and idles!
    }
    
    // ==========================================
    // STATIC CYCLIC DISTRIBUTION
    // ==========================================
    
    /**
     * Cyclic distribution: Round-robin assignment.
     * Worker 0: items 0,4,8,...  Worker 1: items 1,5,9,... etc.
     * Balances if variation is regular.
     */
    static void cyclicDistribution(int[] items, int numWorkers) {
        ExecutorService executor = Executors.newFixedThreadPool(numWorkers);
        
        for (int worker = 0; worker < numWorkers; worker++) {
            final int workerId = worker;
            
            executor.submit(() -> {
                // Process items workerId, workerId+numWorkers, ...
                for (int i = workerId; i < items.length; i += numWorkers) {
                    processItem(items[i]);
                }
            });
        }
        
        executor.shutdown();
    }
    
    // ==========================================
    // DYNAMIC CHUNKED DISTRIBUTION
    // ==========================================
    
    /**
     * Dynamic chunked: Workers grab chunks from shared counter.
     * Good balance but has synchronization overhead.
     */
    static void dynamicChunked(int[] items, int numWorkers, int chunkSize) {
        AtomicInteger nextChunk = new AtomicInteger(0);
        ExecutorService executor = Executors.newFixedThreadPool(numWorkers);
        
        for (int worker = 0; worker < numWorkers; worker++) {
            executor.submit(() -> {
                while (true) {
                    // Atomically grab next chunk
                    int start = nextChunk.getAndAdd(chunkSize);
                    if (start >= items.length) break;
                    
                    int end = Math.min(start + chunkSize, items.length);
                    for (int i = start; i < end; i++) {
                        processItem(items[i]);
                    }
                }
            });
        }
        
        executor.shutdown();
    }
    
    // ==========================================
    // WORK-STEALING: Fork-Join's Built-in Solution
    // ==========================================
    
    /**
     * Work-stealing automatically balances load.
     * Each worker has a deque; idle workers steal from busy workers.
     */
    static class WorkStealingSum extends RecursiveTask<Long> {
        private static final int THRESHOLD = 1000;
        private final long[] array;
        private final int lo, hi;
        
        WorkStealingSum(long[] array, int lo, int hi) {
            this.array = array;
            this.lo = lo;
            this.hi = hi;
        }
        
        @Override
        protected Long compute() {
            if (hi - lo <= THRESHOLD) {
                return sequentialCompute();
            }
            
            int mid = (lo + hi) / 2;
            WorkStealingSum left = new WorkStealingSum(array, lo, mid);
            WorkStealingSum right = new WorkStealingSum(array, mid, hi);
            
            // Fork creates task in local deque
            left.fork();
            
            // While 'left' might sit in deque, other workers can STEAL it
            // This is the work-stealing magic
            Long rightResult = right.compute();
            Long leftResult = left.join();
            
            return leftResult + rightResult;
        }
        
        private long sequentialCompute() {
            // Variable work simulation
            long sum = 0;
            for (int i = lo; i < hi; i++) {
                sum += expensiveOp(array[i]);  // Time varies per value
            }
            return sum;
        }
        
        private long expensiveOp(long value) {
            // Simulate variable work: more work for larger values
            long result = value;
            for (int i = 0; i < value % 100; i++) {
                result = result * 31 + i;
            }
            return result;
        }
    }
    
    // ==========================================
    // GUIDED SELF-SCHEDULING
    // ==========================================
    
    /**
     * Guided scheduling: chunk size decreases as work progresses.
     * Large chunks initially (low overhead), small chunks at end (good balance).
     */
    static void guidedScheduling(int[] items, int numWorkers) {
        AtomicInteger remaining = new AtomicInteger(items.length);
        AtomicInteger currentIndex = new AtomicInteger(0);
        ExecutorService executor = Executors.newFixedThreadPool(numWorkers);
        
        for (int worker = 0; worker < numWorkers; worker++) {
            executor.submit(() -> {
                while (true) {
                    int rem = remaining.get();
                    if (rem <= 0) break;
                    
                    // Chunk size proportional to remaining work divided by workers
                    // Minimum chunk size of 1
                    int chunkSize = Math.max(1, rem / (numWorkers * 2));
                    
                    int start = currentIndex.getAndAdd(chunkSize);
                    int actualRemaining = remaining.addAndGet(-chunkSize);
                    
                    if (start >= items.length) break;
                    
                    int end = Math.min(start + chunkSize, items.length);
                    for (int i = start; i < end; i++) {
                        processItem(items[i]);
                    }
                }
            });
        }
        
        executor.shutdown();
    }
    
    private static void processItem(int item) {
        // Simulate work proportional to item value
        try { Thread.sleep(item % 10); } 
        catch (InterruptedException e) {}
    }
}

Work-Stealing is Usually the Answer

For most fork-join parallel programs, work-stealing (built into ForkJoinPool, Intel TBB, Cilk) provides excellent load balancing with low overhead. Use static distribution only when work is perfectly uniform AND you need to minimize synchronization overhead (e.g., SIMD/GPU). When in doubt, use work-stealing.

Common Decomposition Patterns

Certain decomposition patterns appear repeatedly in parallel programming. Recognizing these patterns accelerates algorithm design and leverages well-tested implementation strategies.

Core Parallel Patterns:

Fundamental Decomposition Patterns

•Map (Embarrassingly Parallel): Apply same function to each element independently. No inter-element dependencies. Fork per element (or chunk), join all. Examples: image filters, Monte Carlo trials, parameter sweeps.
•Reduce: Combine elements using associative operation. Divide into groups, reduce each in parallel, reduce group results. Examples: sum, product, min/max, merge sorted lists.
•Scan (Prefix): Compute running aggregates. Parallel algorithm: sum pairs, propagate, combine. Examples: running totals, lexical analysis, parallel allocation.
•Stencil: Each element computed from neighboring elements. Partition into tiles, process tiles in parallel, handle boundaries. Examples: convolution, heat diffusion, cellular automata.
•Recursive Divide-and-Conquer: Split problem into independent subproblems, solve in parallel, combine. Examples: merge sort, quicksort, matrix multiply, tree traversals.

decomposition_patterns.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import java.util.concurrent.*;
import java.util.stream.*;
import java.util.*;
 
/**
 * Implementation examples of common decomposition patterns.
 */
public class DecompositionPatterns {
    
    // ==========================================
    // PATTERN 1: MAP (Embarrassingly Parallel)
    // ==========================================
    
    /**
     * Map pattern: independent transformation of each element.
     */
    static double[] parallelMap(double[] input, DoubleUnaryOperator f) {
        return Arrays.stream(input)
            .parallel()
            .map(f)
            .toArray();
    }
    
    // Fork-Join version for illustration
    static class MapTask extends RecursiveAction {
        private static final int THRESHOLD = 10000;
        private final double[] input, output;
        private final int lo, hi;
        private final DoubleUnaryOperator f;
        
        MapTask(double[] input, double[] output, int lo, int hi, 
                DoubleUnaryOperator f) {
            this.input = input; this.output = output;
            this.lo = lo; this.hi = hi; this.f = f;
        }
        
        @Override
        protected void compute() {
            if (hi - lo <= THRESHOLD) {
                for (int i = lo; i < hi; i++) {
                    output[i] = f.applyAsDouble(input[i]);
                }
            } else {
                int mid = (lo + hi) / 2;
                invokeAll(
                    new MapTask(input, output, lo, mid, f),
                    new MapTask(input, output, mid, hi, f)
                );
            }
        }
    }
    
    // ==========================================
    // PATTERN 2: REDUCE
    // ==========================================
    
    /**
     * Reduce pattern: combine elements with associative operation.
     */
    static class ReduceTask extends RecursiveTask<Long> {
        private static final int THRESHOLD = 10000;
        private final long[] array;
        private final int lo, hi;
        private final LongBinaryOperator combine;
        private final long identity;
        
        ReduceTask(long[] array, int lo, int hi, 
                   long identity, LongBinaryOperator combine) {
            this.array = array;
            this.lo = lo; this.hi = hi;
            this.identity = identity;
            this.combine = combine;
        }
        
        @Override
        protected Long compute() {
            if (hi - lo <= THRESHOLD) {
                long result = identity;
                for (int i = lo; i < hi; i++) {
                    result = combine.applyAsLong(result, array[i]);
                }
                return result;
            }
            
            int mid = (lo + hi) / 2;
            ReduceTask left = new ReduceTask(array, lo, mid, identity, combine);
            ReduceTask right = new ReduceTask(array, mid, hi, identity, combine);
            
            left.fork();
            Long rightResult = right.compute();
            Long leftResult = left.join();
            
            return combine.applyAsLong(leftResult, rightResult);
        }
    }
    
    static long parallelSum(long[] array) {
        return ForkJoinPool.commonPool().invoke(
            new ReduceTask(array, 0, array.length, 0L, Long::sum));
    }
    
    static long parallelMax(long[] array) {
        return ForkJoinPool.commonPool().invoke(
            new ReduceTask(array, 0, array.length, Long.MIN_VALUE, Long::max));
    }
    
    // ==========================================
    // PATTERN 3: DIVIDE AND CONQUER
    // ==========================================
    
    /**
     * Divide-and-conquer: recursive subdivision into independent subproblems.
     * Classic example: parallel merge sort.
     */
    static class ParallelMergeSort extends RecursiveAction {
        private static final int THRESHOLD = 4096;
        private final int[] array;
        private final int lo, hi;
        private final int[] buffer;
        
        ParallelMergeSort(int[] array, int lo, int hi, int[] buffer) {
            this.array = array;
            this.lo = lo; this.hi = hi;
            this.buffer = buffer;
        }
        
        @Override
        protected void compute() {
            if (hi - lo <= THRESHOLD) {
                Arrays.sort(array, lo, hi);  // Sequential fallback
                return;
            }
            
            int mid = (lo + hi) / 2;
            
            // DIVIDE: Create subproblems
            ParallelMergeSort left = 
                new ParallelMergeSort(array, lo, mid, buffer);
            ParallelMergeSort right = 
                new ParallelMergeSort(array, mid, hi, buffer);
            
            // CONQUER: Solve subproblems in parallel
            invokeAll(left, right);
            
            // COMBINE: Merge sorted halves
            merge(array, lo, mid, hi, buffer);
        }
        
        private void merge(int[] arr, int lo, int mid, int hi, int[] buf) {
            System.arraycopy(arr, lo, buf, lo, hi - lo);
            int i = lo, j = mid, k = lo;
            while (i < mid && j < hi) {
                arr[k++] = (buf[i] <= buf[j]) ? buf[i++] : buf[j++];
            }
            while (i < mid) arr[k++] = buf[i++];
            while (j < hi) arr[k++] = buf[j++];
        }
    }
    
    // ==========================================
    // PATTERN 4: STENCIL
    // ==========================================
    
    /**
     * Stencil pattern: each output depends on neighborhood of inputs.
     * Exercise: avoid data races at tile boundaries.
     */
    static void parallelStencil(double[][] input, double[][] output, 
                                 double[][] kernel) {
        int height = input.length;
        int width = input[0].length;
        int kRadius = kernel.length / 2;
        
        // Process rows in parallel (each row independent if reading input)
        IntStream.range(kRadius, height - kRadius).parallel().forEach(y -> {
            for (int x = kRadius; x < width - kRadius; x++) {
                double sum = 0;
                for (int ky = -kRadius; ky <= kRadius; ky++) {
                    for (int kx = -kRadius; kx <= kRadius; kx++) {
                        sum += input[y + ky][x + kx] * 
                               kernel[ky + kRadius][kx + kRadius];
                    }
                }
                output[y][x] = sum;
            }
        });
    }
}

Dependency Graph Analysis for Parallelization

Complex computations with multiple dependencies can be analyzed using Directed Acyclic Graphs (DAGs). Each node represents a task, and edges represent dependencies. This analysis reveals the critical path (longest dependency chain) and the available parallelism.

Key Metrics:

Work (T₁): Total time on a single processor (sum of all node weights)
Span (T∞): Time on infinite processors (critical path length)
Parallelism: T₁ / T∞ — the available parallelism
Speedup: T₁ / Tₚ — actual speedup on P processors

Brent's Law:

$$T_P \leq \frac{T_1}{P} + T_\infty$$

This bounds the execution time on P processors: you can't be faster than either the sequential work divided by processors OR the critical path.

Converting Mermaid diagram...

Interpreting the DAG:

In the example above:

A and B can execute in parallel (no dependencies between them)
D can execute in parallel with A, B, or C
C must wait for both A and B
E must wait for both C and D
F must wait for E

Critical Path: B → C → E → F (4 + 3 + 2 + 1 = 10ms)

Even with infinite processors, we can't complete faster than 10ms because this chain is entirely sequential.

Practical Use of DAG Analysis:

Identify parallelization barriers: Long sequential chains limit speedup
Restructure algorithms: Sometimes dependencies can be broken or relaxed
Prioritize scheduling: Start critical path tasks early to overlap with other work
Estimate speedup bounds: Know the limit before optimizing

When Parallelism is Limited

If analysis shows parallelism < number of cores, you have two choices: (1) Restructure the algorithm to expose more parallelism (e.g., parallel prefix instead of sequential scan), or (2) Accept the limit and optimize the sequential portions. Adding more cores won't help if they have nothing to do.

Summary: Mastering Task Decomposition

Task decomposition is the foundation skill for parallel programming. Before any forking or joining, you must understand how to break work into parallel pieces effectively.

Let's consolidate the key insights:

Key Takeaways

•Identify independence first — Analysis of data dependencies reveals what can and cannot parallelize. No dependency = potential parallelism.
•Choose the right parallelism type — Data parallelism for uniform work on large datasets; task parallelism for heterogeneous operations.
•Granularity is critical — Too fine incurs overhead; too coarse limits parallelism. Use sequential thresholds, typically 1,000-10,000 elements.
•Load balancing determines efficiency — Work-stealing handles irregular workloads automatically. Static distribution only when work is perfectly uniform.
•Recognize common patterns — Map, reduce, scan, stencil, divide-and-conquer are building blocks. Master these and you can decompose most problems.
•DAG analysis reveals limits — Work/Span ratio gives the available parallelism. Critical path bounds best-case execution time.
•Amdahl's Law constrains speedup — The sequential fraction limits total speedup regardless of processor count. Minimize it.

What's Next:

The theoretical foundation is set. The next page explores the Java Fork/Join Framework—a practical implementation of fork-join concepts that you can use to build high-performance parallel applications. We'll see how the concepts of fork, join, task decomposition, work-stealing, and sequential thresholds come together in production-quality code.

Page Complete

You now understand how to identify parallelizable work, choose between data and task parallelism, select appropriate granularity, balance load effectively, and analyze dependency graphs. With these skills, you can decompose complex computations for efficient parallel execution. Next, we'll apply these concepts in the Java Fork/Join Framework.

3 / 5

Loading learning content...

Operating SystemsConcurrent Programming

Fork-Join Model

LevelIntermediate

Duration90 mins

TopicConcurrent Programming

3 / 5

Task Decomposition

The Art of Breaking Work into Parallel Pieces

What You Will Learn

Amdahl's Law: The Fundamental Limit:

Before exploring decomposition strategies, we must understand Amdahl's Law—the fundamental limit on parallel speedup:

$$Speedup = \frac{1}{(1-P) + \frac{P}{N}}$$

Where:

P = fraction of the program that can be parallelized
N = number of processors
(1-P) = inherently sequential fraction

Identifying Parallelizable Work

Types of Dependencies:

True Dependency (Read-After-Write): Operation B reads a value written by A. B cannot execute until A completes.
Anti-Dependency (Write-After-Read): Operation B writes a value read by A. B must wait for A to read the old value.
Output Dependency (Write-After-Write): Both A and B write to the same location. Order must be preserved.
No Dependency: A and B access disjoint data, or both only read shared data. These can execute in parallel.

dependency_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
"""
Dependency Analysis Examples:
Identifying which operations can execute in parallel.
"""
 
# Example 1: Loop with NO dependencies - PARALLELIZABLE
def independent_iterations():
    """Each iteration operates on independent data."""
    results = [0] * 1000
    
    # Each iteration writes to a different index
    # No iteration reads another's output
    # All iterations can execute in parallel
    for i in range(1000):
        results[i] = expensive_compute(i)  # ✓ Parallel
    
    return results
 
 
# Example 2: Loop with TRUE dependency - SEQUENTIAL
def dependent_iterations():
    """Each iteration depends on the previous result."""
    result = 1
    
    # Each iteration reads 'result' written by previous
    # True dependency: read-after-write
    for i in range(1, 1000):
        result = result * i  # ✗ Sequential
    
    return result
 
 
# Example 3: Loop with REDUCIBLE dependency - PARTIALLY PARALLELIZABLE
def reducible_pattern():
    """Reduction: combine results associatively."""
    data = [1, 2, 3, 4, 5, 6, 7, 8]
    
    # Naive sequential sum
    total = 0
    for x in data:
        total += x  # Seems sequential due to 'total' dependency
    
    # But addition is associative: (a+b)+c = a+(b+c)
    # Can decompose into parallel partial sums, then combine:
    # Partition: [1,2,3,4], [5,6,7,8]
    # Parallel:  sum1=10,   sum2=26
    # Combine:   10 + 26 = 36
    # This is the REDUCTION pattern
 
 
# Example 4: Map pattern - EMBARRASSINGLY PARALLEL
def map_pattern(data):
    """Apply independent function to each element."""
    
    # Each call is independent - perfect parallelism
    results = [transform(x) for x in data]  # ✓ Parallel
    
    return results
 
 
# Example 5: Prefix sum - SEEMINGLY SEQUENTIAL, ACTUALLY PARALLEL
def prefix_sum_analysis():
    """
    Prefix sum: result[i] = sum(data[0:i+1])
    
    Naive approach seems sequential:
        result[0] = data[0]
        result[i] = result[i-1] + data[i]  # Depends on previous!
    
    But there's a parallel algorithm (Blelloch scan):
        O(n) work, O(log n) span
    
    Key insight: operations can be restructured to expose parallelism.
    """
    pass
 
 
# Example 6: Dependency Graph Analysis
def analyze_dependencies():
    """
    Consider these operations:
    
    A: x = input1 * 2
    B: y = input2 + 3  
    C: z = x + y
    D: w = input3 - 1
    E: result = z * w
    
    Dependency graph:
    
        A ──┐
            ├──► C ──┐
        B ──┘       ├──► E
        D ─────────┘
    
    Parallel schedule:
    - Level 1: A, B, D (all independent)
    - Level 2: C (depends on A, B)
    - Level 3: E (depends on C, D)
    
    Speedup: 5 sequential operations → 3 parallel steps
    """
    pass

Questions to Identify Parallelism

•Do iterations access disjoint data? If each iteration works on independent elements, the loop is parallelizable.
•Is the operation associative? If yes, consider reduction parallelization (sum, product, min, max, merge).
•Can the algorithm be restructured? Some seemingly sequential algorithms have parallel variants (prefix sum, parallel merge sort).
•What's the critical path? Even if some operations are sequential, others may proceed in parallel with them.
•Are dependencies on indices or values? Index-only dependencies are easier to parallelize than value dependencies.

Data Parallelism vs Task Parallelism

Data Parallelism:

In data parallelism, we partition data across parallel workers, with each worker applying the same computation to its partition. This is the dominant model for numerical and array-based computations.

Characteristics:

Same operation applied uniformly
Data is partitioned, function is replicated
Scales with data size
Examples: map, filter, parallel loops, matrix operations, image processing

Data Parallelism

•Partition DATA, replicate FUNCTION
•Uniform computation per element
•Load balanced if elements have equal work
•SIMD/GPU friendly
•Example: parallel array sum
•Example: image convolution
•Example: Monte Carlo simulation

Task Parallelism

•Partition FUNCTION, may share DATA
•Different computations per task
•Load balanced only if tasks sized equally
•Better for heterogeneous work
•Example: web server handling requests
•Example: video encoding pipeline
•Example: game engine subsystems

parallelism_comparison.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import java.util.concurrent.*;
import java.util.stream.*;
 
/**
 * Comparison of data parallelism and task parallelism approaches.
 */
public class ParallelismComparison {
    
    // ==========================================
    // DATA PARALLELISM: Same operation on partitioned data
    // ==========================================
    
    /**
     * Data-parallel array processing.
     * Each element processed identically; data is partitioned.
     */
    static long dataParallelSum(int[] data) {
        // Fork-Join style: divide array, sum partitions, combine
        return IntStream.of(data)
            .parallel()  // Automatic data partitioning
            .mapToLong(x -> x)  // Same operation on each element
            .sum();  // Reduction
    }
    
    /**
     * Data-parallel image processing.
     * Each pixel processed by same kernel; image partitioned into tiles.
     */
    static void dataParallelImageProcess(int[][] image, int[][] output) {
        int height = image.length;
        int width = image[0].length;
        
        IntStream.range(0, height).parallel().forEach(y -> {
            for (int x = 0; x < width; x++) {
                // Same convolution operation applied to each pixel
                output[y][x] = applyConvolution(image, x, y);
            }
        });
    }
    
    // ==========================================
    // TASK PARALLELISM: Different operations concurrently
    // ==========================================
    
    /**
     * Task-parallel pipeline.
     * Different stages process concurrently on different data.
     */
    static void taskParallelPipeline(Queue<RawData> input, 
                                      Queue<Result> output) {
        ExecutorService executor = Executors.newFixedThreadPool(3);
        
        BlockingQueue<DecodedData> decoded = new LinkedBlockingQueue<>();
        BlockingQueue<ProcessedData> processed = new LinkedBlockingQueue<>();
        
        // Task 1: Decode (runs concurrently)
        executor.submit(() -> {
            while (true) {
                RawData raw = input.poll();
                if (raw == null) break;
                decoded.put(decode(raw));  // Different function
            }
        });
        
        // Task 2: Process (runs concurrently)
        executor.submit(() -> {
            while (true) {
                DecodedData d = decoded.take();
                processed.put(process(d));  // Different function
            }
        });
        
        // Task 3: Encode (runs concurrently)
        executor.submit(() -> {
            while (true) {
                ProcessedData p = processed.take();
                output.add(encode(p));  // Different function
            }
        });
        
        executor.shutdown();
    }
    
    /**
     * Task-parallel independent computations.
     * Different analyses run on same data concurrently.
     */
    static AnalysisResult taskParallelAnalysis(Data data) 
            throws Exception {
        ExecutorService executor = Executors.newFixedThreadPool(4);
        
        // Four different analyses on same data
        Future<StatisticalResult> statsF = 
            executor.submit(() -> computeStatistics(data));
        Future<TrendResult> trendF = 
            executor.submit(() -> detectTrends(data));
        Future<AnomalyResult> anomalyF = 
            executor.submit(() -> findAnomalies(data));
        Future<ForecastResult> forecastF = 
            executor.submit(() -> generateForecast(data));
        
        // Wait for all (join)
        return new AnalysisResult(
            statsF.get(),
            trendF.get(),
            anomalyF.get(),
            forecastF.get()
        );
    }
    
    // ==========================================
    // HYBRID: Both data and task parallelism
    // ==========================================
    
    /**
     * Hybrid parallelism: parallel pipeline with data-parallel stages.
     */
    static void hybridParallelism(int[][][] videoFrames) {
        // Task parallelism: different stages
        // Data parallelism within each stage: parallel over frames/pixels
        
        ForkJoinPool pool = ForkJoinPool.commonPool();
        
        pool.submit(() -> {
            Arrays.stream(videoFrames)
                .parallel()  // Data parallel over frames
                .forEach(frame -> {
                    // Each frame processed by many workers
                    IntStream.range(0, frame.length)
                        .parallel()  // Data parallel over rows
                        .forEach(y -> processRow(frame[y]));
                });
        });
    }
}

Choosing the Right Decomposition

Granularity Selection: Finding the Sweet Spot

The Granularity Tradeoff:

Granularity	Overhead	Load Balance	Parallelism	Cache Behavior
Too Fine	High (fork/join cost per unit)	Excellent	Maximum	Poor (less reuse)
Optimal	Amortized	Good	Sufficient	Good
Too Coarse	Low	Poor (some cores idle)	Limited	Excellent

Golden Rule: A task should perform at least several thousand CPU operations to amortize fork-join overhead. The exact threshold depends on your framework and hardware.

granularity_tuning.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import java.util.concurrent.*;
 
/**
 * Demonstrates granularity selection and sequential thresholds.
 */
public class GranularityExample extends RecursiveTask<Long> {
    
    // Sequential threshold - tune based on workload
    // Start with ~1000-10000 elements, benchmark to optimize
    private static final int SEQUENTIAL_THRESHOLD = 5000;
    
    private final long[] array;
    private final int lo;
    private final int hi;
    
    public GranularityExample(long[] array, int lo, int hi) {
        this.array = array;
        this.lo = lo;
        this.hi = hi;
    }
    
    @Override
    protected Long compute() {
        int size = hi - lo;
        
        // COARSE CHECK: Is task small enough to do sequentially?
        if (size <= SEQUENTIAL_THRESHOLD) {
            // Sequential execution - no more forking
            return sequentialSum(array, lo, hi);
        }
        
        // DECOMPOSITION: Split into two halves
        int mid = (lo + hi) / 2;
        
        // Fork left half
        GranularityExample left = new GranularityExample(array, lo, mid);
        left.fork();
        
        // Compute right half directly (fork-once pattern)
        GranularityExample right = new GranularityExample(array, mid, hi);
        Long rightResult = right.compute();
        
        // Join left and combine results
        Long leftResult = left.join();
        
        return leftResult + rightResult;
    }
    
    /**
     * Sequential computation for small chunks.
     * Optimized for cache locality - sequential access pattern.
     */
    private long sequentialSum(long[] arr, int start, int end) {
        long sum = 0;
        for (int i = start; i < end; i++) {
            sum += arr[i];
            
            // If per-element work is expensive, threshold can be lower
            // If per-element work is trivial, threshold should be higher
        }
        return sum;
    }
    
    // ==========================================
    // ADAPTIVE GRANULARITY
    // ==========================================
    
    /**
     * Adaptive approach: consider available parallelism and work remaining.
     */
    static class AdaptiveTask extends RecursiveTask<Long> {
        private static final int MIN_GRANULARITY = 1000;
        
        private final long[] array;
        private final int lo, hi;
        private final int surplusTasks;
        
        AdaptiveTask(long[] array, int lo, int hi, int surplusTasks) {
            this.array = array;
            this.lo = lo;
            this.hi = hi;
            this.surplusTasks = surplusTasks;
        }
        
        @Override
        protected Long compute() {
            int size = hi - lo;
            
            // Stop splitting when:
            // 1. Task is small enough, OR
            // 2. We have enough tasks queued (surplus)
            if (size <= MIN_GRANULARITY || surplusTasks <= 0) {
                return sequentialSum(array, lo, hi);
            }
            
            int mid = (lo + hi) / 2;
            
            // Reduce surplus as we create more tasks
            AdaptiveTask left = new AdaptiveTask(
                array, lo, mid, surplusTasks - 1);
            AdaptiveTask right = new AdaptiveTask(
                array, mid, hi, surplusTasks - 1);
            
            left.fork();
            Long rightResult = right.compute();
            Long leftResult = left.join();
            
            return leftResult + rightResult;
        }
        
        private long sequentialSum(long[] arr, int start, int end) {
            long sum = 0;
            for (int i = start; i < end; i++) sum += arr[i];
            return sum;
        }
    }
    
    // ==========================================
    // BENCHMARKING GRANULARITY
    // ==========================================
    
    public static void benchmarkThresholds() {
        long[] data = new long[10_000_000];
        for (int i = 0; i < data.length; i++) data[i] = i;
        
        int[] thresholds = {100, 1000, 5000, 10000, 50000, 100000};
        ForkJoinPool pool = ForkJoinPool.commonPool();
        
        System.out.println("Threshold | Time (ms) | Tasks Created");
        System.out.println("----------+-----------+--------------");
        
        for (int threshold : thresholds) {
            // Reset task counter...
            long start = System.nanoTime();
            
            // Would need to modify SEQUENTIAL_THRESHOLD dynamically
            // or create threshold-specific subclass
            Long result = pool.invoke(
                new GranularityExample(data, 0, data.length));
            
            long elapsed = (System.nanoTime() - start) / 1_000_000;
            
            // Optimal threshold minimizes time
            System.out.printf("%9d | %9d | (measure)%n", threshold, elapsed);
        }
    }
}

Tuning the Sequential Threshold

Load Balancing Strategies

Sources of Load Imbalance:

Non-uniform task sizes: Some tasks take longer than others
Data-dependent computation: Time depends on input values, not just input size
Irregular problem structure: Trees, graphs with varying node degrees
External interference: OS scheduling, garbage collection, I/O
Static assignment: Pre-assigning work ignores runtime variations

Load Balancing Approaches Comparison
Strategy	Mechanism	Overhead	Balance Quality	Best For
Static/Block	Pre-divide into equal chunks	Minimal	Poor if work varies	Uniform work per element
Static/Cyclic	Round-robin assignment	Minimal	Good if work varies regularly	Predictable variation patterns
Dynamic/Chunked	Workers grab chunks from queue	Medium	Good	Moderate variation
Work-Stealing	Idle workers steal from busy	Low	Excellent	Unpredictable variation
Guided	Decreasing chunk sizes	Medium	Very good	Unknown variation

load_balancing.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
import java.util.concurrent.*;
import java.util.concurrent.atomic.*;
 
/**
 * Load balancing strategies for parallel decomposition.
 */
public class LoadBalancing {
    
    // ==========================================
    // PROBLEM: Non-uniform work distribution
    // ==========================================
    
    /**
     * Example: Processing items where work varies dramatically.
     * Item 0 might take 1ms, item 999 might take 1000ms.
     */
    static void nonUniformWorkProblem() {
        int[] items = new int[1000];
        
        // Work per item: items[i] takes i milliseconds
        // Total work: 0+1+2+...+999 = ~500k ms = 500 seconds
        // If we split evenly: Worker 1 handles [0-499], Worker 2 handles [500-999]
        // Worker 1 work: 0+1+...+499 = 124,750 ms
        // Worker 2 work: 500+501+...+999 = 374,750 ms
        // Worker 1 finishes 3x earlier and idles!
    }
    
    // ==========================================
    // STATIC CYCLIC DISTRIBUTION
    // ==========================================
    
    /**
     * Cyclic distribution: Round-robin assignment.
     * Worker 0: items 0,4,8,...  Worker 1: items 1,5,9,... etc.
     * Balances if variation is regular.
     */
    static void cyclicDistribution(int[] items, int numWorkers) {
        ExecutorService executor = Executors.newFixedThreadPool(numWorkers);
        
        for (int worker = 0; worker < numWorkers; worker++) {
            final int workerId = worker;
            
            executor.submit(() -> {
                // Process items workerId, workerId+numWorkers, ...
                for (int i = workerId; i < items.length; i += numWorkers) {
                    processItem(items[i]);
                }
            });
        }
        
        executor.shutdown();
    }
    
    // ==========================================
    // DYNAMIC CHUNKED DISTRIBUTION
    // ==========================================
    
    /**
     * Dynamic chunked: Workers grab chunks from shared counter.
     * Good balance but has synchronization overhead.
     */
    static void dynamicChunked(int[] items, int numWorkers, int chunkSize) {
        AtomicInteger nextChunk = new AtomicInteger(0);
        ExecutorService executor = Executors.newFixedThreadPool(numWorkers);
        
        for (int worker = 0; worker < numWorkers; worker++) {
            executor.submit(() -> {
                while (true) {
                    // Atomically grab next chunk
                    int start = nextChunk.getAndAdd(chunkSize);
                    if (start >= items.length) break;
                    
                    int end = Math.min(start + chunkSize, items.length);
                    for (int i = start; i < end; i++) {
                        processItem(items[i]);
                    }
                }
            });
        }
        
        executor.shutdown();
    }
    
    // ==========================================
    // WORK-STEALING: Fork-Join's Built-in Solution
    // ==========================================
    
    /**
     * Work-stealing automatically balances load.
     * Each worker has a deque; idle workers steal from busy workers.
     */
    static class WorkStealingSum extends RecursiveTask<Long> {
        private static final int THRESHOLD = 1000;
        private final long[] array;
        private final int lo, hi;
        
        WorkStealingSum(long[] array, int lo, int hi) {
            this.array = array;
            this.lo = lo;
            this.hi = hi;
        }
        
        @Override
        protected Long compute() {
            if (hi - lo <= THRESHOLD) {
                return sequentialCompute();
            }
            
            int mid = (lo + hi) / 2;
            WorkStealingSum left = new WorkStealingSum(array, lo, mid);
            WorkStealingSum right = new WorkStealingSum(array, mid, hi);
            
            // Fork creates task in local deque
            left.fork();
            
            // While 'left' might sit in deque, other workers can STEAL it
            // This is the work-stealing magic
            Long rightResult = right.compute();
            Long leftResult = left.join();
            
            return leftResult + rightResult;
        }
        
        private long sequentialCompute() {
            // Variable work simulation
            long sum = 0;
            for (int i = lo; i < hi; i++) {
                sum += expensiveOp(array[i]);  // Time varies per value
            }
            return sum;
        }
        
        private long expensiveOp(long value) {
            // Simulate variable work: more work for larger values
            long result = value;
            for (int i = 0; i < value % 100; i++) {
                result = result * 31 + i;
            }
            return result;
        }
    }
    
    // ==========================================
    // GUIDED SELF-SCHEDULING
    // ==========================================
    
    /**
     * Guided scheduling: chunk size decreases as work progresses.
     * Large chunks initially (low overhead), small chunks at end (good balance).
     */
    static void guidedScheduling(int[] items, int numWorkers) {
        AtomicInteger remaining = new AtomicInteger(items.length);
        AtomicInteger currentIndex = new AtomicInteger(0);
        ExecutorService executor = Executors.newFixedThreadPool(numWorkers);
        
        for (int worker = 0; worker < numWorkers; worker++) {
            executor.submit(() -> {
                while (true) {
                    int rem = remaining.get();
                    if (rem <= 0) break;
                    
                    // Chunk size proportional to remaining work divided by workers
                    // Minimum chunk size of 1
                    int chunkSize = Math.max(1, rem / (numWorkers * 2));
                    
                    int start = currentIndex.getAndAdd(chunkSize);
                    int actualRemaining = remaining.addAndGet(-chunkSize);
                    
                    if (start >= items.length) break;
                    
                    int end = Math.min(start + chunkSize, items.length);
                    for (int i = start; i < end; i++) {
                        processItem(items[i]);
                    }
                }
            });
        }
        
        executor.shutdown();
    }
    
    private static void processItem(int item) {
        // Simulate work proportional to item value
        try { Thread.sleep(item % 10); } 
        catch (InterruptedException e) {}
    }
}

Work-Stealing is Usually the Answer

Common Decomposition Patterns

Certain decomposition patterns appear repeatedly in parallel programming. Recognizing these patterns accelerates algorithm design and leverages well-tested implementation strategies.

Core Parallel Patterns:

Fundamental Decomposition Patterns

•Map (Embarrassingly Parallel): Apply same function to each element independently. No inter-element dependencies. Fork per element (or chunk), join all. Examples: image filters, Monte Carlo trials, parameter sweeps.
•Reduce: Combine elements using associative operation. Divide into groups, reduce each in parallel, reduce group results. Examples: sum, product, min/max, merge sorted lists.
•Scan (Prefix): Compute running aggregates. Parallel algorithm: sum pairs, propagate, combine. Examples: running totals, lexical analysis, parallel allocation.
•Stencil: Each element computed from neighboring elements. Partition into tiles, process tiles in parallel, handle boundaries. Examples: convolution, heat diffusion, cellular automata.
•Recursive Divide-and-Conquer: Split problem into independent subproblems, solve in parallel, combine. Examples: merge sort, quicksort, matrix multiply, tree traversals.

decomposition_patterns.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import java.util.concurrent.*;
import java.util.stream.*;
import java.util.*;
 
/**
 * Implementation examples of common decomposition patterns.
 */
public class DecompositionPatterns {
    
    // ==========================================
    // PATTERN 1: MAP (Embarrassingly Parallel)
    // ==========================================
    
    /**
     * Map pattern: independent transformation of each element.
     */
    static double[] parallelMap(double[] input, DoubleUnaryOperator f) {
        return Arrays.stream(input)
            .parallel()
            .map(f)
            .toArray();
    }
    
    // Fork-Join version for illustration
    static class MapTask extends RecursiveAction {
        private static final int THRESHOLD = 10000;
        private final double[] input, output;
        private final int lo, hi;
        private final DoubleUnaryOperator f;
        
        MapTask(double[] input, double[] output, int lo, int hi, 
                DoubleUnaryOperator f) {
            this.input = input; this.output = output;
            this.lo = lo; this.hi = hi; this.f = f;
        }
        
        @Override
        protected void compute() {
            if (hi - lo <= THRESHOLD) {
                for (int i = lo; i < hi; i++) {
                    output[i] = f.applyAsDouble(input[i]);
                }
            } else {
                int mid = (lo + hi) / 2;
                invokeAll(
                    new MapTask(input, output, lo, mid, f),
                    new MapTask(input, output, mid, hi, f)
                );
            }
        }
    }
    
    // ==========================================
    // PATTERN 2: REDUCE
    // ==========================================
    
    /**
     * Reduce pattern: combine elements with associative operation.
     */
    static class ReduceTask extends RecursiveTask<Long> {
        private static final int THRESHOLD = 10000;
        private final long[] array;
        private final int lo, hi;
        private final LongBinaryOperator combine;
        private final long identity;
        
        ReduceTask(long[] array, int lo, int hi, 
                   long identity, LongBinaryOperator combine) {
            this.array = array;
            this.lo = lo; this.hi = hi;
            this.identity = identity;
            this.combine = combine;
        }
        
        @Override
        protected Long compute() {
            if (hi - lo <= THRESHOLD) {
                long result = identity;
                for (int i = lo; i < hi; i++) {
                    result = combine.applyAsLong(result, array[i]);
                }
                return result;
            }
            
            int mid = (lo + hi) / 2;
            ReduceTask left = new ReduceTask(array, lo, mid, identity, combine);
            ReduceTask right = new ReduceTask(array, mid, hi, identity, combine);
            
            left.fork();
            Long rightResult = right.compute();
            Long leftResult = left.join();
            
            return combine.applyAsLong(leftResult, rightResult);
        }
    }
    
    static long parallelSum(long[] array) {
        return ForkJoinPool.commonPool().invoke(
            new ReduceTask(array, 0, array.length, 0L, Long::sum));
    }
    
    static long parallelMax(long[] array) {
        return ForkJoinPool.commonPool().invoke(
            new ReduceTask(array, 0, array.length, Long.MIN_VALUE, Long::max));
    }
    
    // ==========================================
    // PATTERN 3: DIVIDE AND CONQUER
    // ==========================================
    
    /**
     * Divide-and-conquer: recursive subdivision into independent subproblems.
     * Classic example: parallel merge sort.
     */
    static class ParallelMergeSort extends RecursiveAction {
        private static final int THRESHOLD = 4096;
        private final int[] array;
        private final int lo, hi;
        private final int[] buffer;
        
        ParallelMergeSort(int[] array, int lo, int hi, int[] buffer) {
            this.array = array;
            this.lo = lo; this.hi = hi;
            this.buffer = buffer;
        }
        
        @Override
        protected void compute() {
            if (hi - lo <= THRESHOLD) {
                Arrays.sort(array, lo, hi);  // Sequential fallback
                return;
            }
            
            int mid = (lo + hi) / 2;
            
            // DIVIDE: Create subproblems
            ParallelMergeSort left = 
                new ParallelMergeSort(array, lo, mid, buffer);
            ParallelMergeSort right = 
                new ParallelMergeSort(array, mid, hi, buffer);
            
            // CONQUER: Solve subproblems in parallel
            invokeAll(left, right);
            
            // COMBINE: Merge sorted halves
            merge(array, lo, mid, hi, buffer);
        }
        
        private void merge(int[] arr, int lo, int mid, int hi, int[] buf) {
            System.arraycopy(arr, lo, buf, lo, hi - lo);
            int i = lo, j = mid, k = lo;
            while (i < mid && j < hi) {
                arr[k++] = (buf[i] <= buf[j]) ? buf[i++] : buf[j++];
            }
            while (i < mid) arr[k++] = buf[i++];
            while (j < hi) arr[k++] = buf[j++];
        }
    }
    
    // ==========================================
    // PATTERN 4: STENCIL
    // ==========================================
    
    /**
     * Stencil pattern: each output depends on neighborhood of inputs.
     * Exercise: avoid data races at tile boundaries.
     */
    static void parallelStencil(double[][] input, double[][] output, 
                                 double[][] kernel) {
        int height = input.length;
        int width = input[0].length;
        int kRadius = kernel.length / 2;
        
        // Process rows in parallel (each row independent if reading input)
        IntStream.range(kRadius, height - kRadius).parallel().forEach(y -> {
            for (int x = kRadius; x < width - kRadius; x++) {
                double sum = 0;
                for (int ky = -kRadius; ky <= kRadius; ky++) {
                    for (int kx = -kRadius; kx <= kRadius; kx++) {
                        sum += input[y + ky][x + kx] * 
                               kernel[ky + kRadius][kx + kRadius];
                    }
                }
                output[y][x] = sum;
            }
        });
    }
}

Dependency Graph Analysis for Parallelization

Key Metrics:

Work (T₁): Total time on a single processor (sum of all node weights)
Span (T∞): Time on infinite processors (critical path length)
Parallelism: T₁ / T∞ — the available parallelism
Speedup: T₁ / Tₚ — actual speedup on P processors

Brent's Law:

$$T_P \leq \frac{T_1}{P} + T_\infty$$

This bounds the execution time on P processors: you can't be faster than either the sequential work divided by processors OR the critical path.

Converting Mermaid diagram...

Interpreting the DAG:

In the example above:

A and B can execute in parallel (no dependencies between them)
D can execute in parallel with A, B, or C
C must wait for both A and B
E must wait for both C and D
F must wait for E

Critical Path: B → C → E → F (4 + 3 + 2 + 1 = 10ms)

Even with infinite processors, we can't complete faster than 10ms because this chain is entirely sequential.

Practical Use of DAG Analysis:

Identify parallelization barriers: Long sequential chains limit speedup
Restructure algorithms: Sometimes dependencies can be broken or relaxed
Prioritize scheduling: Start critical path tasks early to overlap with other work
Estimate speedup bounds: Know the limit before optimizing

When Parallelism is Limited

Summary: Mastering Task Decomposition

Task decomposition is the foundation skill for parallel programming. Before any forking or joining, you must understand how to break work into parallel pieces effectively.

Let's consolidate the key insights:

Key Takeaways

•Identify independence first — Analysis of data dependencies reveals what can and cannot parallelize. No dependency = potential parallelism.
•Choose the right parallelism type — Data parallelism for uniform work on large datasets; task parallelism for heterogeneous operations.
•Granularity is critical — Too fine incurs overhead; too coarse limits parallelism. Use sequential thresholds, typically 1,000-10,000 elements.
•Load balancing determines efficiency — Work-stealing handles irregular workloads automatically. Static distribution only when work is perfectly uniform.
•Recognize common patterns — Map, reduce, scan, stencil, divide-and-conquer are building blocks. Master these and you can decompose most problems.
•DAG analysis reveals limits — Work/Span ratio gives the available parallelism. Critical path bounds best-case execution time.
•Amdahl's Law constrains speedup — The sequential fraction limits total speedup regardless of processor count. Minimize it.

What's Next:

Page Complete

3 / 5