Fork Join Model - Learning Module

Loading content...

0/227

Parallel Divide-and-Conquer

The Power of Divide-and-Conquer Parallelism

Divide-and-conquer is one of the most naturally parallelizable algorithm paradigms. The pattern—split a problem into independent subproblems, solve them, and combine results—maps perfectly onto the fork-join model.

This synergy isn't coincidental. The fork-join framework was specifically designed to efficiently execute divide-and-conquer algorithms. When fork-join execution meets divide-and-conquer algorithms, we get a powerful combination: logarithmic-depth parallelism with automatic load balancing.

What You Will Learn

By the end of this page, you will understand: the divide-and-conquer parallel pattern, work and span analysis, classic parallel algorithms (merge sort, quicksort, matrix multiply), optimization techniques, and when divide-and-conquer parallelism is most effective.

The Divide-and-Conquer Parallel Paradigm

Every divide-and-conquer algorithm follows this structure:

solve(problem):
    if problem is small enough:
        return solve_directly(problem)      // Base case
    
    subproblems = divide(problem)           // DIVIDE
    
    subresults = []
    for each subproblem:
        subresults.append(solve(subproblem)) // CONQUER (recursive)
    
    return combine(subresults)              // COMBINE

Parallelization opportunity: The recursive calls are independent! Each subproblem can be solved in parallel:

parallel_solve(problem):
    if problem is small enough:
        return solve_directly(problem)
    
    subproblems = divide(problem)
    
    // FORK: Solve subproblems in parallel
    for i = 0 to n-2:
        fork solve(subproblems[i])
    result_n = solve(subproblems[n-1])  // Compute last directly
    
    // JOIN: Wait for all results
    for i = 0 to n-2:
        subresults[i] = join()
    
    return combine(subresults)

Divide-and-Conquer Components for Parallelization
Component	Sequential	Parallel Opportunity
Divide	O(1) to O(n)	Usually sequential (quick, low overhead)
Conquer	Recursive calls	Parallel: Each subproblem independent
Combine	O(1) to O(n)	Sometimes parallel (e.g., parallel merge)
Base case	Direct solution	Sequential (too small to parallelize)

Key Insight

The parallelism in divide-and-conquer comes from solving subproblems concurrently. If you divide into k subproblems at each level and have d levels, you have up to k^d parallel tasks. The binary divide (k=2) creates 2^d tasks from d levels of recursion.

Work and Span Analysis

To analyze parallel divide-and-conquer algorithms, we use work and span:

Work (T₁): Total operations on one processor (sequential time)
Span (T∞): Longest dependency chain (time on infinite processors)
Parallelism: T₁/T∞ (how much parallel speedup is possible)

For divide-and-conquer:

If we divide into a subproblems of size n/b:

T₁(n) = a·T₁(n/b) + f(n)    // Work recurrence
T∞(n) = T∞(n/b) + f(n)      // Span: only ONE branch on critical path!

Notice: Span only counts one branch! Parallel execution means subproblems run concurrently, so span includes just the longest path.

Work and Span for Classic Algorithms
Algorithm	Work T₁(n)	Span T∞(n)	Parallelism
Parallel Merge Sort	O(n log n)	O(log² n)	O(n / log n)
Parallel Quicksort (expected)	O(n log n)	O(log² n)	O(n / log n)
Parallel Reduce (sum)	O(n)	O(log n)	O(n / log n)
Parallel Map	O(n)	O(1)*	O(n)
Matrix Multiply (naive)	O(n³)	O(log n)	O(n³ / log n)

Example: Parallel Sum

sum([a₁, a₂, ..., aₙ]):
    if n == 1: return a₁
    
    left = sum([a₁, ..., a_{n/2}])     // Fork
    right = sum([a_{n/2+1}, ..., aₙ])  // Compute
    
    return left + right                // Join + Combine

Work: T₁(n) = 2·T₁(n/2) + O(1) = O(n) (must touch every element)

Span: T∞(n) = T∞(n/2) + O(1) = O(log n) (tree height)

Parallelism: O(n) / O(log n) = O(n / log n)

With n = 1 million elements, parallelism ≈ 50,000—vastly more than available cores!

Parallel Merge Sort

Merge sort is the canonical parallel divide-and-conquer sorting algorithm. Its regular structure (always splits in half) makes it ideal for parallel execution.

Parallelism Sources:

Recursive sorting: Both halves sorted in parallel
Merge operation: Can be parallelized (parallel merge)

The basic parallel merge sort achieves O(n log n) work and O(log² n) span.

ParallelMergeSort.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import java.util.concurrent.*;
import java.util.Arrays;
 
public class ParallelMergeSort extends RecursiveAction {
    private static final int THRESHOLD = 4096;
    
    private final int[] array, buffer;
    private final int lo, hi;
    
    public ParallelMergeSort(int[] array, int[] buffer, int lo, int hi) {
        this.array = array;
        this.buffer = buffer;
        this.lo = lo;
        this.hi = hi;
    }
    
    @Override
    protected void compute() {
        int size = hi - lo;
        
        // Base case: use sequential sort
        if (size <= THRESHOLD) {
            Arrays.sort(array, lo, hi);
            return;
        }
        
        int mid = lo + size / 2;
        
        // DIVIDE & CONQUER: Sort halves in parallel
        ParallelMergeSort left = new ParallelMergeSort(array, buffer, lo, mid);
        ParallelMergeSort right = new ParallelMergeSort(array, buffer, mid, hi);
        
        invokeAll(left, right);  // Fork both, join both
        
        // COMBINE: Merge sorted halves
        merge(lo, mid, hi);
    }
    
    private void merge(int lo, int mid, int hi) {
        // Copy to buffer
        System.arraycopy(array, lo, buffer, lo, hi - lo);
        
        int i = lo, j = mid, k = lo;
        while (i < mid && j < hi) {
            if (buffer[i] <= buffer[j]) {
                array[k++] = buffer[i++];
            } else {
                array[k++] = buffer[j++];
            }
        }
        while (i < mid) array[k++] = buffer[i++];
        while (j < hi) array[k++] = buffer[j++];
    }
    
    public static void sort(int[] array) {
        int[] buffer = new int[array.length];
        ForkJoinPool.commonPool().invoke(
            new ParallelMergeSort(array, buffer, 0, array.length));
    }
    
    public static void main(String[] args) {
        int[] data = new int[10_000_000];
        for (int i = 0; i < data.length; i++) {
            data[i] = (int)(Math.random() * 1_000_000);
        }
        
        long start = System.nanoTime();
        sort(data);
        long elapsed = System.nanoTime() - start;
        
        System.out.printf("Sorted %d elements in %.2f ms%n",
                          data.length, elapsed / 1e6);
        
        // Verify
        for (int i = 1; i < data.length; i++) {
            if (data[i] < data[i-1]) {
                System.out.println("ERROR: Not sorted!");
                return;
            }
        }
        System.out.println("Verified: Array is sorted");
    }
}

Parallelizing the Merge

The basic parallel merge sort still merges sequentially (O(n) per level). For maximum parallelism, use parallel merge: find the median, recursively merge subarrays. This reduces span from O(n log n) to O(log² n) but adds complexity.

Parallel Quicksort

Quicksort can also be parallelized, but it presents unique challenges:

Challenges:

Unbalanced partitions: Bad pivot choices create uneven workloads
Sequential partition: Standard partitioning is sequential
Variable span: Worst case O(n) span with bad pivots

Parallelization strategies:

Parallel recursive calls (after partition)
Parallel partition using prefix sums
Pivot selection sampling for better balance

ParallelQuicksort.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import java.util.concurrent.*;
import java.util.Arrays;
 
public class ParallelQuicksort extends RecursiveAction {
    private static final int THRESHOLD = 4096;
    
    private final int[] array;
    private final int lo, hi;
    
    public ParallelQuicksort(int[] array, int lo, int hi) {
        this.array = array;
        this.lo = lo;
        this.hi = hi;
    }
    
    @Override
    protected void compute() {
        if (hi - lo <= THRESHOLD) {
            Arrays.sort(array, lo, hi);
            return;
        }
        
        // DIVIDE: Partition around pivot
        int pivotIndex = partition(lo, hi);
        
        // CONQUER: Sort partitions in parallel
        ParallelQuicksort left = new ParallelQuicksort(array, lo, pivotIndex);
        ParallelQuicksort right = new ParallelQuicksort(array, pivotIndex + 1, hi);
        
        // Both partitions are independent - parallelize
        invokeAll(left, right);
        
        // COMBINE: Nothing to do - array is sorted in place
    }
    
    private int partition(int lo, int hi) {
        // Median-of-three pivot selection for better balance
        int mid = lo + (hi - lo) / 2;
        if (array[mid] < array[lo]) swap(lo, mid);
        if (array[hi-1] < array[lo]) swap(lo, hi-1);
        if (array[hi-1] < array[mid]) swap(mid, hi-1);
        
        int pivot = array[mid];
        swap(mid, hi - 1);  // Move pivot to end
        
        int i = lo, j = hi - 2;
        while (true) {
            while (array[i] < pivot) i++;
            while (j > lo && array[j] > pivot) j--;
            if (i >= j) break;
            swap(i++, j--);
        }
        swap(i, hi - 1);  // Restore pivot
        return i;
    }
    
    private void swap(int i, int j) {
        int temp = array[i];
        array[i] = array[j];
        array[j] = temp;
    }
    
    public static void sort(int[] array) {
        ForkJoinPool.commonPool().invoke(
            new ParallelQuicksort(array, 0, array.length));
    }
}

When Parallel Divide-and-Conquer Shines

Ideal Conditions for Parallel D&C

•Independent subproblems — No data dependencies between branches
•Balanced division — Subproblems of similar size (avoids load imbalance)
•Sufficient problem size — Large enough to amortize fork/join overhead
•Efficient combine — Merging results doesn't dominate runtime
•CPU-bound work — Computation-intensive, not I/O-bound

Algorithm Suitability for Parallel D&C
Algorithm	Parallelism Quality	Notes
Merge Sort	Excellent	Balanced division, independent subproblems
Quicksort	Good	Partition may be unbalanced; use good pivot selection
Binary Search	Poor	Sequential by nature - only one branch matters
Matrix Multiply	Excellent	Block decomposition parallelizes well
Tree Traversal	Variable	Depends on tree balance
Fast Fourier Transform	Excellent	Regular structure, independent butterflies

Not All D&C is Parallel-Friendly

Binary search is divide-and-conquer but NOT parallel: each step depends on the previous comparison. The key is independent subproblems—if solving one branch affects another, parallelism is limited.

Optimization Techniques

Performance Optimization Strategies

•Tune sequential threshold — Profile to find optimal cutoff (typically 1K-10K elements)
•Cache-oblivious design — Let recursion naturally fit data into cache hierarchy
•Memory allocation — Pre-allocate buffers to avoid allocation in hot paths
•Avoid object creation — Reuse task objects or use primitive arrays
•Consider hybrid algorithms — Use insertion sort for small arrays within merge sort
•Parallelize the combine step — Parallel merge can further reduce span

Threshold Tuning Example:

// Benchmark with different thresholds
int[] thresholds = {256, 1024, 4096, 16384, 65536};

for (int threshold : thresholds) {
    THRESHOLD = threshold;
    long start = System.nanoTime();
    sort(data.clone());
    long time = System.nanoTime() - start;
    System.out.printf("Threshold %6d: %.2f ms%n", threshold, time/1e6);
}

// Output might show:
// Threshold    256: 180.5 ms  (too fine - overhead)
// Threshold   1024: 145.2 ms  
// Threshold   4096: 135.8 ms  (optimal for this workload)
// Threshold  16384: 140.1 ms
// Threshold  65536: 165.3 ms  (too coarse - less parallelism)

Summary: Parallel Divide-and-Conquer Mastery

Key Takeaways

•Divide-and-conquer naturally maps to fork-join — Independent subproblems execute in parallel
•Work is total computation; Span is critical path — Parallelism = Work/Span
•Span only counts one branch — Parallel subproblems don't add to span
•Merge sort is the canonical example — Balanced, predictable, excellent parallelism
•Quicksort needs good pivots — Median-of-three prevents worst-case imbalance
•Sequential threshold is crucial — Too small = overhead; Too large = limited parallelism
•Not all D&C parallelizes — Need independent subproblems (binary search doesn't work)

Module Complete

Congratulations! You've completed the Fork-Join Model module. You now understand fork operations, join synchronization, task decomposition strategies, Java's Fork/Join framework, and parallel divide-and-conquer algorithms. These concepts form the foundation for building efficient parallel applications.