Loading content...
Divide-and-conquer is one of the most naturally parallelizable algorithm paradigms. The pattern—split a problem into independent subproblems, solve them, and combine results—maps perfectly onto the fork-join model.
This synergy isn't coincidental. The fork-join framework was specifically designed to efficiently execute divide-and-conquer algorithms. When fork-join execution meets divide-and-conquer algorithms, we get a powerful combination: logarithmic-depth parallelism with automatic load balancing.
By the end of this page, you will understand: the divide-and-conquer parallel pattern, work and span analysis, classic parallel algorithms (merge sort, quicksort, matrix multiply), optimization techniques, and when divide-and-conquer parallelism is most effective.
Every divide-and-conquer algorithm follows this structure:
solve(problem):
if problem is small enough:
return solve_directly(problem) // Base case
subproblems = divide(problem) // DIVIDE
subresults = []
for each subproblem:
subresults.append(solve(subproblem)) // CONQUER (recursive)
return combine(subresults) // COMBINE
Parallelization opportunity: The recursive calls are independent! Each subproblem can be solved in parallel:
parallel_solve(problem):
if problem is small enough:
return solve_directly(problem)
subproblems = divide(problem)
// FORK: Solve subproblems in parallel
for i = 0 to n-2:
fork solve(subproblems[i])
result_n = solve(subproblems[n-1]) // Compute last directly
// JOIN: Wait for all results
for i = 0 to n-2:
subresults[i] = join()
return combine(subresults)
| Component | Sequential | Parallel Opportunity |
|---|---|---|
| Divide | O(1) to O(n) | Usually sequential (quick, low overhead) |
| Conquer | Recursive calls | Parallel: Each subproblem independent |
| Combine | O(1) to O(n) | Sometimes parallel (e.g., parallel merge) |
| Base case | Direct solution | Sequential (too small to parallelize) |
The parallelism in divide-and-conquer comes from solving subproblems concurrently. If you divide into k subproblems at each level and have d levels, you have up to k^d parallel tasks. The binary divide (k=2) creates 2^d tasks from d levels of recursion.
To analyze parallel divide-and-conquer algorithms, we use work and span:
For divide-and-conquer:
If we divide into a subproblems of size n/b:
T₁(n) = a·T₁(n/b) + f(n) // Work recurrence
T∞(n) = T∞(n/b) + f(n) // Span: only ONE branch on critical path!
Notice: Span only counts one branch! Parallel execution means subproblems run concurrently, so span includes just the longest path.
| Algorithm | Work T₁(n) | Span T∞(n) | Parallelism |
|---|---|---|---|
| Parallel Merge Sort | O(n log n) | O(log² n) | O(n / log n) |
| Parallel Quicksort (expected) | O(n log n) | O(log² n) | O(n / log n) |
| Parallel Reduce (sum) | O(n) | O(log n) | O(n / log n) |
| Parallel Map | O(n) | O(1)* | O(n) |
| Matrix Multiply (naive) | O(n³) | O(log n) | O(n³ / log n) |
Example: Parallel Sum
sum([a₁, a₂, ..., aₙ]):
if n == 1: return a₁
left = sum([a₁, ..., a_{n/2}]) // Fork
right = sum([a_{n/2+1}, ..., aₙ]) // Compute
return left + right // Join + Combine
Work: T₁(n) = 2·T₁(n/2) + O(1) = O(n) (must touch every element)
Span: T∞(n) = T∞(n/2) + O(1) = O(log n) (tree height)
Parallelism: O(n) / O(log n) = O(n / log n)
With n = 1 million elements, parallelism ≈ 50,000—vastly more than available cores!
Merge sort is the canonical parallel divide-and-conquer sorting algorithm. Its regular structure (always splits in half) makes it ideal for parallel execution.
Parallelism Sources:
The basic parallel merge sort achieves O(n log n) work and O(log² n) span.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import java.util.concurrent.*;import java.util.Arrays; public class ParallelMergeSort extends RecursiveAction { private static final int THRESHOLD = 4096; private final int[] array, buffer; private final int lo, hi; public ParallelMergeSort(int[] array, int[] buffer, int lo, int hi) { this.array = array; this.buffer = buffer; this.lo = lo; this.hi = hi; } @Override protected void compute() { int size = hi - lo; // Base case: use sequential sort if (size <= THRESHOLD) { Arrays.sort(array, lo, hi); return; } int mid = lo + size / 2; // DIVIDE & CONQUER: Sort halves in parallel ParallelMergeSort left = new ParallelMergeSort(array, buffer, lo, mid); ParallelMergeSort right = new ParallelMergeSort(array, buffer, mid, hi); invokeAll(left, right); // Fork both, join both // COMBINE: Merge sorted halves merge(lo, mid, hi); } private void merge(int lo, int mid, int hi) { // Copy to buffer System.arraycopy(array, lo, buffer, lo, hi - lo); int i = lo, j = mid, k = lo; while (i < mid && j < hi) { if (buffer[i] <= buffer[j]) { array[k++] = buffer[i++]; } else { array[k++] = buffer[j++]; } } while (i < mid) array[k++] = buffer[i++]; while (j < hi) array[k++] = buffer[j++]; } public static void sort(int[] array) { int[] buffer = new int[array.length]; ForkJoinPool.commonPool().invoke( new ParallelMergeSort(array, buffer, 0, array.length)); } public static void main(String[] args) { int[] data = new int[10_000_000]; for (int i = 0; i < data.length; i++) { data[i] = (int)(Math.random() * 1_000_000); } long start = System.nanoTime(); sort(data); long elapsed = System.nanoTime() - start; System.out.printf("Sorted %d elements in %.2f ms%n", data.length, elapsed / 1e6); // Verify for (int i = 1; i < data.length; i++) { if (data[i] < data[i-1]) { System.out.println("ERROR: Not sorted!"); return; } } System.out.println("Verified: Array is sorted"); }}The basic parallel merge sort still merges sequentially (O(n) per level). For maximum parallelism, use parallel merge: find the median, recursively merge subarrays. This reduces span from O(n log n) to O(log² n) but adds complexity.
Quicksort can also be parallelized, but it presents unique challenges:
Challenges:
Parallelization strategies:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import java.util.concurrent.*;import java.util.Arrays; public class ParallelQuicksort extends RecursiveAction { private static final int THRESHOLD = 4096; private final int[] array; private final int lo, hi; public ParallelQuicksort(int[] array, int lo, int hi) { this.array = array; this.lo = lo; this.hi = hi; } @Override protected void compute() { if (hi - lo <= THRESHOLD) { Arrays.sort(array, lo, hi); return; } // DIVIDE: Partition around pivot int pivotIndex = partition(lo, hi); // CONQUER: Sort partitions in parallel ParallelQuicksort left = new ParallelQuicksort(array, lo, pivotIndex); ParallelQuicksort right = new ParallelQuicksort(array, pivotIndex + 1, hi); // Both partitions are independent - parallelize invokeAll(left, right); // COMBINE: Nothing to do - array is sorted in place } private int partition(int lo, int hi) { // Median-of-three pivot selection for better balance int mid = lo + (hi - lo) / 2; if (array[mid] < array[lo]) swap(lo, mid); if (array[hi-1] < array[lo]) swap(lo, hi-1); if (array[hi-1] < array[mid]) swap(mid, hi-1); int pivot = array[mid]; swap(mid, hi - 1); // Move pivot to end int i = lo, j = hi - 2; while (true) { while (array[i] < pivot) i++; while (j > lo && array[j] > pivot) j--; if (i >= j) break; swap(i++, j--); } swap(i, hi - 1); // Restore pivot return i; } private void swap(int i, int j) { int temp = array[i]; array[i] = array[j]; array[j] = temp; } public static void sort(int[] array) { ForkJoinPool.commonPool().invoke( new ParallelQuicksort(array, 0, array.length)); }}| Algorithm | Parallelism Quality | Notes |
|---|---|---|
| Merge Sort | Excellent | Balanced division, independent subproblems |
| Quicksort | Good | Partition may be unbalanced; use good pivot selection |
| Binary Search | Poor | Sequential by nature - only one branch matters |
| Matrix Multiply | Excellent | Block decomposition parallelizes well |
| Tree Traversal | Variable | Depends on tree balance |
| Fast Fourier Transform | Excellent | Regular structure, independent butterflies |
Binary search is divide-and-conquer but NOT parallel: each step depends on the previous comparison. The key is independent subproblems—if solving one branch affects another, parallelism is limited.
Threshold Tuning Example:
// Benchmark with different thresholds
int[] thresholds = {256, 1024, 4096, 16384, 65536};
for (int threshold : thresholds) {
THRESHOLD = threshold;
long start = System.nanoTime();
sort(data.clone());
long time = System.nanoTime() - start;
System.out.printf("Threshold %6d: %.2f ms%n", threshold, time/1e6);
}
// Output might show:
// Threshold 256: 180.5 ms (too fine - overhead)
// Threshold 1024: 145.2 ms
// Threshold 4096: 135.8 ms (optimal for this workload)
// Threshold 16384: 140.1 ms
// Threshold 65536: 165.3 ms (too coarse - less parallelism)
Congratulations! You've completed the Fork-Join Model module. You now understand fork operations, join synchronization, task decomposition strategies, Java's Fork/Join framework, and parallel divide-and-conquer algorithms. These concepts form the foundation for building efficient parallel applications.