Loading content...
When engineers discuss system performance, they often conflate two fundamentally different metrics: latency (how fast a single request completes) and throughput (how many requests the system can handle per unit of time). While latency optimization focuses on making individual operations faster, throughput optimization is about maximizing the total work accomplished—the volume of requests processed, data transformed, or transactions completed.
Consider two database systems:
System B has worse per-query latency but 5x higher throughput. For many workloads—batch processing, analytics pipelines, high-traffic APIs—throughput is the primary constraint, not latency.
This page deep-dives into parallelization—the most fundamental throughput optimization technique. You'll understand different parallelization models, their trade-offs, implementation strategies, and when each approach is appropriate. By the end, you'll be able to design systems that extract maximum throughput from available hardware through intelligent parallel execution.
Parallelization is the technique of dividing work into smaller units that can be executed simultaneously across multiple processing resources. Unlike sequential execution where tasks complete one after another, parallel execution exploits the availability of multiple CPUs, cores, threads, or machines to process work concurrently.
The fundamental insight: Modern computing resources are massively parallel. A single server might have 64+ CPU cores, each capable of independent computation. A distributed system might span thousands of machines. Sequential processing utilizes only a tiny fraction of this capacity. Parallelization unlocks the full potential.
Amdahl's Law—The Theoretical Limit:
Amdahl's Law defines the theoretical speedup achievable through parallelization:
$$ \text{Speedup} = \frac{1}{(1 - P) + \frac{P}{N}} $$
Where:
Key implications:
| Parallelizable (P) | 2 Cores | 8 Cores | 64 Cores | Infinite |
|---|---|---|---|---|
| 50% | 1.33x | 1.78x | 1.97x | 2.00x |
| 75% | 1.60x | 2.91x | 3.77x | 4.00x |
| 90% | 1.82x | 4.71x | 8.89x | 10.00x |
| 95% | 1.90x | 5.93x | 14.69x | 20.00x |
| 99% | 1.98x | 7.48x | 39.26x | 100.00x |
Many teams add more servers expecting linear throughput gains, only to discover serialized components (shared locks, single-threaded coordinators, strict ordering requirements) cap their improvement. Before scaling out, identify and eliminate serial bottlenecks—this often yields greater returns than adding hardware.
Parallelism manifests in different forms depending on what is being parallelized and how. Understanding these distinctions is critical for choosing the right approach for your workload.
Granularity of Parallelism:
The size of parallelizable work units significantly impacts efficiency:
Optimal granularity balances coordination costs against load distribution. Too fine-grained and you spend more time managing tasks than executing them. Too coarse-grained and some workers sit idle while others are overloaded.
At the implementation level, parallelism can be achieved through different mechanisms, each with distinct tradeoffs in isolation, overhead, and programming complexity.
| Mechanism | Memory | Creation Cost | Context Switch | Best For |
|---|---|---|---|---|
| Processes | Isolated (separate address space) | High (fork/exec) | Expensive (~1-10ms) | CPU-bound, crash isolation |
| Threads | Shared (same address space) | Medium (~10-100µs) | Moderate (~1-10µs) | Mixed CPU/IO, shared state |
| Coroutines/Green Threads | Shared (cooperative) | Very Low (~1µs) | Very Low (~100ns) | I/O-bound, massive concurrency |
| Async/Await | Shared (event loop) | Minimal | Minimal | High-concurrency I/O |
Process-Based Parallelism:
Processes provide the strongest isolation—each process has its own memory space, file descriptors, and runtime state. A crash in one process doesn't affect others (hence "crash isolation").
┌─────────────────┐
│ Parent Process │
│ ┌───────────┐ │
│ │ Memory │ │
│ └───────────┘ │
└────────┬────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Worker 1 │ │ Worker 2 │ │ Worker N │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ Memory │ │ │ │ Memory │ │ │ │ Memory │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└─────────────┘ └─────────────┘ └─────────────┘
(Isolated) (Isolated) (Isolated)
When to use processes:
Thread-Based Parallelism:
Threads share memory within a process, enabling efficient data sharing but requiring careful synchronization to avoid race conditions.
┌───────────────────────────────────┐
│ Single Process │
│ ┌──────────────────────────────┐ │
│ │ Shared Memory │ │
│ │ (Heap, Global Variables) │ │
│ └──────────────────────────────┘ │
│ │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │Thread1│ │Thread2│ │ThreadN│ │
│ │ Stack │ │ Stack │ │ Stack │ │
│ └───────┘ └───────┘ └───────┘ │
└───────────────────────────────────┘
Thread safety considerations:
Languages and thread models:
Coroutines and Async/Await:
Coroutines provide concurrency without parallelism—multiple tasks share a single thread, yielding cooperatively at I/O boundaries. This enables handling thousands of concurrent connections with minimal overhead.
┌────────────────────────────────────────────┐
│ Single Thread │
│ │
│ Request 1: ●→→→→○ (waiting for I/O) │
│ ↓ yield │
│ Request 2: ●→→→○ (waiting) │
│ ↓ yield │
│ Request 3: ●→→→→○ │
│ ↓ yield │
│ Request 1: (I/O complete) →→→→● │
│ │
│ ● = executing ○ = blocked/waiting │
└────────────────────────────────────────────┘
Key insight: While coroutines don't provide true CPU parallelism, they dramatically increase throughput for I/O-bound workloads by ensuring the CPU is always doing useful work instead of waiting for I/O.
Example concurrency capacity:
Production systems often combine models: multiple processes (for crash isolation and multi-core utilization) each running async I/O (for high concurrency). Example: Gunicorn spawns multiple worker processes, each running an async event loop in frameworks like FastAPI.
At the distributed system level, parallelization takes specific architectural forms. These patterns have been refined through decades of experience at scale.
Scatter-Gather Pattern Deep Dive:
This pattern is ubiquitous in microservices architectures where a single request requires data from multiple services.
Request
│
▼
┌─────────────┐
│ Coordinator │
│ (Scatter) │
└──────┬──────┘
│
┌──────┼──────┬──────┬──────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌──────┐┌──────┐┌──────┐┌──────┐┌──────┐
│Svc A ││Svc B ││Svc C ││Svc D ││Svc E │
└──┬───┘└──┬───┘└──┬───┘└──┬───┘└──┬───┘
│ │ │ │ │
└───────┴───────┼───────┴───────┘
│
┌───────▼───────┐
│ Coordinator │
│ (Gather) │
└───────┬───────┘
│
▼
Response
Critical considerations:
Timeout strategy: What happens when one service is slow?
Failure handling: One service fails → fail entire request? Return partial?
Result ordering: Results arrive out of order; aggregation must handle this
Connection management: N parallel calls = N connections; need connection pooling
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
// Scatter-Gather pattern with timeout and partial resultsinterface ServiceResult<T> { service: string; success: boolean; data?: T; error?: string; latencyMs: number;} async function scatterGather<T>( services: string[], fetcher: (service: string) => Promise<T>, timeoutMs: number = 200): Promise<ServiceResult<T>[]> { const fetchWithTimeout = async (service: string): Promise<ServiceResult<T>> => { const start = Date.now(); const timeoutPromise = new Promise<never>((_, reject) => setTimeout(() => reject(new Error('Timeout')), timeoutMs) ); try { const data = await Promise.race([ fetcher(service), timeoutPromise ]); return { service, success: true, data, latencyMs: Date.now() - start }; } catch (error) { return { service, success: false, error: error instanceof Error ? error.message : 'Unknown error', latencyMs: Date.now() - start }; } }; // Scatter: Launch all fetches in parallel const promises = services.map(fetchWithTimeout); // Gather: Wait for all to complete (or timeout) const results = await Promise.allSettled(promises); // Aggregate results return results .filter((r): r is PromiseFulfilledResult<ServiceResult<T>> => r.status === 'fulfilled' ) .map(r => r.value);} // Usage example: Fetch user dashboard data from multiple microservicesconst dashboardData = await scatterGather( ['user-service', 'recommendations', 'notifications', 'analytics'], async (service) => { const response = await fetch(`http://${service}/api/data/${userId}`); return response.json(); }, 150 // 150ms timeout); const successfulResults = dashboardData.filter(r => r.success);console.log(`Got ${successfulResults.length} of 4 services within timeout`);Parallelization is not free. Every parallel system pays coordination costs that reduce the net speedup. Understanding and minimizing this overhead is crucial for effective parallelization.
| Operation | Approximate Latency | Impact on Parallelization |
|---|---|---|
| Create new thread (Java) | ~10-100µs | Significant for short tasks |
| Thread context switch | ~1-10µs | Multiplies with oversubscription |
| Acquire uncontended lock | ~10-100ns | Negligible in isolation |
| Acquire contended lock | ~1µs-10ms | Serial bottleneck under contention |
| Cache line invalidation | ~10-100ns | Compounds for shared mutable state |
| Network round-trip (same DC) | ~0.5-2ms | Dominates for small tasks |
| Network round-trip (cross-region) | ~50-200ms | Often prohibitive for fine-grained work |
Strategies to Minimize Overhead:
Batch Small Tasks — Instead of parallelizing 1,000 tasks of 1ms each (overhead-dominated), batch into 10 tasks of 100ms each.
Lock-Free Data Structures — Use atomic operations and compare-and-swap (CAS) instead of locks where possible. Examples: ConcurrentHashMap in Java, atomic types in C++/Rust.
Thread-Local Storage — Avoid sharing state by giving each thread its own copy. Merge results only at the end.
Work Stealing Over Work Pushing — Let idle workers pull work rather than distributing work upfront. This naturally balances load.
Connection Pooling — For distributed parallelism, reuse network connections rather than establishing new ones per task.
Colocate Data and Compute — Move computation to data rather than moving data to computation. This is the core insight of MapReduce.
Optimal parallelization is about finding the right granularity—fine enough for good load balancing, coarse enough to amortize coordination costs. There's no universal formula; profile your specific workload to find the sweet spot.
Let's examine how production systems leverage parallelization to achieve massive throughput.
Database Query Parallelization:
Modern databases parallelize query execution across multiple dimensions:
Intra-Query Parallelism: A single query uses multiple threads
Inter-Query Parallelism: Multiple queries execute concurrently
PostgreSQL Parallel Query Example:
-- PostgreSQL will use parallel workers for this query
SET max_parallel_workers_per_gather = 4;
EXPLAIN ANALYZE
SELECT category, SUM(amount)
FROM transactions
WHERE created_at > '2024-01-01'
GROUP BY category;
-- Output shows: Gather Merge -> Parallel Hash Aggregate -> Parallel Seq Scan
PostgreSQL's planner automatically parallelizes when:
Parallelization can backfire when applied incorrectly. These anti-patterns waste resources or actually reduce throughput.
For CPU-bound work: threads = CPU cores. For I/O-bound work: threads = CPU cores × (1 + wait time / compute time).
Example: If tasks spend 80% waiting on I/O (wait/compute = 4), optimal threads = cores × 5. Going beyond this trades context switch overhead for diminishing returns.
We've explored parallelization as the foundational technique for throughput optimization. Let's consolidate the key insights:
What's next:
Parallelization is just one dimension of throughput optimization. The next page explores batching—a complementary technique that amortizes per-operation overhead by grouping multiple operations together, often providing greater throughput gains than parallelization alone for I/O-bound workloads.
You now understand parallelization as a throughput optimization technique—from theoretical foundations (Amdahl's Law) through implementation mechanisms (processes, threads, async) to distributed patterns (scatter-gather, MapReduce) and common pitfalls. Next, we'll examine batching strategies.