Throughput Optimization - Learning Module

Loading content...

0/273

Parallelization: Maximizing Concurrent Execution

The Throughput Imperative

When engineers discuss system performance, they often conflate two fundamentally different metrics: latency (how fast a single request completes) and throughput (how many requests the system can handle per unit of time). While latency optimization focuses on making individual operations faster, throughput optimization is about maximizing the total work accomplished—the volume of requests processed, data transformed, or transactions completed.

Consider two database systems:

System A: Processes each query in 10ms, but can only handle 100 queries at a time = 10,000 queries/second
System B: Processes each query in 20ms, but can handle 1,000 concurrent queries = 50,000 queries/second

System B has worse per-query latency but 5x higher throughput. For many workloads—batch processing, analytics pipelines, high-traffic APIs—throughput is the primary constraint, not latency.

What You Will Learn

This page deep-dives into parallelization—the most fundamental throughput optimization technique. You'll understand different parallelization models, their trade-offs, implementation strategies, and when each approach is appropriate. By the end, you'll be able to design systems that extract maximum throughput from available hardware through intelligent parallel execution.

Understanding Parallelization

Parallelization is the technique of dividing work into smaller units that can be executed simultaneously across multiple processing resources. Unlike sequential execution where tasks complete one after another, parallel execution exploits the availability of multiple CPUs, cores, threads, or machines to process work concurrently.

The fundamental insight: Modern computing resources are massively parallel. A single server might have 64+ CPU cores, each capable of independent computation. A distributed system might span thousands of machines. Sequential processing utilizes only a tiny fraction of this capacity. Parallelization unlocks the full potential.

Amdahl's Law—The Theoretical Limit:

Amdahl's Law defines the theoretical speedup achievable through parallelization:

$$ \text{Speedup} = \frac{1}{(1 - P) + \frac{P}{N}} $$

Where:

P = Proportion of the program that can be parallelized (0 to 1)
N = Number of parallel processing units

Key implications:

If only 90% of work is parallelizable (P=0.9), infinite processors yield at most 10x speedup
Serial portions become the dominant constraint as parallel resources increase
Maximizing P is often more impactful than adding more processors

Speedup Under Amdahl's Law (Maximum Theoretical)
Parallelizable (P)	2 Cores	8 Cores	64 Cores	Infinite
50%	1.33x	1.78x	1.97x	2.00x
75%	1.60x	2.91x	3.77x	4.00x
90%	1.82x	4.71x	8.89x	10.00x
95%	1.90x	5.93x	14.69x	20.00x
99%	1.98x	7.48x	39.26x	100.00x

The Serial Bottleneck Trap

Many teams add more servers expecting linear throughput gains, only to discover serialized components (shared locks, single-threaded coordinators, strict ordering requirements) cap their improvement. Before scaling out, identify and eliminate serial bottlenecks—this often yields greater returns than adding hardware.

Types of Parallelism

Parallelism manifests in different forms depending on what is being parallelized and how. Understanding these distinctions is critical for choosing the right approach for your workload.

Core Parallelism Models

•Data Parallelism — The same operation is applied to different data elements simultaneously. Example: Processing each record in a dataset independently. This is the most common and scalable form of parallelism in distributed systems.
•Task Parallelism — Different operations (tasks) execute concurrently, potentially on different data. Example: A web request that simultaneously fetches user profile, recommendations, and notifications from different services.
•Pipeline Parallelism — Work flows through stages, with different stages processing different items concurrently. Example: An ETL pipeline where extraction, transformation, and loading occur in parallel on different batches.
•Bit-Level Parallelism — Hardware-level parallelism where processors operate on wider data words. Less relevant for application design, but foundational to CPU architecture.
•Instruction-Level Parallelism (ILP) — CPUs execute multiple instructions simultaneously through pipelining, superscalar execution, and out-of-order execution. Automatic but can be enhanced by cache-friendly code.

Data Parallelism Characteristics

•Same code, different data partitions
•Scales linearly with data size
•Minimal coordination overhead
•Embarrassingly parallel workloads
•Map operations in MapReduce
•Batch processing of events
•Image/video frame processing

Task Parallelism Characteristics

•Different code, potentially same data
•Scales with number of independent tasks
•May require result aggregation
•Microservices fan-out patterns
•Concurrent API calls
•Parallel feature extraction
•Background job execution

Granularity of Parallelism:

The size of parallelizable work units significantly impacts efficiency:

Fine-grained parallelism: Many small tasks → High coordination overhead, but good load balancing
Coarse-grained parallelism: Fewer large tasks → Low overhead, but risk of uneven load distribution

Optimal granularity balances coordination costs against load distribution. Too fine-grained and you spend more time managing tasks than executing them. Too coarse-grained and some workers sit idle while others are overloaded.

Process, Thread, and Coroutine Parallelism

At the implementation level, parallelism can be achieved through different mechanisms, each with distinct tradeoffs in isolation, overhead, and programming complexity.

Parallelism Implementation Mechanisms Compared
Mechanism	Memory	Creation Cost	Context Switch	Best For
Processes	Isolated (separate address space)	High (fork/exec)	Expensive (~1-10ms)	CPU-bound, crash isolation
Threads	Shared (same address space)	Medium (~10-100µs)	Moderate (~1-10µs)	Mixed CPU/IO, shared state
Coroutines/Green Threads	Shared (cooperative)	Very Low (~1µs)	Very Low (~100ns)	I/O-bound, massive concurrency
Async/Await	Shared (event loop)	Minimal	Minimal	High-concurrency I/O

Process-Based Parallelism:

Processes provide the strongest isolation—each process has its own memory space, file descriptors, and runtime state. A crash in one process doesn't affect others (hence "crash isolation").

                    ┌─────────────────┐
                    │  Parent Process │
                    │  ┌───────────┐  │
                    │  │   Memory  │  │
                    │  └───────────┘  │
                    └────────┬────────┘
                             │
            ┌────────────────┼────────────────┐
            │                │                │
     ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐
     │  Worker 1   │  │  Worker 2   │  │  Worker N   │
     │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │
     │ │ Memory  │ │  │ │ Memory  │ │  │ │ Memory  │ │
     │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │
     └─────────────┘  └─────────────┘  └─────────────┘
      (Isolated)       (Isolated)       (Isolated)

When to use processes:

CPU-bound workloads (bypass Python GIL, etc.)
Untrusted or potentially crashing code
Multi-language architectures
Microservices (each service is a process)

Thread-Based Parallelism:

Threads share memory within a process, enabling efficient data sharing but requiring careful synchronization to avoid race conditions.

                    ┌───────────────────────────────────┐
                    │           Single Process          │
                    │  ┌──────────────────────────────┐ │
                    │  │       Shared Memory          │ │
                    │  │   (Heap, Global Variables)   │ │
                    │  └──────────────────────────────┘ │
                    │                                   │
                    │   ┌───────┐ ┌───────┐ ┌───────┐  │
                    │   │Thread1│ │Thread2│ │ThreadN│  │
                    │   │ Stack │ │ Stack │ │ Stack │  │
                    │   └───────┘ └───────┘ └───────┘  │
                    └───────────────────────────────────┘

Thread safety considerations:

Mutable shared state requires synchronization (locks, mutexes)
Lock contention can serialize parallel execution
Deadlocks possible with multiple locks
Thread-local storage avoids sharing overhead

Languages and thread models:

Java: Full threading with synchronized blocks, concurrent collections
Go: Goroutines (lightweight threads) with channels for communication
Python: Limited by GIL for CPU-bound work; effective for I/O-bound
Rust: Ownership model prevents data races at compile time

Coroutines and Async/Await:

Coroutines provide concurrency without parallelism—multiple tasks share a single thread, yielding cooperatively at I/O boundaries. This enables handling thousands of concurrent connections with minimal overhead.

        ┌────────────────────────────────────────────┐
        │              Single Thread                 │
        │                                            │
        │    Request 1: ●→→→→○ (waiting for I/O)    │
        │              ↓ yield                       │
        │    Request 2:     ●→→→○ (waiting)         │
        │                   ↓ yield                  │
        │    Request 3:         ●→→→→○              │
        │                       ↓ yield              │
        │    Request 1: (I/O complete) →→→→●        │
        │                                            │
        │    ● = executing   ○ = blocked/waiting    │
        └────────────────────────────────────────────┘

Key insight: While coroutines don't provide true CPU parallelism, they dramatically increase throughput for I/O-bound workloads by ensuring the CPU is always doing useful work instead of waiting for I/O.

Example concurrency capacity:

Traditional threaded server: ~1,000-10,000 concurrent connections
Async/coroutine server: ~100,000-1,000,000 concurrent connections

The Hybrid Approach

Production systems often combine models: multiple processes (for crash isolation and multi-core utilization) each running async I/O (for high concurrency). Example: Gunicorn spawns multiple worker processes, each running an async event loop in frameworks like FastAPI.

Parallel Patterns in Distributed Systems

At the distributed system level, parallelization takes specific architectural forms. These patterns have been refined through decades of experience at scale.

Essential Distributed Parallelism Patterns

•Scatter-Gather (Fan-Out/Fan-In) — A coordinator distributes work to multiple workers, then aggregates their results. Used in search engines, recommendation systems, and any aggregated query across partitioned data.
•Map-Reduce — Data is partitioned (map phase), processed in parallel, shuffled by key, then aggregated (reduce phase). Foundation of batch analytics at scale.
•Parallel Pipelines — Multiple instances of a pipeline run concurrently, each processing a subset of input. Used in stream processing (Kafka partitions + consumer groups).
•Work Stealing — Idle workers steal tasks from busy workers' queues. Provides automatic load balancing without central coordination.
•Speculative Execution — The same task runs on multiple workers; the first result wins. Reduces tail latency when some workers are slow (stragglers).

Scatter-Gather Pattern Deep Dive:

This pattern is ubiquitous in microservices architectures where a single request requires data from multiple services.

       Request
          │
          ▼
    ┌─────────────┐
    │ Coordinator │
    │   (Scatter) │
    └──────┬──────┘
           │
    ┌──────┼──────┬──────┬──────┐
    │      │      │      │      │
    ▼      ▼      ▼      ▼      ▼
┌──────┐┌──────┐┌──────┐┌──────┐┌──────┐
│Svc A ││Svc B ││Svc C ││Svc D ││Svc E │
└──┬───┘└──┬───┘└──┬───┘└──┬───┘└──┬───┘
   │       │       │       │       │
   └───────┴───────┼───────┴───────┘
                   │
           ┌───────▼───────┐
           │  Coordinator  │
           │   (Gather)    │
           └───────┬───────┘
                   │
                   ▼
              Response

Critical considerations:

Timeout strategy: What happens when one service is slow?
- Wait indefinitely → Entire request blocked
- Strict timeout → Partial results or error
- Best practice: Return partial results with degraded experience
Failure handling: One service fails → fail entire request? Return partial?
Result ordering: Results arrive out of order; aggregation must handle this
Connection management: N parallel calls = N connections; need connection pooling

scatter-gather-example.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
// Scatter-Gather pattern with timeout and partial results
interface ServiceResult<T> {
  service: string;
  success: boolean;
  data?: T;
  error?: string;
  latencyMs: number;
}
 
async function scatterGather<T>(
  services: string[],
  fetcher: (service: string) => Promise<T>,
  timeoutMs: number = 200
): Promise<ServiceResult<T>[]> {
  
  const fetchWithTimeout = async (service: string): Promise<ServiceResult<T>> => {
    const start = Date.now();
    const timeoutPromise = new Promise<never>((_, reject) => 
      setTimeout(() => reject(new Error('Timeout')), timeoutMs)
    );
    
    try {
      const data = await Promise.race([
        fetcher(service),
        timeoutPromise
      ]);
      return {
        service,
        success: true,
        data,
        latencyMs: Date.now() - start
      };
    } catch (error) {
      return {
        service,
        success: false,
        error: error instanceof Error ? error.message : 'Unknown error',
        latencyMs: Date.now() - start
      };
    }
  };
  
  // Scatter: Launch all fetches in parallel
  const promises = services.map(fetchWithTimeout);
  
  // Gather: Wait for all to complete (or timeout)
  const results = await Promise.allSettled(promises);
  
  // Aggregate results
  return results
    .filter((r): r is PromiseFulfilledResult<ServiceResult<T>> => 
      r.status === 'fulfilled'
    )
    .map(r => r.value);
}
 
// Usage example: Fetch user dashboard data from multiple microservices
const dashboardData = await scatterGather(
  ['user-service', 'recommendations', 'notifications', 'analytics'],
  async (service) => {
    const response = await fetch(`http://${service}/api/data/${userId}`);
    return response.json();
  },
  150 // 150ms timeout
);
 
const successfulResults = dashboardData.filter(r => r.success);
console.log(`Got ${successfulResults.length} of 4 services within timeout`);

Synchronization and Coordination Overhead

Parallelization is not free. Every parallel system pays coordination costs that reduce the net speedup. Understanding and minimizing this overhead is crucial for effective parallelization.

Sources of Parallelization Overhead

•Task Creation and Scheduling — Creating threads/processes, managing task queues, and scheduling work has non-zero cost. For very small tasks, this overhead exceeds the parallelization benefit.
•Context Switching — When the OS switches between threads/processes, it must save and restore CPU state, flush caches, and update memory mappings. Excessive context switching destroys throughput.
•Lock Contention — Multiple threads competing for shared locks serialize execution. High-contention locks become the bottleneck, regardless of available parallelism.
•Cache Coherence — In multi-core systems, changes to shared data must be synchronized across CPU caches (MESI protocol). This causes cache invalidation storms for heavily shared data.
•Memory Bandwidth — Parallel threads may saturate memory bus bandwidth before exhausting CPU capacity, creating an unexpected bottleneck.
•Network Overhead (Distributed) — Distributing work across machines adds serialization, network transfer, and coordination costs. Network round-trips (1-100ms) may dwarf computation time.

Approximate Overhead Costs (Order of Magnitude)
Operation	Approximate Latency	Impact on Parallelization
Create new thread (Java)	~10-100µs	Significant for short tasks
Thread context switch	~1-10µs	Multiplies with oversubscription
Acquire uncontended lock	~10-100ns	Negligible in isolation
Acquire contended lock	~1µs-10ms	Serial bottleneck under contention
Cache line invalidation	~10-100ns	Compounds for shared mutable state
Network round-trip (same DC)	~0.5-2ms	Dominates for small tasks
Network round-trip (cross-region)	~50-200ms	Often prohibitive for fine-grained work

Strategies to Minimize Overhead:

Batch Small Tasks — Instead of parallelizing 1,000 tasks of 1ms each (overhead-dominated), batch into 10 tasks of 100ms each.
Lock-Free Data Structures — Use atomic operations and compare-and-swap (CAS) instead of locks where possible. Examples: ConcurrentHashMap in Java, atomic types in C++/Rust.
Thread-Local Storage — Avoid sharing state by giving each thread its own copy. Merge results only at the end.
Work Stealing Over Work Pushing — Let idle workers pull work rather than distributing work upfront. This naturally balances load.
Connection Pooling — For distributed parallelism, reuse network connections rather than establishing new ones per task.
Colocate Data and Compute — Move computation to data rather than moving data to computation. This is the core insight of MapReduce.

The Goldilocks Principle

Optimal parallelization is about finding the right granularity—fine enough for good load balancing, coarse enough to amortize coordination costs. There's no universal formula; profile your specific workload to find the sweet spot.

Real-World Parallelization Examples

Let's examine how production systems leverage parallelization to achieve massive throughput.

Database Query Parallelization:

Modern databases parallelize query execution across multiple dimensions:

Intra-Query Parallelism: A single query uses multiple threads
- Parallel table scans (each thread reads different blocks)
- Parallel index lookups
- Parallel aggregation (compute partial GROUP BY in parallel, merge at end)
- Parallel sorts (parallel quicksort, then merge)
Inter-Query Parallelism: Multiple queries execute concurrently
- Connection pool serving multiple clients
- Read queries can run fully parallel (no locks for snapshot isolation)
- Write queries require coordination (transaction isolation)

PostgreSQL Parallel Query Example:

-- PostgreSQL will use parallel workers for this query
SET max_parallel_workers_per_gather = 4;

EXPLAIN ANALYZE
SELECT category, SUM(amount) 
FROM transactions 
WHERE created_at > '2024-01-01'
GROUP BY category;

-- Output shows: Gather Merge -> Parallel Hash Aggregate -> Parallel Seq Scan

PostgreSQL's planner automatically parallelizes when:

Table is large enough to benefit
Query type supports parallelism
Parallel workers are available

Anti-Patterns and Common Pitfalls

Parallelization can backfire when applied incorrectly. These anti-patterns waste resources or actually reduce throughput.

Parallelization Anti-Patterns

•Over-Parallelization — Spawning more threads/processes than CPU cores leads to context switching overhead without benefit. For CPU-bound work, optimal thread count ≈ number of cores.
•Parallelizing Inherently Serial Work — Some algorithms have dependencies that prevent parallelization. Forcing parallelism on serial work adds overhead without speedup.
•Lock Convoy — Every thread acquires the same lock in sequence, creating a queue that serializes execution despite parallel threads.
•False Sharing — Independent data on the same cache line causes unnecessary cache invalidation. Threads writing to adjacent array elements suffer this.
•Unbounded Parallelism — Spawning unlimited threads for incoming work leads to resource exhaustion, cascading failures, and system crashes under load.
•Thundering Herd — All workers wake simultaneously for a single event (e.g., accept() on a socket), causing massive contention. Only one succeeds; others wasted cycles.
•Ignoring Backpressure — Producing work faster than consumers can process leads to unbounded queue growth, memory exhaustion, and crash.

The Thread Pool Size Formula

For CPU-bound work: threads = CPU cores. For I/O-bound work: threads = CPU cores × (1 + wait time / compute time).

Example: If tasks spend 80% waiting on I/O (wait/compute = 4), optimal threads = cores × 5. Going beyond this trades context switch overhead for diminishing returns.

Summary: Parallelization for Throughput

We've explored parallelization as the foundational technique for throughput optimization. Let's consolidate the key insights:

Key Takeaways

•Parallelization multiplies throughput — By utilizing multiple processing resources simultaneously, systems can handle far more work than sequential execution allows.
•Amdahl's Law sets limits — Serial portions of a workload cap the maximum speedup. Eliminating serialization is often more impactful than adding more parallelism.
•Choose the right parallelism model — Processes for isolation, threads for shared state with moderate concurrency, async/coroutines for massive I/O concurrency.
•Distributed patterns enable scale — Scatter-gather, MapReduce, and work stealing patterns parallelize across machines, not just cores.
•Overhead is the enemy — Coordination, synchronization, and network costs reduce net gains. Batch work, minimize sharing, and optimize granularity.
•Avoid common pitfalls — Over-parallelization, lock convoys, and unbounded concurrency can make systems slower, not faster.

What's next:

Parallelization is just one dimension of throughput optimization. The next page explores batching—a complementary technique that amortizes per-operation overhead by grouping multiple operations together, often providing greater throughput gains than parallelization alone for I/O-bound workloads.

Page Complete

You now understand parallelization as a throughput optimization technique—from theoretical foundations (Amdahl's Law) through implementation mechanisms (processes, threads, async) to distributed patterns (scatter-gather, MapReduce) and common pitfalls. Next, we'll examine batching strategies.