Loading learning content...
Modern servers routinely feature 32, 64, or even 128 CPU cores. Yet a single-threaded query uses just one of these cores, leaving vast computational resources idle. Parallel query execution transforms query processing by distributing work across multiple cores (or even multiple machines), dramatically reducing execution time for suitable workloads.
Parallelism in database systems is not simply about speed—it's about resource utilization. A query that takes 100 seconds on a single core might complete in 10 seconds using 10 cores, delivering a 10× improvement in user experience while actually consuming the same total CPU resources. Understanding when, how, and why databases parallelize queries is essential knowledge for database professionals.
By the end of this page, you will understand the types of parallelism in query execution, how data is partitioned for parallel processing, how individual operators are parallelized, the challenges of synchronization and coordination, and how to analyze and tune parallel query performance.
Database systems employ multiple forms of parallelism, each targeting different aspects of query execution. Understanding these types is fundamental to grasping how modern databases achieve high performance.
Inter-Query Parallelism:
The simplest form—multiple independent queries execute simultaneously on different cores. This is handled by the database's connection management and provides natural parallelism for OLTP workloads with many concurrent users.
| Type | What's Parallel | Granularity | Primary Benefit |
|---|---|---|---|
| Inter-Query | Different queries | Query level | Concurrency, OLTP throughput |
| Intra-Query | Parts of same query | Operator/data level | Single query speedup |
| Intra-Operator | Same operator on data partitions | Data partition level | Scan, join, sort speedup |
| Inter-Operator | Different operators simultaneously | Pipeline level | Pipeline throughput |
| Distributed | Different machines | Node level | Scale beyond single machine |
Intra-Query Parallelism:
This is our primary focus—parallelizing a single query to reduce its execution time. Intra-query parallelism can be further divided:
123456789101112131415161718192021222324252627282930313233343536
INTRA-OPERATOR PARALLELISM (Data Parallelism):═══════════════════════════════════════════════════════════════Same operation applied to different data partitions simultaneously Query: SELECT COUNT(*) FROM large_table WHERE status = 'active' [Gather] │ ┌──────────────┼──────────────┐ │ │ │ [Scan P1] [Scan P2] [Scan P3] rows 0-1M rows 1M-2M rows 2M-3M │ │ │ [Filter] [Filter] [Filter] [Count] [Count] [Count] │ │ │ └──────────────┴──────────────┘ │ [Final Sum] 3 workers, each scanning 1/3 of data → ~3x speedup INTER-OPERATOR PARALLELISM (Pipeline Parallelism):═══════════════════════════════════════════════════════════════Different operators run simultaneously on different tuples Time ────────────────────────────────────────────────────► Worker 1: [Scan]────────────────────────────────────────► └─ tuples ─┐ Worker 2: [Join]────────────────────────────► └─ tuples ─┐ Worker 3: [Aggregate]────────────► Overlapping execution of pipeline stagesIntra-query parallelism is most beneficial for OLAP (analytical) workloads with long-running queries over large datasets. OLTP workloads benefit more from inter-query parallelism since individual transactions are short-lived. Many databases automatically choose single-threaded execution for small queries where parallelization overhead exceeds benefit.
For intra-operator parallelism to work, data must be divided among parallel workers. The partitioning strategy determines how data is distributed and significantly impacts parallel efficiency.
Key Partitioning Strategies:
| Strategy | How It Works | Best For | Limitation |
|---|---|---|---|
| Range Partitioning | Divide by key value ranges | Range queries, sorted output | Data skew if key distribution uneven |
| Hash Partitioning | Hash key to determine partition | Point lookups, joins on key | No ordering preserved |
| Round-Robin | Assign rows to partitions cyclically | Even distribution regardless of data | No locality for key-based operations |
| Block-Based | Divide by storage blocks | Sequential scans | Simple, no key awareness |
1234567891011121314151617181920212223242526272829
Original Table: employees (1M rows)═══════════════════════════════════════════════════════════════ RANGE PARTITIONING (by salary):───────────────────────────────────────────────────────────── Worker 1: salary < 50,000 (~400K rows) Worker 2: 50,000 ≤ salary < 80,000 (~350K rows) Worker 3: salary ≥ 80,000 (~250K rows) ⚠ Skewed: Worker 1 has most data HASH PARTITIONING (by employee_id % 4):───────────────────────────────────────────────────────────── Worker 1: id % 4 = 0 (~250K rows) Worker 2: id % 4 = 1 (~250K rows) Worker 3: id % 4 = 2 (~250K rows) Worker 4: id % 4 = 3 (~250K rows) ✓ Balanced: Even distribution BLOCK-BASED (by storage):───────────────────────────────────────────────────────────── Worker 1: Pages 0-999 (~333K rows) Worker 2: Pages 1000-1999 (~333K rows) Worker 3: Pages 2000-2999 (~333K rows) ✓ Balanced, I/O friendlyPartitioning for Joins:
Join operations require special consideration—matching rows must end up on the same worker. Two approaches handle this:
Data skew is the enemy of parallel efficiency. If 90% of data lands on one worker while others sit idle, you get 10% of potential speedup. Practical systems monitor partition sizes and may dynamically rebalance or adapt strategies when skew is detected.
Different operators have different parallelization strategies. Some are straightforward to parallelize; others require careful coordination.
Parallel Sequential Scan:
The simplest case—divide the table into ranges and have each worker scan its portion:
12345678910111213141516171819202122
Parallel Sequential Scan: employees (3M pages, 4 workers) Strategy: Block-range partitioning═══════════════════════════════════════════════════════════════ Worker 0: Scans pages 0 - 749,999 Worker 1: Scans pages 750,000 - 1,499,999 Worker 2: Scans pages 1,500,000 - 2,249,999 Worker 3: Scans pages 2,250,000 - 2,999,999 Each worker: 1. Opens table for its designated range 2. Scans pages sequentially (I/O efficient) 3. Applies filters locally 4. Sends qualifying tuples to Gather node Gather Node: • Collects tuples from all workers • Preserves partial ordering if needed • Passes combined stream to parent Speedup: ~4x (minus coordination overhead)Parallel Hash Join:
Hash joins parallelize well but require coordination for the build phase:
1234567891011121314151617181920212223242526
Parallel Hash Join: large_table ⋈ medium_table Phase 1: Parallel Hash Build (Shared Hash Table)═══════════════════════════════════════════════════════════════ Each worker scans portion of build relation (medium_table) All workers insert into SHARED hash table Worker 0: Scans rows 0-250K ──┐ Worker 1: Scans rows 250K-500K ──┼──► [Shared Hash Table] Worker 2: Scans rows 500K-750K ──┤ (concurrent insert Worker 3: Scans rows 750K-1M ──┘ with locking) Barrier: All workers wait for build completion Phase 2: Parallel Probe═══════════════════════════════════════════════════════════════ Each worker scans portion of probe relation (large_table) All workers probe same shared hash table (read-only - fast!) Worker 0: Scans large rows 0-2.5M ──► Probe hash table Worker 1: Scans large rows 2.5M-5M ──► Probe hash table Worker 2: Scans large rows 5M-7.5M ──► Probe hash table Worker 3: Scans large rows 7.5M-10M ──► Probe hash table Each worker produces join results independently Gather collects all resultsParallel Aggregation:
Aggregation requires two phases: partial aggregation (parallel) and final aggregation (single):
12345678910111213141516171819202122232425262728
Parallel Aggregation: SELECT dept, SUM(salary), COUNT(*) FROM employees GROUP BY dept Phase 1: Parallel Partial Aggregation═══════════════════════════════════════════════════════════════ Worker 0: Scans portion → Hash Aggregate → Partial results {A: sum=100K, cnt=20}, {B: sum=80K, cnt=15}, ... Worker 1: Scans portion → Hash Aggregate → Partial results {A: sum=90K, cnt=18}, {B: sum=85K, cnt=17}, ... Worker 2: Scans portion → Hash Aggregate → Partial results {A: sum=110K, cnt=22}, {B: sum=75K, cnt=14}, ... Phase 2: Gather + Final Aggregation═══════════════════════════════════════════════════════════════ Collect partial results from all workers Final Aggregate for group A: sum = 100K + 90K + 110K = 300K count = 20 + 18 + 22 = 60 Final Aggregate for group B: sum = 80K + 85K + 75K = 240K count = 15 + 17 + 14 = 46 Note: Aggregates like SUM, COUNT, MIN, MAX parallelize trivially Aggregates like MEDIAN, MODE require different approachesNot all aggregation functions parallelize easily. SUM, COUNT, MIN, MAX can combine partial results. AVG requires separate SUM and COUNT. MEDIAN or percentiles may need complete sorting. Some databases support user-defined aggregate functions with explicit combine semantics for parallelization.
Sorting is a blocking operation that benefits significantly from parallelism. However, producing a globally sorted result from parallel workers requires careful coordination.
Parallel Sort Strategy:
The most common approach is parallel sort with merge:
1234567891011121314151617181920212223242526272829
Parallel Sort: ORDER BY salary for 4M tuples, 4 workers Phase 1: Parallel Partial Sort═══════════════════════════════════════════════════════════════ Each worker sorts its partition independently: Worker 0: Sorts 1M tuples → [sorted list 0] Worker 1: Sorts 1M tuples → [sorted list 1] Worker 2: Sorts 1M tuples → [sorted list 2] Worker 3: Sorts 1M tuples → [sorted list 3] Each worker completes in ~1/4 the time of sequential sort Phase 2: Gather Merge═══════════════════════════════════════════════════════════════ [Gather Merge] reads from 4 sorted streams Worker results (sorted): W0: [25K, 30K, 45K, 55K, ...] W1: [22K, 35K, 40K, 60K, ...] W2: [28K, 32K, 50K, 58K, ...] W3: [20K, 38K, 42K, 65K, ...] Merge output (globally sorted): [20K, 22K, 25K, 28K, 30K, 32K, 35K, ...] Uses k-way merge: O(n log k) where k = number of workers Total Time: O(n/p log n) + O(n log p) ≈ much faster than O(n log n)Parallel Sort Algorithms:
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Local Sort + Merge | Workers sort locally, leader merges | Simple, good locality | Merge bottleneck |
| Range Partition Sort | Partition by ranges, sort partitions | Output already partitioned | Requires sampling, skew risk |
| Sample Sort | Sample to determine partitions, distribute, sort | Handles skew well | Sampling overhead |
| Bitonic Sort | Parallel comparison network | Predictable performance | Not comparison-based |
Top-N Optimization:
For ORDER BY with LIMIT, parallel Top-N avoids full sort:
123456789101112131415161718
Query: SELECT * FROM log_events ORDER BY timestamp DESC LIMIT 100 Parallel Top-N (much faster than full parallel sort!):═══════════════════════════════════════════════════════════════ Worker 0: Maintains heap of top 100 from its partition Worker 1: Maintains heap of top 100 from its partition Worker 2: Maintains heap of top 100 from its partition Worker 3: Maintains heap of top 100 from its partition Each worker: O(n log 100) = O(n) effectively Gather: Merge 400 candidates (4 × 100) Final: Take top 100 from 400 candidates - trivial! Memory: Only 100 tuples per worker, not entire partition ✓ No full sort required ✓ No disk spill for sort buffersThe leader/coordinator that performs final merge can become a bottleneck. Advanced systems may parallelize the merge itself or use hierarchical gathering (pairs of workers merge, then pairs of pairs, etc.) to distribute the merge workload.
Parallel execution introduces coordination challenges that don't exist in single-threaded execution. Proper synchronization ensures correctness without sacrificing too much performance.
Key Synchronization Points:
123456789101112131415161718192021
Parallel Hash Join with Barriers: Timeline (4 workers):═══════════════════════════════════════════════════════════════ W0: [─── Build portion ───]▓[─────── Probe portion ────────]W1: [─── Build ───────────]▓[─────── Probe portion ────────]W2: [─── Build ─────] ▓[─────── Probe portion ────────]W3: [─── Build ───────] ▓[─────── Probe portion ────────] ↑ BARRIER (all workers wait here until all have finished building) Why Barrier is Necessary:───────────────────────────────────────────────────────────────── • If W2 starts probing before W1 finishes building • W2 might miss hash entries that W1 is still inserting • Results would be incorrect - missing join matches! Barrier ensures: Hash table complete before any probesMinimizing Synchronization Overhead:
Excessive synchronization kills parallel performance. Effective parallel systems minimize coordination:
| Technique | Description | Example |
|---|---|---|
| Lock-Free Structures | Atomic operations instead of locks | Lock-free hash table inserts |
| Thread-Local Aggregation | Each worker maintains local state, merge at end | Counting with local counters |
| Partitioned Structures | Divide shared structure into non-overlapping regions | Hash table with per-partition locks |
| Batch Communication | Buffer results, send in batches | Gather batches of tuples |
| Work Stealing | Idle workers steal from busy ones | Dynamic load balancing |
Amdahl's Law reminds us that the serial portion of work limits parallel speedup. If 10% of a query must be serial (e.g., gathering results), speedup is capped at 10× regardless of worker count. Minimizing serial portions is critical for scalable parallelism.
When examining execution plans, understanding parallel execution constructs helps interpret what the database is doing. Modern databases provide clear visualization of parallel execution.
Parallel Plan Anatomy (PostgreSQL):
123456
EXPLAIN (ANALYZE, FORMAT TEXT)SELECT dept, COUNT(*), AVG(salary)FROM employeesWHERE hire_date > '2020-01-01'GROUP BY deptORDER BY COUNT(*) DESC;Key Plan Elements:
| Node | Purpose | What to Look For |
|---|---|---|
| Gather | Collect results from workers | Workers Planned/Launched |
| Gather Merge | Collect and preserve sort order | Indicates parallel sorting |
| Parallel Seq Scan | Parallel table scan | loops = workers + 1 (leader) |
| Parallel Index Scan | Parallel index scan | Range split among workers |
| Partial Aggregate | Worker-local aggregation | Combined by parent Finalize |
| Finalize Aggregate | Combine partial aggregates | Follows Partial Aggregate |
In parallel plans, 'actual time' often shows per-worker time. Multiply by loops or check total time at Gather to assess true wall-clock time. A node with 'actual time=10ms, loops=4' consumed 10ms per worker, but only ~10ms wall-clock if truly parallel.
Effective parallel execution requires balanced work distribution. Imbalanced load means some workers finish early and wait while others remain busy—wasting parallel potential.
Static vs. Dynamic Work Distribution:
12345678910111213141516171819202122232425262728
Morsel-Driven Parallelism (HyPer/Umbra approach):═══════════════════════════════════════════════════════════════ Data is divided into "morsels" (small chunks, e.g., 1000 tuples) Morsel Queue: [M1][M2][M3][M4][M5][M6][M7][M8]... Worker Execution:───────────────────────────────────────────────────────────────── Time ─────────────────────────────────────────────────────────► W0: [Process M1] [ M4 ] [M7] [done, waiting]W1: [ Process M2 ] [ M5 ] [ M8 ] [done]W2: [ Process M3 ] [Process M6] [M9] [still working]W3: [slow... M gets stuck on complex data...][finishes M] [grab more] Benefits: ✓ Fast workers naturally do more work ✓ Slow morsels don't bottleneck entire execution ✓ Worker pool can service multiple queries ✓ Cache-friendly: small morsel fits in CPU cache Morsel Size Choice:───────────────────────────────────────────────────────────────── Too small: High synchronization overhead (queue contention) Too large: Poor load balancing (approaches static allocation) Sweet spot: ~10,000 - 100,000 tuples (depends on workload)Work Stealing:
An advanced technique where idle workers "steal" work from busy workers' queues:
PostgreSQL uses block-range distribution with a shared scan position. Workers atomically claim ranges of table blocks from a shared counter. This provides dynamic balance without a complex work queue—fast workers simply claim more ranges.
Database systems provide numerous configuration options for parallel execution. Understanding these parameters helps optimize for specific workloads.
Key Configuration Parameters (PostgreSQL):
| Parameter | Default | Description |
|---|---|---|
| max_parallel_workers_per_gather | 2 | Max workers per parallel operation |
| max_parallel_workers | 8 | Total parallel workers system-wide |
| max_worker_processes | 8 | Total background workers (includes parallel) |
| parallel_tuple_cost | 0.1 | Cost of passing tuple to leader |
| parallel_setup_cost | 1000 | Cost of starting parallel operation |
| min_parallel_table_scan_size | 8MB | Table must be this large for parallel |
| min_parallel_index_scan_size | 512kB | Index must be this large for parallel |
1234567891011121314151617
-- Increase parallelism for data warehouse workloadSET max_parallel_workers_per_gather = 8; -- More workers per querySET max_parallel_workers = 32; -- More total workersSET parallel_tuple_cost = 0.001; -- Lower to encourage parallelismSET parallel_setup_cost = 500; -- Lower to encourage parallelism -- Verify parallelism is being usedEXPLAIN SELECT COUNT(*) FROM large_table WHERE column > 100;-- Should show "Parallel Seq Scan" with multiple workers -- For specific query, force parallelismSET parallel_leader_participation = off; -- Leader only gathersSET force_parallel_mode = on; -- Force parallel even if marginal -- Or disable for specific query that performs worse parallelSET max_parallel_workers_per_gather = 0; -- No parallelismSELECT * FROM small_lookup_table;When to Increase Parallelism:
When to Limit Parallelism:
In mixed workloads, aggressive parallelism for one query can starve other queries. Use resource manager features (if available) to limit parallel resources by user or query type. Balance per-query speedup against overall system fairness.
Parallel query execution represents the culmination of modern query processing—leveraging multiple cores to dramatically accelerate suitable workloads. Let's consolidate the key concepts:
Module Complete:
Congratulations! You've completed the Query Execution Engine module. You now understand:
This knowledge forms the foundation for understanding how databases actually run your queries—essential for performance tuning, system design, and becoming a truly effective database professional.
You've mastered the Query Execution Engine—from the fundamental iterator model through advanced parallel execution. You can now analyze execution plans with confidence, understand why queries perform as they do, and make informed decisions about configuration and optimization. This knowledge is invaluable for database performance tuning and system architecture.