Database Management SystemsQuery Execution Engine

Query Execution Engine

LevelIntermediate

Duration90 mins

TopicQuery Execution Engine

5 / 5

Parallelism

Harnessing the Power of Multiple Cores

Modern servers routinely feature 32, 64, or even 128 CPU cores. Yet a single-threaded query uses just one of these cores, leaving vast computational resources idle. Parallel query execution transforms query processing by distributing work across multiple cores (or even multiple machines), dramatically reducing execution time for suitable workloads.

Parallelism in database systems is not simply about speed—it's about resource utilization. A query that takes 100 seconds on a single core might complete in 10 seconds using 10 cores, delivering a 10× improvement in user experience while actually consuming the same total CPU resources. Understanding when, how, and why databases parallelize queries is essential knowledge for database professionals.

What You Will Learn

By the end of this page, you will understand the types of parallelism in query execution, how data is partitioned for parallel processing, how individual operators are parallelized, the challenges of synchronization and coordination, and how to analyze and tune parallel query performance.

Types of Parallelism in Query Execution

Database systems employ multiple forms of parallelism, each targeting different aspects of query execution. Understanding these types is fundamental to grasping how modern databases achieve high performance.

Inter-Query Parallelism:

The simplest form—multiple independent queries execute simultaneously on different cores. This is handled by the database's connection management and provides natural parallelism for OLTP workloads with many concurrent users.

Types of Parallelism in Database Systems
Type	What's Parallel	Granularity	Primary Benefit
Inter-Query	Different queries	Query level	Concurrency, OLTP throughput
Intra-Query	Parts of same query	Operator/data level	Single query speedup
Intra-Operator	Same operator on data partitions	Data partition level	Scan, join, sort speedup
Inter-Operator	Different operators simultaneously	Pipeline level	Pipeline throughput
Distributed	Different machines	Node level	Scale beyond single machine

Intra-Query Parallelism:

This is our primary focus—parallelizing a single query to reduce its execution time. Intra-query parallelism can be further divided:

Intra-Query Parallelism Types

Visualization

INTRA-OPERATOR PARALLELISM (Data Parallelism):
═══════════════════════════════════════════════════════════════
Same operation applied to different data partitions simultaneously
 
Query: SELECT COUNT(*) FROM large_table WHERE status = 'active'
 
                    [Gather]
                       │
        ┌──────────────┼──────────────┐
        │              │              │
    [Scan P1]     [Scan P2]      [Scan P3]
    rows 0-1M     rows 1M-2M    rows 2M-3M
        │              │              │
    [Filter]      [Filter]      [Filter]
    [Count]       [Count]       [Count]
        │              │              │
        └──────────────┴──────────────┘
                       │
                 [Final Sum]
 
3 workers, each scanning 1/3 of data → ~3x speedup
 
 
INTER-OPERATOR PARALLELISM (Pipeline Parallelism):
═══════════════════════════════════════════════════════════════
Different operators run simultaneously on different tuples
 
    Time ────────────────────────────────────────────────────►
    
    Worker 1: [Scan]────────────────────────────────────────►
                   └─ tuples ─┐
    Worker 2:            [Join]────────────────────────────►
                              └─ tuples ─┐
    Worker 3:                       [Aggregate]────────────►
    
Overlapping execution of pipeline stages

OLAP vs. OLTP

Intra-query parallelism is most beneficial for OLAP (analytical) workloads with long-running queries over large datasets. OLTP workloads benefit more from inter-query parallelism since individual transactions are short-lived. Many databases automatically choose single-threaded execution for small queries where parallelization overhead exceeds benefit.

Data Partitioning Strategies

For intra-operator parallelism to work, data must be divided among parallel workers. The partitioning strategy determines how data is distributed and significantly impacts parallel efficiency.

Key Partitioning Strategies:

Data Partitioning Strategies
Strategy	How It Works	Best For	Limitation
Range Partitioning	Divide by key value ranges	Range queries, sorted output	Data skew if key distribution uneven
Hash Partitioning	Hash key to determine partition	Point lookups, joins on key	No ordering preserved
Round-Robin	Assign rows to partitions cyclically	Even distribution regardless of data	No locality for key-based operations
Block-Based	Divide by storage blocks	Sequential scans	Simple, no key awareness

Partitioning Examples

Visualization

Original Table: employees (1M rows)
═══════════════════════════════════════════════════════════════
 
RANGE PARTITIONING (by salary):
─────────────────────────────────────────────────────────────
    Worker 1: salary < 50,000     (~400K rows)
    Worker 2: 50,000 ≤ salary < 80,000  (~350K rows)
    Worker 3: salary ≥ 80,000     (~250K rows)
    
    ⚠ Skewed: Worker 1 has most data
 
 
HASH PARTITIONING (by employee_id % 4):
─────────────────────────────────────────────────────────────
    Worker 1: id % 4 = 0          (~250K rows)
    Worker 2: id % 4 = 1          (~250K rows)
    Worker 3: id % 4 = 2          (~250K rows)
    Worker 4: id % 4 = 3          (~250K rows)
    
    ✓ Balanced: Even distribution
 
 
BLOCK-BASED (by storage):
─────────────────────────────────────────────────────────────
    Worker 1: Pages 0-999         (~333K rows)
    Worker 2: Pages 1000-1999     (~333K rows)
    Worker 3: Pages 2000-2999     (~333K rows)
    
    ✓ Balanced, I/O friendly

Partitioning for Joins:

Join operations require special consideration—matching rows must end up on the same worker. Two approaches handle this:

Parallel Join Partitioning

•Co-located Partitioning — Both tables partitioned by join key; matching rows guaranteed to be on same worker; no data exchange needed
•Repartitioning (Shuffle) — Hash both inputs by join key at runtime; redistribute to ensure matches collocate; expensive but flexible
•Broadcast — Send entire small table to all workers; each worker joins with its partition of large table; good for small-large joins

Data Skew

Data skew is the enemy of parallel efficiency. If 90% of data lands on one worker while others sit idle, you get 10% of potential speedup. Practical systems monitor partition sizes and may dynamically rebalance or adapt strategies when skew is detected.

Parallel Operator Implementations

Different operators have different parallelization strategies. Some are straightforward to parallelize; others require careful coordination.

Parallel Sequential Scan:

The simplest case—divide the table into ranges and have each worker scan its portion:

Parallel Scan

Implementation

Parallel Sequential Scan: employees (3M pages, 4 workers)
 
Strategy: Block-range partitioning
═══════════════════════════════════════════════════════════════
 
    Worker 0: Scans pages 0 - 749,999
    Worker 1: Scans pages 750,000 - 1,499,999
    Worker 2: Scans pages 1,500,000 - 2,249,999
    Worker 3: Scans pages 2,250,000 - 2,999,999
 
Each worker:
    1. Opens table for its designated range
    2. Scans pages sequentially (I/O efficient)
    3. Applies filters locally
    4. Sends qualifying tuples to Gather node
 
Gather Node:
    • Collects tuples from all workers
    • Preserves partial ordering if needed
    • Passes combined stream to parent
 
Speedup: ~4x (minus coordination overhead)

Parallel Hash Join:

Hash joins parallelize well but require coordination for the build phase:

Parallel Hash Join

Implementation

Parallel Hash Join: large_table ⋈ medium_table
 
Phase 1: Parallel Hash Build (Shared Hash Table)
═══════════════════════════════════════════════════════════════
    Each worker scans portion of build relation (medium_table)
    All workers insert into SHARED hash table
    
    Worker 0: Scans rows 0-250K    ──┐
    Worker 1: Scans rows 250K-500K ──┼──► [Shared Hash Table]
    Worker 2: Scans rows 500K-750K ──┤       (concurrent insert
    Worker 3: Scans rows 750K-1M   ──┘        with locking)
    
    Barrier: All workers wait for build completion
 
Phase 2: Parallel Probe
═══════════════════════════════════════════════════════════════
    Each worker scans portion of probe relation (large_table)
    All workers probe same shared hash table (read-only - fast!)
    
    Worker 0: Scans large rows 0-2.5M    ──► Probe hash table
    Worker 1: Scans large rows 2.5M-5M   ──► Probe hash table
    Worker 2: Scans large rows 5M-7.5M   ──► Probe hash table
    Worker 3: Scans large rows 7.5M-10M  ──► Probe hash table
    
    Each worker produces join results independently
    Gather collects all results

Parallel Aggregation:

Aggregation requires two phases: partial aggregation (parallel) and final aggregation (single):

Parallel Aggregation

Implementation

Parallel Aggregation: SELECT dept, SUM(salary), COUNT(*) 
                       FROM employees GROUP BY dept
 
Phase 1: Parallel Partial Aggregation
═══════════════════════════════════════════════════════════════
    Worker 0: Scans portion → Hash Aggregate → Partial results
              {A: sum=100K, cnt=20}, {B: sum=80K, cnt=15}, ...
              
    Worker 1: Scans portion → Hash Aggregate → Partial results
              {A: sum=90K, cnt=18}, {B: sum=85K, cnt=17}, ...
              
    Worker 2: Scans portion → Hash Aggregate → Partial results
              {A: sum=110K, cnt=22}, {B: sum=75K, cnt=14}, ...
 
Phase 2: Gather + Final Aggregation
═══════════════════════════════════════════════════════════════
    Collect partial results from all workers
    
    Final Aggregate for group A:
        sum = 100K + 90K + 110K = 300K
        count = 20 + 18 + 22 = 60
        
    Final Aggregate for group B:
        sum = 80K + 85K + 75K = 240K
        count = 15 + 17 + 14 = 46
 
Note: Aggregates like SUM, COUNT, MIN, MAX parallelize trivially
      Aggregates like MEDIAN, MODE require different approaches

Parallelizable Aggregates

Not all aggregation functions parallelize easily. SUM, COUNT, MIN, MAX can combine partial results. AVG requires separate SUM and COUNT. MEDIAN or percentiles may need complete sorting. Some databases support user-defined aggregate functions with explicit combine semantics for parallelization.

Parallel Sort and Ordering

Sorting is a blocking operation that benefits significantly from parallelism. However, producing a globally sorted result from parallel workers requires careful coordination.

Parallel Sort Strategy:

The most common approach is parallel sort with merge:

Parallel Sort

Implementation

Parallel Sort: ORDER BY salary for 4M tuples, 4 workers
 
Phase 1: Parallel Partial Sort
═══════════════════════════════════════════════════════════════
    Each worker sorts its partition independently:
    
    Worker 0: Sorts 1M tuples → [sorted list 0]
    Worker 1: Sorts 1M tuples → [sorted list 1]
    Worker 2: Sorts 1M tuples → [sorted list 2]
    Worker 3: Sorts 1M tuples → [sorted list 3]
    
    Each worker completes in ~1/4 the time of sequential sort
 
Phase 2: Gather Merge
═══════════════════════════════════════════════════════════════
    [Gather Merge] reads from 4 sorted streams
    
    Worker results (sorted):
        W0: [25K, 30K, 45K, 55K, ...]
        W1: [22K, 35K, 40K, 60K, ...]
        W2: [28K, 32K, 50K, 58K, ...]
        W3: [20K, 38K, 42K, 65K, ...]
            
    Merge output (globally sorted):
        [20K, 22K, 25K, 28K, 30K, 32K, 35K, ...]
        
    Uses k-way merge: O(n log k) where k = number of workers
 
Total Time: O(n/p log n) + O(n log p) ≈ much faster than O(n log n)

Parallel Sort Algorithms:

Parallel Sorting Approaches
Approach	How It Works	Pros	Cons
Local Sort + Merge	Workers sort locally, leader merges	Simple, good locality	Merge bottleneck
Range Partition Sort	Partition by ranges, sort partitions	Output already partitioned	Requires sampling, skew risk
Sample Sort	Sample to determine partitions, distribute, sort	Handles skew well	Sampling overhead
Bitonic Sort	Parallel comparison network	Predictable performance	Not comparison-based

Top-N Optimization:

For ORDER BY with LIMIT, parallel Top-N avoids full sort:

Parallel Top-N

Optimization

Query: SELECT * FROM log_events ORDER BY timestamp DESC LIMIT 100
 
Parallel Top-N (much faster than full parallel sort!):
═══════════════════════════════════════════════════════════════
 
    Worker 0: Maintains heap of top 100 from its partition
    Worker 1: Maintains heap of top 100 from its partition
    Worker 2: Maintains heap of top 100 from its partition
    Worker 3: Maintains heap of top 100 from its partition
    
    Each worker: O(n log 100) = O(n) effectively
    
    Gather: Merge 400 candidates (4 × 100)
    Final: Take top 100 from 400 candidates - trivial!
    
    Memory: Only 100 tuples per worker, not entire partition
    ✓ No full sort required
    ✓ No disk spill for sort buffers

Avoiding the Merge Bottleneck

The leader/coordinator that performs final merge can become a bottleneck. Advanced systems may parallelize the merge itself or use hierarchical gathering (pairs of workers merge, then pairs of pairs, etc.) to distribute the merge workload.

Synchronization and Coordination

Parallel execution introduces coordination challenges that don't exist in single-threaded execution. Proper synchronization ensures correctness without sacrificing too much performance.

Key Synchronization Points:

Critical Synchronization Points

•Barrier Synchronization — All workers must reach a point before proceeding (e.g., hash build must complete before probe begins)
•Shared Data Structure Access — Concurrent reads/writes to hash tables, aggregation state, etc. require locking
•Work Distribution — Parceling out data ranges to workers dynamically
•Result Gathering — Collecting and combining results from all workers
•Error Propagation — If one worker fails, others must be notified

Synchronization Example

Hash Join Barriers

Parallel Hash Join with Barriers:
 
Timeline (4 workers):
═══════════════════════════════════════════════════════════════
 
W0: [─── Build portion ───]▓[─────── Probe portion ────────]
W1: [─── Build ───────────]▓[─────── Probe portion ────────]
W2: [─── Build ─────]      ▓[─────── Probe portion ────────]
W3: [─── Build ───────]    ▓[─────── Probe portion ────────]
                           ↑
                      BARRIER
                      (all workers wait here until
                       all have finished building)
 
Why Barrier is Necessary:
─────────────────────────────────────────────────────────────────
    • If W2 starts probing before W1 finishes building
    • W2 might miss hash entries that W1 is still inserting
    • Results would be incorrect - missing join matches!
    
    Barrier ensures: Hash table complete before any probes

Minimizing Synchronization Overhead:

Excessive synchronization kills parallel performance. Effective parallel systems minimize coordination:

Synchronization Overhead Reduction Techniques
Technique	Description	Example
Lock-Free Structures	Atomic operations instead of locks	Lock-free hash table inserts
Thread-Local Aggregation	Each worker maintains local state, merge at end	Counting with local counters
Partitioned Structures	Divide shared structure into non-overlapping regions	Hash table with per-partition locks
Batch Communication	Buffer results, send in batches	Gather batches of tuples
Work Stealing	Idle workers steal from busy ones	Dynamic load balancing

Amdahl's Law

Amdahl's Law reminds us that the serial portion of work limits parallel speedup. If 10% of a query must be serial (e.g., gathering results), speedup is capped at 10× regardless of worker count. Minimizing serial portions is critical for scalable parallelism.

Parallel Execution Plans

When examining execution plans, understanding parallel execution constructs helps interpret what the database is doing. Modern databases provide clear visualization of parallel execution.

Parallel Plan Anatomy (PostgreSQL):

PostgreSQL Parallel Plan
1
2
3
4
5
6
EXPLAIN (ANALYZE, FORMAT TEXT)
SELECT dept, COUNT(*), AVG(salary)
FROM employees
WHERE hire_date > '2020-01-01'
GROUP BY dept
ORDER BY COUNT(*) DESC;

Key Plan Elements:

Parallel Plan Node Types
Node	Purpose	What to Look For
Gather	Collect results from workers	Workers Planned/Launched
Gather Merge	Collect and preserve sort order	Indicates parallel sorting
Parallel Seq Scan	Parallel table scan	loops = workers + 1 (leader)
Parallel Index Scan	Parallel index scan	Range split among workers
Partial Aggregate	Worker-local aggregation	Combined by parent Finalize
Finalize Aggregate	Combine partial aggregates	Follows Partial Aggregate

Reading Parallel Plan Times

In parallel plans, 'actual time' often shows per-worker time. Multiply by loops or check total time at Gather to assess true wall-clock time. A node with 'actual time=10ms, loops=4' consumed 10ms per worker, but only ~10ms wall-clock if truly parallel.

Load Balancing and Work Distribution

Effective parallel execution requires balanced work distribution. Imbalanced load means some workers finish early and wait while others remain busy—wasting parallel potential.

Static vs. Dynamic Work Distribution:

Static Distribution

•Work assigned before execution begins
•Worker 1 gets pages 0-1000
•Worker 2 gets pages 1001-2000
•Simple, no coordination overhead
•Vulnerable to data skew

Dynamic Distribution

•Work assigned as workers become available
•Workers pull chunks from shared queue
•Fast workers do more chunks
•Adapts to variable processing costs
•More coordination overhead

Dynamic Work Distribution

Morsel-Driven Parallelism

Morsel-Driven Parallelism (HyPer/Umbra approach):
═══════════════════════════════════════════════════════════════
 
Data is divided into "morsels" (small chunks, e.g., 1000 tuples)
 
    Morsel Queue: [M1][M2][M3][M4][M5][M6][M7][M8]...
 
Worker Execution:
─────────────────────────────────────────────────────────────────
 
Time ─────────────────────────────────────────────────────────►
 
W0: [Process M1] [  M4  ] [M7]                   [done, waiting]
W1: [  Process M2  ] [ M5 ] [ M8 ]               [done]
W2: [   Process M3   ] [Process M6] [M9]         [still working]
W3: [slow... M gets stuck on complex data...][finishes M] [grab more]
 
Benefits:
    ✓ Fast workers naturally do more work
    ✓ Slow morsels don't bottleneck entire execution
    ✓ Worker pool can service multiple queries
    ✓ Cache-friendly: small morsel fits in CPU cache
 
Morsel Size Choice:
─────────────────────────────────────────────────────────────────
    Too small: High synchronization overhead (queue contention)
    Too large: Poor load balancing (approaches static allocation)
    Sweet spot: ~10,000 - 100,000 tuples (depends on workload)

Work Stealing:

An advanced technique where idle workers "steal" work from busy workers' queues:

Work Stealing Process

•Each worker maintains a local work queue
•Workers preferentially process from their local queue (cache locality)
•When local queue is empty, worker attempts to steal from another worker
•Stealing typically takes from the back of victim's queue (avoid contention)
•Result: Self-balancing even when initial distribution is uneven

PostgreSQL's Approach

PostgreSQL uses block-range distribution with a shared scan position. Workers atomically claim ranges of table blocks from a shared counter. This provides dynamic balance without a complex work queue—fast workers simply claim more ranges.

Parallelism Configuration and Tuning

Database systems provide numerous configuration options for parallel execution. Understanding these parameters helps optimize for specific workloads.

Key Configuration Parameters (PostgreSQL):

PostgreSQL Parallel Query Parameters
Parameter	Default	Description
max_parallel_workers_per_gather	2	Max workers per parallel operation
max_parallel_workers	8	Total parallel workers system-wide
max_worker_processes	8	Total background workers (includes parallel)
parallel_tuple_cost	0.1	Cost of passing tuple to leader
parallel_setup_cost	1000	Cost of starting parallel operation
min_parallel_table_scan_size	8MB	Table must be this large for parallel
min_parallel_index_scan_size	512kB	Index must be this large for parallel

Tuning Parallel Execution
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Increase parallelism for data warehouse workload
SET max_parallel_workers_per_gather = 8;  -- More workers per query
SET max_parallel_workers = 32;            -- More total workers
SET parallel_tuple_cost = 0.001;          -- Lower to encourage parallelism
SET parallel_setup_cost = 500;            -- Lower to encourage parallelism
 
-- Verify parallelism is being used
EXPLAIN SELECT COUNT(*) FROM large_table WHERE column > 100;
-- Should show "Parallel Seq Scan" with multiple workers
 
-- For specific query, force parallelism
SET parallel_leader_participation = off;  -- Leader only gathers
SET force_parallel_mode = on;             -- Force parallel even if marginal
 
-- Or disable for specific query that performs worse parallel
SET max_parallel_workers_per_gather = 0;  -- No parallelism
SELECT * FROM small_lookup_table;

When to Increase Parallelism:

Indicators for More Parallelism

•Large Analytical Queries — Queries scanning millions of rows benefit from more workers
•IO-Bound Workloads — Parallel scans can overlap disk I/O across workers
•Underutilized CPUs — If CPU usage is low during queries, add workers
•Long-Running Reports — Multi-minute queries might complete much faster with parallelism
•Dedicated Analytics Servers — Low-concurrency environments can afford high per-query parallelism

When to Limit Parallelism:

Indicators for Less Parallelism

•High-Concurrency OLTP — Many small queries fighting for workers causes contention
•Small Queries — Parallel setup overhead exceeds benefit for small results
•Memory Pressure — Each worker uses memory; too many can cause swapping
•Already CPU-Saturated — If CPU is at 100%, more workers just cause contention
•Complex per-Tuple Processing — CPU-bound queries may not benefit from more workers

Resource Competition

In mixed workloads, aggressive parallelism for one query can starve other queries. Use resource manager features (if available) to limit parallel resources by user or query type. Balance per-query speedup against overall system fairness.

Summary: Mastering Parallel Query Execution

Parallel query execution represents the culmination of modern query processing—leveraging multiple cores to dramatically accelerate suitable workloads. Let's consolidate the key concepts:

Key Takeaways

•Types of Parallelism — Inter-query (different queries), intra-query (same query), intra-operator (data parallelism), inter-operator (pipeline parallelism)
•Data Partitioning — Range, hash, round-robin, and block-based strategies determine how data is distributed to workers
•Parallel Operators — Scans, joins, aggregations, and sorts all have parallel implementations with specific coordination requirements
•Synchronization — Barriers, shared data structures, and result gathering require careful coordination to maintain correctness
•Load Balancing — Static allocation is simple but vulnerable to skew; dynamic distribution and work stealing provide better balance
•Reading Parallel Plans — Gather nodes, worker counts, and loop counts reveal parallel execution structure
•Configuration Tuning — Parameters control worker counts, table size thresholds, and cost factors affecting when parallelism is used

Module Complete:

Congratulations! You've completed the Query Execution Engine module. You now understand:

The executor's role in transforming plans into results
The iterator model that enables composable operator execution
Pipeline execution for efficient tuple flow
Materialization for handling blocking operations
Parallelism for leveraging multiple CPU cores

This knowledge forms the foundation for understanding how databases actually run your queries—essential for performance tuning, system design, and becoming a truly effective database professional.

Module Complete

You've mastered the Query Execution Engine—from the fundamental iterator model through advanced parallel execution. You can now analyze execution plans with confidence, understand why queries perform as they do, and make informed decisions about configuration and optimization. This knowledge is invaluable for database performance tuning and system architecture.

5 / 5

Loading learning content...

Database Management SystemsQuery Execution Engine

Query Execution Engine

LevelIntermediate

Duration90 mins

TopicQuery Execution Engine

5 / 5

Parallelism

Harnessing the Power of Multiple Cores

What You Will Learn

Types of Parallelism in Query Execution

Inter-Query Parallelism:

Types of Parallelism in Database Systems
Type	What's Parallel	Granularity	Primary Benefit
Inter-Query	Different queries	Query level	Concurrency, OLTP throughput
Intra-Query	Parts of same query	Operator/data level	Single query speedup
Intra-Operator	Same operator on data partitions	Data partition level	Scan, join, sort speedup
Inter-Operator	Different operators simultaneously	Pipeline level	Pipeline throughput
Distributed	Different machines	Node level	Scale beyond single machine

Intra-Query Parallelism:

This is our primary focus—parallelizing a single query to reduce its execution time. Intra-query parallelism can be further divided:

Intra-Query Parallelism Types

Visualization

INTRA-OPERATOR PARALLELISM (Data Parallelism):
═══════════════════════════════════════════════════════════════
Same operation applied to different data partitions simultaneously
 
Query: SELECT COUNT(*) FROM large_table WHERE status = 'active'
 
                    [Gather]
                       │
        ┌──────────────┼──────────────┐
        │              │              │
    [Scan P1]     [Scan P2]      [Scan P3]
    rows 0-1M     rows 1M-2M    rows 2M-3M
        │              │              │
    [Filter]      [Filter]      [Filter]
    [Count]       [Count]       [Count]
        │              │              │
        └──────────────┴──────────────┘
                       │
                 [Final Sum]
 
3 workers, each scanning 1/3 of data → ~3x speedup
 
 
INTER-OPERATOR PARALLELISM (Pipeline Parallelism):
═══════════════════════════════════════════════════════════════
Different operators run simultaneously on different tuples
 
    Time ────────────────────────────────────────────────────►
    
    Worker 1: [Scan]────────────────────────────────────────►
                   └─ tuples ─┐
    Worker 2:            [Join]────────────────────────────►
                              └─ tuples ─┐
    Worker 3:                       [Aggregate]────────────►
    
Overlapping execution of pipeline stages

OLAP vs. OLTP

Data Partitioning Strategies

For intra-operator parallelism to work, data must be divided among parallel workers. The partitioning strategy determines how data is distributed and significantly impacts parallel efficiency.

Key Partitioning Strategies:

Data Partitioning Strategies
Strategy	How It Works	Best For	Limitation
Range Partitioning	Divide by key value ranges	Range queries, sorted output	Data skew if key distribution uneven
Hash Partitioning	Hash key to determine partition	Point lookups, joins on key	No ordering preserved
Round-Robin	Assign rows to partitions cyclically	Even distribution regardless of data	No locality for key-based operations
Block-Based	Divide by storage blocks	Sequential scans	Simple, no key awareness

Partitioning Examples

Visualization

Original Table: employees (1M rows)
═══════════════════════════════════════════════════════════════
 
RANGE PARTITIONING (by salary):
─────────────────────────────────────────────────────────────
    Worker 1: salary < 50,000     (~400K rows)
    Worker 2: 50,000 ≤ salary < 80,000  (~350K rows)
    Worker 3: salary ≥ 80,000     (~250K rows)
    
    ⚠ Skewed: Worker 1 has most data
 
 
HASH PARTITIONING (by employee_id % 4):
─────────────────────────────────────────────────────────────
    Worker 1: id % 4 = 0          (~250K rows)
    Worker 2: id % 4 = 1          (~250K rows)
    Worker 3: id % 4 = 2          (~250K rows)
    Worker 4: id % 4 = 3          (~250K rows)
    
    ✓ Balanced: Even distribution
 
 
BLOCK-BASED (by storage):
─────────────────────────────────────────────────────────────
    Worker 1: Pages 0-999         (~333K rows)
    Worker 2: Pages 1000-1999     (~333K rows)
    Worker 3: Pages 2000-2999     (~333K rows)
    
    ✓ Balanced, I/O friendly

Partitioning for Joins:

Join operations require special consideration—matching rows must end up on the same worker. Two approaches handle this:

Parallel Join Partitioning

•Co-located Partitioning — Both tables partitioned by join key; matching rows guaranteed to be on same worker; no data exchange needed
•Repartitioning (Shuffle) — Hash both inputs by join key at runtime; redistribute to ensure matches collocate; expensive but flexible
•Broadcast — Send entire small table to all workers; each worker joins with its partition of large table; good for small-large joins

Data Skew

Parallel Operator Implementations

Different operators have different parallelization strategies. Some are straightforward to parallelize; others require careful coordination.

Parallel Sequential Scan:

The simplest case—divide the table into ranges and have each worker scan its portion:

Parallel Scan

Implementation

Parallel Sequential Scan: employees (3M pages, 4 workers)
 
Strategy: Block-range partitioning
═══════════════════════════════════════════════════════════════
 
    Worker 0: Scans pages 0 - 749,999
    Worker 1: Scans pages 750,000 - 1,499,999
    Worker 2: Scans pages 1,500,000 - 2,249,999
    Worker 3: Scans pages 2,250,000 - 2,999,999
 
Each worker:
    1. Opens table for its designated range
    2. Scans pages sequentially (I/O efficient)
    3. Applies filters locally
    4. Sends qualifying tuples to Gather node
 
Gather Node:
    • Collects tuples from all workers
    • Preserves partial ordering if needed
    • Passes combined stream to parent
 
Speedup: ~4x (minus coordination overhead)

Parallel Hash Join:

Hash joins parallelize well but require coordination for the build phase:

Parallel Hash Join

Implementation

Parallel Hash Join: large_table ⋈ medium_table
 
Phase 1: Parallel Hash Build (Shared Hash Table)
═══════════════════════════════════════════════════════════════
    Each worker scans portion of build relation (medium_table)
    All workers insert into SHARED hash table
    
    Worker 0: Scans rows 0-250K    ──┐
    Worker 1: Scans rows 250K-500K ──┼──► [Shared Hash Table]
    Worker 2: Scans rows 500K-750K ──┤       (concurrent insert
    Worker 3: Scans rows 750K-1M   ──┘        with locking)
    
    Barrier: All workers wait for build completion
 
Phase 2: Parallel Probe
═══════════════════════════════════════════════════════════════
    Each worker scans portion of probe relation (large_table)
    All workers probe same shared hash table (read-only - fast!)
    
    Worker 0: Scans large rows 0-2.5M    ──► Probe hash table
    Worker 1: Scans large rows 2.5M-5M   ──► Probe hash table
    Worker 2: Scans large rows 5M-7.5M   ──► Probe hash table
    Worker 3: Scans large rows 7.5M-10M  ──► Probe hash table
    
    Each worker produces join results independently
    Gather collects all results

Parallel Aggregation:

Aggregation requires two phases: partial aggregation (parallel) and final aggregation (single):

Parallel Aggregation

Implementation

Parallel Aggregation: SELECT dept, SUM(salary), COUNT(*) 
                       FROM employees GROUP BY dept
 
Phase 1: Parallel Partial Aggregation
═══════════════════════════════════════════════════════════════
    Worker 0: Scans portion → Hash Aggregate → Partial results
              {A: sum=100K, cnt=20}, {B: sum=80K, cnt=15}, ...
              
    Worker 1: Scans portion → Hash Aggregate → Partial results
              {A: sum=90K, cnt=18}, {B: sum=85K, cnt=17}, ...
              
    Worker 2: Scans portion → Hash Aggregate → Partial results
              {A: sum=110K, cnt=22}, {B: sum=75K, cnt=14}, ...
 
Phase 2: Gather + Final Aggregation
═══════════════════════════════════════════════════════════════
    Collect partial results from all workers
    
    Final Aggregate for group A:
        sum = 100K + 90K + 110K = 300K
        count = 20 + 18 + 22 = 60
        
    Final Aggregate for group B:
        sum = 80K + 85K + 75K = 240K
        count = 15 + 17 + 14 = 46
 
Note: Aggregates like SUM, COUNT, MIN, MAX parallelize trivially
      Aggregates like MEDIAN, MODE require different approaches

Parallelizable Aggregates

Parallel Sort and Ordering

Sorting is a blocking operation that benefits significantly from parallelism. However, producing a globally sorted result from parallel workers requires careful coordination.

Parallel Sort Strategy:

The most common approach is parallel sort with merge:

Parallel Sort

Implementation

Parallel Sort: ORDER BY salary for 4M tuples, 4 workers
 
Phase 1: Parallel Partial Sort
═══════════════════════════════════════════════════════════════
    Each worker sorts its partition independently:
    
    Worker 0: Sorts 1M tuples → [sorted list 0]
    Worker 1: Sorts 1M tuples → [sorted list 1]
    Worker 2: Sorts 1M tuples → [sorted list 2]
    Worker 3: Sorts 1M tuples → [sorted list 3]
    
    Each worker completes in ~1/4 the time of sequential sort
 
Phase 2: Gather Merge
═══════════════════════════════════════════════════════════════
    [Gather Merge] reads from 4 sorted streams
    
    Worker results (sorted):
        W0: [25K, 30K, 45K, 55K, ...]
        W1: [22K, 35K, 40K, 60K, ...]
        W2: [28K, 32K, 50K, 58K, ...]
        W3: [20K, 38K, 42K, 65K, ...]
            
    Merge output (globally sorted):
        [20K, 22K, 25K, 28K, 30K, 32K, 35K, ...]
        
    Uses k-way merge: O(n log k) where k = number of workers
 
Total Time: O(n/p log n) + O(n log p) ≈ much faster than O(n log n)

Parallel Sort Algorithms:

Parallel Sorting Approaches
Approach	How It Works	Pros	Cons
Local Sort + Merge	Workers sort locally, leader merges	Simple, good locality	Merge bottleneck
Range Partition Sort	Partition by ranges, sort partitions	Output already partitioned	Requires sampling, skew risk
Sample Sort	Sample to determine partitions, distribute, sort	Handles skew well	Sampling overhead
Bitonic Sort	Parallel comparison network	Predictable performance	Not comparison-based

Top-N Optimization:

For ORDER BY with LIMIT, parallel Top-N avoids full sort:

Parallel Top-N

Optimization

Query: SELECT * FROM log_events ORDER BY timestamp DESC LIMIT 100
 
Parallel Top-N (much faster than full parallel sort!):
═══════════════════════════════════════════════════════════════
 
    Worker 0: Maintains heap of top 100 from its partition
    Worker 1: Maintains heap of top 100 from its partition
    Worker 2: Maintains heap of top 100 from its partition
    Worker 3: Maintains heap of top 100 from its partition
    
    Each worker: O(n log 100) = O(n) effectively
    
    Gather: Merge 400 candidates (4 × 100)
    Final: Take top 100 from 400 candidates - trivial!
    
    Memory: Only 100 tuples per worker, not entire partition
    ✓ No full sort required
    ✓ No disk spill for sort buffers

Avoiding the Merge Bottleneck

Synchronization and Coordination

Parallel execution introduces coordination challenges that don't exist in single-threaded execution. Proper synchronization ensures correctness without sacrificing too much performance.

Key Synchronization Points:

Critical Synchronization Points

•Barrier Synchronization — All workers must reach a point before proceeding (e.g., hash build must complete before probe begins)
•Shared Data Structure Access — Concurrent reads/writes to hash tables, aggregation state, etc. require locking
•Work Distribution — Parceling out data ranges to workers dynamically
•Result Gathering — Collecting and combining results from all workers
•Error Propagation — If one worker fails, others must be notified

Synchronization Example

Hash Join Barriers

Parallel Hash Join with Barriers:
 
Timeline (4 workers):
═══════════════════════════════════════════════════════════════
 
W0: [─── Build portion ───]▓[─────── Probe portion ────────]
W1: [─── Build ───────────]▓[─────── Probe portion ────────]
W2: [─── Build ─────]      ▓[─────── Probe portion ────────]
W3: [─── Build ───────]    ▓[─────── Probe portion ────────]
                           ↑
                      BARRIER
                      (all workers wait here until
                       all have finished building)
 
Why Barrier is Necessary:
─────────────────────────────────────────────────────────────────
    • If W2 starts probing before W1 finishes building
    • W2 might miss hash entries that W1 is still inserting
    • Results would be incorrect - missing join matches!
    
    Barrier ensures: Hash table complete before any probes

Minimizing Synchronization Overhead:

Excessive synchronization kills parallel performance. Effective parallel systems minimize coordination:

Synchronization Overhead Reduction Techniques
Technique	Description	Example
Lock-Free Structures	Atomic operations instead of locks	Lock-free hash table inserts
Thread-Local Aggregation	Each worker maintains local state, merge at end	Counting with local counters
Partitioned Structures	Divide shared structure into non-overlapping regions	Hash table with per-partition locks
Batch Communication	Buffer results, send in batches	Gather batches of tuples
Work Stealing	Idle workers steal from busy ones	Dynamic load balancing

Amdahl's Law

Parallel Execution Plans

When examining execution plans, understanding parallel execution constructs helps interpret what the database is doing. Modern databases provide clear visualization of parallel execution.

Parallel Plan Anatomy (PostgreSQL):

PostgreSQL Parallel Plan
1
2
3
4
5
6
EXPLAIN (ANALYZE, FORMAT TEXT)
SELECT dept, COUNT(*), AVG(salary)
FROM employees
WHERE hire_date > '2020-01-01'
GROUP BY dept
ORDER BY COUNT(*) DESC;

Key Plan Elements:

Parallel Plan Node Types
Node	Purpose	What to Look For
Gather	Collect results from workers	Workers Planned/Launched
Gather Merge	Collect and preserve sort order	Indicates parallel sorting
Parallel Seq Scan	Parallel table scan	loops = workers + 1 (leader)
Parallel Index Scan	Parallel index scan	Range split among workers
Partial Aggregate	Worker-local aggregation	Combined by parent Finalize
Finalize Aggregate	Combine partial aggregates	Follows Partial Aggregate

Reading Parallel Plan Times

Load Balancing and Work Distribution

Effective parallel execution requires balanced work distribution. Imbalanced load means some workers finish early and wait while others remain busy—wasting parallel potential.

Static vs. Dynamic Work Distribution:

Static Distribution

•Work assigned before execution begins
•Worker 1 gets pages 0-1000
•Worker 2 gets pages 1001-2000
•Simple, no coordination overhead
•Vulnerable to data skew

Dynamic Distribution

•Work assigned as workers become available
•Workers pull chunks from shared queue
•Fast workers do more chunks
•Adapts to variable processing costs
•More coordination overhead

Dynamic Work Distribution

Morsel-Driven Parallelism

Morsel-Driven Parallelism (HyPer/Umbra approach):
═══════════════════════════════════════════════════════════════
 
Data is divided into "morsels" (small chunks, e.g., 1000 tuples)
 
    Morsel Queue: [M1][M2][M3][M4][M5][M6][M7][M8]...
 
Worker Execution:
─────────────────────────────────────────────────────────────────
 
Time ─────────────────────────────────────────────────────────►
 
W0: [Process M1] [  M4  ] [M7]                   [done, waiting]
W1: [  Process M2  ] [ M5 ] [ M8 ]               [done]
W2: [   Process M3   ] [Process M6] [M9]         [still working]
W3: [slow... M gets stuck on complex data...][finishes M] [grab more]
 
Benefits:
    ✓ Fast workers naturally do more work
    ✓ Slow morsels don't bottleneck entire execution
    ✓ Worker pool can service multiple queries
    ✓ Cache-friendly: small morsel fits in CPU cache
 
Morsel Size Choice:
─────────────────────────────────────────────────────────────────
    Too small: High synchronization overhead (queue contention)
    Too large: Poor load balancing (approaches static allocation)
    Sweet spot: ~10,000 - 100,000 tuples (depends on workload)

Work Stealing:

An advanced technique where idle workers "steal" work from busy workers' queues:

Work Stealing Process

•Each worker maintains a local work queue
•Workers preferentially process from their local queue (cache locality)
•When local queue is empty, worker attempts to steal from another worker
•Stealing typically takes from the back of victim's queue (avoid contention)
•Result: Self-balancing even when initial distribution is uneven

PostgreSQL's Approach

Parallelism Configuration and Tuning

Database systems provide numerous configuration options for parallel execution. Understanding these parameters helps optimize for specific workloads.

Key Configuration Parameters (PostgreSQL):

PostgreSQL Parallel Query Parameters
Parameter	Default	Description
max_parallel_workers_per_gather	2	Max workers per parallel operation
max_parallel_workers	8	Total parallel workers system-wide
max_worker_processes	8	Total background workers (includes parallel)
parallel_tuple_cost	0.1	Cost of passing tuple to leader
parallel_setup_cost	1000	Cost of starting parallel operation
min_parallel_table_scan_size	8MB	Table must be this large for parallel
min_parallel_index_scan_size	512kB	Index must be this large for parallel

Tuning Parallel Execution
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Increase parallelism for data warehouse workload
SET max_parallel_workers_per_gather = 8;  -- More workers per query
SET max_parallel_workers = 32;            -- More total workers
SET parallel_tuple_cost = 0.001;          -- Lower to encourage parallelism
SET parallel_setup_cost = 500;            -- Lower to encourage parallelism
 
-- Verify parallelism is being used
EXPLAIN SELECT COUNT(*) FROM large_table WHERE column > 100;
-- Should show "Parallel Seq Scan" with multiple workers
 
-- For specific query, force parallelism
SET parallel_leader_participation = off;  -- Leader only gathers
SET force_parallel_mode = on;             -- Force parallel even if marginal
 
-- Or disable for specific query that performs worse parallel
SET max_parallel_workers_per_gather = 0;  -- No parallelism
SELECT * FROM small_lookup_table;

When to Increase Parallelism:

Indicators for More Parallelism

•Large Analytical Queries — Queries scanning millions of rows benefit from more workers
•IO-Bound Workloads — Parallel scans can overlap disk I/O across workers
•Underutilized CPUs — If CPU usage is low during queries, add workers
•Long-Running Reports — Multi-minute queries might complete much faster with parallelism
•Dedicated Analytics Servers — Low-concurrency environments can afford high per-query parallelism

When to Limit Parallelism:

Indicators for Less Parallelism

•High-Concurrency OLTP — Many small queries fighting for workers causes contention
•Small Queries — Parallel setup overhead exceeds benefit for small results
•Memory Pressure — Each worker uses memory; too many can cause swapping
•Already CPU-Saturated — If CPU is at 100%, more workers just cause contention
•Complex per-Tuple Processing — CPU-bound queries may not benefit from more workers

Resource Competition

Summary: Mastering Parallel Query Execution

Parallel query execution represents the culmination of modern query processing—leveraging multiple cores to dramatically accelerate suitable workloads. Let's consolidate the key concepts:

Key Takeaways

•Types of Parallelism — Inter-query (different queries), intra-query (same query), intra-operator (data parallelism), inter-operator (pipeline parallelism)
•Data Partitioning — Range, hash, round-robin, and block-based strategies determine how data is distributed to workers
•Parallel Operators — Scans, joins, aggregations, and sorts all have parallel implementations with specific coordination requirements
•Synchronization — Barriers, shared data structures, and result gathering require careful coordination to maintain correctness
•Load Balancing — Static allocation is simple but vulnerable to skew; dynamic distribution and work stealing provide better balance
•Reading Parallel Plans — Gather nodes, worker counts, and loop counts reveal parallel execution structure
•Configuration Tuning — Parameters control worker counts, table size thresholds, and cost factors affecting when parallelism is used

Module Complete:

Congratulations! You've completed the Query Execution Engine module. You now understand:

The executor's role in transforming plans into results
The iterator model that enables composable operator execution
Pipeline execution for efficient tuple flow
Materialization for handling blocking operations
Parallelism for leveraging multiple CPU cores

This knowledge forms the foundation for understanding how databases actually run your queries—essential for performance tuning, system design, and becoming a truly effective database professional.

Module Complete

5 / 5