Other Operators - Learning Module

Loading content...

0/252

Pipelining

The Art of Data Flow

Imagine a factory assembly line: raw materials enter at one end, pass through successive processing stations, and finished products emerge at the other end. Workers at each station don't wait for all materials to be fully processed by previous stations—they work continuously as materials flow through. This is pipelining, and it's the key to efficient query execution.

Without pipelining, query execution would materialize complete intermediate results at each step—writing millions of temporary tuples to disk, only to read them back for the next operator. With pipelining, tuples flow through the operator tree like water through pipes, with minimal buffering and no unnecessary materialization.

What You Will Learn

By the end of this page, you will understand how database engines execute operator trees using pipelining. You'll master the classic iterator (Volcano) model, understand blocking operators that break pipelines, explore push-based alternatives, and learn how modern vectorized and compiled execution push pipelining to its limits.

Materialization vs Pipelining

Before diving into pipelining mechanisms, let's understand what we're avoiding: full materialization.

Materialized Execution:

In a materialized execution model, each operator:

Consumes its entire input(s)
Computes a complete result
Writes the result to temporary storage
Next operator reads this temporary storage

Query: SELECT name FROM employees WHERE salary > 100000 ORDER BY name;

Materialized execution:
1. Scan employees → write all 1M rows to temp1
2. Filter(salary > 100K) on temp1 → write 10K rows to temp2
3. Project(name) on temp2 → write 10K names to temp3
4. Sort(name) on temp3 → write sorted names to temp4
5. Return temp4

Total I/O: 1M + 1M + 10K + 10K + 10K + 10K + 10K + 10K + 10K = ~2M reads/writes

This is catastrophically inefficient—we're writing and reading data multiple times that could have been processed in a single pass.

Pipelined Execution:

In pipelined execution, operators pass tuples directly to one another without intermediate materialization:

Pipelined execution:
1. Scan next row from employees
2. Apply filter: salary > 100K?
3. If passes, project: extract name
4. Pass to sort buffer (sort is blocking, but buffers only 10K qualifying rows)
5. After scan complete, sort outputs sorted names

Total I/O: 1M (scan) + ~20K (sort, if spills to disk) = ~1M

Benefits:

Reduced I/O: Intermediate results often never touch disk
Lower Latency: First results can be returned before scan completes
Better Memory Usage: Only active tuples in memory, not entire relations
Cache Efficiency: Tuple stays in CPU cache across multiple operators

Materialization vs Pipelining Comparison
Aspect	Materialized	Pipelined
Intermediate storage	Required for every operator	Minimal (in-memory buffers)
Memory usage	Proportional to largest intermediate	Proportional to tuple size
First tuple latency	After all operators complete	After first tuple passes all operators
I/O operations	O(sum of intermediate sizes)	O(input + output + blocking)
Cache efficiency	Poor (data evicted between operators)	Excellent (data in cache)
Implementation	Simple	Requires coordination protocol

Why Would Anyone Materialize?

Despite its inefficiency, materialization is sometimes necessary:

Debugging: Inspect intermediate results for troubleshooting
Resource Management: Bound memory by spilling to disk
Fault Tolerance: Checkpointing for crash recovery
Subquery Caching: Reuse expensive subquery results

Modern systems use selective materialization—materialize only when necessary, pipeline everywhere else.

The Iterator (Volcano) Model

The Iterator Model, pioneered by the Volcano/Graefe system, is the classic approach to pipelined query execution. It's elegant, simple, and remains influential in modern systems.

Core Concept:

Each operator implements a simple interface:

Interface Iterator:
    open()      → Initialize operator, open child iterators
    next()      → Return next tuple, or end-of-stream marker
    close()     → Clean up resources, close child iterators

Operators are composed into a tree. The root operator's next() is called repeatedly, which recursively pulls tuples up through the tree.

Execution Flow:

Client calls root.next()
Root operator calls child.next() to get input tuple
Child calls its children's next() recursively
Leaf operator (e.g., table scan) returns tuple from storage
Tuples bubble up, each operator processing and passing along
Root returns result tuple to client
Repeat until next() returns end-of-stream

iterator_model.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Base iterator interface
interface Iterator:
    open(): void
    next(): Tuple | END_OF_STREAM
    close(): void
 
// Table scan operator
class TableScan implements Iterator:
    table: Table
    cursor: Cursor
    
    open():
        cursor = table.openScan()
    
    next() -> Tuple | END_OF_STREAM:
        if cursor.hasNext():
            return cursor.next()
        else:
            return END_OF_STREAM
    
    close():
        cursor.close()
 
// Filter (selection) operator
class Filter implements Iterator:
    child: Iterator
    predicate: Expression
    
    open():
        child.open()
    
    next() -> Tuple | END_OF_STREAM:
        while true:
            tuple = child.next()
            if tuple == END_OF_STREAM:
                return END_OF_STREAM
            if predicate.evaluate(tuple):
                return tuple
            // else: tuple filtered out, get next
    
    close():
        child.close()
 
// Project operator
class Project implements Iterator:
    child: Iterator
    columns: List<Column>
    
    open():
        child.open()
    
    next() -> Tuple | END_OF_STREAM:
        tuple = child.next()
        if tuple == END_OF_STREAM:
            return END_OF_STREAM
        return extractColumns(tuple, columns)
    
    close():
        child.close()

Composability:

The beauty of the iterator model is composability. Any operator can be connected to any other, as long as schemas match:

Query: SELECT name FROM employees WHERE salary > 100000

Plan tree:
    Project(name)
        Filter(salary > 100K)
            TableScan(employees)

Execution:
    project.next() calls filter.next() calls scan.next()
    Tuple flows up: scan → filter → project → client

Advantages:

Natural Pipelining: Tuples flow without buffering
Memory Efficiency: Only one tuple needs active memory per level
Simplicity: Each operator is self-contained
Lazy Evaluation: Work only done when next() called
Early Termination: LIMIT can stop execution early

Why 'Volcano'?

The model is named after the Volcano query processing system developed by Goetz Graefe in the early 1990s. It's also called the 'pull-based' model because parent operators pull tuples from children, or the 'iterator model' because of its next() interface. Most classic RDBMS implementations (PostgreSQL, MySQL) use this model or close variants.

Blocking Operators and Pipeline Breakers

Not all operators can produce output as soon as they receive input. Blocking operators must consume all (or a significant portion) of their input before producing any output. These operators break the pipeline, requiring materialization of their input.

Common Blocking Operators:

1. Sort: Must see all tuples to determine ordering

Blocking: Yes, fully
Exception: Streaming sort when input is already sorted (degenerates to pass-through)

2. Hash Aggregation: Must build complete hash table before output

Blocking: Partially (can output as groups complete, but typically waits)
Exception: Streaming aggregation when input sorted by group key

3. Hash Join (build side): Must complete hash table before probing

Blocking: Build side fully, probe side can stream
This is why smaller relation should build

4. Set Operations (INTERSECT, EXCEPT): Must buffer one side

Blocking: One input fully, other can stream

Operator Blocking Characteristics
Operator	Blocking Behavior	Pipeline Status
Table Scan	Non-blocking	Fully pipelined ✓
Index Scan	Non-blocking	Fully pipelined ✓
Filter (Selection)	Non-blocking	Fully pipelined ✓
Projection	Non-blocking	Fully pipelined ✓
Nested Loop Join	Non-blocking*	Pipelined (outer side) ✓
Sort-Merge Join	Blocking	Pipeline breaker ✗
Hash Join	Partially blocking	Build blocks, probe streams
Sort	Fully blocking	Pipeline breaker ✗
Hash Aggregation	Fully blocking	Pipeline breaker ✗
DISTINCT (hash)	Non-blocking	Streams first occurrences ✓
LIMIT	Non-blocking	Fully pipelined ✓
UNION ALL	Non-blocking	Fully pipelined ✓

Pipeline Breakers:

When a blocking operator appears in a plan, it creates a pipeline breaker—the point where tuples must be materialized (at least in memory). The query plan naturally divides into pipeline segments separated by breakers:

Query: SELECT dept, SUM(salary) FROM employees 
       WHERE status='active' GROUP BY dept ORDER BY dept;

Plan:
    Sort(dept)                  ← Pipeline breaker #2
        HashAggregate(dept)     ← Pipeline breaker #1
            Filter(status='active')
                TableScan(employees)

Pipeline segments:
    Segment 1: Scan → Filter → HashAggregate (build)
    [Materialization: hash table]
    Segment 2: HashAggregate (output) → Sort (input)
    [Materialization: sort buffer]
    Segment 3: Sort (output) → Client

Memory Management at Breakers

Pipeline breakers are where memory management becomes critical. A hash aggregation with millions of groups may not fit in memory, requiring spilling strategies. Query optimizers estimate intermediate sizes to allocate memory budgets and choose between in-memory and external algorithms. Memory pressure at one breaker can force spilling, dramatically increasing I/O cost.

Push vs Pull Execution Models

The iterator model is a pull-based approach: parents pull tuples from children. An alternative is the push-based model where children push tuples to parents.

Pull-Based (Iterator/Volcano):

Control flow: Top → Down (next() calls)
Data flow: Bottom → Up (tuples returned)

parent.next():
    tuple = child.next()  // pull from child
    process(tuple)
    return result

Push-Based:

Control flow: Bottom → Up (produce() calls)
Data flow: Bottom → Up (tuples passed)

child.produce():
    while hasNext():
        tuple = getNextTuple()
        parent.consume(tuple)  // push to parent

push_based_model.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Push-based operator interface
interface PushOperator:
    produce(): void        // Start producing tuples
    consume(tuple): void   // Receive tuple from child
 
// Table scan in push model
class TableScanPush implements PushOperator:
    table: Table
    parent: PushOperator
    
    produce():
        for page in table.pages:
            for tuple in page.tuples:
                parent.consume(tuple)  // Push to parent
        parent.consume(END_OF_STREAM)
 
// Filter in push model  
class FilterPush implements PushOperator:
    predicate: Expression
    parent: PushOperator
    child: PushOperator
    
    produce():
        child.produce()  // Tell child to start pushing
    
    consume(tuple):
        if tuple == END_OF_STREAM:
            parent.consume(END_OF_STREAM)
        elif predicate.evaluate(tuple):
            parent.consume(tuple)  // Push passing tuples
        // else: silently drop filtered tuples
 
// Execution starts from bottom
scan.produce()  // Initiates data flow
 
// Push model advantages:
// - Fewer function calls (no return, just forward call)
// - Better inlining opportunities for JIT
// - Natural for producer-consumer parallelism

Pull Model Advantages

•Simple flow control (caller decides when to pull)
•Natural back-pressure (LIMIT just stops pulling)
•Easy debugging (stack trace shows operator chain)
•Classic, well-understood model
•Good for interactive, demand-driven queries

Push Model Advantages

•Fewer function calls (no return overhead)
•Better code inlining for compilation
•Tight loops stay in CPU cache
•Natural for batch/vectorized processing
•Good for high-throughput analytics

Hybrid Models in Practice

Modern systems often use hybrid models:

Push within pipelines: Tight inner loops use push for efficiency
Pull across pipeline breakers: Control flow uses pull for flow control

For example, within a scan-filter-project pipeline, push gives tight loops. But pull is used to coordinate with a blocking sort downstream.

Vectorized (Batch) Execution

The traditional iterator model processes one tuple at a time. This is elegant but incurs significant overhead:

Virtual function call per tuple
Branch prediction misses
Poor instruction cache utilization
Cannot leverage SIMD instructions

Vectorized execution addresses this by processing batches of tuples (vectors) at a time, typically 1,000-10,000 tuples per batch.

Key Changes:

next() returns a vector of tuples, not a single tuple
Operators process vectors with tight loops
SIMD instructions process multiple values in parallel
Branching is minimized through selection vectors

vectorized_execution.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Vectorized iterator interface
interface VectorizedIterator:
                next() -> Batch | END_OF_STREAM  // Returns batch of tuples
 
class Batch:
    size: int               // Number of tuples in batch
    columns: Column[]       // Column-oriented data
    selection: BitVector    // Which rows are active (after filter)
 
// Vectorized filter
class VectorizedFilter implements VectorizedIterator:
    child: VectorizedIterator
    predicate: CompiledExpression
    
    next() -> Batch:
        batch = child.next()
        if batch == END_OF_STREAM:
            return END_OF_STREAM
        
        // Evaluate predicate on entire column at once using SIMD
        // Example: salary > 100000
        // Input:  salary = [50K, 120K, 80K, 150K, 90K, 200K, 70K, 110K]
        // Output: selection = [0, 1, 0, 1, 0, 1, 0, 1]
        
        batch.selection = simd_compare_gt(batch.columns['salary'], 100000)
        return batch
 
// SIMD comparison (conceptual)
simd_compare_gt(column, threshold) -> BitVector:
    result = new BitVector(column.length)
    
    // Process 8 values at a time with AVX-512
    for i = 0 to column.length step 8:
        vec = simd_load_8_int64(column, i)
        mask = simd_cmp_gt_8_int64(vec, threshold_vec)
        simd_store_8_bits(result, i, mask)
    
    return result

Selection Vectors:

Instead of physically removing filtered tuples (expensive copying), vectorized engines use selection vectors—bitmaps indicating which tuples are "active". Downstream operators only process selected tuples.

Benefits of Vectorized Execution:

Amortized Call Overhead: One next() call per 1000 tuples, not per tuple
SIMD Parallelism: Process 4-16 values per instruction
Cache Efficiency: Tight loops stay in instruction cache
Branch Avoidance: Predicate evaluation computes masks, not branches
Memory Bandwidth: Column-oriented access enables prefetching

Real-World Impact:

Systems like DuckDB, ClickHouse, and Velox achieve 10-100x speedup over tuple-at-a-time processing for analytical workloads through vectorization.

Batch Size Tuning

Batch size is a critical tuning parameter:

Too small: Amortization ineffective, overhead returns
Too large: Batches don't fit in L2/L3 cache, causing cache misses

Optimal batch size is typically 1,000-4,000 tuples, depending on tuple width and cache hierarchy. Some systems adaptively tune batch size based on query characteristics.

Compiled (JIT) Query Execution

The ultimate form of query execution optimization is query compilation: generating specialized machine code for each query, eliminating interpretation overhead entirely.

The Problem with Interpretation:

Even vectorized execution involves interpretation:

Operator dispatch through virtual functions
Expression evaluation through generic code
Data type handling through runtime checks

Compiled Execution:

Instead of interpreting operations, compile the entire query into native code:

Query: SELECT name FROM employees WHERE salary > 100000

Compiled to machine code equivalent of:

for (page : employees.pages) {
    for (i = 0; i < page.count; i++) {
        if (page.salary[i] > 100000) {
            output(page.name[i]);
        }
    }
}

No virtual function calls, no operator abstraction—just tight, optimized loops.

query_compilation.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Query compilation pipeline
COMPILE_QUERY(query_plan):
    // Step 1: Generate intermediate representation (IR)
    ir = generate_LLVM_IR(query_plan)
    
    // Step 2: Optimize IR
    optimized_ir = llvm_optimize(ir)
    
    // Step 3: Compile to machine code
    machine_code = llvm_compile(optimized_ir)
    
    // Step 4: Return callable function
    return machine_code
 
// Example: Compiling a scan-filter-project pipeline
generate_for_pipeline(scan, filter, project):
    emit_code("""
        // Outer loop: iterate pages
        for (page_idx = 0; page_idx < num_pages; page_idx++) {
            page = load_page(table, page_idx);
            
            // Inner loop: iterate tuples in page
            for (i = 0; i < page.tuple_count; i++) {
                // Inlined filter evaluation
                salary = page.salary_column[i];
                if (salary > 100000) {  // Constant folded
                    // Inlined projection
                    result = page.name_column[i];
                    output_buffer.append(result);
                }
            }
        }
    """)
 
// The generated code is as fast as hand-written C
// with full optimization: inlining, constant folding,
// register allocation, vectorization by the C compiler

Compilation Strategies:

1. Full Query Compilation:

Compile entire query to single function
Maximum optimization opportunity
Used by: HyPer, Umbra, parts of Spark

2. Operator-at-a-time Compilation:

Compile hot operators (complex expressions)
Less compilation overhead
Used by: PostgreSQL JIT, MySQL

3. Hybrid Vectorized + Compiled:

Use vectorized primitives as building blocks
Compile expression evaluation
Used by: DuckDB, Databricks Photon

Compilation Overhead:

Query compilation takes time (10ms-100ms). For short-running queries, compilation overhead may exceed execution time. Solutions:

Compile only long-running queries (based on cost estimate)
Cache compiled code for repeated queries
Tiered execution: Start interpreted, compile if query runs long

The LLVM Revolution

LLVM (Low-Level Virtual Machine) made query compilation practical by providing:

Portable IR: Generate once, compile for any architecture
Production-quality optimizer: Same optimizations as Clang/GCC
JIT compilation: Generate machine code at runtime

Before LLVM, database teams would have to implement their own code generators—a massive undertaking. Now, databases like PostgreSQL can add JIT compilation with moderate engineering effort.

Pipeline Parallelism and Morsel-Driven Execution

Modern multi-core CPUs offer massive parallelism, but exploiting it in query execution is challenging. Pipeline parallelism and morsel-driven execution are techniques that scale query execution across many cores.

Traditional Parallelism Approaches:

1. Inter-Query Parallelism:

Different queries run on different cores
Easy but doesn't help single expensive query

2. Inter-Operator Parallelism:

Different operators run on different cores
Limited by pipeline depth (typically 5-20 operators)

3. Intra-Operator Parallelism:

Single operator spreads across multiple cores
Best for data-intensive operators (scans, joins)

Morsel-Driven Parallelism:

The HyPer database introduced morsel-driven execution—a sophisticated approach that achieves excellent parallel scalability:

morsel_driven_execution.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Morsel-driven parallel execution
MORSEL_DRIVEN_EXECUTE(query_plan, num_workers):
    // Divide input into small chunks called "morsels" (e.g., 100K tuples each)
    morsels = partition_input(query_plan.input, MORSEL_SIZE)
    morsel_queue = create_work_queue(morsels)
    
    // Create shared pipeline state (e.g., hash table for join)
    shared_state = create_shared_state(query_plan)
    
    // Launch worker threads
    results = []
    parallel_for worker_id in range(num_workers):
        while true:
            morsel = morsel_queue.try_dequeue()
            if morsel == null:
                break  // No more work
            
            // Process morsel through entire pipeline
            local_result = process_pipeline(morsel, query_plan, shared_state)
            
            // Merge local results
            synchronized:
                results.append(local_result)
    
    return merge(results)
 
// Key insight: Each worker processes complete pipeline on morsel
// No inter-thread communication during morsel processing
// Thread-local state + lock-free data structures minimize contention
 
// Hash join with morsel-driven execution:
BUILD_PHASE(build_morsels, shared_hash_table):
    parallel_for morsel in build_morsels:
        for tuple in morsel:
            bucket = hash(tuple.key) % num_buckets
            lock(bucket)  // Fine-grained locking
            bucket.append(tuple)
            unlock(bucket)
 
PROBE_PHASE(probe_morsels, shared_hash_table):
    parallel_for morsel in probe_morsels:
        local_results = []
        for tuple in morsel:
            bucket = hash(tuple.key) % num_buckets
            // No lock needed during probe (read-only)
            for match in bucket:
                if match.key == tuple.key:
                    local_results.append(join(tuple, match))
        output(local_results)

Key Properties of Morsel-Driven Execution:

Full Pipeline Processing: Each morsel is processed through the entire pipeline, maximizing cache locality.
Elasticity: Workers can be dynamically added/removed; morsel queue provides natural load balancing.
NUMA-Awareness: Morsels can be assigned to workers on the same NUMA node as the data.
Minimal Synchronization: Most work is thread-local; shared state uses fine-grained locking.
Tail Latency Control: Small morsel size prevents single slow morsel from blocking completion.

Scaling Results:

Well-implemented morsel-driven engines achieve near-linear scaling up to core counts in the hundreds, processing billions of rows per second.

The NUMA Challenge

On multi-socket servers, Non-Uniform Memory Access (NUMA) means memory access time depends on which CPU accesses which memory region. Cross-socket memory access can be 2-3x slower. NUMA-aware query execution:

Partitions data across NUMA nodes at load time
Schedules workers to process local data
Minimizes cross-node communication

Ignoring NUMA can cost 30-50% performance on large servers.

Summary: Pipelining

Pipelining is the technique that transforms database query execution from sequential materialization into efficient, streaming data flow. Let's consolidate the key concepts:

Key Takeaways

•Pipelining avoids materialization — Tuples flow through operators without writing intermediate results to disk.
•Iterator model provides composable operators — open/next/close interface enables any operator combination.
•Blocking operators break pipelines — Sort, aggregation, and hash build phases require materialization.
•Push vs pull trade different benefits — Pull for control flow, push for tight loops and compilation.
•Vectorization processes batches — 1000+ tuples per call enables SIMD and amortizes overhead.
•Compilation eliminates interpretation — JIT-compiled queries run as fast as hand-written C.
•Morsel-driven execution scales across cores — Fine-grained work units enable near-linear parallel scaling.

Module Complete:

This concludes our exploration of physical operators. From basic access methods through join algorithms to aggregation, set operations, sorting, and pipelining, you now understand how database systems transform abstract query plans into efficient physical execution. These techniques represent decades of database research and engineering, enabling the remarkable performance of modern database systems.

Module Complete

Congratulations! You've completed the module on Other Physical Operators. You now understand the full executor toolkit: aggregation, duplicate elimination, set operations, sorting, and the pipelining techniques that make it all work together efficiently. This knowledge is essential for understanding query performance, reading execution plans, and designing efficient database applications.