Query Processing Steps - Learning Module

Loading content...

0/241

Execution - Transforming Plans into Results

Where Plans Become Reality

We've journeyed through query input, parsing, translation, and optimization. Now comes the moment of truth: execution. The carefully crafted execution plan must actually retrieve, process, and return data to the user.

The query executor (or query engine) is the runtime component that takes the physical execution plan and orchestrates all the machinery needed to produce results: reading from storage, applying operators, managing memory, and streaming results back to the client.

This is where abstract planning meets concrete computation—where I/O happens, CPU cycles burn, and data finally flows from disk to user.

What You Will Learn

By the end of this page, you will understand the iterator (Volcano) execution model, the difference between pipelined and materialized execution, how specific operators work at runtime, memory management during execution, and how results flow back to clients.

The Query Execution Engine

The execution engine is responsible for:

Initializing operators — Allocating resources, opening files/indexes
Executing the plan — Processing data through the operator tree
Managing resources — Memory allocation, temporary files, buffers
Handling errors — Constraint violations, resource exhaustion, interrupts
Returning results — Streaming tuples to the client connection

Execution Plan as a Tree:

The execution plan is a tree of operators. Data flows from leaf nodes (table scans) up through intermediate nodes (joins, filters) to the root (result delivery).

Converting Mermaid diagram...

Execution Engine Components:

Component	Responsibility
Plan Executor	Traverses plan tree, coordinates operator execution
Operator Implementations	Specific code for each operator type (scan, join, sort, etc.)
Expression Evaluator	Computes expressions (predicates, projections, function calls)
Buffer Manager	Manages memory buffers, page replacement
Storage Manager	Reads/writes data pages from disk
Transaction Manager	Ensures ACID properties, manages locks

The Iterator (Volcano) Model

Most relational databases use the iterator model (also called the Volcano model after the Volcano system that popularized it). Each operator implements a simple interface with three methods:

open()   — Initialize the operator, allocate resources
next()   — Return the next tuple (or null if exhausted)
close()  — Release resources, cleanup

How It Works:

The root operator calls next() on its child, which calls next() on its child, recursively down to the leaves. Tuples flow up one at a time through the tree.

This is a pull-based model: the consumer pulls data from the producer. The root "pulls" results by repeatedly asking for the next tuple.

iterator_model.pseudo

Iterator Interface

// Abstract iterator interface (all operators implement this)
interface Iterator:
    function open()       // Initialize operator
    function next() → Tuple | null  // Get next tuple
    function close()      // Cleanup
 
// Sequential scan implementation
class SeqScan implements Iterator:
    table: Table
    currentPage: int
    currentSlot: int
    
    function open():
        currentPage = 0
        currentSlot = 0
    
    function next() → Tuple | null:
        while currentPage < table.pageCount:
            page = bufferManager.fetchPage(table, currentPage)
            while currentSlot < page.tupleCount:
                tuple = page.getTuple(currentSlot)
                currentSlot++
                if evaluatePredicate(tuple):
                    return tuple
            currentPage++
            currentSlot = 0
        return null  // No more tuples
    
    function close():
        // Release any pinned pages
 
// Hash join implementation
class HashJoin implements Iterator:
    leftChild: Iterator   // Build side
    rightChild: Iterator  // Probe side
    hashTable: HashMap
    
    function open():
        leftChild.open()
        rightChild.open()
        
        // Build phase: hash all tuples from left child
        while (tuple = leftChild.next()) != null:
            key = extractJoinKey(tuple)
            hashTable.insert(key, tuple)
    
    function next() → Tuple | null:
        // Probe phase: for each right tuple, find matches
        while (rightTuple = rightChild.next()) != null:
            key = extractJoinKey(rightTuple)
            for leftTuple in hashTable.get(key):
                return combineTuples(leftTuple, rightTuple)
        return null
    
    function close():
        leftChild.close()
        rightChild.close()
        hashTable.destroy()

Iterator Model Advantages

•Simplicity — Each operator is self-contained; complex queries compose naturally
•Memory Efficiency — Only one tuple per operator in flight at once
•Pipelining — Results start flowing before entire input is consumed
•Modularity — Easy to add new operators without changing existing code
•Uniform Interface — All operators look the same to their parents

Iterator Limitations

The tuple-at-a-time model has overhead: each next() call involves function calls within the operator tree. Modern systems use vectorized execution (batch-at-a-time) or compiled execution to reduce this overhead while preserving the iterator abstraction.

Pipelining vs. Materialization

Operators fall into two categories based on how they process data:

Pipeline Operators (Non-Blocking):

Can produce output tuples while still receiving input. They process one tuple at a time without needing to see all input first.

Examples: Selection (filter), Projection, Nested Loop Join (for each outer tuple)

Blocking Operators (Materializing):

Must consume all input before producing any output. They must "materialize" (store) intermediate results.

Examples: Sort, Hash Join (build phase), Aggregation (non-sorted), Set Difference

Operator Classification
Operator	Type	Reason
Selection (σ)	Pipeline	Can filter tuples one at a time
Projection (π)	Pipeline	Can transform tuples one at a time
Nested Loop Join	Pipeline	Outer side pipelines, inner may rescan
Hash Join - Build	Blocking	Must build complete hash table first
Hash Join - Probe	Pipeline	Can probe as probes arrive
Sort	Blocking	Must see all tuples to sort
Hash Aggregate	Blocking	Must see all groups to aggregate
Sort Aggregate	Pipeline*	Pipeline if input pre-sorted
DISTINCT (Hash)	Blocking	Must track all seen values
DISTINCT (Sort)	Pipeline*	Pipeline if input pre-sorted
LIMIT	Pipeline	Can stop after N tuples
Union All	Pipeline	Just concatenates streams

Pipeline Breakers:

Blocking operators are called pipeline breakers because they interrupt the smooth flow of data. An ideal execution plan minimizes pipeline breakers to:

Reduce Memory — Materialized results consume memory/disk
Reduce Latency — First results arrive sooner with pipelining
Enable Streaming — Large results don't need full materialization

Converting Mermaid diagram...

Time to First Row

For interactive queries, 'time to first row' matters as much as total time. A pipelined plan may start returning rows immediately, while a blocking plan makes users wait. EXPLAIN output shows 'startup cost' (time before first row) vs. 'total cost' (time for all rows).

Scan Operator Execution

Scan operators are the leaves of the execution tree—they read data from storage and deliver tuples to their parent operators.

Sequential Scan:

Reads every page of a table in physical order. Simple but thorough.

for each page in table:
    for each tuple in page:
        if tuple satisfies predicate:
            yield tuple

Index Scan:

Uses an index to find qualifying tuples, then fetches heap pages.

for each key matching predicate in index:
    fetch tuple from heap using TID
    yield tuple

scan_operators.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/* Sequential Scan Execution */
typedef struct SeqScanState {
    PlanState ps;           /* Base plan state */
    HeapScanDesc scanDesc;  /* Scan descriptor */
    Relation relation;      /* Table being scanned */
    List *qual;             /* Filter predicates */
} SeqScanState;
 
TupleTableSlot *SeqNext(SeqScanState *node) {
    HeapTuple tuple;
    
    /* Loop until we find a qualifying tuple */
    while ((tuple = heap_getnext(node->scanDesc)) != NULL) {
        /* Store tuple in slot for predicate evaluation */
        ExecStoreTuple(tuple, node->ss_ScanTupleSlot);
        
        /* Evaluate WHERE predicates */
        if (ExecQual(node->qual, econtext)) {
            return node->ss_ScanTupleSlot;  /* Tuple qualifies */
        }
        /* Predicate failed, try next tuple */
    }
    
    return NULL;  /* No more tuples */
}
 
/* Index Scan Execution */
typedef struct IndexScanState {
    PlanState ps;
    IndexScanDesc scanDesc;
    Relation indexRelation;
    Relation heapRelation;
    List *indexQual;        /* Index-checkable predicates */
    List *qual;             /* Additional heap predicates */
} IndexScanState;
 
TupleTableSlot *IndexNext(IndexScanState *node) {
    ItemPointer tid;
    HeapTuple tuple;
    
    while ((tid = index_getnext_tid(node->scanDesc)) != NULL) {
        /* Fetch heap tuple using TID from index */
        tuple = index_fetch_heap(node->scanDesc, tid);
        
        if (tuple == NULL)
            continue;  /* TID was dead/deleted */
        
        ExecStoreTuple(tuple, node->ss_ScanTupleSlot);
        
        /* Check additional predicates not in index */
        if (ExecQual(node->qual, econtext)) {
            return node->ss_ScanTupleSlot;
        }
    }
    
    return NULL;
}

Index-Only Scan:

If all required columns are in the index, skip heap fetches entirely. Much faster for covered queries.

Bitmap Scan:

Hybrid approach for medium-selectivity queries:

Scan index, build bitmap of matching TIDs
Sort TIDs by page number
Fetch heap pages in order (sequential I/O)

Bitmap scans convert random I/O from index lookups into sequential I/O, which is faster for larger result sets.

When to Use Each Scan Type
Scan Type	Selectivity	Best When...
Sequential Scan	High (>20% rows)	Reading most of table anyway
Index Scan	Very low (<10%)	Few matching rows, random access OK
Index-Only Scan	Any	All columns in index (covering index)
Bitmap Scan	Medium (5-20%)	Moderate rows, benefit from sort

Join Operator Execution

Join operators combine rows from two relations based on join conditions. Three main algorithms exist, each with different performance characteristics.

Nested Loop Join:

The simplest algorithm—for each outer row, scan all inner rows.

for each tuple r in outer relation R:
    for each tuple s in inner relation S:
        if join_condition(r, s):
            yield combined(r, s)

Complexity: O(|R| × |S|) — but variants using indexes make this much faster.

join_algorithms.pseudo

// Basic Nested Loop (expensive)
function NestedLoopJoin(R, S, condition):
    for each tuple r in R:
        for each tuple s in S:
            if condition(r, s):
                emit (r, s)
    
    // Cost: |R| × |S| comparisons
 
// Index Nested Loop (efficient when inner has index)
function IndexNestedLoopJoin(R, S, R.key = S.key):
    for each tuple r in R:
        for each tuple s in S.index_lookup(r.key):
            emit (r, s)
    
    // Cost: |R| × log(|S|) if index on S.key
 
// Block Nested Loop (reduces I/O)
function BlockNestedLoopJoin(R, S, condition):
    for each block Br of R that fits in memory:
        for each block Bs of S:
            for each r in Br:
                for each s in Bs:
                    if condition(r, s):
                        emit (r, s)
    
    // Cost: |R|/block_size × |S| page reads

Join Algorithm Selection Guide
Algorithm	Complexity	Best When...
Nested Loop	O(\|R\| × \|S\|)	Very small tables, or inner indexed
Index Nested Loop	O(\|R\| × log\|S\|)	Index on inner join column
Block Nested Loop	O(\|R\|/M × \|S\|)	No index, limited memory
Hash Join	O(\|R\| + \|S\|)	Equi-join, one side fits in memory
Grace Hash Join	O(\|R\| + \|S\|)	Equi-join, larger than memory
Merge Join	O(\|R\| + \|S\|)	Inputs pre-sorted, or sort beneficial

Aggregation and Sorting Execution

Aggregation Execution:

Aggregation operators compute summary values (SUM, COUNT, AVG, etc.) over groups of rows.

Hash Aggregation:

Build a hash table keyed by GROUP BY columns. For each input row, find or create the group entry and update accumulators.

for each input tuple:
    key = extract_group_key(tuple)
    if key not in hashTable:
        hashTable[key] = init_accumulators()
    update_accumulators(hashTable[key], tuple)

for each entry in hashTable:
    yield compute_final_values(entry)

Sort Aggregation:

Sort input by GROUP BY columns, then scan sequentially. All rows of each group are contiguous.

sort input by group_key
current_group = null

for each tuple in sorted_input:
    if tuple.group_key != current_group:
        if current_group != null:
            yield compute_final_values(accumulators)
        current_group = tuple.group_key
        accumulators = init_accumulators()
    update_accumulators(accumulators, tuple)

yield compute_final_values(accumulators)  # Last group

Sorting Execution:

Sorting is fundamental—used for ORDER BY, merge joins, sort aggregation, and duplicate elimination.

In-Memory Sort:

When data fits in work memory, use quicksort or similar O(n log n) algorithm.

External Merge Sort:

When data exceeds memory, use external sorting:

Run Generation — Read chunks that fit in memory, sort each, write as "runs"
Merge Passes — Repeatedly merge runs until single sorted output

// Phase 1: Create sorted runs
while more_data:
    chunk = read_chunk_into_memory()
    sort(chunk)
    write_run(chunk)

// Phase 2: Merge runs
while run_count > 1:
    runs_to_merge = select_runs(merge_factor)
    merged = merge(runs_to_merge)
    write_run(merged)

sort_memory_config.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Work memory controls in-memory sort/hash size
SHOW work_mem;  -- Default: 4MB
 
-- Increase for large sorts (session-level)
SET work_mem = '256MB';
 
-- View sort method in EXPLAIN
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders ORDER BY order_date;
 
/*
 Sort  (cost=12345.67..12456.78 rows=100000 width=84) 
       (actual time=234.567..345.678 rows=100000 loops=1)
   Sort Key: order_date
   Sort Method: external merge  Disk: 15624kB    <-- Spilled to disk!
   Buffers: shared hit=1234, temp read=1953 written=1953
   ...
*/
 
-- With more memory:
SET work_mem = '256MB';
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders ORDER BY order_date;
 
/*
 Sort  (cost=12345.67..12456.78 rows=100000 width=84)
       (actual time=123.456..189.012 rows=100000 loops=1)
   Sort Key: order_date
   Sort Method: quicksort  Memory: 25600kB    <-- In-memory!
   ...
*/

Memory Configuration

work_mem (PostgreSQL) or sort_buffer_size (MySQL) controls memory for sorts and hash operations. Too low causes disk spills; too high risks memory exhaustion. The sweet spot depends on query concurrency and available RAM.

Expression Evaluation

At the core of query execution is expression evaluation—computing values from column references, constants, operators, and function calls.

Expression Types:

Column References — Retrieve a column value from the current tuple
Constants — Literal values compiled into the plan
Operators — Arithmetic (+, -, *, /), comparison (=, <, >), logical (AND, OR, NOT)
Function Calls — Built-in functions (UPPER, SUBSTRING, NOW) and user-defined functions
CASE Expressions — Conditional value selection
Subqueries — Scalar subqueries evaluated inline

expression_evaluation.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/* Expression tree nodes */
typedef enum ExprType {
    E_CONST,        /* Constant value */
    E_VAR,          /* Column reference */
    E_OP,           /* Operator (binary or unary) */
    E_FUNC,         /* Function call */
    E_CASE,         /* CASE/WHEN expression */
    E_SUBQUERY,     /* Scalar subquery */
} ExprType;
 
typedef struct Expr {
    ExprType type;
    Oid resultType;  /* Output data type */
    union {
        Datum constValue;           /* For E_CONST */
        int varAttNo;               /* For E_VAR: attribute number */
        struct {
            OpCode op;
            struct Expr *left;
            struct Expr *right;
        } opExpr;                   /* For E_OP */
        struct {
            FuncPtr function;
            List *args;
        } funcExpr;                 /* For E_FUNC */
    };
} Expr;
 
/* Recursive expression evaluator */
Datum ExecEvalExpr(Expr *expr, TupleTableSlot *slot) {
    switch (expr->type) {
        case E_CONST:
            return expr->constValue;
            
        case E_VAR:
            return slot_getattr(slot, expr->varAttNo);
            
        case E_OP:
            Datum left = ExecEvalExpr(expr->opExpr.left, slot);
            Datum right = ExecEvalExpr(expr->opExpr.right, slot);
            return ApplyOperator(expr->opExpr.op, left, right);
            
        case E_FUNC:
            Datum *args = palloc(list_length(expr->funcExpr.args) * sizeof(Datum));
            int i = 0;
            foreach(arg, expr->funcExpr.args) {
                args[i++] = ExecEvalExpr(arg, slot);
            }
            return expr->funcExpr.function(args);
            
        case E_CASE:
            // Evaluate WHEN clauses until one matches
            foreach(when, expr->caseExpr.whenClauses) {
                if (DatumGetBool(ExecEvalExpr(when->condition, slot)))
                    return ExecEvalExpr(when->result, slot);
            }
            return ExecEvalExpr(expr->caseExpr.elseResult, slot);
    }
}

Expression Compilation:

Modern databases optimize expression evaluation through:

JIT Compilation — Compile expression trees into native machine code (PostgreSQL 11+, Spark)
Vectorized Evaluation — Evaluate expressions on batches of values using SIMD
Expression Caching — Cache expensive function results for repeated calls
Short-Circuit Evaluation — Skip evaluation when result is determined (AND with first FALSE)

Result Delivery and Streaming

The final step is delivering results to the client. This involves formatting tuples and writing them to the network connection.

Result Delivery Modes:

Result Delivery Strategies

•Full Materialization — Collect all results in memory, then send. Simple but memory-intensive for large results.
•Streaming/Pipelining — Send results as they're produced. Memory-efficient, lower latency to first row.
•Batched Streaming — Stream in batches for efficiency (reduce per-tuple overhead).
•Cursor Fetch — Client controls retrieval pace via FETCH N commands. Enables very large result sets.

result_streaming.sql
PostgreSQL Cursors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Using cursors for large result streaming
BEGIN;
 
-- Declare cursor (doesn't execute yet)
DECLARE order_cursor CURSOR FOR
    SELECT order_id, customer_name, total
    FROM   orders o JOIN customers c ON o.customer_id = c.id
    WHERE  order_date >= '2024-01-01';
 
-- Fetch results in batches
FETCH 100 FROM order_cursor;  -- Get first 100 rows
FETCH 100 FROM order_cursor;  -- Get next 100 rows
-- ... continue until no more rows
 
CLOSE order_cursor;
COMMIT;
 
-- Benefits:
-- - Memory: Only batch size in memory at once
-- - Control: Client processes at own pace
-- - Interruptible: Can stop partway through
 
-- Scrollable cursor (can move backward)
DECLARE scroll_cursor SCROLL CURSOR FOR
    SELECT * FROM products ORDER BY price;
 
FETCH LAST FROM scroll_cursor;   -- Go to last row
FETCH BACKWARD 10 FROM scroll_cursor;  -- Previous 10 rows
FETCH FIRST FROM scroll_cursor;  -- Back to start

Wire Protocol Encoding:

Results are encoded according to the database's wire protocol:

Database	Protocol	Format Options
PostgreSQL	libpq	Text or Binary
MySQL	MySQL Protocol	Text (default), Binary (prepared)
SQL Server	TDS	Various encodings

Binary format is more compact and faster to parse, but text format is human-readable for debugging. Prepared statements typically use binary encoding.

Large Result Set Pitfalls

Fetching millions of rows without cursors can exhaust client memory. Applications should use cursors, LIMIT/OFFSET pagination, or keyset pagination for large datasets. Never SELECT * on a large table without limits in production code.

Summary: Execution Completes the Pipeline

We've explored query execution comprehensively. Let's consolidate the key concepts:

Key Takeaways

•Iterator Model — The Volcano model with open/next/close interface is the foundation of query execution in most databases.
•Pull-Based Execution — Data flows from leaf to root as parent operators "pull" tuples from children via next() calls.
•Pipelining vs. Materialization — Pipeline operators stream data; blocking operators must consume all input first.
•Scan Operators — Sequential scan, index scan, bitmap scan, and index-only scan provide different trade-offs for data access.
•Join Algorithms — Nested loop, hash join, and merge join each excel in different scenarios based on data size and indexes.
•Aggregation — Hash aggregation and sort aggregation provide O(n) performance with different memory and ordering characteristics.
•Expression Evaluation — Recursive evaluation of expression trees computes predicates, projections, and function results.
•Result Streaming — Cursors and streaming delivery handle large results without exhausting memory.

Module Complete:

You've now mastered the complete query processing pipeline:

Query Input — Receiving and validating queries from clients
Parsing — Transforming text into structured parse trees
Translation — Converting to relational algebra via semantic analysis
Optimization — Finding efficient execution strategies through cost-based search
Execution — Actually processing data and returning results

This knowledge forms the foundation for understanding database performance, tuning queries, and designing efficient database-backed applications.

Module Complete

Congratulations! You now understand the complete query processing pipeline—from the moment a SQL statement enters the database until results stream back to the client. This end-to-end understanding enables you to debug performance issues, write efficient queries, and appreciate the sophisticated engineering that powers every database system.

Execution - Transforming Plans into Results

Where Plans Become Reality

This is where abstract planning meets concrete computation—where I/O happens, CPU cycles burn, and data finally flows from disk to user.

What You Will Learn

The Query Execution Engine

The execution engine is responsible for:

Initializing operators — Allocating resources, opening files/indexes
Executing the plan — Processing data through the operator tree
Managing resources — Memory allocation, temporary files, buffers
Handling errors — Constraint violations, resource exhaustion, interrupts
Returning results — Streaming tuples to the client connection

Execution Plan as a Tree:

The execution plan is a tree of operators. Data flows from leaf nodes (table scans) up through intermediate nodes (joins, filters) to the root (result delivery).

Converting Mermaid diagram...

Execution Engine Components:

Component	Responsibility
Plan Executor	Traverses plan tree, coordinates operator execution
Operator Implementations	Specific code for each operator type (scan, join, sort, etc.)
Expression Evaluator	Computes expressions (predicates, projections, function calls)
Buffer Manager	Manages memory buffers, page replacement
Storage Manager	Reads/writes data pages from disk
Transaction Manager	Ensures ACID properties, manages locks

The Iterator (Volcano) Model

Most relational databases use the iterator model (also called the Volcano model after the Volcano system that popularized it). Each operator implements a simple interface with three methods:

open()   — Initialize the operator, allocate resources
next()   — Return the next tuple (or null if exhausted)
close()  — Release resources, cleanup

How It Works:

The root operator calls next() on its child, which calls next() on its child, recursively down to the leaves. Tuples flow up one at a time through the tree.

This is a pull-based model: the consumer pulls data from the producer. The root "pulls" results by repeatedly asking for the next tuple.

iterator_model.pseudo

Iterator Interface

// Abstract iterator interface (all operators implement this)
interface Iterator:
    function open()       // Initialize operator
    function next() → Tuple | null  // Get next tuple
    function close()      // Cleanup
 
// Sequential scan implementation
class SeqScan implements Iterator:
    table: Table
    currentPage: int
    currentSlot: int
    
    function open():
        currentPage = 0
        currentSlot = 0
    
    function next() → Tuple | null:
        while currentPage < table.pageCount:
            page = bufferManager.fetchPage(table, currentPage)
            while currentSlot < page.tupleCount:
                tuple = page.getTuple(currentSlot)
                currentSlot++
                if evaluatePredicate(tuple):
                    return tuple
            currentPage++
            currentSlot = 0
        return null  // No more tuples
    
    function close():
        // Release any pinned pages
 
// Hash join implementation
class HashJoin implements Iterator:
    leftChild: Iterator   // Build side
    rightChild: Iterator  // Probe side
    hashTable: HashMap
    
    function open():
        leftChild.open()
        rightChild.open()
        
        // Build phase: hash all tuples from left child
        while (tuple = leftChild.next()) != null:
            key = extractJoinKey(tuple)
            hashTable.insert(key, tuple)
    
    function next() → Tuple | null:
        // Probe phase: for each right tuple, find matches
        while (rightTuple = rightChild.next()) != null:
            key = extractJoinKey(rightTuple)
            for leftTuple in hashTable.get(key):
                return combineTuples(leftTuple, rightTuple)
        return null
    
    function close():
        leftChild.close()
        rightChild.close()
        hashTable.destroy()

Iterator Model Advantages

•Simplicity — Each operator is self-contained; complex queries compose naturally
•Memory Efficiency — Only one tuple per operator in flight at once
•Pipelining — Results start flowing before entire input is consumed
•Modularity — Easy to add new operators without changing existing code
•Uniform Interface — All operators look the same to their parents

Iterator Limitations

Pipelining vs. Materialization

Operators fall into two categories based on how they process data:

Pipeline Operators (Non-Blocking):

Can produce output tuples while still receiving input. They process one tuple at a time without needing to see all input first.

Examples: Selection (filter), Projection, Nested Loop Join (for each outer tuple)

Blocking Operators (Materializing):

Must consume all input before producing any output. They must "materialize" (store) intermediate results.

Examples: Sort, Hash Join (build phase), Aggregation (non-sorted), Set Difference

Operator Classification
Operator	Type	Reason
Selection (σ)	Pipeline	Can filter tuples one at a time
Projection (π)	Pipeline	Can transform tuples one at a time
Nested Loop Join	Pipeline	Outer side pipelines, inner may rescan
Hash Join - Build	Blocking	Must build complete hash table first
Hash Join - Probe	Pipeline	Can probe as probes arrive
Sort	Blocking	Must see all tuples to sort
Hash Aggregate	Blocking	Must see all groups to aggregate
Sort Aggregate	Pipeline*	Pipeline if input pre-sorted
DISTINCT (Hash)	Blocking	Must track all seen values
DISTINCT (Sort)	Pipeline*	Pipeline if input pre-sorted
LIMIT	Pipeline	Can stop after N tuples
Union All	Pipeline	Just concatenates streams

Pipeline Breakers:

Blocking operators are called pipeline breakers because they interrupt the smooth flow of data. An ideal execution plan minimizes pipeline breakers to:

Reduce Memory — Materialized results consume memory/disk
Reduce Latency — First results arrive sooner with pipelining
Enable Streaming — Large results don't need full materialization

Converting Mermaid diagram...

Time to First Row

Scan Operator Execution

Scan operators are the leaves of the execution tree—they read data from storage and deliver tuples to their parent operators.

Sequential Scan:

Reads every page of a table in physical order. Simple but thorough.

for each page in table:
    for each tuple in page:
        if tuple satisfies predicate:
            yield tuple

Index Scan:

Uses an index to find qualifying tuples, then fetches heap pages.

for each key matching predicate in index:
    fetch tuple from heap using TID
    yield tuple

scan_operators.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/* Sequential Scan Execution */
typedef struct SeqScanState {
    PlanState ps;           /* Base plan state */
    HeapScanDesc scanDesc;  /* Scan descriptor */
    Relation relation;      /* Table being scanned */
    List *qual;             /* Filter predicates */
} SeqScanState;
 
TupleTableSlot *SeqNext(SeqScanState *node) {
    HeapTuple tuple;
    
    /* Loop until we find a qualifying tuple */
    while ((tuple = heap_getnext(node->scanDesc)) != NULL) {
        /* Store tuple in slot for predicate evaluation */
        ExecStoreTuple(tuple, node->ss_ScanTupleSlot);
        
        /* Evaluate WHERE predicates */
        if (ExecQual(node->qual, econtext)) {
            return node->ss_ScanTupleSlot;  /* Tuple qualifies */
        }
        /* Predicate failed, try next tuple */
    }
    
    return NULL;  /* No more tuples */
}
 
/* Index Scan Execution */
typedef struct IndexScanState {
    PlanState ps;
    IndexScanDesc scanDesc;
    Relation indexRelation;
    Relation heapRelation;
    List *indexQual;        /* Index-checkable predicates */
    List *qual;             /* Additional heap predicates */
} IndexScanState;
 
TupleTableSlot *IndexNext(IndexScanState *node) {
    ItemPointer tid;
    HeapTuple tuple;
    
    while ((tid = index_getnext_tid(node->scanDesc)) != NULL) {
        /* Fetch heap tuple using TID from index */
        tuple = index_fetch_heap(node->scanDesc, tid);
        
        if (tuple == NULL)
            continue;  /* TID was dead/deleted */
        
        ExecStoreTuple(tuple, node->ss_ScanTupleSlot);
        
        /* Check additional predicates not in index */
        if (ExecQual(node->qual, econtext)) {
            return node->ss_ScanTupleSlot;
        }
    }
    
    return NULL;
}

Index-Only Scan:

If all required columns are in the index, skip heap fetches entirely. Much faster for covered queries.

Bitmap Scan:

Hybrid approach for medium-selectivity queries:

Scan index, build bitmap of matching TIDs
Sort TIDs by page number
Fetch heap pages in order (sequential I/O)

Bitmap scans convert random I/O from index lookups into sequential I/O, which is faster for larger result sets.

When to Use Each Scan Type
Scan Type	Selectivity	Best When...
Sequential Scan	High (>20% rows)	Reading most of table anyway
Index Scan	Very low (<10%)	Few matching rows, random access OK
Index-Only Scan	Any	All columns in index (covering index)
Bitmap Scan	Medium (5-20%)	Moderate rows, benefit from sort

Join Operator Execution

Join operators combine rows from two relations based on join conditions. Three main algorithms exist, each with different performance characteristics.

Nested Loop Join:

The simplest algorithm—for each outer row, scan all inner rows.

for each tuple r in outer relation R:
    for each tuple s in inner relation S:
        if join_condition(r, s):
            yield combined(r, s)

Complexity: O(|R| × |S|) — but variants using indexes make this much faster.

join_algorithms.pseudo

// Basic Nested Loop (expensive)
function NestedLoopJoin(R, S, condition):
    for each tuple r in R:
        for each tuple s in S:
            if condition(r, s):
                emit (r, s)
    
    // Cost: |R| × |S| comparisons
 
// Index Nested Loop (efficient when inner has index)
function IndexNestedLoopJoin(R, S, R.key = S.key):
    for each tuple r in R:
        for each tuple s in S.index_lookup(r.key):
            emit (r, s)
    
    // Cost: |R| × log(|S|) if index on S.key
 
// Block Nested Loop (reduces I/O)
function BlockNestedLoopJoin(R, S, condition):
    for each block Br of R that fits in memory:
        for each block Bs of S:
            for each r in Br:
                for each s in Bs:
                    if condition(r, s):
                        emit (r, s)
    
    // Cost: |R|/block_size × |S| page reads

Join Algorithm Selection Guide
Algorithm	Complexity	Best When...
Nested Loop	O(\|R\| × \|S\|)	Very small tables, or inner indexed
Index Nested Loop	O(\|R\| × log\|S\|)	Index on inner join column
Block Nested Loop	O(\|R\|/M × \|S\|)	No index, limited memory
Hash Join	O(\|R\| + \|S\|)	Equi-join, one side fits in memory
Grace Hash Join	O(\|R\| + \|S\|)	Equi-join, larger than memory
Merge Join	O(\|R\| + \|S\|)	Inputs pre-sorted, or sort beneficial

Aggregation and Sorting Execution

Aggregation Execution:

Aggregation operators compute summary values (SUM, COUNT, AVG, etc.) over groups of rows.

Hash Aggregation:

Build a hash table keyed by GROUP BY columns. For each input row, find or create the group entry and update accumulators.

for each input tuple:
    key = extract_group_key(tuple)
    if key not in hashTable:
        hashTable[key] = init_accumulators()
    update_accumulators(hashTable[key], tuple)

for each entry in hashTable:
    yield compute_final_values(entry)

Sort Aggregation:

Sort input by GROUP BY columns, then scan sequentially. All rows of each group are contiguous.

sort input by group_key
current_group = null

for each tuple in sorted_input:
    if tuple.group_key != current_group:
        if current_group != null:
            yield compute_final_values(accumulators)
        current_group = tuple.group_key
        accumulators = init_accumulators()
    update_accumulators(accumulators, tuple)

yield compute_final_values(accumulators)  # Last group

Sorting Execution:

Sorting is fundamental—used for ORDER BY, merge joins, sort aggregation, and duplicate elimination.

In-Memory Sort:

When data fits in work memory, use quicksort or similar O(n log n) algorithm.

External Merge Sort:

When data exceeds memory, use external sorting:

Run Generation — Read chunks that fit in memory, sort each, write as "runs"
Merge Passes — Repeatedly merge runs until single sorted output

// Phase 1: Create sorted runs
while more_data:
    chunk = read_chunk_into_memory()
    sort(chunk)
    write_run(chunk)

// Phase 2: Merge runs
while run_count > 1:
    runs_to_merge = select_runs(merge_factor)
    merged = merge(runs_to_merge)
    write_run(merged)

sort_memory_config.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Work memory controls in-memory sort/hash size
SHOW work_mem;  -- Default: 4MB
 
-- Increase for large sorts (session-level)
SET work_mem = '256MB';
 
-- View sort method in EXPLAIN
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders ORDER BY order_date;
 
/*
 Sort  (cost=12345.67..12456.78 rows=100000 width=84) 
       (actual time=234.567..345.678 rows=100000 loops=1)
   Sort Key: order_date
   Sort Method: external merge  Disk: 15624kB    <-- Spilled to disk!
   Buffers: shared hit=1234, temp read=1953 written=1953
   ...
*/
 
-- With more memory:
SET work_mem = '256MB';
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders ORDER BY order_date;
 
/*
 Sort  (cost=12345.67..12456.78 rows=100000 width=84)
       (actual time=123.456..189.012 rows=100000 loops=1)
   Sort Key: order_date
   Sort Method: quicksort  Memory: 25600kB    <-- In-memory!
   ...
*/

Memory Configuration

Expression Evaluation

At the core of query execution is expression evaluation—computing values from column references, constants, operators, and function calls.

Expression Types:

Column References — Retrieve a column value from the current tuple
Constants — Literal values compiled into the plan
Operators — Arithmetic (+, -, *, /), comparison (=, <, >), logical (AND, OR, NOT)
Function Calls — Built-in functions (UPPER, SUBSTRING, NOW) and user-defined functions
CASE Expressions — Conditional value selection
Subqueries — Scalar subqueries evaluated inline

expression_evaluation.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/* Expression tree nodes */
typedef enum ExprType {
    E_CONST,        /* Constant value */
    E_VAR,          /* Column reference */
    E_OP,           /* Operator (binary or unary) */
    E_FUNC,         /* Function call */
    E_CASE,         /* CASE/WHEN expression */
    E_SUBQUERY,     /* Scalar subquery */
} ExprType;
 
typedef struct Expr {
    ExprType type;
    Oid resultType;  /* Output data type */
    union {
        Datum constValue;           /* For E_CONST */
        int varAttNo;               /* For E_VAR: attribute number */
        struct {
            OpCode op;
            struct Expr *left;
            struct Expr *right;
        } opExpr;                   /* For E_OP */
        struct {
            FuncPtr function;
            List *args;
        } funcExpr;                 /* For E_FUNC */
    };
} Expr;
 
/* Recursive expression evaluator */
Datum ExecEvalExpr(Expr *expr, TupleTableSlot *slot) {
    switch (expr->type) {
        case E_CONST:
            return expr->constValue;
            
        case E_VAR:
            return slot_getattr(slot, expr->varAttNo);
            
        case E_OP:
            Datum left = ExecEvalExpr(expr->opExpr.left, slot);
            Datum right = ExecEvalExpr(expr->opExpr.right, slot);
            return ApplyOperator(expr->opExpr.op, left, right);
            
        case E_FUNC:
            Datum *args = palloc(list_length(expr->funcExpr.args) * sizeof(Datum));
            int i = 0;
            foreach(arg, expr->funcExpr.args) {
                args[i++] = ExecEvalExpr(arg, slot);
            }
            return expr->funcExpr.function(args);
            
        case E_CASE:
            // Evaluate WHEN clauses until one matches
            foreach(when, expr->caseExpr.whenClauses) {
                if (DatumGetBool(ExecEvalExpr(when->condition, slot)))
                    return ExecEvalExpr(when->result, slot);
            }
            return ExecEvalExpr(expr->caseExpr.elseResult, slot);
    }
}

Expression Compilation:

Modern databases optimize expression evaluation through:

JIT Compilation — Compile expression trees into native machine code (PostgreSQL 11+, Spark)
Vectorized Evaluation — Evaluate expressions on batches of values using SIMD
Expression Caching — Cache expensive function results for repeated calls
Short-Circuit Evaluation — Skip evaluation when result is determined (AND with first FALSE)

Result Delivery and Streaming

The final step is delivering results to the client. This involves formatting tuples and writing them to the network connection.

Result Delivery Modes:

Result Delivery Strategies

•Full Materialization — Collect all results in memory, then send. Simple but memory-intensive for large results.
•Streaming/Pipelining — Send results as they're produced. Memory-efficient, lower latency to first row.
•Batched Streaming — Stream in batches for efficiency (reduce per-tuple overhead).
•Cursor Fetch — Client controls retrieval pace via FETCH N commands. Enables very large result sets.

result_streaming.sql
PostgreSQL Cursors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Using cursors for large result streaming
BEGIN;
 
-- Declare cursor (doesn't execute yet)
DECLARE order_cursor CURSOR FOR
    SELECT order_id, customer_name, total
    FROM   orders o JOIN customers c ON o.customer_id = c.id
    WHERE  order_date >= '2024-01-01';
 
-- Fetch results in batches
FETCH 100 FROM order_cursor;  -- Get first 100 rows
FETCH 100 FROM order_cursor;  -- Get next 100 rows
-- ... continue until no more rows
 
CLOSE order_cursor;
COMMIT;
 
-- Benefits:
-- - Memory: Only batch size in memory at once
-- - Control: Client processes at own pace
-- - Interruptible: Can stop partway through
 
-- Scrollable cursor (can move backward)
DECLARE scroll_cursor SCROLL CURSOR FOR
    SELECT * FROM products ORDER BY price;
 
FETCH LAST FROM scroll_cursor;   -- Go to last row
FETCH BACKWARD 10 FROM scroll_cursor;  -- Previous 10 rows
FETCH FIRST FROM scroll_cursor;  -- Back to start

Wire Protocol Encoding:

Results are encoded according to the database's wire protocol:

Database	Protocol	Format Options
PostgreSQL	libpq	Text or Binary
MySQL	MySQL Protocol	Text (default), Binary (prepared)
SQL Server	TDS	Various encodings

Binary format is more compact and faster to parse, but text format is human-readable for debugging. Prepared statements typically use binary encoding.

Large Result Set Pitfalls

Summary: Execution Completes the Pipeline

We've explored query execution comprehensively. Let's consolidate the key concepts:

Key Takeaways

•Iterator Model — The Volcano model with open/next/close interface is the foundation of query execution in most databases.
•Pull-Based Execution — Data flows from leaf to root as parent operators "pull" tuples from children via next() calls.
•Pipelining vs. Materialization — Pipeline operators stream data; blocking operators must consume all input first.
•Scan Operators — Sequential scan, index scan, bitmap scan, and index-only scan provide different trade-offs for data access.
•Join Algorithms — Nested loop, hash join, and merge join each excel in different scenarios based on data size and indexes.
•Aggregation — Hash aggregation and sort aggregation provide O(n) performance with different memory and ordering characteristics.
•Expression Evaluation — Recursive evaluation of expression trees computes predicates, projections, and function results.
•Result Streaming — Cursors and streaming delivery handle large results without exhausting memory.

Module Complete:

You've now mastered the complete query processing pipeline:

Query Input — Receiving and validating queries from clients
Parsing — Transforming text into structured parse trees
Translation — Converting to relational algebra via semantic analysis
Optimization — Finding efficient execution strategies through cost-based search
Execution — Actually processing data and returning results

This knowledge forms the foundation for understanding database performance, tuning queries, and designing efficient database-backed applications.

Module Complete