Loading content...
We've journeyed through query input, parsing, translation, and optimization. Now comes the moment of truth: execution. The carefully crafted execution plan must actually retrieve, process, and return data to the user.
The query executor (or query engine) is the runtime component that takes the physical execution plan and orchestrates all the machinery needed to produce results: reading from storage, applying operators, managing memory, and streaming results back to the client.
This is where abstract planning meets concrete computation—where I/O happens, CPU cycles burn, and data finally flows from disk to user.
By the end of this page, you will understand the iterator (Volcano) execution model, the difference between pipelined and materialized execution, how specific operators work at runtime, memory management during execution, and how results flow back to clients.
The execution engine is responsible for:
Execution Plan as a Tree:
The execution plan is a tree of operators. Data flows from leaf nodes (table scans) up through intermediate nodes (joins, filters) to the root (result delivery).
Execution Engine Components:
| Component | Responsibility |
|---|---|
| Plan Executor | Traverses plan tree, coordinates operator execution |
| Operator Implementations | Specific code for each operator type (scan, join, sort, etc.) |
| Expression Evaluator | Computes expressions (predicates, projections, function calls) |
| Buffer Manager | Manages memory buffers, page replacement |
| Storage Manager | Reads/writes data pages from disk |
| Transaction Manager | Ensures ACID properties, manages locks |
Most relational databases use the iterator model (also called the Volcano model after the Volcano system that popularized it). Each operator implements a simple interface with three methods:
open() — Initialize the operator, allocate resources
next() — Return the next tuple (or null if exhausted)
close() — Release resources, cleanup
How It Works:
The root operator calls next() on its child, which calls next() on its child, recursively down to the leaves. Tuples flow up one at a time through the tree.
This is a pull-based model: the consumer pulls data from the producer. The root "pulls" results by repeatedly asking for the next tuple.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// Abstract iterator interface (all operators implement this)interface Iterator: function open() // Initialize operator function next() → Tuple | null // Get next tuple function close() // Cleanup // Sequential scan implementationclass SeqScan implements Iterator: table: Table currentPage: int currentSlot: int function open(): currentPage = 0 currentSlot = 0 function next() → Tuple | null: while currentPage < table.pageCount: page = bufferManager.fetchPage(table, currentPage) while currentSlot < page.tupleCount: tuple = page.getTuple(currentSlot) currentSlot++ if evaluatePredicate(tuple): return tuple currentPage++ currentSlot = 0 return null // No more tuples function close(): // Release any pinned pages // Hash join implementationclass HashJoin implements Iterator: leftChild: Iterator // Build side rightChild: Iterator // Probe side hashTable: HashMap function open(): leftChild.open() rightChild.open() // Build phase: hash all tuples from left child while (tuple = leftChild.next()) != null: key = extractJoinKey(tuple) hashTable.insert(key, tuple) function next() → Tuple | null: // Probe phase: for each right tuple, find matches while (rightTuple = rightChild.next()) != null: key = extractJoinKey(rightTuple) for leftTuple in hashTable.get(key): return combineTuples(leftTuple, rightTuple) return null function close(): leftChild.close() rightChild.close() hashTable.destroy()The tuple-at-a-time model has overhead: each next() call involves function calls within the operator tree. Modern systems use vectorized execution (batch-at-a-time) or compiled execution to reduce this overhead while preserving the iterator abstraction.
Operators fall into two categories based on how they process data:
Pipeline Operators (Non-Blocking):
Can produce output tuples while still receiving input. They process one tuple at a time without needing to see all input first.
Examples: Selection (filter), Projection, Nested Loop Join (for each outer tuple)
Blocking Operators (Materializing):
Must consume all input before producing any output. They must "materialize" (store) intermediate results.
Examples: Sort, Hash Join (build phase), Aggregation (non-sorted), Set Difference
| Operator | Type | Reason |
|---|---|---|
| Selection (σ) | Pipeline | Can filter tuples one at a time |
| Projection (π) | Pipeline | Can transform tuples one at a time |
| Nested Loop Join | Pipeline | Outer side pipelines, inner may rescan |
| Hash Join - Build | Blocking | Must build complete hash table first |
| Hash Join - Probe | Pipeline | Can probe as probes arrive |
| Sort | Blocking | Must see all tuples to sort |
| Hash Aggregate | Blocking | Must see all groups to aggregate |
| Sort Aggregate | Pipeline* | Pipeline if input pre-sorted |
| DISTINCT (Hash) | Blocking | Must track all seen values |
| DISTINCT (Sort) | Pipeline* | Pipeline if input pre-sorted |
| LIMIT | Pipeline | Can stop after N tuples |
| Union All | Pipeline | Just concatenates streams |
Pipeline Breakers:
Blocking operators are called pipeline breakers because they interrupt the smooth flow of data. An ideal execution plan minimizes pipeline breakers to:
For interactive queries, 'time to first row' matters as much as total time. A pipelined plan may start returning rows immediately, while a blocking plan makes users wait. EXPLAIN output shows 'startup cost' (time before first row) vs. 'total cost' (time for all rows).
Scan operators are the leaves of the execution tree—they read data from storage and deliver tuples to their parent operators.
Sequential Scan:
Reads every page of a table in physical order. Simple but thorough.
for each page in table:
for each tuple in page:
if tuple satisfies predicate:
yield tuple
Index Scan:
Uses an index to find qualifying tuples, then fetches heap pages.
for each key matching predicate in index:
fetch tuple from heap using TID
yield tuple
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
/* Sequential Scan Execution */typedef struct SeqScanState { PlanState ps; /* Base plan state */ HeapScanDesc scanDesc; /* Scan descriptor */ Relation relation; /* Table being scanned */ List *qual; /* Filter predicates */} SeqScanState; TupleTableSlot *SeqNext(SeqScanState *node) { HeapTuple tuple; /* Loop until we find a qualifying tuple */ while ((tuple = heap_getnext(node->scanDesc)) != NULL) { /* Store tuple in slot for predicate evaluation */ ExecStoreTuple(tuple, node->ss_ScanTupleSlot); /* Evaluate WHERE predicates */ if (ExecQual(node->qual, econtext)) { return node->ss_ScanTupleSlot; /* Tuple qualifies */ } /* Predicate failed, try next tuple */ } return NULL; /* No more tuples */} /* Index Scan Execution */typedef struct IndexScanState { PlanState ps; IndexScanDesc scanDesc; Relation indexRelation; Relation heapRelation; List *indexQual; /* Index-checkable predicates */ List *qual; /* Additional heap predicates */} IndexScanState; TupleTableSlot *IndexNext(IndexScanState *node) { ItemPointer tid; HeapTuple tuple; while ((tid = index_getnext_tid(node->scanDesc)) != NULL) { /* Fetch heap tuple using TID from index */ tuple = index_fetch_heap(node->scanDesc, tid); if (tuple == NULL) continue; /* TID was dead/deleted */ ExecStoreTuple(tuple, node->ss_ScanTupleSlot); /* Check additional predicates not in index */ if (ExecQual(node->qual, econtext)) { return node->ss_ScanTupleSlot; } } return NULL;}Index-Only Scan:
If all required columns are in the index, skip heap fetches entirely. Much faster for covered queries.
Bitmap Scan:
Hybrid approach for medium-selectivity queries:
Bitmap scans convert random I/O from index lookups into sequential I/O, which is faster for larger result sets.
| Scan Type | Selectivity | Best When... |
|---|---|---|
| Sequential Scan | High (>20% rows) | Reading most of table anyway |
| Index Scan | Very low (<10%) | Few matching rows, random access OK |
| Index-Only Scan | Any | All columns in index (covering index) |
| Bitmap Scan | Medium (5-20%) | Moderate rows, benefit from sort |
Join operators combine rows from two relations based on join conditions. Three main algorithms exist, each with different performance characteristics.
Nested Loop Join:
The simplest algorithm—for each outer row, scan all inner rows.
for each tuple r in outer relation R:
for each tuple s in inner relation S:
if join_condition(r, s):
yield combined(r, s)
Complexity: O(|R| × |S|) — but variants using indexes make this much faster.
123456789101112131415161718192021222324252627
// Basic Nested Loop (expensive)function NestedLoopJoin(R, S, condition): for each tuple r in R: for each tuple s in S: if condition(r, s): emit (r, s) // Cost: |R| × |S| comparisons // Index Nested Loop (efficient when inner has index)function IndexNestedLoopJoin(R, S, R.key = S.key): for each tuple r in R: for each tuple s in S.index_lookup(r.key): emit (r, s) // Cost: |R| × log(|S|) if index on S.key // Block Nested Loop (reduces I/O)function BlockNestedLoopJoin(R, S, condition): for each block Br of R that fits in memory: for each block Bs of S: for each r in Br: for each s in Bs: if condition(r, s): emit (r, s) // Cost: |R|/block_size × |S| page reads| Algorithm | Complexity | Best When... |
|---|---|---|
| Nested Loop | O(|R| × |S|) | Very small tables, or inner indexed |
| Index Nested Loop | O(|R| × log|S|) | Index on inner join column |
| Block Nested Loop | O(|R|/M × |S|) | No index, limited memory |
| Hash Join | O(|R| + |S|) | Equi-join, one side fits in memory |
| Grace Hash Join | O(|R| + |S|) | Equi-join, larger than memory |
| Merge Join | O(|R| + |S|) | Inputs pre-sorted, or sort beneficial |
Aggregation Execution:
Aggregation operators compute summary values (SUM, COUNT, AVG, etc.) over groups of rows.
Hash Aggregation:
Build a hash table keyed by GROUP BY columns. For each input row, find or create the group entry and update accumulators.
for each input tuple:
key = extract_group_key(tuple)
if key not in hashTable:
hashTable[key] = init_accumulators()
update_accumulators(hashTable[key], tuple)
for each entry in hashTable:
yield compute_final_values(entry)
Sort Aggregation:
Sort input by GROUP BY columns, then scan sequentially. All rows of each group are contiguous.
sort input by group_key
current_group = null
for each tuple in sorted_input:
if tuple.group_key != current_group:
if current_group != null:
yield compute_final_values(accumulators)
current_group = tuple.group_key
accumulators = init_accumulators()
update_accumulators(accumulators, tuple)
yield compute_final_values(accumulators) # Last group
Sorting Execution:
Sorting is fundamental—used for ORDER BY, merge joins, sort aggregation, and duplicate elimination.
In-Memory Sort:
When data fits in work memory, use quicksort or similar O(n log n) algorithm.
External Merge Sort:
When data exceeds memory, use external sorting:
// Phase 1: Create sorted runs
while more_data:
chunk = read_chunk_into_memory()
sort(chunk)
write_run(chunk)
// Phase 2: Merge runs
while run_count > 1:
runs_to_merge = select_runs(merge_factor)
merged = merge(runs_to_merge)
write_run(merged)
12345678910111213141516171819202122232425262728293031
-- Work memory controls in-memory sort/hash sizeSHOW work_mem; -- Default: 4MB -- Increase for large sorts (session-level)SET work_mem = '256MB'; -- View sort method in EXPLAINEXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders ORDER BY order_date; /* Sort (cost=12345.67..12456.78 rows=100000 width=84) (actual time=234.567..345.678 rows=100000 loops=1) Sort Key: order_date Sort Method: external merge Disk: 15624kB <-- Spilled to disk! Buffers: shared hit=1234, temp read=1953 written=1953 ...*/ -- With more memory:SET work_mem = '256MB';EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders ORDER BY order_date; /* Sort (cost=12345.67..12456.78 rows=100000 width=84) (actual time=123.456..189.012 rows=100000 loops=1) Sort Key: order_date Sort Method: quicksort Memory: 25600kB <-- In-memory! ...*/work_mem (PostgreSQL) or sort_buffer_size (MySQL) controls memory for sorts and hash operations. Too low causes disk spills; too high risks memory exhaustion. The sweet spot depends on query concurrency and available RAM.
At the core of query execution is expression evaluation—computing values from column references, constants, operators, and function calls.
Expression Types:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
/* Expression tree nodes */typedef enum ExprType { E_CONST, /* Constant value */ E_VAR, /* Column reference */ E_OP, /* Operator (binary or unary) */ E_FUNC, /* Function call */ E_CASE, /* CASE/WHEN expression */ E_SUBQUERY, /* Scalar subquery */} ExprType; typedef struct Expr { ExprType type; Oid resultType; /* Output data type */ union { Datum constValue; /* For E_CONST */ int varAttNo; /* For E_VAR: attribute number */ struct { OpCode op; struct Expr *left; struct Expr *right; } opExpr; /* For E_OP */ struct { FuncPtr function; List *args; } funcExpr; /* For E_FUNC */ };} Expr; /* Recursive expression evaluator */Datum ExecEvalExpr(Expr *expr, TupleTableSlot *slot) { switch (expr->type) { case E_CONST: return expr->constValue; case E_VAR: return slot_getattr(slot, expr->varAttNo); case E_OP: Datum left = ExecEvalExpr(expr->opExpr.left, slot); Datum right = ExecEvalExpr(expr->opExpr.right, slot); return ApplyOperator(expr->opExpr.op, left, right); case E_FUNC: Datum *args = palloc(list_length(expr->funcExpr.args) * sizeof(Datum)); int i = 0; foreach(arg, expr->funcExpr.args) { args[i++] = ExecEvalExpr(arg, slot); } return expr->funcExpr.function(args); case E_CASE: // Evaluate WHEN clauses until one matches foreach(when, expr->caseExpr.whenClauses) { if (DatumGetBool(ExecEvalExpr(when->condition, slot))) return ExecEvalExpr(when->result, slot); } return ExecEvalExpr(expr->caseExpr.elseResult, slot); }}Expression Compilation:
Modern databases optimize expression evaluation through:
The final step is delivering results to the client. This involves formatting tuples and writing them to the network connection.
Result Delivery Modes:
1234567891011121314151617181920212223242526272829
-- Using cursors for large result streamingBEGIN; -- Declare cursor (doesn't execute yet)DECLARE order_cursor CURSOR FOR SELECT order_id, customer_name, total FROM orders o JOIN customers c ON o.customer_id = c.id WHERE order_date >= '2024-01-01'; -- Fetch results in batchesFETCH 100 FROM order_cursor; -- Get first 100 rowsFETCH 100 FROM order_cursor; -- Get next 100 rows-- ... continue until no more rows CLOSE order_cursor;COMMIT; -- Benefits:-- - Memory: Only batch size in memory at once-- - Control: Client processes at own pace-- - Interruptible: Can stop partway through -- Scrollable cursor (can move backward)DECLARE scroll_cursor SCROLL CURSOR FOR SELECT * FROM products ORDER BY price; FETCH LAST FROM scroll_cursor; -- Go to last rowFETCH BACKWARD 10 FROM scroll_cursor; -- Previous 10 rowsFETCH FIRST FROM scroll_cursor; -- Back to startWire Protocol Encoding:
Results are encoded according to the database's wire protocol:
| Database | Protocol | Format Options |
|---|---|---|
| PostgreSQL | libpq | Text or Binary |
| MySQL | MySQL Protocol | Text (default), Binary (prepared) |
| SQL Server | TDS | Various encodings |
Binary format is more compact and faster to parse, but text format is human-readable for debugging. Prepared statements typically use binary encoding.
Fetching millions of rows without cursors can exhaust client memory. Applications should use cursors, LIMIT/OFFSET pagination, or keyset pagination for large datasets. Never SELECT * on a large table without limits in production code.
We've explored query execution comprehensively. Let's consolidate the key concepts:
Module Complete:
You've now mastered the complete query processing pipeline:
This knowledge forms the foundation for understanding database performance, tuning queries, and designing efficient database-backed applications.
Congratulations! You now understand the complete query processing pipeline—from the moment a SQL statement enters the database until results stream back to the client. This end-to-end understanding enables you to debug performance issues, write efficient queries, and appreciate the sophisticated engineering that powers every database system.