Loading content...
Every query plan—whether a simple table lookup or a complex multi-table analytical query—is built from a small set of fundamental operations. These operator nodes are the atoms of query processing: indivisible, well-defined, and composable.
Understanding operator nodes at a deep level unlocks multiple capabilities:
This page examines each major operator category, its semantics, its properties, and its role in query execution.
By the end of this page, you will have deep understanding of each major operator type: access operators, filter and project operators, join operators, aggregation operators, and set/ordering operators. You'll understand their inputs, outputs, properties, costs, and common implementations.
Before examining individual operators, let's establish the common model that all operators share.
Operator Interface:
All operators conform to a common interface, typically called the iterator model or Volcano model:
open() → Initialize the operator
next() → Return the next tuple (or null if done)
close() → Release resources
This simple interface enables pipelining: operators can produce output as they receive input, without materializing complete intermediate results.
Operator Properties:
| Category | Operators | Typical Arity | Blocking? | Example SQL |
|---|---|---|---|---|
| Access | Table Scan, Index Scan | Nullary (0) | No (streams data) | FROM table |
| Filter | Select/Filter | Unary (1) | No | WHERE condition |
| Project | Project, Extend | Unary (1) | No | SELECT columns |
| Join | All join types | Binary (2) | Depends on algorithm | JOIN ... ON |
| Aggregate | Group By, Window | Unary (1) | Usually yes | GROUP BY / OVER |
| Set | Union, Intersect, Except | Binary (2) | Depends | UNION / INTERSECT |
| Order | Sort, Limit | Unary (1) | Sort: yes, Limit: no | ORDER BY / LIMIT |
The iterator model enables operators to be composed arbitrarily: a Filter's next() calls its child's next(), processes the tuple, and either returns it or calls again. This uniform interface makes query plans trees of composable operators with no special coordination code.
Access operators are the leaf nodes of query plans—they interface with stored data and produce the initial tuple streams that flow through the plan.
Sequential (Table) Scan
The most basic access operator. Reads all rows from a table in physical storage order.
Behavior:
Cost ≈ (pages in table) × (page read cost). Sequential reads are ~10x faster than random reads, making table scans efficient for non-selective queries even on large tables.
Filter and project are the most fundamental tuple-manipulating operators—they don't change structure (no joins) but refine the data flowing through the plan.
Filter (Select) Operator
Purpose: Pass through only tuples satisfying a predicate.
Input: Stream of tuples Output: Subset of input tuples where predicate = true Schema Effect: Unchanged (same columns)
Implementation:
next():
while true:
tuple = child.next()
if tuple is null: return null
if predicate(tuple): return tuple
Properties:
Project Operator
Purpose: Transform tuples to new schema with computed expressions.
Input: Stream of tuples Output: Tuples with selected/computed columns Schema Effect: Changes to output column list
Implementation:
next():
tuple = child.next()
if tuple is null: return null
return [expr.eval(tuple) for expr in outputList]
Properties:
12345678910111213141516171819202122232425262728293031
-- Example query demonstrating filter and projectSELECT name, -- Simple column reference salary * 1.1 AS new_salary, -- Computed expression UPPER(department) AS dept -- Function applicationFROM employeesWHERE status = 'active' -- Simple equality AND salary > 50000 -- Range comparison AND department IN ('Eng', 'PM') -- Set membership; -- Plan structure (simplified):---- Project [name, salary*1.1, UPPER(department)]-- |-- Filter [status='active' AND salary>50000 AND department IN ('Eng','PM')]-- |-- TableScan [employees] -- Pushed variant (optimized):-- Project [name, salary*1.1, UPPER(department)]-- |-- TableScan [employees] -- with pushed predicate: status='active' AND salary>50000 AND department IN ('Eng','PM') -- Index-utilizing variant:-- Project [...]-- |-- IndexScan [employees, idx_status_salary]-- with remaining filter: department IN ('Eng','PM')In optimized plans, filter predicates often disappear as separate nodes—they're 'pushed into' scan operators. The scan evaluates the predicate while reading, avoiding the cost of materializing and discarding tuples. Look for 'Filter:' annotations on Scan nodes in EXPLAIN output.
Joins are the most performance-critical operators in most queries. They combine tuples from two inputs based on a condition, and their cost can vary by orders of magnitude based on algorithm choice and input characteristics.
| Join Type | Behavior | Null Handling | SQL Syntax |
|---|---|---|---|
| Inner Join | Only matching rows from both sides | No unmatched rows | JOIN or INNER JOIN |
| Left Outer | All left rows; matched right or NULLs | Left preserved | LEFT JOIN |
| Right Outer | All right rows; matched left or NULLs | Right preserved | RIGHT JOIN |
| Full Outer | All rows from both; NULLs where no match | Both preserved | FULL JOIN |
| Cross Join | All combinations (Cartesian product) | N/A | CROSS JOIN |
| Semi Join | Left rows with at least one right match | Left only in output | WHERE EXISTS |
| Anti Join | Left rows with NO right match | Left only in output | WHERE NOT EXISTS |
Nested Loop Join
The conceptually simplest join: for each outer row, scan inner for matches.
for each row r in outer:
for each row s in inner:
if join_condition(r, s):
emit (r, s)
Variants:
Nested loop's outer side is iterated fully; inner side is scanned repeatedly. Put the smaller/more selective table on outer side. With index on inner, cost is O(outer × log(inner)), not O(outer × inner).
Aggregation operators partition input rows into groups and compute summary values over each group. They're essential for analytical queries and reporting.
Hash Aggregate
Uses hash table to maintain per-group state.
hash_table = {}
for each row:
key = (grouping_columns)
if key not in hash_table:
hash_table[key] = new_aggregate_state()
update_aggregate(hash_table[key], row)
for each (key, state) in hash_table:
emit (key, finalize(state))
Properties:
Stream (Sorted) Aggregate
Exploits sorted input to process groups sequentially.
current_key = null
current_state = null
for each row:
key = (grouping_columns)
if key != current_key:
if current_key != null:
emit (current_key, finalize(current_state))
current_key = key
current_state = new_aggregate_state()
update_aggregate(current_state, row)
emit (current_key, finalize(current_state))
Properties:
12345678910111213141516171819202122232425262728293031
-- Query with aggregationSELECT department, COUNT(*) as emp_count, AVG(salary) as avg_salary, MAX(hire_date) as latest_hireFROM employeesWHERE status = 'active'GROUP BY departmentHAVING COUNT(*) > 5; -- Plan options: -- Option 1: Hash Aggregate (no useful index)-- Filter [status='active']-- |-- TableScan [employees]-- |-- HashAggregate [department; COUNT(*), AVG(salary), MAX(hire_date)]-- |-- Filter [COUNT(*) > 5] -- Option 2: Sorted Aggregate (with index on department)-- IndexScan [employees, idx_department] -- produces sorted by dept-- |-- Filter [status='active']-- |-- StreamAggregate [department; COUNT(*), AVG(salary), MAX(hire_date)]-- |-- Filter [COUNT(*) > 5] -- The HAVING clause becomes a post-aggregation filterFor parallel queries, aggregation often splits into partial (local) and final (global) phases. Each worker computes partial aggregates; results merge at final step. Works for decomposable aggregates (SUM, COUNT) but not all (MEDIAN requires all data).
Set operators combine results from multiple queries; order operators control tuple ordering and quantity.
| Operator | SQL | Semantics | Duplicate Handling |
|---|---|---|---|
| Union | UNION / UNION ALL | Combine rows from both inputs | UNION removes dups, UNION ALL keeps all |
| Intersect | INTERSECT | Rows present in both inputs | Removes duplicates by default |
| Except | EXCEPT / MINUS | Rows in first but not second | Removes duplicates by default |
| Distinct | SELECT DISTINCT | Remove duplicate rows | Keeps one copy of each unique row |
Set Operation Implementations:
Order Operators:
123456789101112131415161718192021222324
-- Top-N querySELECT name, salaryFROM employeesORDER BY salary DESCLIMIT 10; -- Naive plan: Sort all rows, take first 10-- Sort [salary DESC] -- sorts all N employees-- |-- Limit 10 -- returns first 10 -- Optimized plan: Heap-based Top-10-- TopN [salary DESC, N=10] -- maintains heap of 10 largest-- |-- TableScan [employees]-- Cost: O(N log 10) instead of O(N log N) -- Pagination querySELECT * FROM productsORDER BY created_at DESCLIMIT 20 OFFSET 1000; -- This is inefficient! Must identify and skip 1000 rows.-- Better: Use keyset pagination with WHERE created_at < @last_seen_valueOFFSET requires computing and discarding rows. For large offsets, this is expensive. Prefer keyset/cursor pagination: remember the last value seen and use WHERE to start from there. Much more efficient for deep pagination.
Window functions compute values over partitions of rows while preserving individual row identity—unlike aggregates that collapse groups into single rows.
Window Function Anatomy:
FUNCTION(args) OVER (
PARTITION BY partition_columns -- Groups for separate calculation
ORDER BY order_columns -- Order within partition
frame_clause -- Which rows in partition to consider
)
12345678910111213141516171819202122232425
-- Complex window function querySELECT employee_id, department, salary, -- Row number within department ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank, -- Running total of salary within department SUM(salary) OVER (PARTITION BY department ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total, -- Compare to department average salary - AVG(salary) OVER (PARTITION BY department) as diff_from_avg, -- Previous salary in department (by salary order) LAG(salary, 1) OVER (PARTITION BY department ORDER BY salary) as prev_salaryFROM employees; -- Execution plan typically:-- 1. Sort by (department, salary)-- 2. WindowAgg operator computes all window functions in single pass-- - Maintains state per partition-- - Tracks frame boundaries-- - Computes all compatible window functions together -- Multiple OVER clauses with same PARTITION BY/ORDER BY share one sort-- Different PARTITION BY clauses may require multiple sortsGroup window functions with identical PARTITION BY and ORDER BY—they share the same sort and compute in one pass. Different ordering requirements cause separate sort operations. Index order can eliminate some window sorts.
Operator nodes are the building blocks of all query plans. Understanding each operator's behavior, cost characteristics, and implementation options enables you to read execution plans and reason about query performance.
What's Next:
With individual operators understood, we'll examine plan transformations—the systematic rules optimizers use to rewrite plans into more efficient forms. Understanding transformations reveals how optimizers explore the space of equivalent plans to find efficient executions.
You now have deep understanding of the operator nodes that compose query plans. This knowledge enables you to read and interpret EXPLAIN output, understand cost drivers, and reason about query performance. Next: Plan Transformations.