Query Representation - Learning Module

Loading content...

0/241

Operator Nodes

The Atoms of Query Execution

Every query plan—whether a simple table lookup or a complex multi-table analytical query—is built from a small set of fundamental operations. These operator nodes are the atoms of query processing: indivisible, well-defined, and composable.

Understanding operator nodes at a deep level unlocks multiple capabilities:

Reading execution plans: EXPLAIN output is a tree of operators. Understanding each operator's behavior lets you interpret and diagnose plans.
Performance tuning: Knowing operator costs and behaviors helps identify bottlenecks and optimization opportunities.
Query writing: Understanding how SQL constructs map to operators helps you write more efficient queries.
System design: Building or extending a database requires implementing these operators correctly.

This page examines each major operator category, its semantics, its properties, and its role in query execution.

What You Will Learn

By the end of this page, you will have deep understanding of each major operator type: access operators, filter and project operators, join operators, aggregation operators, and set/ordering operators. You'll understand their inputs, outputs, properties, costs, and common implementations.

Operator Model Fundamentals

Before examining individual operators, let's establish the common model that all operators share.

Operator Interface:

All operators conform to a common interface, typically called the iterator model or Volcano model:

open()      → Initialize the operator
next()      → Return the next tuple (or null if done)
close()     → Release resources

This simple interface enables pipelining: operators can produce output as they receive input, without materializing complete intermediate results.

Operator Properties:

Universal Operator Properties

•Arity: Number of input streams. Unary (1 input): selection, projection, sort. Binary (2 inputs): join, set operations. Nullary (0 inputs): table scan.
•Blocking vs. Pipelining: Blocking operators must see all input before producing any output (e.g., sort). Pipelining operators produce output incrementally (e.g., filter).
•Cardinality Effect: How output size relates to input. Filter: ≤ input. Join: varies widely. Aggregation: ≤ input (often much less).
•Schema Transformation: How operator changes tuple schema. Filter: preserves. Project: narrows. Join: widens. Aggregate: replaces.
•State Requirements: Memory and data structures needed. Sort: buffer for all tuples. Hash join: hash table. Filter: minimal state.

Operator Classification
Category	Operators	Typical Arity	Blocking?	Example SQL
Access	Table Scan, Index Scan	Nullary (0)	No (streams data)	FROM table
Filter	Select/Filter	Unary (1)	No	WHERE condition
Project	Project, Extend	Unary (1)	No	SELECT columns
Join	All join types	Binary (2)	Depends on algorithm	JOIN ... ON
Aggregate	Group By, Window	Unary (1)	Usually yes	GROUP BY / OVER
Set	Union, Intersect, Except	Binary (2)	Depends	UNION / INTERSECT
Order	Sort, Limit	Unary (1)	Sort: yes, Limit: no	ORDER BY / LIMIT

The Iterator Model's Power

The iterator model enables operators to be composed arbitrarily: a Filter's next() calls its child's next(), processes the tuple, and either returns it or calls again. This uniform interface makes query plans trees of composable operators with no special coordination code.

Access Operators

Access operators are the leaf nodes of query plans—they interface with stored data and produce the initial tuple streams that flow through the plan.

Sequential (Table) Scan

The most basic access operator. Reads all rows from a table in physical storage order.

Behavior:

Iterates through table pages sequentially
Returns all tuples (filtering happens in downstream operators or pushed predicates)
Highly efficient for full-table access due to sequential I/O patterns
Leverages OS read-ahead and buffer pool prefetching

When Used

•No suitable index exists for query predicates
•Query needs most/all of table (low selectivity)
•Table is small enough that index overhead isn't worth it
•Optimizer estimates seq scan cheaper than index scan

Cost Model

Cost ≈ (pages in table) × (page read cost). Sequential reads are ~10x faster than random reads, making table scans efficient for non-selective queries even on large tables.

Filter and Project Operators

Filter and project are the most fundamental tuple-manipulating operators—they don't change structure (no joins) but refine the data flowing through the plan.

Filter (Select) Operator

Purpose: Pass through only tuples satisfying a predicate.

Input: Stream of tuples Output: Subset of input tuples where predicate = true Schema Effect: Unchanged (same columns)

Implementation:

next():
  while true:
    tuple = child.next()
    if tuple is null: return null
    if predicate(tuple): return tuple

Properties:

Non-blocking (pipelining)
Cardinality: ≤ input (selectivity × input)
Cost: CPU for predicate evaluation
Highly composable (multiple filters can cascade)

Project Operator

Purpose: Transform tuples to new schema with computed expressions.

Input: Stream of tuples Output: Tuples with selected/computed columns Schema Effect: Changes to output column list

Implementation:

next():
  tuple = child.next()
  if tuple is null: return null
  return [expr.eval(tuple) for expr in outputList]

Properties:

Non-blocking (pipelining)
Cardinality: = input (unless DISTINCT)
Cost: CPU for expression evaluation
Can narrow tuple width (reduce downstream costs)

filter-project-operations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Example query demonstrating filter and project
SELECT 
    name,                           -- Simple column reference
    salary * 1.1 AS new_salary,     -- Computed expression
    UPPER(department) AS dept       -- Function application
FROM employees
WHERE 
    status = 'active'               -- Simple equality
    AND salary > 50000              -- Range comparison
    AND department IN ('Eng', 'PM') -- Set membership
;
 
-- Plan structure (simplified):
--
-- Project [name, salary*1.1, UPPER(department)]
--    |
-- Filter [status='active' AND salary>50000 AND department IN ('Eng','PM')]
--    |
-- TableScan [employees]
 
-- Pushed variant (optimized):
-- Project [name, salary*1.1, UPPER(department)]
--    |
-- TableScan [employees] 
--   with pushed predicate: status='active' AND salary>50000 AND department IN ('Eng','PM')
 
-- Index-utilizing variant:
-- Project [...]
--    |
-- IndexScan [employees, idx_status_salary]
--   with remaining filter: department IN ('Eng','PM')

Predicate Pushdown to Scans

In optimized plans, filter predicates often disappear as separate nodes—they're 'pushed into' scan operators. The scan evaluates the predicate while reading, avoiding the cost of materializing and discarding tuples. Look for 'Filter:' annotations on Scan nodes in EXPLAIN output.

Predicate Types and Evaluation

•Comparison predicates: =, <>, <, >, <=, >=. Most common; often index-accelerable.
•Range predicates: BETWEEN, IN list. May use index range scans or bitmap scans.
•Pattern predicates: LIKE, ILIKE, ~. Limited index support (prefix patterns only for B-tree).
•NULL checks: IS NULL, IS NOT NULL. Index support varies by system.
•Compound predicates: AND, OR, NOT. AND is conjunctive (intersection); OR is disjunctive (union).
•Subquery predicates: EXISTS, IN (subquery). May decorrelate to joins or execute per-row.

Join Operators

Joins are the most performance-critical operators in most queries. They combine tuples from two inputs based on a condition, and their cost can vary by orders of magnitude based on algorithm choice and input characteristics.

Join Types (Logical)
Join Type	Behavior	Null Handling	SQL Syntax
Inner Join	Only matching rows from both sides	No unmatched rows	JOIN or INNER JOIN
Left Outer	All left rows; matched right or NULLs	Left preserved	LEFT JOIN
Right Outer	All right rows; matched left or NULLs	Right preserved	RIGHT JOIN
Full Outer	All rows from both; NULLs where no match	Both preserved	FULL JOIN
Cross Join	All combinations (Cartesian product)	N/A	CROSS JOIN
Semi Join	Left rows with at least one right match	Left only in output	WHERE EXISTS
Anti Join	Left rows with NO right match	Left only in output	WHERE NOT EXISTS

Nested Loop Join

The conceptually simplest join: for each outer row, scan inner for matches.

for each row r in outer:
  for each row s in inner:
    if join_condition(r, s):
      emit (r, s)

Variants:

Simple Nested Loop: Re-scans inner for each outer row. O(N × M) comparisons.
Block Nested Loop: Scans inner once per block of outer rows. Reduces I/O.
Index Nested Loop: Uses index on inner for each outer row. O(N × log M).

When Chosen

•Small outer side (few iterations)
•Index available on inner side's join column
•Non-equi join conditions (hash/merge need equality)
•Optimizer estimates low actual comparisons

Outer Side Matters

Nested loop's outer side is iterated fully; inner side is scanned repeatedly. Put the smaller/more selective table on outer side. With index on inner, cost is O(outer × log(inner)), not O(outer × inner).

Converting Mermaid diagram...

Aggregation Operators

Aggregation operators partition input rows into groups and compute summary values over each group. They're essential for analytical queries and reporting.

Aggregate Functions

•COUNT(*): Number of rows in group. Simplest aggregate; just counts.
•COUNT(column): Number of non-NULL values. Excludes nulls from count.
•SUM(column): Sum of values. Numeric types; ignores nulls.
•AVG(column): Average (SUM/COUNT). Watch for integer division in some systems.
•MIN(column), MAX(column): Extreme values. May use index for single-group case.
•COUNT(DISTINCT column): Distinct value count. Expensive—requires tracking all values.
•STRING_AGG, ARRAY_AGG: Collect values into string/array. Order may matter.
•Statistical: STDDEV, VARIANCE, percentiles. Require multiple passes or streaming algorithms.

Hash Aggregate

Uses hash table to maintain per-group state.

hash_table = {}
for each row:
  key = (grouping_columns)
  if key not in hash_table:
    hash_table[key] = new_aggregate_state()
  update_aggregate(hash_table[key], row)

for each (key, state) in hash_table:
  emit (key, finalize(state))

Properties:

Memory: proportional to distinct groups
Time: O(N) with good hashing
Non-blocking at input, blocking at output
Best for unsorted input, moderate groups

Stream (Sorted) Aggregate

Exploits sorted input to process groups sequentially.

current_key = null
current_state = null
for each row:
  key = (grouping_columns)
  if key != current_key:
    if current_key != null:
      emit (current_key, finalize(current_state))
    current_key = key
    current_state = new_aggregate_state()
  update_aggregate(current_state, row)

emit (current_key, finalize(current_state))

Properties:

Memory: O(1) for single group state
Requires pre-sorted input
Fully pipelined (emits as it goes)
Best when input already sorted

aggregate-plans.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Query with aggregation
SELECT department, 
       COUNT(*) as emp_count,
       AVG(salary) as avg_salary,
       MAX(hire_date) as latest_hire
FROM employees
WHERE status = 'active'
GROUP BY department
HAVING COUNT(*) > 5;
 
-- Plan options:
 
-- Option 1: Hash Aggregate (no useful index)
-- Filter [status='active']
--    |
-- TableScan [employees]
--    |
-- HashAggregate [department; COUNT(*), AVG(salary), MAX(hire_date)]
--    |
-- Filter [COUNT(*) > 5]
 
-- Option 2: Sorted Aggregate (with index on department)
-- IndexScan [employees, idx_department] -- produces sorted by dept
--    |
-- Filter [status='active']
--    |
-- StreamAggregate [department; COUNT(*), AVG(salary), MAX(hire_date)]
--    |
-- Filter [COUNT(*) > 5]
 
-- The HAVING clause becomes a post-aggregation filter

Partial Aggregation for Parallelism

For parallel queries, aggregation often splits into partial (local) and final (global) phases. Each worker computes partial aggregates; results merge at final step. Works for decomposable aggregates (SUM, COUNT) but not all (MEDIAN requires all data).

Set and Order Operators

Set operators combine results from multiple queries; order operators control tuple ordering and quantity.

Set Operations
Operator	SQL	Semantics	Duplicate Handling
Union	UNION / UNION ALL	Combine rows from both inputs	UNION removes dups, UNION ALL keeps all
Intersect	INTERSECT	Rows present in both inputs	Removes duplicates by default
Except	EXCEPT / MINUS	Rows in first but not second	Removes duplicates by default
Distinct	SELECT DISTINCT	Remove duplicate rows	Keeps one copy of each unique row

Set Operation Implementations:

UNION ALL: Simple append—no duplicate checking, just concatenate streams
UNION: Append + hash/sort-based duplicate elimination
INTERSECT/EXCEPT: Hash-based (build one side, probe with other) or sort-merge based

Order Operators:

Ordering and Limiting Operators

•Sort: Reorders tuples by specified columns. Blocking—must see all input. Algorithms: in-memory quicksort, external merge sort, radix sort for some cases.
•Top-N / Limit: Returns first N tuples. With Sort, becomes heap-based top-N selection (O(N log K) for N inputs, K output).
•Offset: Skips first M tuples. Combined with Limit for pagination. Often inefficient for large offsets.
•Distinct: Removes duplicates. Hash-based (like hash aggregate with no aggregates) or sort-based (sort then deduplicate).

order-operators.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Top-N query
SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 10;
 
-- Naive plan: Sort all rows, take first 10
-- Sort [salary DESC] -- sorts all N employees
--    |
-- Limit 10 -- returns first 10
 
-- Optimized plan: Heap-based Top-10
-- TopN [salary DESC, N=10] -- maintains heap of 10 largest
--    |
-- TableScan [employees]
-- Cost: O(N log 10) instead of O(N log N)
 
-- Pagination query
SELECT * FROM products
ORDER BY created_at DESC
LIMIT 20 OFFSET 1000;
 
-- This is inefficient! Must identify and skip 1000 rows.
-- Better: Use keyset pagination with WHERE created_at < @last_seen_value

OFFSET Performance

OFFSET requires computing and discarding rows. For large offsets, this is expensive. Prefer keyset/cursor pagination: remember the last value seen and use WHERE to start from there. Much more efficient for deep pagination.

Window Operators

Window functions compute values over partitions of rows while preserving individual row identity—unlike aggregates that collapse groups into single rows.

Window Function Anatomy:

FUNCTION(args) OVER (
    PARTITION BY partition_columns   -- Groups for separate calculation
    ORDER BY order_columns           -- Order within partition
    frame_clause                     -- Which rows in partition to consider
)

Common Window Functions

•ROW_NUMBER(): Sequential number within partition. 1, 2, 3, ...
•RANK(): Rank with gaps for ties. 1, 2, 2, 4, ...
•DENSE_RANK(): Rank without gaps. 1, 2, 2, 3, ...
•LAG(col, n): Value n rows before current row.
•LEAD(col, n): Value n rows after current row.
•FIRST_VALUE, LAST_VALUE: First/last value in frame.
•NTH_VALUE(col, n): Nth value in frame.
•SUM/AVG/COUNT OVER(): Running or partitioned aggregates.
•NTILE(n): Divide partition into n buckets.

window-function-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Complex window function query
SELECT 
    employee_id,
    department,
    salary,
    -- Row number within department
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
    -- Running total of salary within department
    SUM(salary) OVER (PARTITION BY department ORDER BY salary 
                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total,
    -- Compare to department average
    salary - AVG(salary) OVER (PARTITION BY department) as diff_from_avg,
    -- Previous salary in department (by salary order)
    LAG(salary, 1) OVER (PARTITION BY department ORDER BY salary) as prev_salary
FROM employees;
 
-- Execution plan typically:
-- 1. Sort by (department, salary)
-- 2. WindowAgg operator computes all window functions in single pass
--    - Maintains state per partition
--    - Tracks frame boundaries
--    - Computes all compatible window functions together
 
-- Multiple OVER clauses with same PARTITION BY/ORDER BY share one sort
-- Different PARTITION BY clauses may require multiple sorts

Window Function Optimization

Group window functions with identical PARTITION BY and ORDER BY—they share the same sort and compute in one pass. Different ordering requirements cause separate sort operations. Index order can eliminate some window sorts.

Summary and Next Steps

Operator nodes are the building blocks of all query plans. Understanding each operator's behavior, cost characteristics, and implementation options enables you to read execution plans and reason about query performance.

Key Takeaways

•Operators share a common interface (iterator model) enabling arbitrary composition and pipelining.
•Access operators (scans) are leaves that produce initial data streams. Choice depends on selectivity and index availability.
•Filter and project are simple but critical for reducing data volume early in the plan.
•Join algorithms (nested loop, hash, merge) have different cost profiles depending on sizes, sortedness, and memory.
•Aggregation uses hash or stream algorithms; choice depends on sortedness and group count.
•Set and order operators handle combining results and controlling tuple order and quantity.
•Window operators enable partition-aware analytics while preserving row identity.

What's Next:

With individual operators understood, we'll examine plan transformations—the systematic rules optimizers use to rewrite plans into more efficient forms. Understanding transformations reveals how optimizers explore the space of equivalent plans to find efficient executions.

Page Complete

You now have deep understanding of the operator nodes that compose query plans. This knowledge enables you to read and interpret EXPLAIN output, understand cost drivers, and reason about query performance. Next: Plan Transformations.

Operator Nodes

The Atoms of Query Execution

Understanding operator nodes at a deep level unlocks multiple capabilities:

Reading execution plans: EXPLAIN output is a tree of operators. Understanding each operator's behavior lets you interpret and diagnose plans.
Performance tuning: Knowing operator costs and behaviors helps identify bottlenecks and optimization opportunities.
Query writing: Understanding how SQL constructs map to operators helps you write more efficient queries.
System design: Building or extending a database requires implementing these operators correctly.

This page examines each major operator category, its semantics, its properties, and its role in query execution.

What You Will Learn

Operator Model Fundamentals

Before examining individual operators, let's establish the common model that all operators share.

Operator Interface:

All operators conform to a common interface, typically called the iterator model or Volcano model:

open()      → Initialize the operator
next()      → Return the next tuple (or null if done)
close()     → Release resources

This simple interface enables pipelining: operators can produce output as they receive input, without materializing complete intermediate results.

Operator Properties:

Universal Operator Properties

•Arity: Number of input streams. Unary (1 input): selection, projection, sort. Binary (2 inputs): join, set operations. Nullary (0 inputs): table scan.
•Blocking vs. Pipelining: Blocking operators must see all input before producing any output (e.g., sort). Pipelining operators produce output incrementally (e.g., filter).
•Cardinality Effect: How output size relates to input. Filter: ≤ input. Join: varies widely. Aggregation: ≤ input (often much less).
•Schema Transformation: How operator changes tuple schema. Filter: preserves. Project: narrows. Join: widens. Aggregate: replaces.
•State Requirements: Memory and data structures needed. Sort: buffer for all tuples. Hash join: hash table. Filter: minimal state.

Operator Classification
Category	Operators	Typical Arity	Blocking?	Example SQL
Access	Table Scan, Index Scan	Nullary (0)	No (streams data)	FROM table
Filter	Select/Filter	Unary (1)	No	WHERE condition
Project	Project, Extend	Unary (1)	No	SELECT columns
Join	All join types	Binary (2)	Depends on algorithm	JOIN ... ON
Aggregate	Group By, Window	Unary (1)	Usually yes	GROUP BY / OVER
Set	Union, Intersect, Except	Binary (2)	Depends	UNION / INTERSECT
Order	Sort, Limit	Unary (1)	Sort: yes, Limit: no	ORDER BY / LIMIT

The Iterator Model's Power

Access Operators

Access operators are the leaf nodes of query plans—they interface with stored data and produce the initial tuple streams that flow through the plan.

Sequential (Table) Scan

The most basic access operator. Reads all rows from a table in physical storage order.

Behavior:

Iterates through table pages sequentially
Returns all tuples (filtering happens in downstream operators or pushed predicates)
Highly efficient for full-table access due to sequential I/O patterns
Leverages OS read-ahead and buffer pool prefetching

When Used

•No suitable index exists for query predicates
•Query needs most/all of table (low selectivity)
•Table is small enough that index overhead isn't worth it
•Optimizer estimates seq scan cheaper than index scan

Cost Model

Cost ≈ (pages in table) × (page read cost). Sequential reads are ~10x faster than random reads, making table scans efficient for non-selective queries even on large tables.

Filter and Project Operators

Filter and project are the most fundamental tuple-manipulating operators—they don't change structure (no joins) but refine the data flowing through the plan.

Filter (Select) Operator

Purpose: Pass through only tuples satisfying a predicate.

Input: Stream of tuples Output: Subset of input tuples where predicate = true Schema Effect: Unchanged (same columns)

Implementation:

next():
  while true:
    tuple = child.next()
    if tuple is null: return null
    if predicate(tuple): return tuple

Properties:

Non-blocking (pipelining)
Cardinality: ≤ input (selectivity × input)
Cost: CPU for predicate evaluation
Highly composable (multiple filters can cascade)

Project Operator

Purpose: Transform tuples to new schema with computed expressions.

Input: Stream of tuples Output: Tuples with selected/computed columns Schema Effect: Changes to output column list

Implementation:

next():
  tuple = child.next()
  if tuple is null: return null
  return [expr.eval(tuple) for expr in outputList]

Properties:

Non-blocking (pipelining)
Cardinality: = input (unless DISTINCT)
Cost: CPU for expression evaluation
Can narrow tuple width (reduce downstream costs)

filter-project-operations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Example query demonstrating filter and project
SELECT 
    name,                           -- Simple column reference
    salary * 1.1 AS new_salary,     -- Computed expression
    UPPER(department) AS dept       -- Function application
FROM employees
WHERE 
    status = 'active'               -- Simple equality
    AND salary > 50000              -- Range comparison
    AND department IN ('Eng', 'PM') -- Set membership
;
 
-- Plan structure (simplified):
--
-- Project [name, salary*1.1, UPPER(department)]
--    |
-- Filter [status='active' AND salary>50000 AND department IN ('Eng','PM')]
--    |
-- TableScan [employees]
 
-- Pushed variant (optimized):
-- Project [name, salary*1.1, UPPER(department)]
--    |
-- TableScan [employees] 
--   with pushed predicate: status='active' AND salary>50000 AND department IN ('Eng','PM')
 
-- Index-utilizing variant:
-- Project [...]
--    |
-- IndexScan [employees, idx_status_salary]
--   with remaining filter: department IN ('Eng','PM')

Predicate Pushdown to Scans

Predicate Types and Evaluation

•Comparison predicates: =, <>, <, >, <=, >=. Most common; often index-accelerable.
•Range predicates: BETWEEN, IN list. May use index range scans or bitmap scans.
•Pattern predicates: LIKE, ILIKE, ~. Limited index support (prefix patterns only for B-tree).
•NULL checks: IS NULL, IS NOT NULL. Index support varies by system.
•Compound predicates: AND, OR, NOT. AND is conjunctive (intersection); OR is disjunctive (union).
•Subquery predicates: EXISTS, IN (subquery). May decorrelate to joins or execute per-row.

Join Operators

Join Types (Logical)
Join Type	Behavior	Null Handling	SQL Syntax
Inner Join	Only matching rows from both sides	No unmatched rows	JOIN or INNER JOIN
Left Outer	All left rows; matched right or NULLs	Left preserved	LEFT JOIN
Right Outer	All right rows; matched left or NULLs	Right preserved	RIGHT JOIN
Full Outer	All rows from both; NULLs where no match	Both preserved	FULL JOIN
Cross Join	All combinations (Cartesian product)	N/A	CROSS JOIN
Semi Join	Left rows with at least one right match	Left only in output	WHERE EXISTS
Anti Join	Left rows with NO right match	Left only in output	WHERE NOT EXISTS

Nested Loop Join

The conceptually simplest join: for each outer row, scan inner for matches.

for each row r in outer:
  for each row s in inner:
    if join_condition(r, s):
      emit (r, s)

Variants:

Simple Nested Loop: Re-scans inner for each outer row. O(N × M) comparisons.
Block Nested Loop: Scans inner once per block of outer rows. Reduces I/O.
Index Nested Loop: Uses index on inner for each outer row. O(N × log M).

When Chosen

•Small outer side (few iterations)
•Index available on inner side's join column
•Non-equi join conditions (hash/merge need equality)
•Optimizer estimates low actual comparisons

Outer Side Matters

Converting Mermaid diagram...

Aggregation Operators

Aggregation operators partition input rows into groups and compute summary values over each group. They're essential for analytical queries and reporting.

Aggregate Functions

•COUNT(*): Number of rows in group. Simplest aggregate; just counts.
•COUNT(column): Number of non-NULL values. Excludes nulls from count.
•SUM(column): Sum of values. Numeric types; ignores nulls.
•AVG(column): Average (SUM/COUNT). Watch for integer division in some systems.
•MIN(column), MAX(column): Extreme values. May use index for single-group case.
•COUNT(DISTINCT column): Distinct value count. Expensive—requires tracking all values.
•STRING_AGG, ARRAY_AGG: Collect values into string/array. Order may matter.
•Statistical: STDDEV, VARIANCE, percentiles. Require multiple passes or streaming algorithms.

Hash Aggregate

Uses hash table to maintain per-group state.

hash_table = {}
for each row:
  key = (grouping_columns)
  if key not in hash_table:
    hash_table[key] = new_aggregate_state()
  update_aggregate(hash_table[key], row)

for each (key, state) in hash_table:
  emit (key, finalize(state))

Properties:

Memory: proportional to distinct groups
Time: O(N) with good hashing
Non-blocking at input, blocking at output
Best for unsorted input, moderate groups

Stream (Sorted) Aggregate

Exploits sorted input to process groups sequentially.

current_key = null
current_state = null
for each row:
  key = (grouping_columns)
  if key != current_key:
    if current_key != null:
      emit (current_key, finalize(current_state))
    current_key = key
    current_state = new_aggregate_state()
  update_aggregate(current_state, row)

emit (current_key, finalize(current_state))

Properties:

Memory: O(1) for single group state
Requires pre-sorted input
Fully pipelined (emits as it goes)
Best when input already sorted

aggregate-plans.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Query with aggregation
SELECT department, 
       COUNT(*) as emp_count,
       AVG(salary) as avg_salary,
       MAX(hire_date) as latest_hire
FROM employees
WHERE status = 'active'
GROUP BY department
HAVING COUNT(*) > 5;
 
-- Plan options:
 
-- Option 1: Hash Aggregate (no useful index)
-- Filter [status='active']
--    |
-- TableScan [employees]
--    |
-- HashAggregate [department; COUNT(*), AVG(salary), MAX(hire_date)]
--    |
-- Filter [COUNT(*) > 5]
 
-- Option 2: Sorted Aggregate (with index on department)
-- IndexScan [employees, idx_department] -- produces sorted by dept
--    |
-- Filter [status='active']
--    |
-- StreamAggregate [department; COUNT(*), AVG(salary), MAX(hire_date)]
--    |
-- Filter [COUNT(*) > 5]
 
-- The HAVING clause becomes a post-aggregation filter

Partial Aggregation for Parallelism

Set and Order Operators

Set operators combine results from multiple queries; order operators control tuple ordering and quantity.

Set Operations
Operator	SQL	Semantics	Duplicate Handling
Union	UNION / UNION ALL	Combine rows from both inputs	UNION removes dups, UNION ALL keeps all
Intersect	INTERSECT	Rows present in both inputs	Removes duplicates by default
Except	EXCEPT / MINUS	Rows in first but not second	Removes duplicates by default
Distinct	SELECT DISTINCT	Remove duplicate rows	Keeps one copy of each unique row

Set Operation Implementations:

UNION ALL: Simple append—no duplicate checking, just concatenate streams
UNION: Append + hash/sort-based duplicate elimination
INTERSECT/EXCEPT: Hash-based (build one side, probe with other) or sort-merge based

Order Operators:

Ordering and Limiting Operators

•Sort: Reorders tuples by specified columns. Blocking—must see all input. Algorithms: in-memory quicksort, external merge sort, radix sort for some cases.
•Top-N / Limit: Returns first N tuples. With Sort, becomes heap-based top-N selection (O(N log K) for N inputs, K output).
•Offset: Skips first M tuples. Combined with Limit for pagination. Often inefficient for large offsets.
•Distinct: Removes duplicates. Hash-based (like hash aggregate with no aggregates) or sort-based (sort then deduplicate).

order-operators.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Top-N query
SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 10;
 
-- Naive plan: Sort all rows, take first 10
-- Sort [salary DESC] -- sorts all N employees
--    |
-- Limit 10 -- returns first 10
 
-- Optimized plan: Heap-based Top-10
-- TopN [salary DESC, N=10] -- maintains heap of 10 largest
--    |
-- TableScan [employees]
-- Cost: O(N log 10) instead of O(N log N)
 
-- Pagination query
SELECT * FROM products
ORDER BY created_at DESC
LIMIT 20 OFFSET 1000;
 
-- This is inefficient! Must identify and skip 1000 rows.
-- Better: Use keyset pagination with WHERE created_at < @last_seen_value

OFFSET Performance

Window Operators

Window functions compute values over partitions of rows while preserving individual row identity—unlike aggregates that collapse groups into single rows.

Window Function Anatomy:

FUNCTION(args) OVER (
    PARTITION BY partition_columns   -- Groups for separate calculation
    ORDER BY order_columns           -- Order within partition
    frame_clause                     -- Which rows in partition to consider
)

Common Window Functions

•ROW_NUMBER(): Sequential number within partition. 1, 2, 3, ...
•RANK(): Rank with gaps for ties. 1, 2, 2, 4, ...
•DENSE_RANK(): Rank without gaps. 1, 2, 2, 3, ...
•LAG(col, n): Value n rows before current row.
•LEAD(col, n): Value n rows after current row.
•FIRST_VALUE, LAST_VALUE: First/last value in frame.
•NTH_VALUE(col, n): Nth value in frame.
•SUM/AVG/COUNT OVER(): Running or partitioned aggregates.
•NTILE(n): Divide partition into n buckets.

window-function-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Complex window function query
SELECT 
    employee_id,
    department,
    salary,
    -- Row number within department
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
    -- Running total of salary within department
    SUM(salary) OVER (PARTITION BY department ORDER BY salary 
                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total,
    -- Compare to department average
    salary - AVG(salary) OVER (PARTITION BY department) as diff_from_avg,
    -- Previous salary in department (by salary order)
    LAG(salary, 1) OVER (PARTITION BY department ORDER BY salary) as prev_salary
FROM employees;
 
-- Execution plan typically:
-- 1. Sort by (department, salary)
-- 2. WindowAgg operator computes all window functions in single pass
--    - Maintains state per partition
--    - Tracks frame boundaries
--    - Computes all compatible window functions together
 
-- Multiple OVER clauses with same PARTITION BY/ORDER BY share one sort
-- Different PARTITION BY clauses may require multiple sorts

Window Function Optimization

Summary and Next Steps

Key Takeaways

•Operators share a common interface (iterator model) enabling arbitrary composition and pipelining.
•Access operators (scans) are leaves that produce initial data streams. Choice depends on selectivity and index availability.
•Filter and project are simple but critical for reducing data volume early in the plan.
•Join algorithms (nested loop, hash, merge) have different cost profiles depending on sizes, sortedness, and memory.
•Aggregation uses hash or stream algorithms; choice depends on sortedness and group count.
•Set and order operators handle combining results and controlling tuple order and quantity.
•Window operators enable partition-aware analytics while preserving row identity.

What's Next:

Page Complete