Query ProcessingQuery Representation

Query Representation

LevelIntermediate

Duration60 mins

TopicQuery Representation

1 / 5

Relational Algebra Trees

The Invisible Translator

When you write a SQL query, you're expressing what data you want—not how to retrieve it. This declarative nature is SQL's greatest strength, but it creates a fundamental challenge: the database engine must translate your high-level request into a concrete sequence of operations that can actually execute against stored data.

This translation is not trivial. A simple query like SELECT name FROM employees WHERE salary > 50000 AND department = 'Engineering' could be executed in dozens of ways, each with vastly different performance characteristics. Should we first filter by salary, then by department? Or vice versa? Should we use an index? Which one? The answers depend on data distribution, indexes available, memory constraints, and countless other factors.

Relational algebra trees are the database's answer to this challenge. They provide a structured, mathematically grounded representation that captures the logical intent of a query while enabling systematic optimization and execution planning.

What You Will Learn

By the end of this page, you will understand how databases transform SQL queries into relational algebra trees, the structure and semantics of these trees, the role of each node type, and why this representation is fundamental to query optimization. You'll gain insight into one of the most critical internal mechanisms of every relational database.

The Bridge Between SQL and Execution

To understand relational algebra trees, we must first appreciate the gap they bridge. SQL is a declarative query language—you specify the desired result, not the steps to achieve it. This is fundamentally different from procedural programming where you write explicit instructions.

Consider the conceptual gulf:

SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 75000
ORDER BY e.name;

This SQL query makes no mention of:

Which table to read first
Whether to use indexes or sequential scans
How to perform the join (nested loops? hash join? sort-merge?)
When to apply the filter condition
How to sort the results

Yet the database must make all these decisions—and must make them optimally. Relational algebra trees provide the intermediate representation that makes this possible.

Why Trees?

Trees are natural for representing queries because SQL operations compose hierarchically. A SELECT operates on the result of a JOIN, which operates on the results of table scans. This parent-child relationship maps directly to a tree structure, where each node produces output that flows upward to its parent.

The Query Processing Pipeline:

Relational algebra trees occupy a critical position in the query processing pipeline:

Parsing: SQL text → Parse Tree (syntactic structure)
Semantic Analysis: Parse Tree → Validated Query (name resolution, type checking)
Translation: Validated Query → Relational Algebra Tree (logical representation)
Optimization: Relational Algebra Tree → Optimized Logical Plan
Physical Planning: Logical Plan → Execution Plan (with concrete algorithms)
Execution: Execution Plan → Query Results

The relational algebra tree is where the query transitions from being a syntactic artifact to a semantic representation that captures exactly what operations must be performed, independent of how they'll be implemented.

SQL Constructs → Relational Algebra Operations
SQL Construct	Relational Algebra	Operation Symbol	Description
SELECT columns	Projection (π)	π	Extract specified attributes from tuples
WHERE condition	Selection (σ)	σ	Filter tuples based on predicate
FROM table	Relation	R	Base table or derived relation
JOIN ... ON	Join (⋈)	⋈	Combine tuples from two relations
UNION	Union (∪)	∪	Combine tuples from two relations
INTERSECT	Intersection (∩)	∩	Tuples present in both relations
EXCEPT	Difference (−)	−	Tuples in first but not second relation
DISTINCT	Duplicate Elimination (δ)	δ	Remove duplicate tuples
GROUP BY	Grouping (γ)	γ	Partition tuples and aggregate
ORDER BY	Sorting (τ)	τ	Order tuples by attribute(s)

Anatomy of a Relational Algebra Tree

A relational algebra tree is a directed acyclic graph (typically a tree, though some representations allow sharing) where:

Leaf nodes represent base relations (tables) or constants
Internal nodes represent relational algebra operations
Edges represent the flow of data from child to parent
The root produces the final query result

Data flows upward: Each node receives input from its child nodes, applies its operation, and passes the result to its parent. The root node's output is the query result.

Key Structural Properties:

Tree Structure Properties

•Arity varies by operation: Unary operations (σ, π, δ, τ) have one child. Binary operations (⋈, ∪, ∩, −) have two children. N-ary operations exist in some extensions.
•Order independence at conceptual level: Relational algebra is set-based; the order of tuples in intermediate results is undefined unless explicitly sorted.
•Schema propagation: Each node has an output schema determined by its operation and input schemas. Projection changes attributes; selection preserves them.
•Nullability and type tracking: Modern systems track null constraints and data types through the tree for optimization and error detection.

tree-structure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Example Query:
SELECT e.name, e.salary
FROM employees e
WHERE e.department = 'Engineering' AND e.salary > 80000;
 
-- Conceptual Relational Algebra Expression:
-- π_name,salary(σ_department='Engineering' ∧ salary>80000(employees))
 
-- Tree Representation:
--
--           π (name, salary)           ← Root: Projection
--                 │
--           σ (department='Engineering'  ← Selection predicate
--              AND salary > 80000)
--                 │
--           employees                  ← Leaf: Base relation
--
-- Data flows from bottom (employees table) upward through
-- the selection filter, then through projection to produce
-- the final result with only name and salary columns.

Understanding the Example:

In this tree:

Leaf Node (employees): Represents scanning the employees table. Its output schema is all columns of the employees table.
Selection Node (σ): Receives all employee tuples, evaluates the compound predicate department='Engineering' AND salary > 80000, and outputs only tuples where this predicate is true. Output schema identical to input.
Projection Node (π): Receives filtered tuples, extracts only name and salary attributes, discards all others. Output schema contains only these two columns.

The final result flows out of the projection node—containing names and salaries of engineering employees earning over $80,000.

Why This Representation Matters

The tree structure makes transformation rules easy to apply. Optimizers manipulate trees using algebraic identities—pushing selections down, reordering joins, combining projections. The tree is both a representation and a workspace for optimization.

Constructing Trees from SQL

The transformation from SQL to relational algebra trees follows systematic rules. Understanding these rules illuminates how the database "understands" your queries and why certain SQL patterns produce certain tree shapes.

The Canonical Translation Algorithm:

For a SELECT-FROM-WHERE query, the basic translation produces:

π_<select list>(
  σ_<where condition>(
    <from clause translation>
  )
)

For the FROM clause:

A single table becomes a leaf node
Multiple tables become a chain of Cartesian products (×) or joins (⋈)
Explicit JOINs with ON conditions become join operators with those conditions

Let's trace a complex example:

complex-query-translation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Complex SQL Query
SELECT c.customer_name, 
       SUM(o.order_total) as total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA' 
  AND o.order_date >= '2024-01-01'
GROUP BY c.customer_name
HAVING SUM(o.order_total) > 1000
ORDER BY total_spent DESC;
 
-- Step-by-Step Tree Construction:
 
-- 1. Base relations (leaves)
--    customers (c)
--    orders (o)
 
-- 2. Join operation
--    c ⋈_{c.customer_id = o.customer_id} o
 
-- 3. Selection for WHERE conditions
--    σ_{c.country='USA' ∧ o.order_date>='2024-01-01'}(join result)
 
-- 4. Grouping with aggregation
--    γ_{c.customer_name; SUM(o.order_total)→total_spent}(selection result)
 
-- 5. Having filter (selection on grouped result)
--    σ_{total_spent > 1000}(grouped result)
 
-- 6. Sort
--    τ_{total_spent DESC}(having result)
 
-- 7. Final projection (often implicit when all needed columns present)
--    π_{customer_name, total_spent}(sorted result)

Converting Mermaid diagram...

Critical Observations:

WHERE and HAVING differ fundamentally: WHERE filters rows before grouping (appears below γ), while HAVING filters groups after aggregation (appears above γ). This isn't just syntax—it's semantically different operations at different tree levels.
Join conditions vs. WHERE conditions: The join condition (c.customer_id = o.customer_id) is part of the join operator, while filter conditions appear in selection nodes. Some optimizers later push WHERE conditions into joins when beneficial.
ORDER BY comes last: Sorting produces the final ordered result, so it appears at or near the root.
The tree is canonical, not optimal: This is the initial translation. The optimizer will transform it extensively before execution.

Initial Translation ≠ Execution Order

The canonical tree represents logical operations, not execution order. During optimization, nodes may move, merge, or transform. A selection might push down past a join; projections might combine. The initial tree is a starting point, not a final plan.

Node Types in Depth

Each node type in a relational algebra tree has specific semantics, input requirements, and output characteristics. Understanding these deeply is essential for reasoning about query behavior and optimization opportunities.

Selection (σ) filters tuples based on a predicate. It's one of the most common and heavily optimized operations.

Formal Definition:

σ_p(R) = { t | t ∈ R ∧ p(t) = true }

Where p is a predicate (boolean expression) over attributes of R.

Selection Properties

•Arity: Unary (single child)
•Schema: Output schema = Input schema (no attribute changes)
•Cardinality: Output ≤ Input (filters never add tuples)
•Selectivity: Fraction of tuples passing the predicate (key to cost estimation)
•Commutative: Multiple selections can apply in any order
•Cascadable: σ_p∧q(R) = σ_p(σ_q(R)) = σ_q(σ_p(R))

Optimization Implication

Selections are prime candidates for 'pushing down'—moving them closer to leaf nodes. Filtering early reduces the size of intermediate results, speeding up all subsequent operations.

Schema and Type Propagation

Each node in a relational algebra tree has input and output schemas. The output schema of a child becomes the input schema of its parent. Understanding how schemas propagate is crucial for:

Query validation: Detecting references to non-existent columns
Type checking: Ensuring operations are type-compatible
Optimization: Knowing what columns are available at each point
Physical planning: Determining tuple layouts for each operation

Schema Transformation Rules:

Output Schema by Operation
Operation	Input Schema(s)	Output Schema
Base Relation R	—	Schema(R) from catalog
Selection σ_p(R)	Schema(R)	Schema(R) unchanged
Projection π_A(R)	Schema(R)	A (subset of attributes)
Join R ⋈ S	Schema(R), Schema(S)	Schema(R) ∪ Schema(S)*
Natural Join R ⋈ S	Schema(R), Schema(S)	Schema(R) ∪ Schema(S) minus duplicates
Union R ∪ S	Schema(R), Schema(S)	Schema(R) (must equal Schema(S))
Grouping γ_G;aggs(R)	Schema(R)	G ∪ {aggregate output columns}
Rename ρ_X(R)	Schema(R)	Renamed attributes

Type Propagation Example:

Consider the expression π_name, salary*1.1(employees):

employees has schema: {id: INT, name: VARCHAR(100), salary: DECIMAL(10,2), dept: INT}
Projection extracts name and computes salary * 1.1
Output schema: {name: VARCHAR(100), salary_adjusted: DECIMAL(12,3)}

Note how the type of the computed column (salary * 1.1) must be inferred. Multiplying DECIMAL(10,2) by a literal may widen precision. The query processor must:

Infer that 1.1 is numeric (DECIMAL or FLOAT)
Apply type promotion rules
Determine the result type
Optionally cast if the target column has a specific type

Null Handling in Schema

Modern systems also track nullability through the tree. An inner join might make previously nullable columns non-null (if they were join keys), while outer joins introduce new nullability. This information aids in optimization (some optimizations are invalid with nulls) and code generation.

Properties and Annotations

Production query processors attach extensive metadata to tree nodes beyond just the operation and schema. This metadata enables optimization and planning decisions.

Key Properties Tracked:

Node Properties and Annotations

•Cardinality estimates: Expected number of output tuples. Critical for cost estimation. Derived from statistics (histograms, NDV) and selectivity estimates.
•Unique keys: Columns or column combinations guaranteed unique. Enables optimization like redundant DISTINCT elimination.
•Functional dependencies: If A determines B (A → B), knowing A tells you B. Enables projection pushdown and simplification.
•Sort order: If output is sorted on certain columns. Enables avoiding redundant sorts; impacts merge joins.
•Partitioning: For parallel/distributed execution—which columns partition the data. Enables join colocation decisions.
•Nullability: Which columns may contain nulls. Affects join semantics and optimization validity.
•Correlation: Statistical correlation between columns. Improves cardinality estimation accuracy.

annotated-node-concept.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Conceptual representation of an annotated tree node
interface RelationalNode {
    operation: OperationType;       // e.g., SELECTION, JOIN, PROJECTION
    children: RelationalNode[];      // Child nodes (0 for leaves, 1 for unary, 2 for binary)
    
    // Operation-specific parameters
    predicate?: Expression;          // For selection
    columns?: Column[];              // For projection
    joinCondition?: Expression;      // For join
    groupingColumns?: Column[];      // For aggregation
    aggregates?: Aggregate[];        // For aggregation
    
    // Schema information
    outputSchema: Schema;            // Column names and types produced
    
    // Statistical annotations
    estimatedCardinality: number;    // Rows expected
    estimatedCost: Cost;             // I/O, CPU costs
    
    // Logical properties
    uniqueKeys: ColumnSet[];         // Unique key constraints
    sortOrder: SortSpec[];           // If sorted, on what columns
    nullableColumns: Set<Column>;    // Which columns may be NULL
    
    // Derived properties
    accessedTables: Set<Table>;      // Tables this subtree reads
    correlations: ColumnPair[];      // Known column correlations
}

Why Annotations Matter:

These annotations drive optimizer decisions:

Cardinality tells the optimizer whether an operation is cheap (few tuples) or expensive (many tuples)
Sort order determines if a merge join is viable without a prior sort
Unique keys let the optimizer eliminate redundant DISTINCT operations
Partitioning in distributed systems determines if a shuffle is needed before a join

Without accurate annotations, the optimizer cannot make good decisions. This is why statistics maintenance is crucial for database performance.

Garbage In, Garbage Out

The optimizer is only as good as its cardinality estimates. Stale statistics, missing histograms, or correlated columns that appear independent can lead to catastrophically bad plans. Run ANALYZE/UPDATE STATISTICS regularly.

Real Database Implementations

Different database systems implement relational algebra trees with varying terminology and structures, but the fundamental concepts remain consistent. Understanding these implementations helps when reading EXPLAIN output and debugging query plans.

Terminology Across Database Systems
Concept	PostgreSQL	MySQL	SQL Server	Oracle
Relational Algebra Tree	Plan Tree / LogExpr	JOIN Tree	Logical Tree	Query Block / Expr
Selection	Filter / Quals	WHERE conditions	Filter	Filter
Projection	Target List	SELECT expressions	Output List	Projection
Table Access	SeqScan / IndexScan	Table access	Table Scan / Index Seek	Full Table Scan
Join	Join (Nested Loop/Hash/Merge)	JOIN (type)	Join (type)	Join (type)
View Plan	EXPLAIN	EXPLAIN	SET SHOWPLAN	EXPLAIN PLAN

explain-output-postgres.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- PostgreSQL EXPLAIN output reveals the tree structure
EXPLAIN (VERBOSE, COSTS)
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 75000;
 
/*
 Hash Join  (cost=1.09..2.23 rows=3 width=64)
   Output: e.name, d.dept_name
   Hash Cond: (e.dept_id = d.id)
   ->  Seq Scan on employees e  (cost=0.00..1.10 rows=3 width=68)
         Output: e.id, e.name, e.salary, e.dept_id
         Filter: (e.salary > 75000)
   ->  Hash  (cost=1.05..1.05 rows=5 width=36)
         Output: d.dept_name, d.id
         ->  Seq Scan on departments d  (cost=0.00..1.05 rows=5 width=36)
               Output: d.dept_name, d.id
*/
 
-- Note the tree structure:
-- - Root: Hash Join (with projection: e.name, d.dept_name)
--   - Left child: Seq Scan on employees (with Filter for salary > 75000)
--   - Right child: Hash (wrapping Seq Scan on departments)
-- 
-- Selection (salary > 75000) has been pushed to the scan.
-- Projection (Output columns) appears at each level.
-- Join condition is at the Hash Join node.

Key Observations from EXPLAIN:

Operations nest: The tree is visible in the indentation. Each indented block is a child of the parent above it.
Costs propagate upward: Parent costs include child costs. The root's cost is the total query cost.
Selection pushed to scan: The Filter: (e.salary > 75000) appears directly at the Seq Scan, not as a separate node. This is an optimization—the canonical tree had selection above the scan.
Physical decisions: "Hash Join" and "Seq Scan" are physical operators, not just logical algebra. This is the execution plan derived from the optimized logical tree.
Output columns listed: Each node shows what columns it produces, corresponding to projection.

Learning to Read EXPLAIN

EXPLAIN output is the window into the optimizer's decisions. Practice reading it regularly. Observe how different WHERE clauses affect the plan, how indexes change scan types, and how join orders shift based on table sizes. This skill is invaluable for performance optimization.

Summary and Next Steps

We've established a deep understanding of relational algebra trees—the fundamental representation that bridges declarative SQL and executable query plans. Let's consolidate the key insights:

Key Takeaways

•Relational algebra trees translate declarative SQL into a structured representation where nodes are operations and edges represent data flow.
•Leaf nodes represent base tables; internal nodes represent operations like selection, projection, join, and aggregation.
•Data flows upward: Children produce output consumed by parents; the root produces the final query result.
•Schema propagates through the tree, with each operation transforming the schema according to its semantics.
•Annotations (cardinality, sort order, keys) enable the optimizer to estimate costs and choose efficient plans.
•The canonical translation is just the starting point; optimizers extensively transform trees using algebraic rules.
•EXPLAIN output reveals the final tree structure, showing how your SQL became an executable plan.

What's Next:

Relational algebra trees focus on what operations to perform. The next page explores query graphs—an alternative representation that emphasizes the relationships between tables and conditions. Query graphs excel at join ordering and expose optimization opportunities that tree structures can obscure.

Page Complete

You now understand how databases internally represent queries as relational algebra trees. This knowledge is foundational for understanding query optimization, reading execution plans, and diagnosing performance issues. Next: Query Graphs and their role in join optimization.

1 / 5

Loading learning content...

Query ProcessingQuery Representation

Query Representation

LevelIntermediate

Duration60 mins

TopicQuery Representation

1 / 5

Relational Algebra Trees

The Invisible Translator

What You Will Learn

The Bridge Between SQL and Execution

Consider the conceptual gulf:

SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 75000
ORDER BY e.name;

This SQL query makes no mention of:

Which table to read first
Whether to use indexes or sequential scans
How to perform the join (nested loops? hash join? sort-merge?)
When to apply the filter condition
How to sort the results

Yet the database must make all these decisions—and must make them optimally. Relational algebra trees provide the intermediate representation that makes this possible.

Why Trees?

The Query Processing Pipeline:

Relational algebra trees occupy a critical position in the query processing pipeline:

Parsing: SQL text → Parse Tree (syntactic structure)
Semantic Analysis: Parse Tree → Validated Query (name resolution, type checking)
Translation: Validated Query → Relational Algebra Tree (logical representation)
Optimization: Relational Algebra Tree → Optimized Logical Plan
Physical Planning: Logical Plan → Execution Plan (with concrete algorithms)
Execution: Execution Plan → Query Results

SQL Constructs → Relational Algebra Operations
SQL Construct	Relational Algebra	Operation Symbol	Description
SELECT columns	Projection (π)	π	Extract specified attributes from tuples
WHERE condition	Selection (σ)	σ	Filter tuples based on predicate
FROM table	Relation	R	Base table or derived relation
JOIN ... ON	Join (⋈)	⋈	Combine tuples from two relations
UNION	Union (∪)	∪	Combine tuples from two relations
INTERSECT	Intersection (∩)	∩	Tuples present in both relations
EXCEPT	Difference (−)	−	Tuples in first but not second relation
DISTINCT	Duplicate Elimination (δ)	δ	Remove duplicate tuples
GROUP BY	Grouping (γ)	γ	Partition tuples and aggregate
ORDER BY	Sorting (τ)	τ	Order tuples by attribute(s)

Anatomy of a Relational Algebra Tree

A relational algebra tree is a directed acyclic graph (typically a tree, though some representations allow sharing) where:

Leaf nodes represent base relations (tables) or constants
Internal nodes represent relational algebra operations
Edges represent the flow of data from child to parent
The root produces the final query result

Data flows upward: Each node receives input from its child nodes, applies its operation, and passes the result to its parent. The root node's output is the query result.

Key Structural Properties:

Tree Structure Properties

•Arity varies by operation: Unary operations (σ, π, δ, τ) have one child. Binary operations (⋈, ∪, ∩, −) have two children. N-ary operations exist in some extensions.
•Order independence at conceptual level: Relational algebra is set-based; the order of tuples in intermediate results is undefined unless explicitly sorted.
•Schema propagation: Each node has an output schema determined by its operation and input schemas. Projection changes attributes; selection preserves them.
•Nullability and type tracking: Modern systems track null constraints and data types through the tree for optimization and error detection.

tree-structure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Example Query:
SELECT e.name, e.salary
FROM employees e
WHERE e.department = 'Engineering' AND e.salary > 80000;
 
-- Conceptual Relational Algebra Expression:
-- π_name,salary(σ_department='Engineering' ∧ salary>80000(employees))
 
-- Tree Representation:
--
--           π (name, salary)           ← Root: Projection
--                 │
--           σ (department='Engineering'  ← Selection predicate
--              AND salary > 80000)
--                 │
--           employees                  ← Leaf: Base relation
--
-- Data flows from bottom (employees table) upward through
-- the selection filter, then through projection to produce
-- the final result with only name and salary columns.

Understanding the Example:

In this tree:

Leaf Node (employees): Represents scanning the employees table. Its output schema is all columns of the employees table.
Selection Node (σ): Receives all employee tuples, evaluates the compound predicate department='Engineering' AND salary > 80000, and outputs only tuples where this predicate is true. Output schema identical to input.
Projection Node (π): Receives filtered tuples, extracts only name and salary attributes, discards all others. Output schema contains only these two columns.

The final result flows out of the projection node—containing names and salaries of engineering employees earning over $80,000.

Why This Representation Matters

Constructing Trees from SQL

The Canonical Translation Algorithm:

For a SELECT-FROM-WHERE query, the basic translation produces:

π_<select list>(
  σ_<where condition>(
    <from clause translation>
  )
)

For the FROM clause:

A single table becomes a leaf node
Multiple tables become a chain of Cartesian products (×) or joins (⋈)
Explicit JOINs with ON conditions become join operators with those conditions

Let's trace a complex example:

complex-query-translation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Complex SQL Query
SELECT c.customer_name, 
       SUM(o.order_total) as total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA' 
  AND o.order_date >= '2024-01-01'
GROUP BY c.customer_name
HAVING SUM(o.order_total) > 1000
ORDER BY total_spent DESC;
 
-- Step-by-Step Tree Construction:
 
-- 1. Base relations (leaves)
--    customers (c)
--    orders (o)
 
-- 2. Join operation
--    c ⋈_{c.customer_id = o.customer_id} o
 
-- 3. Selection for WHERE conditions
--    σ_{c.country='USA' ∧ o.order_date>='2024-01-01'}(join result)
 
-- 4. Grouping with aggregation
--    γ_{c.customer_name; SUM(o.order_total)→total_spent}(selection result)
 
-- 5. Having filter (selection on grouped result)
--    σ_{total_spent > 1000}(grouped result)
 
-- 6. Sort
--    τ_{total_spent DESC}(having result)
 
-- 7. Final projection (often implicit when all needed columns present)
--    π_{customer_name, total_spent}(sorted result)

Converting Mermaid diagram...

Critical Observations:

WHERE and HAVING differ fundamentally: WHERE filters rows before grouping (appears below γ), while HAVING filters groups after aggregation (appears above γ). This isn't just syntax—it's semantically different operations at different tree levels.
Join conditions vs. WHERE conditions: The join condition (c.customer_id = o.customer_id) is part of the join operator, while filter conditions appear in selection nodes. Some optimizers later push WHERE conditions into joins when beneficial.
ORDER BY comes last: Sorting produces the final ordered result, so it appears at or near the root.
The tree is canonical, not optimal: This is the initial translation. The optimizer will transform it extensively before execution.

Initial Translation ≠ Execution Order

Node Types in Depth

Selection (σ) filters tuples based on a predicate. It's one of the most common and heavily optimized operations.

Formal Definition:

σ_p(R) = { t | t ∈ R ∧ p(t) = true }

Where p is a predicate (boolean expression) over attributes of R.

Selection Properties

•Arity: Unary (single child)
•Schema: Output schema = Input schema (no attribute changes)
•Cardinality: Output ≤ Input (filters never add tuples)
•Selectivity: Fraction of tuples passing the predicate (key to cost estimation)
•Commutative: Multiple selections can apply in any order
•Cascadable: σ_p∧q(R) = σ_p(σ_q(R)) = σ_q(σ_p(R))

Optimization Implication

Selections are prime candidates for 'pushing down'—moving them closer to leaf nodes. Filtering early reduces the size of intermediate results, speeding up all subsequent operations.

Schema and Type Propagation

Each node in a relational algebra tree has input and output schemas. The output schema of a child becomes the input schema of its parent. Understanding how schemas propagate is crucial for:

Query validation: Detecting references to non-existent columns
Type checking: Ensuring operations are type-compatible
Optimization: Knowing what columns are available at each point
Physical planning: Determining tuple layouts for each operation

Schema Transformation Rules:

Output Schema by Operation
Operation	Input Schema(s)	Output Schema
Base Relation R	—	Schema(R) from catalog
Selection σ_p(R)	Schema(R)	Schema(R) unchanged
Projection π_A(R)	Schema(R)	A (subset of attributes)
Join R ⋈ S	Schema(R), Schema(S)	Schema(R) ∪ Schema(S)*
Natural Join R ⋈ S	Schema(R), Schema(S)	Schema(R) ∪ Schema(S) minus duplicates
Union R ∪ S	Schema(R), Schema(S)	Schema(R) (must equal Schema(S))
Grouping γ_G;aggs(R)	Schema(R)	G ∪ {aggregate output columns}
Rename ρ_X(R)	Schema(R)	Renamed attributes

Type Propagation Example:

Consider the expression π_name, salary*1.1(employees):

employees has schema: {id: INT, name: VARCHAR(100), salary: DECIMAL(10,2), dept: INT}
Projection extracts name and computes salary * 1.1
Output schema: {name: VARCHAR(100), salary_adjusted: DECIMAL(12,3)}

Note how the type of the computed column (salary * 1.1) must be inferred. Multiplying DECIMAL(10,2) by a literal may widen precision. The query processor must:

Infer that 1.1 is numeric (DECIMAL or FLOAT)
Apply type promotion rules
Determine the result type
Optionally cast if the target column has a specific type

Null Handling in Schema

Properties and Annotations

Production query processors attach extensive metadata to tree nodes beyond just the operation and schema. This metadata enables optimization and planning decisions.

Key Properties Tracked:

Node Properties and Annotations

•Cardinality estimates: Expected number of output tuples. Critical for cost estimation. Derived from statistics (histograms, NDV) and selectivity estimates.
•Unique keys: Columns or column combinations guaranteed unique. Enables optimization like redundant DISTINCT elimination.
•Functional dependencies: If A determines B (A → B), knowing A tells you B. Enables projection pushdown and simplification.
•Sort order: If output is sorted on certain columns. Enables avoiding redundant sorts; impacts merge joins.
•Partitioning: For parallel/distributed execution—which columns partition the data. Enables join colocation decisions.
•Nullability: Which columns may contain nulls. Affects join semantics and optimization validity.
•Correlation: Statistical correlation between columns. Improves cardinality estimation accuracy.

annotated-node-concept.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Conceptual representation of an annotated tree node
interface RelationalNode {
    operation: OperationType;       // e.g., SELECTION, JOIN, PROJECTION
    children: RelationalNode[];      // Child nodes (0 for leaves, 1 for unary, 2 for binary)
    
    // Operation-specific parameters
    predicate?: Expression;          // For selection
    columns?: Column[];              // For projection
    joinCondition?: Expression;      // For join
    groupingColumns?: Column[];      // For aggregation
    aggregates?: Aggregate[];        // For aggregation
    
    // Schema information
    outputSchema: Schema;            // Column names and types produced
    
    // Statistical annotations
    estimatedCardinality: number;    // Rows expected
    estimatedCost: Cost;             // I/O, CPU costs
    
    // Logical properties
    uniqueKeys: ColumnSet[];         // Unique key constraints
    sortOrder: SortSpec[];           // If sorted, on what columns
    nullableColumns: Set<Column>;    // Which columns may be NULL
    
    // Derived properties
    accessedTables: Set<Table>;      // Tables this subtree reads
    correlations: ColumnPair[];      // Known column correlations
}

Why Annotations Matter:

These annotations drive optimizer decisions:

Cardinality tells the optimizer whether an operation is cheap (few tuples) or expensive (many tuples)
Sort order determines if a merge join is viable without a prior sort
Unique keys let the optimizer eliminate redundant DISTINCT operations
Partitioning in distributed systems determines if a shuffle is needed before a join

Without accurate annotations, the optimizer cannot make good decisions. This is why statistics maintenance is crucial for database performance.

Garbage In, Garbage Out

Real Database Implementations

Terminology Across Database Systems
Concept	PostgreSQL	MySQL	SQL Server	Oracle
Relational Algebra Tree	Plan Tree / LogExpr	JOIN Tree	Logical Tree	Query Block / Expr
Selection	Filter / Quals	WHERE conditions	Filter	Filter
Projection	Target List	SELECT expressions	Output List	Projection
Table Access	SeqScan / IndexScan	Table access	Table Scan / Index Seek	Full Table Scan
Join	Join (Nested Loop/Hash/Merge)	JOIN (type)	Join (type)	Join (type)
View Plan	EXPLAIN	EXPLAIN	SET SHOWPLAN	EXPLAIN PLAN

explain-output-postgres.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- PostgreSQL EXPLAIN output reveals the tree structure
EXPLAIN (VERBOSE, COSTS)
SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 75000;
 
/*
 Hash Join  (cost=1.09..2.23 rows=3 width=64)
   Output: e.name, d.dept_name
   Hash Cond: (e.dept_id = d.id)
   ->  Seq Scan on employees e  (cost=0.00..1.10 rows=3 width=68)
         Output: e.id, e.name, e.salary, e.dept_id
         Filter: (e.salary > 75000)
   ->  Hash  (cost=1.05..1.05 rows=5 width=36)
         Output: d.dept_name, d.id
         ->  Seq Scan on departments d  (cost=0.00..1.05 rows=5 width=36)
               Output: d.dept_name, d.id
*/
 
-- Note the tree structure:
-- - Root: Hash Join (with projection: e.name, d.dept_name)
--   - Left child: Seq Scan on employees (with Filter for salary > 75000)
--   - Right child: Hash (wrapping Seq Scan on departments)
-- 
-- Selection (salary > 75000) has been pushed to the scan.
-- Projection (Output columns) appears at each level.
-- Join condition is at the Hash Join node.

Key Observations from EXPLAIN:

Operations nest: The tree is visible in the indentation. Each indented block is a child of the parent above it.
Costs propagate upward: Parent costs include child costs. The root's cost is the total query cost.
Selection pushed to scan: The Filter: (e.salary > 75000) appears directly at the Seq Scan, not as a separate node. This is an optimization—the canonical tree had selection above the scan.
Physical decisions: "Hash Join" and "Seq Scan" are physical operators, not just logical algebra. This is the execution plan derived from the optimized logical tree.
Output columns listed: Each node shows what columns it produces, corresponding to projection.

Learning to Read EXPLAIN

Summary and Next Steps

We've established a deep understanding of relational algebra trees—the fundamental representation that bridges declarative SQL and executable query plans. Let's consolidate the key insights:

Key Takeaways

•Relational algebra trees translate declarative SQL into a structured representation where nodes are operations and edges represent data flow.
•Leaf nodes represent base tables; internal nodes represent operations like selection, projection, join, and aggregation.
•Data flows upward: Children produce output consumed by parents; the root produces the final query result.
•Schema propagates through the tree, with each operation transforming the schema according to its semantics.
•Annotations (cardinality, sort order, keys) enable the optimizer to estimate costs and choose efficient plans.
•The canonical translation is just the starting point; optimizers extensively transform trees using algebraic rules.
•EXPLAIN output reveals the final tree structure, showing how your SQL became an executable plan.

What's Next:

Page Complete

1 / 5