Loading learning content...
When you write a SQL query, you're expressing what data you want—not how to retrieve it. This declarative nature is SQL's greatest strength, but it creates a fundamental challenge: the database engine must translate your high-level request into a concrete sequence of operations that can actually execute against stored data.
This translation is not trivial. A simple query like SELECT name FROM employees WHERE salary > 50000 AND department = 'Engineering' could be executed in dozens of ways, each with vastly different performance characteristics. Should we first filter by salary, then by department? Or vice versa? Should we use an index? Which one? The answers depend on data distribution, indexes available, memory constraints, and countless other factors.
Relational algebra trees are the database's answer to this challenge. They provide a structured, mathematically grounded representation that captures the logical intent of a query while enabling systematic optimization and execution planning.
By the end of this page, you will understand how databases transform SQL queries into relational algebra trees, the structure and semantics of these trees, the role of each node type, and why this representation is fundamental to query optimization. You'll gain insight into one of the most critical internal mechanisms of every relational database.
To understand relational algebra trees, we must first appreciate the gap they bridge. SQL is a declarative query language—you specify the desired result, not the steps to achieve it. This is fundamentally different from procedural programming where you write explicit instructions.
Consider the conceptual gulf:
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 75000
ORDER BY e.name;
This SQL query makes no mention of:
Yet the database must make all these decisions—and must make them optimally. Relational algebra trees provide the intermediate representation that makes this possible.
Trees are natural for representing queries because SQL operations compose hierarchically. A SELECT operates on the result of a JOIN, which operates on the results of table scans. This parent-child relationship maps directly to a tree structure, where each node produces output that flows upward to its parent.
The Query Processing Pipeline:
Relational algebra trees occupy a critical position in the query processing pipeline:
The relational algebra tree is where the query transitions from being a syntactic artifact to a semantic representation that captures exactly what operations must be performed, independent of how they'll be implemented.
| SQL Construct | Relational Algebra | Operation Symbol | Description |
|---|---|---|---|
| SELECT columns | Projection (π) | π | Extract specified attributes from tuples |
| WHERE condition | Selection (σ) | σ | Filter tuples based on predicate |
| FROM table | Relation | R | Base table or derived relation |
| JOIN ... ON | Join (⋈) | ⋈ | Combine tuples from two relations |
| UNION | Union (∪) | ∪ | Combine tuples from two relations |
| INTERSECT | Intersection (∩) | ∩ | Tuples present in both relations |
| EXCEPT | Difference (−) | − | Tuples in first but not second relation |
| DISTINCT | Duplicate Elimination (δ) | δ | Remove duplicate tuples |
| GROUP BY | Grouping (γ) | γ | Partition tuples and aggregate |
| ORDER BY | Sorting (τ) | τ | Order tuples by attribute(s) |
A relational algebra tree is a directed acyclic graph (typically a tree, though some representations allow sharing) where:
Data flows upward: Each node receives input from its child nodes, applies its operation, and passes the result to its parent. The root node's output is the query result.
Key Structural Properties:
1234567891011121314151617181920
-- Example Query:SELECT e.name, e.salaryFROM employees eWHERE e.department = 'Engineering' AND e.salary > 80000; -- Conceptual Relational Algebra Expression:-- π_name,salary(σ_department='Engineering' ∧ salary>80000(employees)) -- Tree Representation:---- π (name, salary) ← Root: Projection-- │-- σ (department='Engineering' ← Selection predicate-- AND salary > 80000)-- │-- employees ← Leaf: Base relation---- Data flows from bottom (employees table) upward through-- the selection filter, then through projection to produce-- the final result with only name and salary columns.Understanding the Example:
In this tree:
Leaf Node (employees): Represents scanning the employees table. Its output schema is all columns of the employees table.
Selection Node (σ): Receives all employee tuples, evaluates the compound predicate department='Engineering' AND salary > 80000, and outputs only tuples where this predicate is true. Output schema identical to input.
Projection Node (π): Receives filtered tuples, extracts only name and salary attributes, discards all others. Output schema contains only these two columns.
The final result flows out of the projection node—containing names and salaries of engineering employees earning over $80,000.
The tree structure makes transformation rules easy to apply. Optimizers manipulate trees using algebraic identities—pushing selections down, reordering joins, combining projections. The tree is both a representation and a workspace for optimization.
The transformation from SQL to relational algebra trees follows systematic rules. Understanding these rules illuminates how the database "understands" your queries and why certain SQL patterns produce certain tree shapes.
The Canonical Translation Algorithm:
For a SELECT-FROM-WHERE query, the basic translation produces:
π_<select list>(
σ_<where condition>(
<from clause translation>
)
)
For the FROM clause:
Let's trace a complex example:
12345678910111213141516171819202122232425262728293031323334
-- Complex SQL QuerySELECT c.customer_name, SUM(o.order_total) as total_spentFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE c.country = 'USA' AND o.order_date >= '2024-01-01'GROUP BY c.customer_nameHAVING SUM(o.order_total) > 1000ORDER BY total_spent DESC; -- Step-by-Step Tree Construction: -- 1. Base relations (leaves)-- customers (c)-- orders (o) -- 2. Join operation-- c ⋈_{c.customer_id = o.customer_id} o -- 3. Selection for WHERE conditions-- σ_{c.country='USA' ∧ o.order_date>='2024-01-01'}(join result) -- 4. Grouping with aggregation-- γ_{c.customer_name; SUM(o.order_total)→total_spent}(selection result) -- 5. Having filter (selection on grouped result)-- σ_{total_spent > 1000}(grouped result) -- 6. Sort-- τ_{total_spent DESC}(having result) -- 7. Final projection (often implicit when all needed columns present)-- π_{customer_name, total_spent}(sorted result)Critical Observations:
WHERE and HAVING differ fundamentally: WHERE filters rows before grouping (appears below γ), while HAVING filters groups after aggregation (appears above γ). This isn't just syntax—it's semantically different operations at different tree levels.
Join conditions vs. WHERE conditions: The join condition (c.customer_id = o.customer_id) is part of the join operator, while filter conditions appear in selection nodes. Some optimizers later push WHERE conditions into joins when beneficial.
ORDER BY comes last: Sorting produces the final ordered result, so it appears at or near the root.
The tree is canonical, not optimal: This is the initial translation. The optimizer will transform it extensively before execution.
The canonical tree represents logical operations, not execution order. During optimization, nodes may move, merge, or transform. A selection might push down past a join; projections might combine. The initial tree is a starting point, not a final plan.
Each node type in a relational algebra tree has specific semantics, input requirements, and output characteristics. Understanding these deeply is essential for reasoning about query behavior and optimization opportunities.
Selection (σ) filters tuples based on a predicate. It's one of the most common and heavily optimized operations.
Formal Definition:
σ_p(R) = { t | t ∈ R ∧ p(t) = true }
Where p is a predicate (boolean expression) over attributes of R.
Selections are prime candidates for 'pushing down'—moving them closer to leaf nodes. Filtering early reduces the size of intermediate results, speeding up all subsequent operations.
Each node in a relational algebra tree has input and output schemas. The output schema of a child becomes the input schema of its parent. Understanding how schemas propagate is crucial for:
Schema Transformation Rules:
| Operation | Input Schema(s) | Output Schema |
|---|---|---|
| Base Relation R | — | Schema(R) from catalog |
| Selection σ_p(R) | Schema(R) | Schema(R) unchanged |
| Projection π_A(R) | Schema(R) | A (subset of attributes) |
| Join R ⋈ S | Schema(R), Schema(S) | Schema(R) ∪ Schema(S)* |
| Natural Join R ⋈ S | Schema(R), Schema(S) | Schema(R) ∪ Schema(S) minus duplicates |
| Union R ∪ S | Schema(R), Schema(S) | Schema(R) (must equal Schema(S)) |
| Grouping γ_G;aggs(R) | Schema(R) | G ∪ {aggregate output columns} |
| Rename ρ_X(R) | Schema(R) | Renamed attributes |
Type Propagation Example:
Consider the expression π_name, salary*1.1(employees):
employees has schema: {id: INT, name: VARCHAR(100), salary: DECIMAL(10,2), dept: INT}
Projection extracts name and computes salary * 1.1
Output schema: {name: VARCHAR(100), salary_adjusted: DECIMAL(12,3)}
Note how the type of the computed column (salary * 1.1) must be inferred. Multiplying DECIMAL(10,2) by a literal may widen precision. The query processor must:
Modern systems also track nullability through the tree. An inner join might make previously nullable columns non-null (if they were join keys), while outer joins introduce new nullability. This information aids in optimization (some optimizations are invalid with nulls) and code generation.
Production query processors attach extensive metadata to tree nodes beyond just the operation and schema. This metadata enables optimization and planning decisions.
Key Properties Tracked:
12345678910111213141516171819202122232425262728
// Conceptual representation of an annotated tree nodeinterface RelationalNode { operation: OperationType; // e.g., SELECTION, JOIN, PROJECTION children: RelationalNode[]; // Child nodes (0 for leaves, 1 for unary, 2 for binary) // Operation-specific parameters predicate?: Expression; // For selection columns?: Column[]; // For projection joinCondition?: Expression; // For join groupingColumns?: Column[]; // For aggregation aggregates?: Aggregate[]; // For aggregation // Schema information outputSchema: Schema; // Column names and types produced // Statistical annotations estimatedCardinality: number; // Rows expected estimatedCost: Cost; // I/O, CPU costs // Logical properties uniqueKeys: ColumnSet[]; // Unique key constraints sortOrder: SortSpec[]; // If sorted, on what columns nullableColumns: Set<Column>; // Which columns may be NULL // Derived properties accessedTables: Set<Table>; // Tables this subtree reads correlations: ColumnPair[]; // Known column correlations}Why Annotations Matter:
These annotations drive optimizer decisions:
Without accurate annotations, the optimizer cannot make good decisions. This is why statistics maintenance is crucial for database performance.
The optimizer is only as good as its cardinality estimates. Stale statistics, missing histograms, or correlated columns that appear independent can lead to catastrophically bad plans. Run ANALYZE/UPDATE STATISTICS regularly.
Different database systems implement relational algebra trees with varying terminology and structures, but the fundamental concepts remain consistent. Understanding these implementations helps when reading EXPLAIN output and debugging query plans.
| Concept | PostgreSQL | MySQL | SQL Server | Oracle |
|---|---|---|---|---|
| Relational Algebra Tree | Plan Tree / LogExpr | JOIN Tree | Logical Tree | Query Block / Expr |
| Selection | Filter / Quals | WHERE conditions | Filter | Filter |
| Projection | Target List | SELECT expressions | Output List | Projection |
| Table Access | SeqScan / IndexScan | Table access | Table Scan / Index Seek | Full Table Scan |
| Join | Join (Nested Loop/Hash/Merge) | JOIN (type) | Join (type) | Join (type) |
| View Plan | EXPLAIN | EXPLAIN | SET SHOWPLAN | EXPLAIN PLAN |
12345678910111213141516171819202122232425262728
-- PostgreSQL EXPLAIN output reveals the tree structureEXPLAIN (VERBOSE, COSTS)SELECT e.name, d.dept_nameFROM employees eJOIN departments d ON e.dept_id = d.idWHERE e.salary > 75000; /* Hash Join (cost=1.09..2.23 rows=3 width=64) Output: e.name, d.dept_name Hash Cond: (e.dept_id = d.id) -> Seq Scan on employees e (cost=0.00..1.10 rows=3 width=68) Output: e.id, e.name, e.salary, e.dept_id Filter: (e.salary > 75000) -> Hash (cost=1.05..1.05 rows=5 width=36) Output: d.dept_name, d.id -> Seq Scan on departments d (cost=0.00..1.05 rows=5 width=36) Output: d.dept_name, d.id*/ -- Note the tree structure:-- - Root: Hash Join (with projection: e.name, d.dept_name)-- - Left child: Seq Scan on employees (with Filter for salary > 75000)-- - Right child: Hash (wrapping Seq Scan on departments)-- -- Selection (salary > 75000) has been pushed to the scan.-- Projection (Output columns) appears at each level.-- Join condition is at the Hash Join node.Key Observations from EXPLAIN:
Operations nest: The tree is visible in the indentation. Each indented block is a child of the parent above it.
Costs propagate upward: Parent costs include child costs. The root's cost is the total query cost.
Selection pushed to scan: The Filter: (e.salary > 75000) appears directly at the Seq Scan, not as a separate node. This is an optimization—the canonical tree had selection above the scan.
Physical decisions: "Hash Join" and "Seq Scan" are physical operators, not just logical algebra. This is the execution plan derived from the optimized logical tree.
Output columns listed: Each node shows what columns it produces, corresponding to projection.
EXPLAIN output is the window into the optimizer's decisions. Practice reading it regularly. Observe how different WHERE clauses affect the plan, how indexes change scan types, and how join orders shift based on table sizes. This skill is invaluable for performance optimization.
We've established a deep understanding of relational algebra trees—the fundamental representation that bridges declarative SQL and executable query plans. Let's consolidate the key insights:
What's Next:
Relational algebra trees focus on what operations to perform. The next page explores query graphs—an alternative representation that emphasizes the relationships between tables and conditions. Query graphs excel at join ordering and expose optimization opportunities that tree structures can obscure.
You now understand how databases internally represent queries as relational algebra trees. This knowledge is foundational for understanding query optimization, reading execution plans, and diagnosing performance issues. Next: Query Graphs and their role in join optimization.