Database Management SystemsRelational Algebra Overview

Relational Algebra Overview

LevelIntermediate

Duration60 mins

TopicRelational Algebra Overview

4 / 5

Expression Trees

The Shape of a Query

Every query you write has a hidden structure—a shape that determines how it will be executed, optimized, and understood by the database engine. This structure is the expression tree, a hierarchical representation that captures the complete logic of a relational algebra expression.

When you write SQL, the database parses your text, validates the semantics, and translates everything into an expression tree. This tree becomes the central representation that the query optimizer manipulates, the execution engine interprets, and debuggers display. Understanding expression trees means understanding how databases actually process queries.

In this page, we'll explore expression trees in depth: their structure, how they're built, how they're evaluated, and how they're transformed during optimization. By the end, you'll be able to visualize any query as a tree and understand why certain tree shapes lead to better performance.

What You Will Learn

By the end of this page, you will understand: the formal structure of relational algebra expression trees; how expressions map to tree representations; bottom-up and top-down tree evaluation strategies; how tree transformations enable query optimization; and practical skills for analyzing query plans.

Tree Structure Fundamentals

An expression tree is a hierarchical data structure where:

Nodes represent operations or data sources
Edges represent data flow from producer to consumer
The root represents the final query result
Leaves represent base relations (tables)

Node Types:

Leaf Nodes (Operands):

Represent base relations: Employees, Departments, Orders
Have no children
Produce data by reading from storage
Schema comes from table definition

Internal Nodes (Operators):

Represent relational algebra operations: σ, π, ⋈, ∪, etc.
Have one child (unary) or two children (binary)
Consume input relations from children
Produce output relation for parent
Schema determined by operation and input schemas

The Root Node:

Special internal node at the top
Produces the final query result
Output goes to the client, not another operator

Edge Direction:

By convention, edges point FROM child TO parent, representing data flow upward. A join node has two incoming edges (from its two input relations) and one outgoing edge (to its parent or the result).

Converting Mermaid diagram...

Tree Properties:

Height: Number of edges from root to deepest leaf; indicates query complexity and nesting depth
Width: Maximum number of nodes at any level; indicates parallelism potential
Size: Total number of nodes; correlates with query complexity
Balance: Whether subtrees have similar heights; affects execution strategies

Formal Definition:

An expression tree T is a tuple (N, E, root, label) where:

N is a finite set of nodes
E ⊆ N × N is a set of directed edges
root ∈ N is the designated root node
label: N → (RelName ∪ OpName) assigns each node a relation name or operator
The graph (N, E) forms a tree rooted at root
Every leaf is labeled with a relation name
Every internal node is labeled with an operator

Trees vs DAGs

Pure expression trees allow each subexpression to have exactly one parent. In practice, common subexpressions might be shared (e.g., the same subquery used twice), creating a DAG (Directed Acyclic Graph). Query optimizers often handle both, with DAGs enabling common subexpression elimination.

Building Expression Trees from Algebra

Translating a relational algebra expression into a tree is straightforward—the nested structure of the expression directly maps to the tree structure.

The Recursive Construction:

Tree(R) where R is a base relation:
  Create leaf node labeled R
  
Tree(Op(E)) where Op is a unary operator:
  Create internal node labeled Op
  Add Tree(E) as its child
  
Tree(E1 Op E2) where Op is a binary operator:
  Create internal node labeled Op
  Add Tree(E1) as left child
  Add Tree(E2) as right child

Example Construction:

For the expression:

π_name(σ_{salary>50000}(Employees ⋈_{dept_id=id} Departments))

Step-by-step:

Start from the outermost operator: π_name
Its child is σ_{salary>50000}(...)
Selection's child is Employees ⋈ Departments
Join's children are Employees and Departments (leaves)

Building bottom-up:

Step 1: Create leaf nodes
  node_E = Leaf(Employees)
  node_D = Leaf(Departments)
  
Step 2: Create join node
  node_J = Internal(⋈, condition=dept_id=id)
  node_J.left = node_E
  node_J.right = node_D
  
Step 3: Create selection node
  node_S = Internal(σ, predicate=salary>50000)
  node_S.child = node_J
  
Step 4: Create projection node (root)
  node_P = Internal(π, columns=[name])
  node_P.child = node_S
  
Result: Tree with root = node_P

Operator Arity and Tree Structure
Operator	Symbol	Arity	Tree Structure
Selection	σ	Unary	One child
Projection	π	Unary	One child
Rename	ρ	Unary	One child
Aggregation	𝒢	Unary	One child
Union	∪	Binary	Two children
Intersection	∩	Binary	Two children
Difference	−	Binary	Two children
Cartesian Product	×	Binary	Two children
Join	⋈	Binary	Two children
Division	÷	Binary	Two children

expression_tree.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Simplified expression tree representation
 
type OperatorType = 'SELECT' | 'PROJECT' | 'JOIN' | 'UNION' | 'PRODUCT' | 'RENAME';
 
interface TreeNode {
  type: 'leaf' | 'unary' | 'binary';
  operator?: OperatorType;
  relation?: string;  // For leaf nodes
  condition?: string; // For selection, join
  columns?: string[]; // For projection
  children: TreeNode[];
}
 
// Constructing the tree for:
// π_name(σ_salary>50000(Employees ⋈ Departments))
 
const tree: TreeNode = {
  type: 'unary',
  operator: 'PROJECT',
  columns: ['name'],
  children: [{
    type: 'unary',
    operator: 'SELECT',
    condition: 'salary > 50000',
    children: [{
      type: 'binary',
      operator: 'JOIN',
      condition: 'dept_id = id',
      children: [
        { type: 'leaf', relation: 'Employees', children: [] },
        { type: 'leaf', relation: 'Departments', children: [] }
      ]
    }]
  }]
};
 
// Tree traversal for display
function printTree(node: TreeNode, indent: number = 0): void {
  const prefix = '  '.repeat(indent);
  if (node.type === 'leaf') {
    console.log(`${prefix}📋 ${node.relation}`);
  } else {
    console.log(`${prefix}🔧 ${node.operator}${node.condition ? '(' + node.condition + ')' : ''}`);
    node.children.forEach(child => printTree(child, indent + 1));
  }
}

Parentheses Encode Structure

In relational algebra notation, parentheses indicate nesting—and nesting directly translates to tree depth. Every pair of parentheses creates a new subtree. Reading expressions inside-out matches traversing the tree from leaves to root.

Evaluating Expression Trees

Expression trees aren't just representations—they're executable specifications. The database evaluator traverses the tree to compute the query result. Two primary evaluation strategies exist.

Bottom-Up (Materialized) Evaluation:

Start at the leaves and work upward:

Read base relations from storage (leaves produce data)
For each internal node, wait for all children to complete
Apply the operator to child results, producing a new relation
Pass the result upward to the parent
The root's output is the query result

Characteristics:

Simple to implement and reason about
Each intermediate result is fully materialized (stored)
Memory usage can be high for large intermediates
Natural for operators that need complete input (e.g., sort, aggregation)

Top-Down (Demand-Driven/Pipelined) Evaluation:

Start at the root and propagate requests downward:

Root requests its first result tuple
Root asks its child for input
Child propagates request to its children
Leaves fetch data on demand
Results flow back up tuple-by-tuple

Characteristics:

Also called "iterator model" or "Volcano model"
Tuples stream through operators without full materialization
Memory efficient for large results
Natural for producers connected directly to consumers

Bottom-Up Evaluation

•Materialize each intermediate result
•Wait for complete inputs before computing
•Higher memory usage
•Better for blocking operators
•Simpler to implement
•Each node runs once

Top-Down Evaluation

•Stream tuples through operators
•Produce results incrementally
•Lower memory footprint
•Better for filter-heavy queries
•More complex control flow
•Supports early termination (LIMIT)

The Iterator Interface:

Most modern databases use the iterator model, where each node implements three operations:

Open()   - Initialize the operator, recursively open children
Next()   - Return the next output tuple (or null if exhausted)
Close()  - Clean up resources, recursively close children

Evaluation Example:

For π_name(σ_{salary>50000}(Employees)):

1. Query executor calls Root.Open()
2. π.Open() calls σ.Open()
3. σ.Open() calls Employees.Open() (init table scan)

4. Executor calls Root.Next()
5. π.Next() calls σ.Next()
6. σ.Next() calls Employees.Next() repeatedly:
   - Returns each tuple, σ checks predicate
   - If salary > 50000, σ returns tuple to π
   - If not, σ calls Next() again, filtering internally
7. π extracts 'name' column, returns to executor
8. Executor receives row, sends to client
9. Repeat 4-8 until Next() returns null

Pipeline Breakers:

Some operators can't stream—they need complete input before producing output:

Sort: Must see all tuples to determine order
Aggregation without grouping: Must see all tuples to compute aggregate
Hash Join build phase: Must build complete hash table

These "pipeline breakers" force materialization at that point in the tree.

Vectorized Execution

Modern systems often use vectorized execution, where Next() returns a batch of tuples rather than one. This reduces function call overhead and enables SIMD optimizations. The tree structure remains the same; only the granularity of data flow changes.

Tree Representation in Practice

Real database systems maintain rich metadata in their expression trees beyond simple operator labels. Understanding this metadata is essential for reading query plans.

Node Annotations:

Each tree node typically includes:

Operator Type: SELECT, JOIN, AGGREGATE, etc.
Operator Parameters: Predicates, columns, join conditions
Schema Information: Output columns and types
Cardinality Estimate: Expected number of output tuples
Cost Estimate: Expected execution cost (I/O, CPU)
Physical Operator: The chosen algorithm (hash join, merge join, etc.)
Properties: Ordering, partitioning, uniqueness guarantees

Example Annotated Node:

HashJoin
  Type: Inner Join
  Condition: employees.dept_id = departments.id
  Output Schema: (emp_id, name, salary, dept_id, dept_name)
  Estimated Rows: 850
  Estimated Cost: 1250.0
  Build Input: Departments (smaller)
  Probe Input: Employees (larger)

query_plan_example.sql
PostgreSQL EXPLAIN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Query
EXPLAIN ANALYZE
SELECT e.name, d.dept_name
FROM Employees e
JOIN Departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
 
-- Query Plan (simplified representation)
/*
Projection (name, dept_name)
  └─ cost=0.00..1250.00 rows=850
  └─ Hash Join (e.dept_id = d.id)
       └─ cost=0.00..1200.00 rows=850
       └─ Hash (Departments)
       │    └─ cost=0.00..50.00 rows=50
       │    └─ Seq Scan on Departments
       │         └─ cost=0.00..50.00 rows=50
       └─ Seq Scan on Employees
            └─ Filter: salary > 50000
            └─ cost=0.00..1100.00 rows=900 (after filter)
            └─ Rows Examined: 10000
*/

Query Plan Visualization:

Databases display trees in various formats:

Indented Text: Each level indented, common in EXPLAIN output
Tree Diagrams: Visual trees with boxes and arrows
JSON/XML: Structured representation for programmatic access
Graphical Tools: Interactive visualizations (pgAdmin, MySQL Workbench)

Reading an EXPLAIN Plan:

Identify the root: The topmost operation produces the final result
Trace the data flow: Follow from leaves up to understand data sources
Note costs and cardinalities: Identify expensive operations
Check physical operators: Are appropriate algorithms chosen?
Look for warnings: Missing indexes, full scans, etc.

The Difference Between Logical and Physical Plans:

Logical Plan: Expression tree with relational algebra operators (JOIN, SELECT)
Physical Plan: Expression tree with executable algorithms (HASH_JOIN, INDEX_SCAN)

The optimizer converts logical to physical, choosing algorithms based on costs and constraints.

EXPLAIN as a Learning Tool

Running EXPLAIN on your queries is one of the best ways to understand expression trees. Start with simple queries to see simple trees. Gradually add complexity—joins, subqueries, aggregations—to see how trees grow. Compare EXPLAIN output before and after adding indexes to see how the tree structure changes.

Tree Transformations for Optimization

Query optimization is largely about transforming expression trees into equivalent but more efficient forms. The tree representation enables systematic application of transformation rules.

What Makes Trees Transformable:

Closedness under transformations: Transform a tree, get another valid tree
Local transformations: Most rules modify small subtrees
Composability: Multiple transformations can be applied sequentially
Reversibility: Transformations can be undone if not beneficial

Common Transformation Patterns:

Selection Pushdown:

Move selection closer to leaves to filter early:

Before:

After (when p involves only R):

Join Reordering:

Change the order of joins to minimize intermediate sizes:

Before:

After (if S ⋈ T is smaller):

Converting Mermaid diagram...

Key Tree Transformation Rules
Transformation	When Applicable	Effect
Selection Pushdown	Selection predicate references only one child	Filter rows early, reduce intermediate sizes
Projection Pushdown	Projected columns subset of child's output	Narrow tuples early, reduce memory
Join Reordering	Joins are associative (inner joins)	Minimize intermediate join sizes
Selection Splitting	Conjunction of predicates	Push independent parts separately
Projection Pulling	Repeated projections	Combine into single projection
Join to Semijoin	Only need existence check	Avoid retrieving full joined tuples
Subquery Decorrelation	Correlated subquery	Convert to flat join structure

Transformation as Tree Rewriting:

Each transformation rule can be expressed as a pattern matching and replacement:

Rule: Selection Pushdown over Join
Pattern:
  σ_p(R ⋈ S) where p references only R
Replacement:
  σ_p(R) ⋈ S

Algorithm:
  For each node N in tree:
    If N matches pattern left-hand side:
      Replace N with pattern right-hand side
      Update parent/child pointers

The Optimizer's Task:

Given an input tree, the optimizer:

Enumerates applicable transformations
Generates alternative trees
Estimates cost of each alternative
Selects the lowest-cost tree (or uses heuristics)

This process might explore thousands of trees for complex queries, using dynamic programming or genetic algorithms to search efficiently.

Transformation Correctness

Not all transformations preserve semantics in all cases. For example, pushing selection past outer join can change results. Optimizers must check applicability conditions carefully. A wrong transformation produces wrong results—the worst possible optimizer bug.

Tree Properties and Invariants

Expression trees carry properties and maintain invariants that help optimizers make decisions and ensure correctness.

Logical Properties:

Properties determined by the logical expression, independent of physical execution:

Output Schema: Columns and types produced by the subtree
Candidate Keys: Sets of attributes that uniquely identify tuples
Functional Dependencies: Relationships between attribute values
NOT NULL Constraints: Attributes guaranteed non-null
Cardinality Estimate: Expected number of output tuples

Physical Properties:

Properties that depend on how the operator is executed:

Ordering: Tuples sorted by certain attributes (important for merge join, ORDER BY)
Partitioning: How tuples are distributed across parallel executors
Locality: Where data physically resides

Property Propagation:

Properties propagate through the tree:

Selection preserves ordering, may affect cardinality
Projection may destroy ordering if the sort column is removed
Join combines properties from both inputs (complex rules)
Sort creates ordering property but destroys previous ordering

Example: Ordering Propagation:

For tree: ORDER BY salary(σ_dept='Eng'(Employees))

If Employees has index on (dept, salary):
  - Scan returns rows ordered by (dept, salary)
  - Selection preserves ordering (single dept value)
  - Order by salary satisfied—no additional sort needed!
  
If Employees has no suitable index:
  - Scan returns unordered rows
  - Selection produces unordered result
  - ORDER BY requires explicit sort operation

Key Property Categories

•Schema Properties: Column names, types, nullability—determine valid operations
•Cardinality Properties: Row count estimates—drive cost calculations
•Ordering Properties: Sort order—affects join and order-by efficiency
•Partitioning Properties: Data distribution—affects parallel execution
•Key Properties: Uniqueness—enables duplicate elimination optimizations
•Constraint Properties: Check constraints, FK—enable semantic optimizations

Interesting Orders:

The concept of "interesting orders" is crucial for optimization:

An ordering is "interesting" if it benefits a downstream operator
ORDER BY creates demand for a specific order
Merge join requires both inputs sorted on join column
Group by benefits from grouped/sorted input

The optimizer tracks which orderings are interesting and prefers plans that produce them, even at slight extra cost, to avoid expensive sort operations later.

Property-Based Optimization:

Modern optimizers use properties for:

Plan pruning: If a plan can't produce required properties, discard it early
Algorithm selection: Choose merge join if inputs are sorted; hash join otherwise
Enforcer insertion: Add sort/exchange operators when required properties aren't present
Cost adjustment: Factor in property mismatches when comparing plans

The Volcano and Cascades optimizer frameworks formalize property-based optimization, treating properties as first-class optimization dimensions.

Thinking in Properties

When analyzing queries, think about what properties each operation produces and requires. Does the join need sorted input? Does the output need to preserve ordering? This property-centric thinking helps you understand optimizer decisions and write queries that are easier to optimize.

Parallelism and Tree Structure

Expression trees naturally expose opportunities for parallel execution. Understanding these opportunities is essential for high-performance query processing.

Independent Subtrees:

Subtrees that share no data dependencies can execute concurrently:

           ⋈
          / 
         /   
     σ(R)     σ(S)

The two selection subtrees (on R and on S) can run simultaneously on different processors. The join waits for both to complete, then combines results.

Pipeline Parallelism:

In pipelining, producer and consumer operators run concurrently:

Producer: Table Scan on Employees
  → Stream tuples to →
Consumer: Selection (salary > 50000)
  → Stream tuples to →
Consumer: Projection (name)

All three operators can be active simultaneously, each processing different tuples.

Partition Parallelism:

Data is partitioned, and the same operator runs on multiple partitions:

         Gather
        /  |  
      σ    σ    σ
      |    |    |
   R(p1) R(p2) R(p3)

The relation R is split into partitions p1, p2, p3. Parallel selections run on each partition. Results are gathered at the end.

Converting Mermaid diagram...

Exchange Operators:

Distributed databases insert exchange operators into trees to manage data distribution:

Scatter/Partition: Split data and send to multiple nodes
Gather: Collect data from multiple nodes
Repartition: Redistribute data to align for join

            Gather
               |
            π(name)
               |
            ⋈ (on hash-partitioned dept_id)
           / 
   Repartition Repartition
       |           |
   σ(Employees) σ(Departments)

Blockers and Parallelism:

Pipeline-breaking operators create synchronization points:

Before a sort, all input must be gathered
Before global aggregation, partial aggregates must be collected
Before broadcast join, the broadcast table must be fully distributed

These blockers limit parallelism but are sometimes unavoidable.

Tree Width and Parallelism:

Narrow trees (long chains) have limited parallelism
Wide trees (many sibling subtrees) have high parallelism potential
The ideal width depends on available processors and data distribution

Distributed Query Planning

In distributed databases (Spark, Presto, CockroachDB), the expression tree is partitioned across nodes. The optimizer must consider network costs, data locality, and partition strategies. The tree structure remains the foundation, augmented with distribution metadata.

Summary: Expression Trees

We've explored expression trees—the fundamental representation of relational algebra queries. Let's consolidate the key insights:

Key Takeaways

•Structure — Trees have leaf nodes (base relations), internal nodes (operators), and a root (final result producer).
•Construction — Relational algebra expressions map directly to trees; nesting becomes tree depth.
•Evaluation — Bottom-up (materialized) or top-down (pipelined) execution strategies traverse the tree to compute results.
•Rich Metadata — Nodes carry schema, cost estimates, properties, and physical operator choices.
•Transformation — Optimization transforms trees via equivalence-preserving rewrites like selection pushdown and join reordering.
•Properties — Logical and physical properties (ordering, partitioning) propagate through trees and guide optimization.
•Parallelism — Tree structure exposes parallelism opportunities: independent subtrees, pipelining, and partition parallelism.

What's Next

With expression trees understood, we'll next explore query equivalence—the formal foundation that justifies tree transformations. We'll examine equivalence rules, the conditions under which they apply, and how they enable the optimization transformations we've previewed.

Page Complete

You now understand expression trees—the core representation that databases use for query processing. Every query plan you examine, every optimization the system applies, and every execution strategy chosen operates on this tree structure. Next, we'll explore the equivalence rules that make tree transformations valid.

4 / 5

Loading learning content...

Database Management SystemsRelational Algebra Overview

Relational Algebra Overview

LevelIntermediate

Duration60 mins

TopicRelational Algebra Overview

4 / 5

Expression Trees

The Shape of a Query

What You Will Learn

Tree Structure Fundamentals

An expression tree is a hierarchical data structure where:

Nodes represent operations or data sources
Edges represent data flow from producer to consumer
The root represents the final query result
Leaves represent base relations (tables)

Node Types:

Leaf Nodes (Operands):

Represent base relations: Employees, Departments, Orders
Have no children
Produce data by reading from storage
Schema comes from table definition

Internal Nodes (Operators):

Represent relational algebra operations: σ, π, ⋈, ∪, etc.
Have one child (unary) or two children (binary)
Consume input relations from children
Produce output relation for parent
Schema determined by operation and input schemas

The Root Node:

Special internal node at the top
Produces the final query result
Output goes to the client, not another operator

Edge Direction:

By convention, edges point FROM child TO parent, representing data flow upward. A join node has two incoming edges (from its two input relations) and one outgoing edge (to its parent or the result).

Converting Mermaid diagram...

Tree Properties:

Height: Number of edges from root to deepest leaf; indicates query complexity and nesting depth
Width: Maximum number of nodes at any level; indicates parallelism potential
Size: Total number of nodes; correlates with query complexity
Balance: Whether subtrees have similar heights; affects execution strategies

Formal Definition:

An expression tree T is a tuple (N, E, root, label) where:

N is a finite set of nodes
E ⊆ N × N is a set of directed edges
root ∈ N is the designated root node
label: N → (RelName ∪ OpName) assigns each node a relation name or operator
The graph (N, E) forms a tree rooted at root
Every leaf is labeled with a relation name
Every internal node is labeled with an operator

Trees vs DAGs

Building Expression Trees from Algebra

Translating a relational algebra expression into a tree is straightforward—the nested structure of the expression directly maps to the tree structure.

The Recursive Construction:

Tree(R) where R is a base relation:
  Create leaf node labeled R
  
Tree(Op(E)) where Op is a unary operator:
  Create internal node labeled Op
  Add Tree(E) as its child
  
Tree(E1 Op E2) where Op is a binary operator:
  Create internal node labeled Op
  Add Tree(E1) as left child
  Add Tree(E2) as right child

Example Construction:

For the expression:

π_name(σ_{salary>50000}(Employees ⋈_{dept_id=id} Departments))

Step-by-step:

Start from the outermost operator: π_name
Its child is σ_{salary>50000}(...)
Selection's child is Employees ⋈ Departments
Join's children are Employees and Departments (leaves)

Building bottom-up:

Step 1: Create leaf nodes
  node_E = Leaf(Employees)
  node_D = Leaf(Departments)
  
Step 2: Create join node
  node_J = Internal(⋈, condition=dept_id=id)
  node_J.left = node_E
  node_J.right = node_D
  
Step 3: Create selection node
  node_S = Internal(σ, predicate=salary>50000)
  node_S.child = node_J
  
Step 4: Create projection node (root)
  node_P = Internal(π, columns=[name])
  node_P.child = node_S
  
Result: Tree with root = node_P

Operator Arity and Tree Structure
Operator	Symbol	Arity	Tree Structure
Selection	σ	Unary	One child
Projection	π	Unary	One child
Rename	ρ	Unary	One child
Aggregation	𝒢	Unary	One child
Union	∪	Binary	Two children
Intersection	∩	Binary	Two children
Difference	−	Binary	Two children
Cartesian Product	×	Binary	Two children
Join	⋈	Binary	Two children
Division	÷	Binary	Two children

expression_tree.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Simplified expression tree representation
 
type OperatorType = 'SELECT' | 'PROJECT' | 'JOIN' | 'UNION' | 'PRODUCT' | 'RENAME';
 
interface TreeNode {
  type: 'leaf' | 'unary' | 'binary';
  operator?: OperatorType;
  relation?: string;  // For leaf nodes
  condition?: string; // For selection, join
  columns?: string[]; // For projection
  children: TreeNode[];
}
 
// Constructing the tree for:
// π_name(σ_salary>50000(Employees ⋈ Departments))
 
const tree: TreeNode = {
  type: 'unary',
  operator: 'PROJECT',
  columns: ['name'],
  children: [{
    type: 'unary',
    operator: 'SELECT',
    condition: 'salary > 50000',
    children: [{
      type: 'binary',
      operator: 'JOIN',
      condition: 'dept_id = id',
      children: [
        { type: 'leaf', relation: 'Employees', children: [] },
        { type: 'leaf', relation: 'Departments', children: [] }
      ]
    }]
  }]
};
 
// Tree traversal for display
function printTree(node: TreeNode, indent: number = 0): void {
  const prefix = '  '.repeat(indent);
  if (node.type === 'leaf') {
    console.log(`${prefix}📋 ${node.relation}`);
  } else {
    console.log(`${prefix}🔧 ${node.operator}${node.condition ? '(' + node.condition + ')' : ''}`);
    node.children.forEach(child => printTree(child, indent + 1));
  }
}

Parentheses Encode Structure

Evaluating Expression Trees

Expression trees aren't just representations—they're executable specifications. The database evaluator traverses the tree to compute the query result. Two primary evaluation strategies exist.

Bottom-Up (Materialized) Evaluation:

Start at the leaves and work upward:

Read base relations from storage (leaves produce data)
For each internal node, wait for all children to complete
Apply the operator to child results, producing a new relation
Pass the result upward to the parent
The root's output is the query result

Characteristics:

Simple to implement and reason about
Each intermediate result is fully materialized (stored)
Memory usage can be high for large intermediates
Natural for operators that need complete input (e.g., sort, aggregation)

Top-Down (Demand-Driven/Pipelined) Evaluation:

Start at the root and propagate requests downward:

Root requests its first result tuple
Root asks its child for input
Child propagates request to its children
Leaves fetch data on demand
Results flow back up tuple-by-tuple

Characteristics:

Also called "iterator model" or "Volcano model"
Tuples stream through operators without full materialization
Memory efficient for large results
Natural for producers connected directly to consumers

Bottom-Up Evaluation

•Materialize each intermediate result
•Wait for complete inputs before computing
•Higher memory usage
•Better for blocking operators
•Simpler to implement
•Each node runs once

Top-Down Evaluation

•Stream tuples through operators
•Produce results incrementally
•Lower memory footprint
•Better for filter-heavy queries
•More complex control flow
•Supports early termination (LIMIT)

The Iterator Interface:

Most modern databases use the iterator model, where each node implements three operations:

Open()   - Initialize the operator, recursively open children
Next()   - Return the next output tuple (or null if exhausted)
Close()  - Clean up resources, recursively close children

Evaluation Example:

For π_name(σ_{salary>50000}(Employees)):

1. Query executor calls Root.Open()
2. π.Open() calls σ.Open()
3. σ.Open() calls Employees.Open() (init table scan)

4. Executor calls Root.Next()
5. π.Next() calls σ.Next()
6. σ.Next() calls Employees.Next() repeatedly:
   - Returns each tuple, σ checks predicate
   - If salary > 50000, σ returns tuple to π
   - If not, σ calls Next() again, filtering internally
7. π extracts 'name' column, returns to executor
8. Executor receives row, sends to client
9. Repeat 4-8 until Next() returns null

Pipeline Breakers:

Some operators can't stream—they need complete input before producing output:

Sort: Must see all tuples to determine order
Aggregation without grouping: Must see all tuples to compute aggregate
Hash Join build phase: Must build complete hash table

These "pipeline breakers" force materialization at that point in the tree.

Vectorized Execution

Tree Representation in Practice

Real database systems maintain rich metadata in their expression trees beyond simple operator labels. Understanding this metadata is essential for reading query plans.

Node Annotations:

Each tree node typically includes:

Operator Type: SELECT, JOIN, AGGREGATE, etc.
Operator Parameters: Predicates, columns, join conditions
Schema Information: Output columns and types
Cardinality Estimate: Expected number of output tuples
Cost Estimate: Expected execution cost (I/O, CPU)
Physical Operator: The chosen algorithm (hash join, merge join, etc.)
Properties: Ordering, partitioning, uniqueness guarantees

Example Annotated Node:

HashJoin
  Type: Inner Join
  Condition: employees.dept_id = departments.id
  Output Schema: (emp_id, name, salary, dept_id, dept_name)
  Estimated Rows: 850
  Estimated Cost: 1250.0
  Build Input: Departments (smaller)
  Probe Input: Employees (larger)

query_plan_example.sql
PostgreSQL EXPLAIN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Query
EXPLAIN ANALYZE
SELECT e.name, d.dept_name
FROM Employees e
JOIN Departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
 
-- Query Plan (simplified representation)
/*
Projection (name, dept_name)
  └─ cost=0.00..1250.00 rows=850
  └─ Hash Join (e.dept_id = d.id)
       └─ cost=0.00..1200.00 rows=850
       └─ Hash (Departments)
       │    └─ cost=0.00..50.00 rows=50
       │    └─ Seq Scan on Departments
       │         └─ cost=0.00..50.00 rows=50
       └─ Seq Scan on Employees
            └─ Filter: salary > 50000
            └─ cost=0.00..1100.00 rows=900 (after filter)
            └─ Rows Examined: 10000
*/

Query Plan Visualization:

Databases display trees in various formats:

Indented Text: Each level indented, common in EXPLAIN output
Tree Diagrams: Visual trees with boxes and arrows
JSON/XML: Structured representation for programmatic access
Graphical Tools: Interactive visualizations (pgAdmin, MySQL Workbench)

Reading an EXPLAIN Plan:

Identify the root: The topmost operation produces the final result
Trace the data flow: Follow from leaves up to understand data sources
Note costs and cardinalities: Identify expensive operations
Check physical operators: Are appropriate algorithms chosen?
Look for warnings: Missing indexes, full scans, etc.

The Difference Between Logical and Physical Plans:

Logical Plan: Expression tree with relational algebra operators (JOIN, SELECT)
Physical Plan: Expression tree with executable algorithms (HASH_JOIN, INDEX_SCAN)

The optimizer converts logical to physical, choosing algorithms based on costs and constraints.

EXPLAIN as a Learning Tool

Tree Transformations for Optimization

Query optimization is largely about transforming expression trees into equivalent but more efficient forms. The tree representation enables systematic application of transformation rules.

What Makes Trees Transformable:

Closedness under transformations: Transform a tree, get another valid tree
Local transformations: Most rules modify small subtrees
Composability: Multiple transformations can be applied sequentially
Reversibility: Transformations can be undone if not beneficial

Common Transformation Patterns:

Selection Pushdown:

Move selection closer to leaves to filter early:

Before:

After (when p involves only R):

Join Reordering:

Change the order of joins to minimize intermediate sizes:

Before:

After (if S ⋈ T is smaller):

Converting Mermaid diagram...

Key Tree Transformation Rules
Transformation	When Applicable	Effect
Selection Pushdown	Selection predicate references only one child	Filter rows early, reduce intermediate sizes
Projection Pushdown	Projected columns subset of child's output	Narrow tuples early, reduce memory
Join Reordering	Joins are associative (inner joins)	Minimize intermediate join sizes
Selection Splitting	Conjunction of predicates	Push independent parts separately
Projection Pulling	Repeated projections	Combine into single projection
Join to Semijoin	Only need existence check	Avoid retrieving full joined tuples
Subquery Decorrelation	Correlated subquery	Convert to flat join structure

Transformation as Tree Rewriting:

Each transformation rule can be expressed as a pattern matching and replacement:

Rule: Selection Pushdown over Join
Pattern:
  σ_p(R ⋈ S) where p references only R
Replacement:
  σ_p(R) ⋈ S

Algorithm:
  For each node N in tree:
    If N matches pattern left-hand side:
      Replace N with pattern right-hand side
      Update parent/child pointers

The Optimizer's Task:

Given an input tree, the optimizer:

Enumerates applicable transformations
Generates alternative trees
Estimates cost of each alternative
Selects the lowest-cost tree (or uses heuristics)

This process might explore thousands of trees for complex queries, using dynamic programming or genetic algorithms to search efficiently.

Transformation Correctness

Tree Properties and Invariants

Expression trees carry properties and maintain invariants that help optimizers make decisions and ensure correctness.

Logical Properties:

Properties determined by the logical expression, independent of physical execution:

Output Schema: Columns and types produced by the subtree
Candidate Keys: Sets of attributes that uniquely identify tuples
Functional Dependencies: Relationships between attribute values
NOT NULL Constraints: Attributes guaranteed non-null
Cardinality Estimate: Expected number of output tuples

Physical Properties:

Properties that depend on how the operator is executed:

Ordering: Tuples sorted by certain attributes (important for merge join, ORDER BY)
Partitioning: How tuples are distributed across parallel executors
Locality: Where data physically resides

Property Propagation:

Properties propagate through the tree:

Selection preserves ordering, may affect cardinality
Projection may destroy ordering if the sort column is removed
Join combines properties from both inputs (complex rules)
Sort creates ordering property but destroys previous ordering

Example: Ordering Propagation:

For tree: ORDER BY salary(σ_dept='Eng'(Employees))

If Employees has index on (dept, salary):
  - Scan returns rows ordered by (dept, salary)
  - Selection preserves ordering (single dept value)
  - Order by salary satisfied—no additional sort needed!
  
If Employees has no suitable index:
  - Scan returns unordered rows
  - Selection produces unordered result
  - ORDER BY requires explicit sort operation

Key Property Categories

•Schema Properties: Column names, types, nullability—determine valid operations
•Cardinality Properties: Row count estimates—drive cost calculations
•Ordering Properties: Sort order—affects join and order-by efficiency
•Partitioning Properties: Data distribution—affects parallel execution
•Key Properties: Uniqueness—enables duplicate elimination optimizations
•Constraint Properties: Check constraints, FK—enable semantic optimizations

Interesting Orders:

The concept of "interesting orders" is crucial for optimization:

An ordering is "interesting" if it benefits a downstream operator
ORDER BY creates demand for a specific order
Merge join requires both inputs sorted on join column
Group by benefits from grouped/sorted input

The optimizer tracks which orderings are interesting and prefers plans that produce them, even at slight extra cost, to avoid expensive sort operations later.

Property-Based Optimization:

Modern optimizers use properties for:

Plan pruning: If a plan can't produce required properties, discard it early
Algorithm selection: Choose merge join if inputs are sorted; hash join otherwise
Enforcer insertion: Add sort/exchange operators when required properties aren't present
Cost adjustment: Factor in property mismatches when comparing plans

The Volcano and Cascades optimizer frameworks formalize property-based optimization, treating properties as first-class optimization dimensions.

Thinking in Properties

Parallelism and Tree Structure

Expression trees naturally expose opportunities for parallel execution. Understanding these opportunities is essential for high-performance query processing.

Independent Subtrees:

Subtrees that share no data dependencies can execute concurrently:

           ⋈
          / 
         /   
     σ(R)     σ(S)

The two selection subtrees (on R and on S) can run simultaneously on different processors. The join waits for both to complete, then combines results.

Pipeline Parallelism:

In pipelining, producer and consumer operators run concurrently:

Producer: Table Scan on Employees
  → Stream tuples to →
Consumer: Selection (salary > 50000)
  → Stream tuples to →
Consumer: Projection (name)

All three operators can be active simultaneously, each processing different tuples.

Partition Parallelism:

Data is partitioned, and the same operator runs on multiple partitions:

         Gather
        /  |  
      σ    σ    σ
      |    |    |
   R(p1) R(p2) R(p3)

The relation R is split into partitions p1, p2, p3. Parallel selections run on each partition. Results are gathered at the end.

Converting Mermaid diagram...

Exchange Operators:

Distributed databases insert exchange operators into trees to manage data distribution:

Scatter/Partition: Split data and send to multiple nodes
Gather: Collect data from multiple nodes
Repartition: Redistribute data to align for join

            Gather
               |
            π(name)
               |
            ⋈ (on hash-partitioned dept_id)
           / 
   Repartition Repartition
       |           |
   σ(Employees) σ(Departments)

Blockers and Parallelism:

Pipeline-breaking operators create synchronization points:

Before a sort, all input must be gathered
Before global aggregation, partial aggregates must be collected
Before broadcast join, the broadcast table must be fully distributed

These blockers limit parallelism but are sometimes unavoidable.

Tree Width and Parallelism:

Narrow trees (long chains) have limited parallelism
Wide trees (many sibling subtrees) have high parallelism potential
The ideal width depends on available processors and data distribution

Distributed Query Planning

Summary: Expression Trees

We've explored expression trees—the fundamental representation of relational algebra queries. Let's consolidate the key insights:

Key Takeaways

•Structure — Trees have leaf nodes (base relations), internal nodes (operators), and a root (final result producer).
•Construction — Relational algebra expressions map directly to trees; nesting becomes tree depth.
•Evaluation — Bottom-up (materialized) or top-down (pipelined) execution strategies traverse the tree to compute results.
•Rich Metadata — Nodes carry schema, cost estimates, properties, and physical operator choices.
•Transformation — Optimization transforms trees via equivalence-preserving rewrites like selection pushdown and join reordering.
•Properties — Logical and physical properties (ordering, partitioning) propagate through trees and guide optimization.
•Parallelism — Tree structure exposes parallelism opportunities: independent subtrees, pipelining, and partition parallelism.

What's Next

Page Complete

4 / 5