Query ProcessingQuery Representation

Query Representation

LevelIntermediate

Duration60 mins

TopicQuery Representation

3 / 5

Logical Plans

The Optimizer's Canvas

Relational algebra trees provide a mathematical representation of query semantics. Query graphs reveal structural relationships. But optimizers need more than either alone offers—they need a working representation rich enough to:

Capture all operations with their parameters
Track statistical estimates and cost annotations
Support systematic transformation rules
Enable comparison of alternative plans
Bridge the gap to physical execution

Logical plans are this comprehensive representation. They combine the operational structure of algebra trees with the metadata annotations needed for optimization. A logical plan is not just a representation—it's the optimizer's workspace where queries are analyzed, transformed, and refined into efficient execution strategies.

What You Will Learn

By the end of this page, you will understand what logical plans are, how they differ from physical plans, the properties and annotations they carry, how optimizers transform them, and how they serve as the bridge between query understanding and query execution.

Logical vs Physical Plans

Before diving into logical plans, it's essential to understand the distinction between logical and physical plans. This separation is fundamental to query processing architecture.

Logical Plans:

Describe what operations to perform
Are algorithm-agnostic: "join these tables" not "use hash join"
Focus on relational semantics: correctness over efficiency
Are declarative: specify desired result properties
Allow multiple physical implementations for each operation

Physical Plans:

Describe how to perform operations
Specify concrete algorithms: "use hash join with build on left"
Include resource allocation: memory grants, parallelism degree
Are imperative: specify exact execution steps
Map directly to execution engine operators

Logical vs Physical Operations
Logical Operation	Possible Physical Implementations
Join (⋈)	Nested Loop Join, Hash Join, Sort-Merge Join, Index Nested Loop
Selection (σ)	Filter, Index Scan with predicate, Bitmap Index Scan
Table Access	Sequential Scan, Index Scan, Index-Only Scan, Bitmap Heap Scan
Aggregation (γ)	Stream Aggregate (if sorted), Hash Aggregate, Partial + Final
Sorting (τ)	In-memory Quicksort, External Merge Sort, Index Scan (presorted)
Distinct (δ)	Sort + Deduplicate, Hash Deduplicate, Streaming (if sorted)
Union (∪)	Append + Deduplicate, or Merge (if sorted inputs)

Why Separate?

This separation provides critical benefits:

Modularity: Logical optimization ("push selection before join") and physical optimization ("choose hash join over nested loop") are independent concerns.
Search space management: Logical transformations are typically correct regardless of data characteristics; physical choices depend on statistics.
Extensibility: New physical operators can be added without changing logical representation.
Portability: The same logical plan can generate different physical plans for different storage backends or execution environments.

The Optimization Pipeline

Most optimizers work in phases: (1) Parse SQL to parse tree, (2) Convert to initial logical plan, (3) Apply logical transformations, (4) Generate physical plan alternatives, (5) Cost and select best physical plan. Logical plans are the output of step 2 and workspace for step 3.

Anatomy of a Logical Plan

A logical plan is typically represented as a tree (or DAG in advanced systems) of logical operators, each encapsulating an operation and its parameters.

logical-plan-structure.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// Conceptual structure of a logical plan node
interface LogicalOperator {
    // Identity
    operatorType: LogicalOpType;   // JOIN, SELECT, PROJECT, AGGREGATE, etc.
    operatorId: string;             // Unique identifier in plan
    
    // Tree structure
    children: LogicalOperator[];    // Input operators (0 for leaves, 1+ for others)
    
    // Operation parameters (vary by type)
    // For SELECT:
    predicate?: Expression;
    // For PROJECT:
    projectList?: ColumnExpression[];
    // For JOIN:
    joinType?: 'INNER' | 'LEFT' | 'RIGHT' | 'FULL' | 'SEMI' | 'ANTI';
    joinCondition?: Expression;
    // For AGGREGATE:
    groupingKeys?: Column[];
    aggregateFunctions?: AggregateCall[];
    // For SCAN:
    tableName?: string;
    tableAlias?: string;
    
    // Schema information
    outputSchema: Schema;           // Columns produced
    
    // Statistical properties (derived)
    estimatedRowCount: number;
    estimatedDistinctValues: Map<Column, number>;
    nullableCols: Set<Column>;
    
    // Logical properties (derived)
    uniqueKeys: ColumnSet[];        // Column combinations guaranteed unique
    functionalDependencies: FDSet;  // A → B relationships
    sortOrdering: SortOrder[];      // If known to be sorted
    
    // Cost estimate (computed during optimization)
    estimatedCost?: Cost;
}
 
// Example: Logical plan for SELECT name FROM emp WHERE salary > 50000
const examplePlan: LogicalOperator = {
    operatorType: 'PROJECT',
    operatorId: 'proj-1',
    children: [{
        operatorType: 'SELECT',
        operatorId: 'sel-1',
        predicate: { type: 'comparison', left: 'salary', op: '>', right: 50000 },
        children: [{
            operatorType: 'SCAN',
            operatorId: 'scan-1',
            tableName: 'employees',
            children: [],
            outputSchema: ['id', 'name', 'salary', 'dept_id'],
            estimatedRowCount: 100000,
            // ...
        }],
        outputSchema: ['id', 'name', 'salary', 'dept_id'],
        estimatedRowCount: 15000,  // After selection
        // ...
    }],
    projectList: [{ column: 'name' }],
    outputSchema: ['name'],
    estimatedRowCount: 15000,
    // ...
};

Key Components Explained:

Operator Type: The logical operation being performed. This determines the operator's semantics regardless of physical implementation.
Children: Input operators whose output this operator consumes. Leaves (scans) have no children; binary operators (joins) have two; most others have one.
Operation Parameters: Type-specific details—predicates for selection, column lists for projection, join conditions and types for joins.
Output Schema: The columns this operator produces. Essential for type checking and propagating attributes.
Statistical Properties: Cardinality estimates, NDV (number of distinct values), null frequencies. These drive cost estimation.
Logical Properties: Invariants like unique keys, sort order, functional dependencies. Enable certain optimizations.

Derived vs Specified Properties

Some properties are specified (operation type, parameters). Others are derived by propagating through the plan—cardinality estimates propagate from leaves upward, transforming at each node based on selectivity estimates. This derivation is a core optimizer activity.

Common Logical Operators

Databases define a set of logical operators corresponding to relational algebra operations and SQL constructs. While implementations vary, a core set appears universally.

Table Access Operators

•LogicalTableScan: Access base table data. Parameters: table name, alias. Physical implementations: sequential scan, index scan, etc.
•LogicalValues: Produce literal rows. Used for VALUES clauses and constant expressions. Example: VALUES (1, 'a'), (2, 'b').
•LogicalTableFunctionScan: Access table-valued function results. Examples: generate_series(), custom UDTFs.
•LogicalCteScan: Access Common Table Expression results. References a CTE defined elsewhere in the query.

Operator Relationships:

Logical operators form a well-defined algebra. Understanding their relationships enables transformations:

Selection distributes over union: σ(R ∪ S) = σ(R) ∪ σ(S)
Selection is commutative: σ_p(σ_q(R)) = σ_q(σ_p(R))
Projection cascades: π_a(π_b(R)) = π_a(R) if a ⊆ b
Join is commutative and associative (for inner joins)
Selection can push through projection if predicate columns are preserved

These identities form the foundation of logical transformation rules.

Logical Properties

Each logical operator node carries logical properties—attributes that describe invariants of its output regardless of physical implementation. These properties enable optimizations and are propagated/computed throughout the plan.

Core Logical Properties

•Output Schema: The columns and types produced. Propagates from children, transformed by each operator (projection narrows, join widens, aggregate replaces with groups + aggregates).
•Cardinality Estimate: Expected number of rows. Derived from base table statistics, modified by selectivity at each operator. Central to cost estimation.
•Unique Keys: Column combinations guaranteed unique in output. Base tables have primary keys; joins may preserve or lose uniqueness; projections can introduce or remove uniqueness.
•Functional Dependencies: If column A determines column B. Important for optimization (e.g., GROUP BY can use either side of a FD). Propagate through joins, partially through projections.
•Sort Order: If output is known sorted on certain columns. Enables sort-avoiding optimizations (merge join without extra sort, streaming aggregation, ordered output without extra sort).
•Partitioning: In distributed systems—how data is distributed across nodes. Enables co-located joins and distributed aggregation optimization.
•Null Constraints: Which columns can/cannot be null. Affects outer join semantics, predicate simplification, and NOT NULL optimization.

property-propagation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Property propagation examples
 
// Unique key propagation through JOIN
function propagateUniquenessJoin(
    leftKeys: ColumnSet[], 
    rightKeys: ColumnSet[],
    joinType: JoinType,
    joinCondition: Expression
): ColumnSet[] {
    // Inner join: both sides' unique keys are preserved
    // (assuming equi-join on key columns)
    if (joinType === 'INNER') {
        return [...leftKeys, ...rightKeys];
    }
    
    // Left outer: only left's uniqueness is preserved
    // (right side may have nulls = duplicates from left's perspective)
    if (joinType === 'LEFT') {
        return leftKeys;
    }
    
    // Full outer: neither side's uniqueness guaranteed
    return [];
}
 
// Sort order propagation through FILTER
function propagateSortFilter(
    inputSortOrder: SortSpec[],
    filterPredicate: Expression
): SortSpec[] {
    // Filtering preserves sort order (removes rows but keeps order)
    return inputSortOrder;
}
 
// Sort order propagation through PROJECT
function propagateSortProject(
    inputSortOrder: SortSpec[],
    projectList: Column[]
): SortSpec[] {
    // Only preserve if all sort columns are in output
    return inputSortOrder.filter(spec => 
        projectList.some(col => col.equals(spec.column))
    );
}

Property Computation Order:

Properties are computed in a bottom-up pass:

Leaf nodes (scans) get properties from catalog metadata: schema from table definition, cardinality from statistics, unique keys from primary/unique constraints.
Each operator computes its properties from its children's properties using operator-specific rules.
Root node has the final query output properties.

Some optimizations require top-down propagation as well—knowing what properties the parent needs can influence child decisions (e.g., requiring a specific sort order).

Properties Enable Optimization

If you know a sort operator's input is already sorted on the required columns (via sort order property), you can eliminate the sort. If you know a column has a unique key, you can remove redundant DISTINCT. Properties turn potential optimizations into applicable optimizations.

Logical Plan Transformations

The power of logical plans lies in transformation rules—systematic ways to rewrite plans while preserving semantics. Optimizers apply these rules to explore the space of equivalent plans, seeking lower-cost alternatives.

Common Transformation Rules

•Selection Pushdown: Move selections below joins to filter early. σ_p(R ⋈ S) → σ_p(R) ⋈ S if p only references R's columns.
•Projection Pushdown: Push projections down to eliminate unused columns early. Reduces tuple width, memory usage.
•Join Reordering: Algebraically equivalent: (R ⋈ S) ⋈ T = R ⋈ (S ⋈ T). Choose order minimizing intermediate result sizes.
•Predicate Simplification: Boolean algebra: (A AND TRUE) → A, (A OR FALSE) → A, constant folding.
•Subquery Decorrelation: Transform correlated subqueries into joins. Often dramatically improves performance.
•Distinct Elimination: Remove redundant DISTINCT when output is already unique (e.g., primary key in SELECT list).
•Aggregate Pushdown: Push aggregation before join when valid (requires careful functional dependency analysis).
•Union to Union All: UNION → UNION ALL + DISTINCT sometimes enables better physical plans.

transformation-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Original Plan (canonical translation)
-- 
--    PROJECT (c.name, o.total)
--        |
--    FILTER (c.region = 'West' AND o.status = 'completed')
--        |
--       JOIN (c.id = o.customer_id)
--       /    \
--   SCAN(c)  SCAN(o)
 
-- After Selection Pushdown
--
--    PROJECT (c.name, o.total)
--        |
--       JOIN (c.id = o.customer_id)
--       /    \
--  FILTER     FILTER
--  (region    (status=
--  ='West')   'completed')
--     |          |
--  SCAN(c)    SCAN(o)
 
-- Why is this better?
-- Before: Join full tables, then filter result
--   - customers: 1M rows, orders: 5M rows
--   - Join produces 5M rows (assume 5 orders/customer)
--   - Filter reduces to 200K rows
 
-- After: Filter first, then join filtered results
--   - Filter customers: 1M → 200K ('West' region)
--   - Filter orders: 5M → 1M ('completed' status)
--   - Join filtered: 200K * 1M → ~200K (filtered customers have fewer orders)
-- Massive reduction in intermediate data!

Converting Mermaid diagram...

Transformation Validity Conditions

Not all transformations are always valid. Selection pushdown through outer joins requires null-rejecting predicates. Join reordering doesn't apply to outer joins freely. Aggregate pushdown has complex validity requirements. Optimizers must verify conditions before applying rules.

Equivalence Classes and Memo Structures

Modern optimizers don't generate plans one at a time. Instead, they use sophisticated data structures to represent all equivalent plans simultaneously. The key abstraction is the equivalence class or group.

Core Concepts:

Memo Structure Concepts

•Equivalence Class (Group): A set of logically equivalent expressions producing the same result. Example: both (A ⋈ B) ⋈ C and A ⋈ (B ⋈ C) belong to the same group if they produce identical output.
•Group Expression: A single plan fragment within a group. Has an operator and child groups (not child expressions)—children are sets of possibilities.
•Memo (MEMO structure): The entire collection of groups, representing the entire space of equivalent plans compactly.
•Exploration: The process of applying transformation rules to generate new group expressions within groups.
•Implementation: The process of considering physical operators for each logical expression.

memo-structure.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Simplified memo structure concept
 
interface Group {
    groupId: number;
    logicalProps: LogicalProperties;    // Shared by all expressions
    expressions: GroupExpression[];      // All equivalent ways to compute
    bestPlan?: Winner;                   // Best physical plan found
}
 
interface GroupExpression {
    operator: LogicalOperator;
    childGroups: Group[];               // Children are groups, not expressions!
}
 
// Example: For query joining A, B, C
// The memo might contain:
 
// Group 0: Base table A
//   - Expression: TableScan('A')
 
// Group 1: Base table B
//   - Expression: TableScan('B')
 
// Group 2: Base table C
//   - Expression: TableScan('C')
 
// Group 3: Join result of A and B
//   - Expression 1: Join(Group0, Group1, A.id = B.a_id)
//   - Expression 2: Join(Group1, Group0, A.id = B.a_id)  // Commuted
 
// Group 4: Final result: join of Group3 and C
//   - Expression 1: Join(Group3, Group2, ...)  // (A⋈B)⋈C
//   - Expression 2: ...
 
// Meanwhile, alternative:
// Group 5: Join result of B and C  
//   - Expression: Join(Group1, Group2, B.id = C.b_id)
 
// Group 4 can also include:
//   - Expression 3: Join(Group0, Group5, ...)  // A⋈(B⋈C)
 
// The memo compactly represents ALL valid join orderings!

Why Memo Structures Matter:

Compact representation: Exponentially many plans in polynomial space. Subplans are shared.
Avoid redundant work: Once a group is explored, its expressions are available to all parents.
Top-down search: Can prune entire groups if they can't beat current best cost.
Memoization: Store intermediate results (best costs for groups)—avoid recomputation.

This approach, pioneered in the Cascades optimizer framework, is used by PostgreSQL, SQL Server, CockroachDB, and many others.

Cascades/Columbia Framework

The Cascades optimization framework (evolved from Volcano/EXODUS) introduced memo-based optimization with rules transforming groups. Most modern commercial and open-source databases use variants of this approach. Understanding memos helps you understand optimizer behavior and debug plan issues.

From Logical to Physical Plan

The culmination of logical planning is physical plan generation—selecting concrete algorithms for each logical operator. This transition considers:

Available physical implementations for each logical operator
Required and delivered properties (sort orders, partitioning)
Cost estimates for each alternative
Resource constraints (memory, parallelism)

Logical to Physical: Decision Factors
Logical Op	Physical Choice	Key Decision Factors
JOIN	Nested Loop	Small inner side, index available, low cardinality
JOIN	Hash Join	No useful sort, one side fits in memory, equality condition
JOIN	Merge Join	Both inputs sorted (or index), many rows, equality condition
AGGREGATE	Stream Agg	Input already sorted on grouping columns
AGGREGATE	Hash Agg	Unsorted input, enough memory for hash table
SORT	In-memory	Small data, fits in sort buffer
SORT	External	Large data, must spill to disk
SCAN	Seq Scan	No useful index, large fraction of table needed
SCAN	Index Scan	Selective predicate matching index, good clustering

physical-plan-generation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Logical Plan:
-- 
-- PROJECT (name, total)
--    |
-- AGGREGATE (GROUP BY cust_id; SUM(amount))
--    |
--   JOIN (orders.cust_id = customers.id)
--   /    \
-- SCAN   SCAN
-- (orders) (customers)
 
-- Possible Physical Plans:
 
-- Option A: Hash-based
-- SequentialScan(orders) -> HashAgg(cust_id, SUM) 
--   -> HashJoin with SeqScan(customers) -> Project
 
-- Option B: Sort-based (if index on orders.cust_id)
-- IndexScan(orders, cust_id_idx) -> StreamAgg(cust_id, SUM)
--   -> MergeJoin with IndexScan(customers, pk_idx) -> Project
 
-- Option C: Nested Loop (if customers is tiny)
-- SeqScan(customers) -> NestedLoop with 
--   [for each customer: IndexScan(orders) -> filtered agg] -> Project
 
-- The optimizer costs each option and picks lowest total cost.
-- Option depends on table sizes, indexes, memory, statistics.

Property Enforcement:

Physical planning may need to enforce properties required by parent operators:

Merge join requires sorted inputs → add Sort operator if not sorted
Parallel aggregation requires distributed input → add Redistribute operator
Certain operations require data in memory → add Materialize operator

The optimizer considers both:

Physical operators that provide needed properties inherently
Explicit enforcement operators to add properties

Cost includes both the operator and any enforcement needed.

Interesting Orders Optimization

Some optimizers track 'interesting orders'—sort orders potentially useful later in the plan. Generating a sorted intermediate result may cost more now but avoids a sort later. This interplay of properties across plan levels requires global optimization, not just local greedy choices.

Summary and Next Steps

Logical plans are the heart of query optimization—the representation where queries are understood, transformed, and prepared for execution. They bridge the gap between declarative SQL and imperative physical execution.

Key Takeaways

•Logical plans specify what to do; physical plans specify how. This separation enables modular optimization.
•Operators encapsulate operations with type-specific parameters. Each has defined semantics independent of physical implementation.
•Logical properties (schema, cardinality, uniqueness, sort order) propagate through plans, enabling optimizations.
•Transformation rules rewrite plans while preserving semantics, exploring the space of equivalent alternatives.
•Memo structures compactly represent all equivalent plans, enabling efficient search over exponentially many alternatives.
•Physical planning selects concrete algorithms, considering costs, properties, and enforcement requirements.
•The optimizer pipeline moves from parsed SQL → logical plan → transformed logical plan → physical plan → execution.

What's Next:

With logical plans understood, we dive deeper into operator nodes—the individual building blocks of plans. We'll examine each operator's semantics, properties, and behavior in detail, building comprehensive understanding of how queries decompose into executable operations.

Page Complete

You now understand logical plans—their structure, properties, transformations, and role in optimization. This knowledge is essential for understanding EXPLAIN output, diagnosing query performance, and appreciating how optimizers find efficient execution strategies. Next: Deep dive into Operator Nodes.

3 / 5

Loading learning content...

Query ProcessingQuery Representation

Query Representation

LevelIntermediate

Duration60 mins

TopicQuery Representation

3 / 5

Logical Plans

The Optimizer's Canvas

Capture all operations with their parameters
Track statistical estimates and cost annotations
Support systematic transformation rules
Enable comparison of alternative plans
Bridge the gap to physical execution

What You Will Learn

Logical vs Physical Plans

Before diving into logical plans, it's essential to understand the distinction between logical and physical plans. This separation is fundamental to query processing architecture.

Logical Plans:

Describe what operations to perform
Are algorithm-agnostic: "join these tables" not "use hash join"
Focus on relational semantics: correctness over efficiency
Are declarative: specify desired result properties
Allow multiple physical implementations for each operation

Physical Plans:

Describe how to perform operations
Specify concrete algorithms: "use hash join with build on left"
Include resource allocation: memory grants, parallelism degree
Are imperative: specify exact execution steps
Map directly to execution engine operators

Logical vs Physical Operations
Logical Operation	Possible Physical Implementations
Join (⋈)	Nested Loop Join, Hash Join, Sort-Merge Join, Index Nested Loop
Selection (σ)	Filter, Index Scan with predicate, Bitmap Index Scan
Table Access	Sequential Scan, Index Scan, Index-Only Scan, Bitmap Heap Scan
Aggregation (γ)	Stream Aggregate (if sorted), Hash Aggregate, Partial + Final
Sorting (τ)	In-memory Quicksort, External Merge Sort, Index Scan (presorted)
Distinct (δ)	Sort + Deduplicate, Hash Deduplicate, Streaming (if sorted)
Union (∪)	Append + Deduplicate, or Merge (if sorted inputs)

Why Separate?

This separation provides critical benefits:

Modularity: Logical optimization ("push selection before join") and physical optimization ("choose hash join over nested loop") are independent concerns.
Search space management: Logical transformations are typically correct regardless of data characteristics; physical choices depend on statistics.
Extensibility: New physical operators can be added without changing logical representation.
Portability: The same logical plan can generate different physical plans for different storage backends or execution environments.

The Optimization Pipeline

Anatomy of a Logical Plan

A logical plan is typically represented as a tree (or DAG in advanced systems) of logical operators, each encapsulating an operation and its parameters.

logical-plan-structure.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// Conceptual structure of a logical plan node
interface LogicalOperator {
    // Identity
    operatorType: LogicalOpType;   // JOIN, SELECT, PROJECT, AGGREGATE, etc.
    operatorId: string;             // Unique identifier in plan
    
    // Tree structure
    children: LogicalOperator[];    // Input operators (0 for leaves, 1+ for others)
    
    // Operation parameters (vary by type)
    // For SELECT:
    predicate?: Expression;
    // For PROJECT:
    projectList?: ColumnExpression[];
    // For JOIN:
    joinType?: 'INNER' | 'LEFT' | 'RIGHT' | 'FULL' | 'SEMI' | 'ANTI';
    joinCondition?: Expression;
    // For AGGREGATE:
    groupingKeys?: Column[];
    aggregateFunctions?: AggregateCall[];
    // For SCAN:
    tableName?: string;
    tableAlias?: string;
    
    // Schema information
    outputSchema: Schema;           // Columns produced
    
    // Statistical properties (derived)
    estimatedRowCount: number;
    estimatedDistinctValues: Map<Column, number>;
    nullableCols: Set<Column>;
    
    // Logical properties (derived)
    uniqueKeys: ColumnSet[];        // Column combinations guaranteed unique
    functionalDependencies: FDSet;  // A → B relationships
    sortOrdering: SortOrder[];      // If known to be sorted
    
    // Cost estimate (computed during optimization)
    estimatedCost?: Cost;
}
 
// Example: Logical plan for SELECT name FROM emp WHERE salary > 50000
const examplePlan: LogicalOperator = {
    operatorType: 'PROJECT',
    operatorId: 'proj-1',
    children: [{
        operatorType: 'SELECT',
        operatorId: 'sel-1',
        predicate: { type: 'comparison', left: 'salary', op: '>', right: 50000 },
        children: [{
            operatorType: 'SCAN',
            operatorId: 'scan-1',
            tableName: 'employees',
            children: [],
            outputSchema: ['id', 'name', 'salary', 'dept_id'],
            estimatedRowCount: 100000,
            // ...
        }],
        outputSchema: ['id', 'name', 'salary', 'dept_id'],
        estimatedRowCount: 15000,  // After selection
        // ...
    }],
    projectList: [{ column: 'name' }],
    outputSchema: ['name'],
    estimatedRowCount: 15000,
    // ...
};

Key Components Explained:

Operator Type: The logical operation being performed. This determines the operator's semantics regardless of physical implementation.
Children: Input operators whose output this operator consumes. Leaves (scans) have no children; binary operators (joins) have two; most others have one.
Operation Parameters: Type-specific details—predicates for selection, column lists for projection, join conditions and types for joins.
Output Schema: The columns this operator produces. Essential for type checking and propagating attributes.
Statistical Properties: Cardinality estimates, NDV (number of distinct values), null frequencies. These drive cost estimation.
Logical Properties: Invariants like unique keys, sort order, functional dependencies. Enable certain optimizations.

Derived vs Specified Properties

Common Logical Operators

Databases define a set of logical operators corresponding to relational algebra operations and SQL constructs. While implementations vary, a core set appears universally.

Table Access Operators

•LogicalTableScan: Access base table data. Parameters: table name, alias. Physical implementations: sequential scan, index scan, etc.
•LogicalValues: Produce literal rows. Used for VALUES clauses and constant expressions. Example: VALUES (1, 'a'), (2, 'b').
•LogicalTableFunctionScan: Access table-valued function results. Examples: generate_series(), custom UDTFs.
•LogicalCteScan: Access Common Table Expression results. References a CTE defined elsewhere in the query.

Operator Relationships:

Logical operators form a well-defined algebra. Understanding their relationships enables transformations:

Selection distributes over union: σ(R ∪ S) = σ(R) ∪ σ(S)
Selection is commutative: σ_p(σ_q(R)) = σ_q(σ_p(R))
Projection cascades: π_a(π_b(R)) = π_a(R) if a ⊆ b
Join is commutative and associative (for inner joins)
Selection can push through projection if predicate columns are preserved

These identities form the foundation of logical transformation rules.

Logical Properties

Core Logical Properties

•Output Schema: The columns and types produced. Propagates from children, transformed by each operator (projection narrows, join widens, aggregate replaces with groups + aggregates).
•Cardinality Estimate: Expected number of rows. Derived from base table statistics, modified by selectivity at each operator. Central to cost estimation.
•Unique Keys: Column combinations guaranteed unique in output. Base tables have primary keys; joins may preserve or lose uniqueness; projections can introduce or remove uniqueness.
•Functional Dependencies: If column A determines column B. Important for optimization (e.g., GROUP BY can use either side of a FD). Propagate through joins, partially through projections.
•Sort Order: If output is known sorted on certain columns. Enables sort-avoiding optimizations (merge join without extra sort, streaming aggregation, ordered output without extra sort).
•Partitioning: In distributed systems—how data is distributed across nodes. Enables co-located joins and distributed aggregation optimization.
•Null Constraints: Which columns can/cannot be null. Affects outer join semantics, predicate simplification, and NOT NULL optimization.

property-propagation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Property propagation examples
 
// Unique key propagation through JOIN
function propagateUniquenessJoin(
    leftKeys: ColumnSet[], 
    rightKeys: ColumnSet[],
    joinType: JoinType,
    joinCondition: Expression
): ColumnSet[] {
    // Inner join: both sides' unique keys are preserved
    // (assuming equi-join on key columns)
    if (joinType === 'INNER') {
        return [...leftKeys, ...rightKeys];
    }
    
    // Left outer: only left's uniqueness is preserved
    // (right side may have nulls = duplicates from left's perspective)
    if (joinType === 'LEFT') {
        return leftKeys;
    }
    
    // Full outer: neither side's uniqueness guaranteed
    return [];
}
 
// Sort order propagation through FILTER
function propagateSortFilter(
    inputSortOrder: SortSpec[],
    filterPredicate: Expression
): SortSpec[] {
    // Filtering preserves sort order (removes rows but keeps order)
    return inputSortOrder;
}
 
// Sort order propagation through PROJECT
function propagateSortProject(
    inputSortOrder: SortSpec[],
    projectList: Column[]
): SortSpec[] {
    // Only preserve if all sort columns are in output
    return inputSortOrder.filter(spec => 
        projectList.some(col => col.equals(spec.column))
    );
}

Property Computation Order:

Properties are computed in a bottom-up pass:

Leaf nodes (scans) get properties from catalog metadata: schema from table definition, cardinality from statistics, unique keys from primary/unique constraints.
Each operator computes its properties from its children's properties using operator-specific rules.
Root node has the final query output properties.

Some optimizations require top-down propagation as well—knowing what properties the parent needs can influence child decisions (e.g., requiring a specific sort order).

Properties Enable Optimization

Logical Plan Transformations

Common Transformation Rules

•Selection Pushdown: Move selections below joins to filter early. σ_p(R ⋈ S) → σ_p(R) ⋈ S if p only references R's columns.
•Projection Pushdown: Push projections down to eliminate unused columns early. Reduces tuple width, memory usage.
•Join Reordering: Algebraically equivalent: (R ⋈ S) ⋈ T = R ⋈ (S ⋈ T). Choose order minimizing intermediate result sizes.
•Predicate Simplification: Boolean algebra: (A AND TRUE) → A, (A OR FALSE) → A, constant folding.
•Subquery Decorrelation: Transform correlated subqueries into joins. Often dramatically improves performance.
•Distinct Elimination: Remove redundant DISTINCT when output is already unique (e.g., primary key in SELECT list).
•Aggregate Pushdown: Push aggregation before join when valid (requires careful functional dependency analysis).
•Union to Union All: UNION → UNION ALL + DISTINCT sometimes enables better physical plans.

transformation-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Original Plan (canonical translation)
-- 
--    PROJECT (c.name, o.total)
--        |
--    FILTER (c.region = 'West' AND o.status = 'completed')
--        |
--       JOIN (c.id = o.customer_id)
--       /    \
--   SCAN(c)  SCAN(o)
 
-- After Selection Pushdown
--
--    PROJECT (c.name, o.total)
--        |
--       JOIN (c.id = o.customer_id)
--       /    \
--  FILTER     FILTER
--  (region    (status=
--  ='West')   'completed')
--     |          |
--  SCAN(c)    SCAN(o)
 
-- Why is this better?
-- Before: Join full tables, then filter result
--   - customers: 1M rows, orders: 5M rows
--   - Join produces 5M rows (assume 5 orders/customer)
--   - Filter reduces to 200K rows
 
-- After: Filter first, then join filtered results
--   - Filter customers: 1M → 200K ('West' region)
--   - Filter orders: 5M → 1M ('completed' status)
--   - Join filtered: 200K * 1M → ~200K (filtered customers have fewer orders)
-- Massive reduction in intermediate data!

Converting Mermaid diagram...

Transformation Validity Conditions

Equivalence Classes and Memo Structures

Core Concepts:

Memo Structure Concepts

•Equivalence Class (Group): A set of logically equivalent expressions producing the same result. Example: both (A ⋈ B) ⋈ C and A ⋈ (B ⋈ C) belong to the same group if they produce identical output.
•Group Expression: A single plan fragment within a group. Has an operator and child groups (not child expressions)—children are sets of possibilities.
•Memo (MEMO structure): The entire collection of groups, representing the entire space of equivalent plans compactly.
•Exploration: The process of applying transformation rules to generate new group expressions within groups.
•Implementation: The process of considering physical operators for each logical expression.

memo-structure.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Simplified memo structure concept
 
interface Group {
    groupId: number;
    logicalProps: LogicalProperties;    // Shared by all expressions
    expressions: GroupExpression[];      // All equivalent ways to compute
    bestPlan?: Winner;                   // Best physical plan found
}
 
interface GroupExpression {
    operator: LogicalOperator;
    childGroups: Group[];               // Children are groups, not expressions!
}
 
// Example: For query joining A, B, C
// The memo might contain:
 
// Group 0: Base table A
//   - Expression: TableScan('A')
 
// Group 1: Base table B
//   - Expression: TableScan('B')
 
// Group 2: Base table C
//   - Expression: TableScan('C')
 
// Group 3: Join result of A and B
//   - Expression 1: Join(Group0, Group1, A.id = B.a_id)
//   - Expression 2: Join(Group1, Group0, A.id = B.a_id)  // Commuted
 
// Group 4: Final result: join of Group3 and C
//   - Expression 1: Join(Group3, Group2, ...)  // (A⋈B)⋈C
//   - Expression 2: ...
 
// Meanwhile, alternative:
// Group 5: Join result of B and C  
//   - Expression: Join(Group1, Group2, B.id = C.b_id)
 
// Group 4 can also include:
//   - Expression 3: Join(Group0, Group5, ...)  // A⋈(B⋈C)
 
// The memo compactly represents ALL valid join orderings!

Why Memo Structures Matter:

Compact representation: Exponentially many plans in polynomial space. Subplans are shared.
Avoid redundant work: Once a group is explored, its expressions are available to all parents.
Top-down search: Can prune entire groups if they can't beat current best cost.
Memoization: Store intermediate results (best costs for groups)—avoid recomputation.

This approach, pioneered in the Cascades optimizer framework, is used by PostgreSQL, SQL Server, CockroachDB, and many others.

Cascades/Columbia Framework

From Logical to Physical Plan

The culmination of logical planning is physical plan generation—selecting concrete algorithms for each logical operator. This transition considers:

Available physical implementations for each logical operator
Required and delivered properties (sort orders, partitioning)
Cost estimates for each alternative
Resource constraints (memory, parallelism)

Logical to Physical: Decision Factors
Logical Op	Physical Choice	Key Decision Factors
JOIN	Nested Loop	Small inner side, index available, low cardinality
JOIN	Hash Join	No useful sort, one side fits in memory, equality condition
JOIN	Merge Join	Both inputs sorted (or index), many rows, equality condition
AGGREGATE	Stream Agg	Input already sorted on grouping columns
AGGREGATE	Hash Agg	Unsorted input, enough memory for hash table
SORT	In-memory	Small data, fits in sort buffer
SORT	External	Large data, must spill to disk
SCAN	Seq Scan	No useful index, large fraction of table needed
SCAN	Index Scan	Selective predicate matching index, good clustering

physical-plan-generation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Logical Plan:
-- 
-- PROJECT (name, total)
--    |
-- AGGREGATE (GROUP BY cust_id; SUM(amount))
--    |
--   JOIN (orders.cust_id = customers.id)
--   /    \
-- SCAN   SCAN
-- (orders) (customers)
 
-- Possible Physical Plans:
 
-- Option A: Hash-based
-- SequentialScan(orders) -> HashAgg(cust_id, SUM) 
--   -> HashJoin with SeqScan(customers) -> Project
 
-- Option B: Sort-based (if index on orders.cust_id)
-- IndexScan(orders, cust_id_idx) -> StreamAgg(cust_id, SUM)
--   -> MergeJoin with IndexScan(customers, pk_idx) -> Project
 
-- Option C: Nested Loop (if customers is tiny)
-- SeqScan(customers) -> NestedLoop with 
--   [for each customer: IndexScan(orders) -> filtered agg] -> Project
 
-- The optimizer costs each option and picks lowest total cost.
-- Option depends on table sizes, indexes, memory, statistics.

Property Enforcement:

Physical planning may need to enforce properties required by parent operators:

Merge join requires sorted inputs → add Sort operator if not sorted
Parallel aggregation requires distributed input → add Redistribute operator
Certain operations require data in memory → add Materialize operator

The optimizer considers both:

Physical operators that provide needed properties inherently
Explicit enforcement operators to add properties

Cost includes both the operator and any enforcement needed.

Interesting Orders Optimization

Summary and Next Steps

Key Takeaways

•Logical plans specify what to do; physical plans specify how. This separation enables modular optimization.
•Operators encapsulate operations with type-specific parameters. Each has defined semantics independent of physical implementation.
•Logical properties (schema, cardinality, uniqueness, sort order) propagate through plans, enabling optimizations.
•Transformation rules rewrite plans while preserving semantics, exploring the space of equivalent alternatives.
•Memo structures compactly represent all equivalent plans, enabling efficient search over exponentially many alternatives.
•Physical planning selects concrete algorithms, considering costs, properties, and enforcement requirements.
•The optimizer pipeline moves from parsed SQL → logical plan → transformed logical plan → physical plan → execution.

What's Next:

Page Complete

3 / 5