Database Management SystemsHeuristic Optimization

Heuristic Optimization

LevelIntermediate

Duration90 mins

TopicHeuristic Optimization

1 / 5

Rule-Based Optimization

The Art of Query Transformation

Before databases learned to count, they learned to follow rules. In the early days of relational database systems, query optimizers didn't have sophisticated statistics about data distributions, couldn't estimate how many rows would flow through a join, and lacked the computational resources for exhaustive plan enumeration. Yet these systems still needed to transform inefficient queries into performant execution plans.

The solution was rule-based optimization—a deterministic approach that applies transformation rules to query plans based on algebraic equivalences and time-tested heuristics. While modern optimizers have evolved to include cost-based decision making, rule-based transformations remain the essential first phase of every query optimization pipeline.

What You Will Learn

By the end of this page, you will understand how rule-based optimizers work, why they were the dominant paradigm in early relational systems, how transformation rules are organized and applied, and why these techniques remain indispensable even in cost-based optimizers. You'll gain insight into the fundamental principle that powers all query optimization: equivalent transformations.

The Rule-Based Paradigm

Rule-based optimization operates on a fundamentally different philosophy than cost-based optimization. Rather than asking "Which plan is cheapest?", it asks "Which transformations make a plan universally better?"

This paradigm is built on a critical insight: some query transformations are almost always beneficial, regardless of data characteristics. Filtering rows before joining them, for instance, reduces the amount of work in nearly every conceivable scenario. You don't need statistics to know this—it's algebraically guaranteed.

Core Principles of Rule-Based Optimization

A rule-based optimizer embodies several key principles that distinguish it from cost-based approaches:

Foundational Principles

•Determinism — Given the same query, a rule-based optimizer always produces the same plan. There's no random element, no sampling variance, and no dependency on potentially stale statistics. This predictability was highly valued in early production systems.
•Algebraic Correctness — Every transformation preserves the semantics of the original query. The optimizer leverages relational algebra equivalences that guarantee the transformed query produces identical results to the original.
•Heuristic Ordering — Rules are applied in a carefully designed sequence. Early rules handle the most impactful transformations (like predicate pushdown), while later rules handle specialized optimizations.
•Local Optimality — Each rule improves the plan locally, making it better according to some well-understood metric. While this doesn't guarantee global optimality, it reliably improves most queries.
•Simplicity and Maintainability — Rules are explicit, documented, and debuggable. When queries perform unexpectedly, administrators can trace exactly which rules fired and why.

Historical Context: The Oracle RBO Era

The most famous rule-based optimizer was Oracle's RBO (Rule-Based Optimizer), which dominated database optimization from the 1970s through the late 1990s. Oracle's RBO used a numbered ranking system where each access path had a fixed priority:

Rank	Access Path	Description
1	Single row by ROWID	Direct physical access
2	Single row by cluster join	Using cluster key
3	Single row by unique or primary key	Unique index lookup
4	Cluster hash	Hash cluster access
5	Cluster range	Cluster key range scan
...	...	...
15	Full table scan	Complete table read

The optimizer simply chose the access path with the lowest rank number, without considering how many rows would be returned or how selective conditions actually were. This approach worked remarkably well for many workloads but failed spectacularly when its assumptions didn't match reality.

The Transition to CBO

Oracle deprecated its RBO in favor of the Cost-Based Optimizer (CBO) starting with Oracle 7. However, the RBO remained available for backward compatibility for decades, and many legacy applications explicitly requested RBO mode because their queries had been tuned assuming rule-based behavior. This illustrates how deeply optimization strategy affects application design.

Transformation Rules Architecture

A rule-based optimizer is essentially a pattern-matching system that recognizes suboptimal query structures and rewrites them into better forms. Each transformation rule has three components:

Pattern — A template that matches against portions of the query plan
Condition — Additional constraints that must be satisfied for the rule to apply
Action — The transformation to perform when the pattern matches and conditions are met

The Anatomy of a Transformation Rule

Consider a fundamental transformation rule: Selection Pushdown Past Join. This rule pushes selection (filter) operations below join operations whenever the selection references only columns from one side of the join.

selection_pushdown_rule.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// RULE: Selection Pushdown Past Join
// =====================================
// Pattern:
//   σ[predicate](R ⋈ S)
// 
// Condition:
//   predicate references ONLY columns from R (or ONLY from S)
//
// Action:
//   If predicate uses only R's columns:
//       Transform to: (σ[predicate](R)) ⋈ S
//   If predicate uses only S's columns:
//       Transform to: R ⋈ (σ[predicate](S))
 
RULE SelectionPushdownPastJoin {
    MATCH (
        Select(
            predicate = P,
            input = Join(left = R, right = S, condition = J)
        )
    )
    
    WHEN (
        referencedColumns(P) ⊆ columns(R)  // P only uses R's columns
    )
    
    THEN REPLACE WITH (
        Join(
            left = Select(predicate = P, input = R),  // Push selection to R
            right = S,
            condition = J
        )
    )
}

Rule Categories

Transformation rules are organized into categories based on the type of optimization they perform. Understanding these categories helps in comprehending how a complete rule-based optimizer is structured:

Categories of Transformation Rules
Category	Purpose	Examples
Simplification	Eliminate redundant or unnecessary operations	Remove double negation, eliminate no-op projections, merge adjacent filters
Predicate Pushdown	Move filters closer to data sources	Push selections below joins, push filters into subqueries
Projection Pushdown	Reduce column set as early as possible	Eliminate unused columns before joins, prune intermediate schemas
Join Transformations	Reorder and restructure join trees	Apply commutativity, associativity; convert subqueries to joins
Subquery Decorrelation	Convert correlated subqueries to joins	Transform EXISTS to semi-join, IN to join with duplicate elimination
Aggregate Optimization	Optimize grouping operations	Push partial aggregates below joins, eliminate redundant grouping

Rule Application Strategies

Different optimizers apply transformation rules using different strategies:

Top-Down Application Start from the root of the query plan and work downward. Each rule is tried at the current node before recursing into children. This approach ensures that high-level restructuring happens before low-level optimization.

Bottom-Up Application Start from the leaves (table scans) and work upward. This ensures that base relations are optimized before considering how they combine.

Fixed-Point Iteration Apply rules repeatedly until no more transformations are possible. This handles cases where one transformation enables another. For example, pushing a selection down might enable a join reordering that wasn't possible before.

Phased Application Organize rules into phases, where all rules in a phase are applied to completion before moving to the next phase. This provides more control over the optimization sequence.

The Power of Composability

Individual transformation rules are simple, but their power comes from composition. A sequence of simple rules can dramatically transform a query. This is similar to how simple algebraic identities combine to solve complex equations—each step is elementary, but the cumulative effect is profound.

Common Universal Transformations

Certain transformations are so universally beneficial that they appear in every rule-based optimizer. These "universal transformations" improve plans regardless of data distribution, making them safe to apply unconditionally.

1. Constant Folding

Constant folding evaluates expressions involving only constants at compile time rather than execution time.

before_constant_folding.sql
1
2
3
4
5
6
7
8
-- Before: Expression evaluated for every row
SELECT *
FROM orders
WHERE total_price > 100 * 1.08 - 5
  AND order_date >= '2024-01-01'::date + 30;
 
-- The expressions "100 * 1.08 - 5" and 
-- "'2024-01-01' + 30" are computed repeatedly

after_constant_folding.sql
1
2
3
4
5
6
7
8
-- After: Constants pre-computed
SELECT *
FROM orders
WHERE total_price > 103.0
  AND order_date >= '2024-01-31';
 
-- Expressions evaluated once at
-- optimization time, not per row

2. Predicate Simplification

Predicate simplification applies Boolean algebra to reduce the complexity of WHERE clauses. This eliminates redundant conditions and identifies tautologies (always true) or contradictions (always false).

predicate_simplification_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Contradiction Detection
-- Before: This query should return empty result
SELECT * FROM employees 
WHERE department_id = 10 AND department_id = 20;
-- After: Entire query can be eliminated (returns empty)
 
-- Tautology Elimination
-- Before: Redundant condition
SELECT * FROM orders WHERE status = 'SHIPPED' OR 1=1;
-- After: No predicate needed (returns all rows)
 
-- Redundant Conjunct Elimination
-- Before: Overlapping conditions
SELECT * FROM products WHERE price > 100 AND price > 50;
-- After: Only the stronger condition remains
SELECT * FROM products WHERE price > 100;
 
-- Absorption Law
-- Before: Complex expression
SELECT * FROM items WHERE (A AND B) OR A;
-- After: Simplified using absorption (A OR (A AND B) = A)
SELECT * FROM items WHERE A;

3. Elimination of Unnecessary Operations

Rule-based optimizers identify and remove operations that have no effect on the query result:

Elimination Rules

•Identity Projection — A projection that includes all columns of its input can be removed entirely. π[a,b,c](R) where R has exactly columns {a, b, c} is equivalent to just R.
•Trivial Selection — A selection with a tautological predicate (WHERE TRUE or WHERE 1=1) can be eliminated, as it passes all rows unchanged.
•Empty Result Detection — If any point in the plan provably produces zero rows (contradictory predicate), the entire query can short-circuit to return empty.
•Redundant DISTINCT — If a query already produces unique rows (e.g., selecting only key columns), the DISTINCT keyword can be removed.
•Unnecessary Subqueries — Simple subqueries that don't provide additional semantics can be flattened into the main query.

4. View Expansion and Merging

When queries reference views, the optimizer must decide how to incorporate view definitions. Rule-based optimizers typically expand views inline and then merge the expanded query with the outer query:

view_merging.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- View Definition
CREATE VIEW expensive_products AS
SELECT product_id, name, price, category_id
FROM products
WHERE price > 1000;
 
-- Query using the view
SELECT e.name, c.category_name
FROM expensive_products e
JOIN categories c ON e.category_id = c.id
WHERE e.category_id = 5;
 
-- After View Expansion (conceptual intermediate step)
SELECT e.name, c.category_name
FROM (
    SELECT product_id, name, price, category_id
    FROM products
    WHERE price > 1000
) e
JOIN categories c ON e.category_id = c.id
WHERE e.category_id = 5;
 
-- After View Merging (final optimized form)
SELECT p.name, c.category_name
FROM products p
JOIN categories c ON p.category_id = c.id
WHERE p.price > 1000 
  AND p.category_id = 5;
-- Predicates merged, subquery eliminated

View Merging Pitfalls

Not all views can be safely merged. Views containing DISTINCT, GROUP BY, LIMIT, or window functions often cannot be merged without changing semantics. Rule-based optimizers must recognize these patterns and avoid incorrect transformations.

The Rule Catalog Structure

A production rule-based optimizer maintains a comprehensive catalog of transformation rules. This catalog is the heart of the optimizer—it encodes decades of database optimization wisdom. Let's examine how such a catalog is structured.

Rule Specification Format

Modern rule-based optimizers represent rules declaratively rather than as imperative code. This allows rules to be reasoned about, combined, and verified for correctness. Here's a detailed specification of a rule:

rule_specification.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Rule: JoinCommutativity
# Algebraic basis: R ⋈ S ≡ S ⋈ R
---
rule_id: JOIN_COMMUTE
name: "Join Commutativity"
category: JOIN_TRANSFORMATIONS
phase: LOGICAL_OPTIMIZATION
 
pattern:
  type: InnerJoin
  left: 
    bind: ?R  # Bind left operand to variable ?R
  right:
    bind: ?S  # Bind right operand to variable ?S
  condition:
    bind: ?cond  # Bind join condition to variable ?cond
 
preconditions:
  - type: InnerJoin  # Only applies to inner joins
  # Note: Left/right/semi joins are NOT commutative
 
action:
  type: InnerJoin
  left: ?S      # Swap operands
  right: ?R
  condition: ?cond  # Condition unchanged
 
postconditions:
  # Verify logical equivalence
  - schema_preserved: true
  - cardinality_preserved: true
 
metadata:
  cost_neutral: true  # Neither better nor worse inherently
  enables:
    - SELECTION_PUSHDOWN  # May enable other rules
    - JOIN_ASSOCIATIVITY
  priority: 50  # Medium priority in rule ordering

Rule Dependencies and Interactions

Rules don't operate in isolation—they interact in complex ways. Some rules enable others (cascading effects), while some rules compete for the same match (mutual exclusion). Understanding these interactions is crucial for optimizer design:

Rule Interaction Patterns
Pattern	Description	Example
Enabling	Rule A creates a pattern that Rule B can match	Selection pushdown creates adjacent selections that can be merged
Blocking	Rule A's transformation makes Rule B inapplicable	Converting subquery to join removes the subquery pattern
Competing	Both Rule A and Rule B match, only one can apply	Join commutativity vs. selection pushdown when both match
Cascading	Rule application triggers recursive application	Pushing selection down enables another pushdown
Oscillating	Rules can undo each other, causing infinite loops	Join commutativity applied repeatedly swaps back and forth

Confluence and Termination

Two critical properties that a rule set must possess:

Termination: The rule application process must eventually stop. Without termination, the optimizer runs forever. Oscillation is a common threat—rules like join commutativity (R ⋈ S → S ⋈ R) could apply indefinitely. Optimizers prevent this through:

Rule priorities that break ties deterministically
Transformation tracking to avoid re-applying the same transformation
Fixed-point detection to recognize when the plan stops changing
Depth or iteration limits as safety bounds

Confluence: Regardless of the order rules are applied, the same final result should be reached. This is harder to guarantee but desirable for predictability. Many optimizers don't achieve full confluence; they rely on carefully ordered rule phases to produce good (if not globally optimal) results.

The Volcano/Cascades Contribution

The Volcano and Cascades optimization frameworks, developed by Goetz Graefe, introduced the concept of "rules" as first-class citizens in optimizer architecture. Rather than hard-coding transformations, these systems allow rules to be added, removed, and modified declaratively. This idea influenced virtually every modern query optimizer, including those in PostgreSQL, SQL Server, and Apache Calcite.

Rule-Based vs Cost-Based: The Spectrum

Modern optimizers don't fall neatly into "rule-based" or "cost-based" categories. Instead, they exist on a spectrum, using rules for some decisions and cost estimation for others. Understanding this spectrum helps clarify when each approach is appropriate.

When Rule-Based Excels

Rule-based optimization is the right choice when:

Rule-Based Sweet Spots

•Transformations are universally beneficial — Selection pushdown before joins is essentially always helpful. No cost model is needed to justify it.
•Statistics are unavailable or unreliable — For ad-hoc queries on new tables, temporary tables, or complex expressions, statistics may not exist. Rules provide reasonable defaults.
•Optimization time is critical — Cost-based optimization can be expensive, especially for complex queries. Rules apply in linear time relative to plan size.
•Determinism is required — Some applications require identical plans for identical queries. Rules provide this; cost-based optimizers may not (due to sampling variance in statistics).
•Logical correctness transformations — Expanding views, decorrelating subqueries, and normalizing predicates are logical steps that should happen regardless of cost.

When Cost-Based Dominates

Cost-based optimization is essential when:

Cost-Based Requirements

•Multiple viable alternatives exist — When three different join algorithms could work, only cost estimation can choose wisely.
•Data distribution matters critically — Whether to use an index depends on selectivity. A 5% selectivity scan uses an index; 50% selectivity prefers a full table scan.
•Join ordering decisions — With multiple tables, the join order dramatically affects performance. The difference between best and worst orders can be 10000× or more.
•Physical operator selection — Choosing between hash join, merge join, and nested loop join requires understanding input sizes and available memory.
•Parallel execution planning — How to partition work across cores or nodes depends on data sizes and system resources.

The Modern Hybrid Approach

Contemporary query optimizers use a two-phase architecture:

Phase 1: Rule-Based Logical Optimization

Apply all safe, beneficial transformations
Canonicalize the query representation
Decorrelate subqueries, expand views, simplify predicates
Produce a normalized logical plan

Phase 2: Cost-Based Physical Optimization

Enumerate alternative physical plans
Estimate costs using statistics
Select the lowest-cost plan
Handle join ordering, operator selection, and parallelism

Converting Mermaid diagram...

Rules as Plan Space Reduction

Rule-based transformations also serve a crucial purpose in cost-based optimization: they reduce the search space. By canonicalizing plans and eliminating obviously suboptimal structures, rules ensure the cost-based phase doesn't waste time evaluating plans that would never be chosen.

Implementation Considerations

Implementing a rule-based optimizer requires careful attention to engineering details. The conceptual simplicity of "apply transformation rules" belies significant implementation complexity.

Pattern Matching Efficiency

Rules match against portions of the query plan tree. With hundreds of rules and potentially large plans, naive pattern matching becomes prohibitively expensive. Production optimizers use several techniques:

pattern_matching_optimization.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Naive approach: O(rules × nodes) matching per iteration
function naiveOptimize(plan):
    changed = true
    while changed:
        changed = false
        for each rule in rules:
            for each node in plan.allNodes():
                if rule.matches(node):
                    plan = rule.apply(plan, node)
                    changed = true
 
// Optimized: Rule Indexing + Event-Driven Matching
function optimizedOptimize(plan):
    // Index rules by root operator type
    ruleIndex = buildRuleIndex(rules)  // {Join: [r1,r2], Select: [r3], ...}
    
    // Worklist of nodes to process
    worklist = new Queue(plan.allNodes())
    
    while not worklist.isEmpty():
        node = worklist.dequeue()
        applicableRules = ruleIndex.get(node.type)
        
        for rule in applicableRules:
            if rule.matches(node):
                newNode = rule.apply(node)
                
                // Only reprocess affected nodes
                worklist.add(newNode)
                worklist.addAll(newNode.children)
                break  // Re-evaluate this node's rules
 
// Key optimizations:
// 1. Rule indexing by operator type
// 2. Worklist avoids re-scanning entire tree
// 3. Early termination on first match
// 4. Minimal reprocessing after transformation

Maintaining Plan Invariants

As rules transform the plan, various invariants must be maintained:

Schema Consistency: Every operator must have correctly typed inputs and outputs. Rules must update column references when they restructure the plan.

Semantics Preservation: The transformed plan must produce identical results to the original. This requires careful handling of NULL values, duplicate rows, and ordering.

Property Propagation: Physical properties (like sort order or data distribution) must be tracked and updated as transformations occur.

Plan Properties Tracked During Optimization
Property	Description	Affected By
Output Schema	Column names, types, and nullability	Projection, join, rename operations
Sort Order	Ordering of output rows	Sort, index scan, merge join
Partitioning	How data is distributed	Shuffle, partition-aware joins
Cardinality Bounds	Min/max estimated row counts	All filter operations
Required Columns	Which columns are needed downstream	Projection pushdown analysis

Testing and Verification

Rule-based optimizers require exhaustive testing because bugs in transformation rules can silently produce incorrect results. Key testing strategies include:

Property-Based Testing: Generate random queries and verify that optimized plans produce the same results as unoptimized plans.

Regression Testing: Maintain a corpus of queries where optimal plans are known and verify that new rules don't regress performance.

Formal Verification: Some systems use proof assistants to mathematically verify that transformation rules preserve semantics. This is especially important for rules handling edge cases like NULL values.

Plan Diff Tools: Compare plans before and after changes to rule sets, flagging unexpected regressions.

The NULL Landmine

Many seemingly correct rule transformations fail in the presence of NULL values. For example, NOT (A = B) is not equivalent to A <> B when either value could be NULL—one can return FALSE while the other returns UNKNOWN. Rule writers must exhaustively consider NULL semantics, which accounts for many optimizer bugs in production systems.

Summary: Rule-Based Optimization

We've explored the foundations of rule-based query optimization, understanding how this deterministic, pattern-based approach transforms queries into more efficient forms. Let's consolidate the key insights:

Key Takeaways

•Rule-based optimization applies deterministic transformations — Given patterns in the query plan, rules rewrite the plan into equivalent but more efficient forms, without relying on statistics or cost estimation.
•Transformation rules leverage relational algebra equivalences — Every rule is grounded in mathematical identities that guarantee the transformed query produces identical results to the original.
•Rules are organized into categories — Simplification, predicate pushdown, projection pushdown, join transformations, and subquery decorrelation each address different aspects of optimization.
•Modern optimizers are hybrid — They use rule-based transformations for logical optimization and safe, universal improvements, then apply cost-based selection for decisions that depend on data characteristics.
•Rule application must terminate and ideally be confluent — Optimizers use priorities, tracking, and phase separation to ensure rule engines reach a stable, optimized state.
•Implementation requires careful engineering — Pattern matching efficiency, invariant maintenance, and thorough testing are essential for correct, performant optimizer implementations.

What's Next:

With the foundations of rule-based optimization established, we'll dive into the specific heuristics that drive query transformation. The next page explores Common Heuristics—the time-tested rules that optimizers use to improve virtually every query, from selection pushdown to join ordering strategies.

Page Complete

You now understand the rule-based optimization paradigm—its historical roots, architectural principles, and role in modern query optimization. This foundation prepares you to study the specific heuristic rules that drive practical query optimization.

1 / 5

Loading learning content...

Database Management SystemsHeuristic Optimization

Heuristic Optimization

LevelIntermediate

Duration90 mins

TopicHeuristic Optimization

1 / 5

Rule-Based Optimization

The Art of Query Transformation

What You Will Learn

The Rule-Based Paradigm

Core Principles of Rule-Based Optimization

A rule-based optimizer embodies several key principles that distinguish it from cost-based approaches:

Foundational Principles

•Determinism — Given the same query, a rule-based optimizer always produces the same plan. There's no random element, no sampling variance, and no dependency on potentially stale statistics. This predictability was highly valued in early production systems.
•Algebraic Correctness — Every transformation preserves the semantics of the original query. The optimizer leverages relational algebra equivalences that guarantee the transformed query produces identical results to the original.
•Heuristic Ordering — Rules are applied in a carefully designed sequence. Early rules handle the most impactful transformations (like predicate pushdown), while later rules handle specialized optimizations.
•Local Optimality — Each rule improves the plan locally, making it better according to some well-understood metric. While this doesn't guarantee global optimality, it reliably improves most queries.
•Simplicity and Maintainability — Rules are explicit, documented, and debuggable. When queries perform unexpectedly, administrators can trace exactly which rules fired and why.

Historical Context: The Oracle RBO Era

Rank	Access Path	Description
1	Single row by ROWID	Direct physical access
2	Single row by cluster join	Using cluster key
3	Single row by unique or primary key	Unique index lookup
4	Cluster hash	Hash cluster access
5	Cluster range	Cluster key range scan
...	...	...
15	Full table scan	Complete table read

The Transition to CBO

Transformation Rules Architecture

A rule-based optimizer is essentially a pattern-matching system that recognizes suboptimal query structures and rewrites them into better forms. Each transformation rule has three components:

Pattern — A template that matches against portions of the query plan
Condition — Additional constraints that must be satisfied for the rule to apply
Action — The transformation to perform when the pattern matches and conditions are met

The Anatomy of a Transformation Rule

selection_pushdown_rule.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// RULE: Selection Pushdown Past Join
// =====================================
// Pattern:
//   σ[predicate](R ⋈ S)
// 
// Condition:
//   predicate references ONLY columns from R (or ONLY from S)
//
// Action:
//   If predicate uses only R's columns:
//       Transform to: (σ[predicate](R)) ⋈ S
//   If predicate uses only S's columns:
//       Transform to: R ⋈ (σ[predicate](S))
 
RULE SelectionPushdownPastJoin {
    MATCH (
        Select(
            predicate = P,
            input = Join(left = R, right = S, condition = J)
        )
    )
    
    WHEN (
        referencedColumns(P) ⊆ columns(R)  // P only uses R's columns
    )
    
    THEN REPLACE WITH (
        Join(
            left = Select(predicate = P, input = R),  // Push selection to R
            right = S,
            condition = J
        )
    )
}

Rule Categories

Categories of Transformation Rules
Category	Purpose	Examples
Simplification	Eliminate redundant or unnecessary operations	Remove double negation, eliminate no-op projections, merge adjacent filters
Predicate Pushdown	Move filters closer to data sources	Push selections below joins, push filters into subqueries
Projection Pushdown	Reduce column set as early as possible	Eliminate unused columns before joins, prune intermediate schemas
Join Transformations	Reorder and restructure join trees	Apply commutativity, associativity; convert subqueries to joins
Subquery Decorrelation	Convert correlated subqueries to joins	Transform EXISTS to semi-join, IN to join with duplicate elimination
Aggregate Optimization	Optimize grouping operations	Push partial aggregates below joins, eliminate redundant grouping

Rule Application Strategies

Different optimizers apply transformation rules using different strategies:

Bottom-Up Application Start from the leaves (table scans) and work upward. This ensures that base relations are optimized before considering how they combine.

Phased Application Organize rules into phases, where all rules in a phase are applied to completion before moving to the next phase. This provides more control over the optimization sequence.

The Power of Composability

Common Universal Transformations

1. Constant Folding

Constant folding evaluates expressions involving only constants at compile time rather than execution time.

before_constant_folding.sql
1
2
3
4
5
6
7
8
-- Before: Expression evaluated for every row
SELECT *
FROM orders
WHERE total_price > 100 * 1.08 - 5
  AND order_date >= '2024-01-01'::date + 30;
 
-- The expressions "100 * 1.08 - 5" and 
-- "'2024-01-01' + 30" are computed repeatedly

after_constant_folding.sql
1
2
3
4
5
6
7
8
-- After: Constants pre-computed
SELECT *
FROM orders
WHERE total_price > 103.0
  AND order_date >= '2024-01-31';
 
-- Expressions evaluated once at
-- optimization time, not per row

2. Predicate Simplification

predicate_simplification_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Contradiction Detection
-- Before: This query should return empty result
SELECT * FROM employees 
WHERE department_id = 10 AND department_id = 20;
-- After: Entire query can be eliminated (returns empty)
 
-- Tautology Elimination
-- Before: Redundant condition
SELECT * FROM orders WHERE status = 'SHIPPED' OR 1=1;
-- After: No predicate needed (returns all rows)
 
-- Redundant Conjunct Elimination
-- Before: Overlapping conditions
SELECT * FROM products WHERE price > 100 AND price > 50;
-- After: Only the stronger condition remains
SELECT * FROM products WHERE price > 100;
 
-- Absorption Law
-- Before: Complex expression
SELECT * FROM items WHERE (A AND B) OR A;
-- After: Simplified using absorption (A OR (A AND B) = A)
SELECT * FROM items WHERE A;

3. Elimination of Unnecessary Operations

Rule-based optimizers identify and remove operations that have no effect on the query result:

Elimination Rules

•Identity Projection — A projection that includes all columns of its input can be removed entirely. π[a,b,c](R) where R has exactly columns {a, b, c} is equivalent to just R.
•Trivial Selection — A selection with a tautological predicate (WHERE TRUE or WHERE 1=1) can be eliminated, as it passes all rows unchanged.
•Empty Result Detection — If any point in the plan provably produces zero rows (contradictory predicate), the entire query can short-circuit to return empty.
•Redundant DISTINCT — If a query already produces unique rows (e.g., selecting only key columns), the DISTINCT keyword can be removed.
•Unnecessary Subqueries — Simple subqueries that don't provide additional semantics can be flattened into the main query.

4. View Expansion and Merging

view_merging.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- View Definition
CREATE VIEW expensive_products AS
SELECT product_id, name, price, category_id
FROM products
WHERE price > 1000;
 
-- Query using the view
SELECT e.name, c.category_name
FROM expensive_products e
JOIN categories c ON e.category_id = c.id
WHERE e.category_id = 5;
 
-- After View Expansion (conceptual intermediate step)
SELECT e.name, c.category_name
FROM (
    SELECT product_id, name, price, category_id
    FROM products
    WHERE price > 1000
) e
JOIN categories c ON e.category_id = c.id
WHERE e.category_id = 5;
 
-- After View Merging (final optimized form)
SELECT p.name, c.category_name
FROM products p
JOIN categories c ON p.category_id = c.id
WHERE p.price > 1000 
  AND p.category_id = 5;
-- Predicates merged, subquery eliminated

View Merging Pitfalls

The Rule Catalog Structure

Rule Specification Format

rule_specification.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Rule: JoinCommutativity
# Algebraic basis: R ⋈ S ≡ S ⋈ R
---
rule_id: JOIN_COMMUTE
name: "Join Commutativity"
category: JOIN_TRANSFORMATIONS
phase: LOGICAL_OPTIMIZATION
 
pattern:
  type: InnerJoin
  left: 
    bind: ?R  # Bind left operand to variable ?R
  right:
    bind: ?S  # Bind right operand to variable ?S
  condition:
    bind: ?cond  # Bind join condition to variable ?cond
 
preconditions:
  - type: InnerJoin  # Only applies to inner joins
  # Note: Left/right/semi joins are NOT commutative
 
action:
  type: InnerJoin
  left: ?S      # Swap operands
  right: ?R
  condition: ?cond  # Condition unchanged
 
postconditions:
  # Verify logical equivalence
  - schema_preserved: true
  - cardinality_preserved: true
 
metadata:
  cost_neutral: true  # Neither better nor worse inherently
  enables:
    - SELECTION_PUSHDOWN  # May enable other rules
    - JOIN_ASSOCIATIVITY
  priority: 50  # Medium priority in rule ordering

Rule Dependencies and Interactions

Rule Interaction Patterns
Pattern	Description	Example
Enabling	Rule A creates a pattern that Rule B can match	Selection pushdown creates adjacent selections that can be merged
Blocking	Rule A's transformation makes Rule B inapplicable	Converting subquery to join removes the subquery pattern
Competing	Both Rule A and Rule B match, only one can apply	Join commutativity vs. selection pushdown when both match
Cascading	Rule application triggers recursive application	Pushing selection down enables another pushdown
Oscillating	Rules can undo each other, causing infinite loops	Join commutativity applied repeatedly swaps back and forth

Confluence and Termination

Two critical properties that a rule set must possess:

Rule priorities that break ties deterministically
Transformation tracking to avoid re-applying the same transformation
Fixed-point detection to recognize when the plan stops changing
Depth or iteration limits as safety bounds

The Volcano/Cascades Contribution

Rule-Based vs Cost-Based: The Spectrum

When Rule-Based Excels

Rule-based optimization is the right choice when:

Rule-Based Sweet Spots

•Transformations are universally beneficial — Selection pushdown before joins is essentially always helpful. No cost model is needed to justify it.
•Statistics are unavailable or unreliable — For ad-hoc queries on new tables, temporary tables, or complex expressions, statistics may not exist. Rules provide reasonable defaults.
•Optimization time is critical — Cost-based optimization can be expensive, especially for complex queries. Rules apply in linear time relative to plan size.
•Determinism is required — Some applications require identical plans for identical queries. Rules provide this; cost-based optimizers may not (due to sampling variance in statistics).
•Logical correctness transformations — Expanding views, decorrelating subqueries, and normalizing predicates are logical steps that should happen regardless of cost.

When Cost-Based Dominates

Cost-based optimization is essential when:

Cost-Based Requirements

•Multiple viable alternatives exist — When three different join algorithms could work, only cost estimation can choose wisely.
•Data distribution matters critically — Whether to use an index depends on selectivity. A 5% selectivity scan uses an index; 50% selectivity prefers a full table scan.
•Join ordering decisions — With multiple tables, the join order dramatically affects performance. The difference between best and worst orders can be 10000× or more.
•Physical operator selection — Choosing between hash join, merge join, and nested loop join requires understanding input sizes and available memory.
•Parallel execution planning — How to partition work across cores or nodes depends on data sizes and system resources.

The Modern Hybrid Approach

Contemporary query optimizers use a two-phase architecture:

Phase 1: Rule-Based Logical Optimization

Apply all safe, beneficial transformations
Canonicalize the query representation
Decorrelate subqueries, expand views, simplify predicates
Produce a normalized logical plan

Phase 2: Cost-Based Physical Optimization

Enumerate alternative physical plans
Estimate costs using statistics
Select the lowest-cost plan
Handle join ordering, operator selection, and parallelism

Converting Mermaid diagram...

Rules as Plan Space Reduction

Implementation Considerations

Implementing a rule-based optimizer requires careful attention to engineering details. The conceptual simplicity of "apply transformation rules" belies significant implementation complexity.

Pattern Matching Efficiency

pattern_matching_optimization.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Naive approach: O(rules × nodes) matching per iteration
function naiveOptimize(plan):
    changed = true
    while changed:
        changed = false
        for each rule in rules:
            for each node in plan.allNodes():
                if rule.matches(node):
                    plan = rule.apply(plan, node)
                    changed = true
 
// Optimized: Rule Indexing + Event-Driven Matching
function optimizedOptimize(plan):
    // Index rules by root operator type
    ruleIndex = buildRuleIndex(rules)  // {Join: [r1,r2], Select: [r3], ...}
    
    // Worklist of nodes to process
    worklist = new Queue(plan.allNodes())
    
    while not worklist.isEmpty():
        node = worklist.dequeue()
        applicableRules = ruleIndex.get(node.type)
        
        for rule in applicableRules:
            if rule.matches(node):
                newNode = rule.apply(node)
                
                // Only reprocess affected nodes
                worklist.add(newNode)
                worklist.addAll(newNode.children)
                break  // Re-evaluate this node's rules
 
// Key optimizations:
// 1. Rule indexing by operator type
// 2. Worklist avoids re-scanning entire tree
// 3. Early termination on first match
// 4. Minimal reprocessing after transformation

Maintaining Plan Invariants

As rules transform the plan, various invariants must be maintained:

Schema Consistency: Every operator must have correctly typed inputs and outputs. Rules must update column references when they restructure the plan.

Semantics Preservation: The transformed plan must produce identical results to the original. This requires careful handling of NULL values, duplicate rows, and ordering.

Property Propagation: Physical properties (like sort order or data distribution) must be tracked and updated as transformations occur.

Plan Properties Tracked During Optimization
Property	Description	Affected By
Output Schema	Column names, types, and nullability	Projection, join, rename operations
Sort Order	Ordering of output rows	Sort, index scan, merge join
Partitioning	How data is distributed	Shuffle, partition-aware joins
Cardinality Bounds	Min/max estimated row counts	All filter operations
Required Columns	Which columns are needed downstream	Projection pushdown analysis

Testing and Verification

Rule-based optimizers require exhaustive testing because bugs in transformation rules can silently produce incorrect results. Key testing strategies include:

Property-Based Testing: Generate random queries and verify that optimized plans produce the same results as unoptimized plans.

Regression Testing: Maintain a corpus of queries where optimal plans are known and verify that new rules don't regress performance.

Plan Diff Tools: Compare plans before and after changes to rule sets, flagging unexpected regressions.

The NULL Landmine

Summary: Rule-Based Optimization

Key Takeaways

•Rule-based optimization applies deterministic transformations — Given patterns in the query plan, rules rewrite the plan into equivalent but more efficient forms, without relying on statistics or cost estimation.
•Transformation rules leverage relational algebra equivalences — Every rule is grounded in mathematical identities that guarantee the transformed query produces identical results to the original.
•Rules are organized into categories — Simplification, predicate pushdown, projection pushdown, join transformations, and subquery decorrelation each address different aspects of optimization.
•Modern optimizers are hybrid — They use rule-based transformations for logical optimization and safe, universal improvements, then apply cost-based selection for decisions that depend on data characteristics.
•Rule application must terminate and ideally be confluent — Optimizers use priorities, tracking, and phase separation to ensure rule engines reach a stable, optimized state.
•Implementation requires careful engineering — Pattern matching efficiency, invariant maintenance, and thorough testing are essential for correct, performant optimizer implementations.

What's Next:

Page Complete

1 / 5