Loading learning content...
Before databases learned to count, they learned to follow rules. In the early days of relational database systems, query optimizers didn't have sophisticated statistics about data distributions, couldn't estimate how many rows would flow through a join, and lacked the computational resources for exhaustive plan enumeration. Yet these systems still needed to transform inefficient queries into performant execution plans.
The solution was rule-based optimization—a deterministic approach that applies transformation rules to query plans based on algebraic equivalences and time-tested heuristics. While modern optimizers have evolved to include cost-based decision making, rule-based transformations remain the essential first phase of every query optimization pipeline.
By the end of this page, you will understand how rule-based optimizers work, why they were the dominant paradigm in early relational systems, how transformation rules are organized and applied, and why these techniques remain indispensable even in cost-based optimizers. You'll gain insight into the fundamental principle that powers all query optimization: equivalent transformations.
Rule-based optimization operates on a fundamentally different philosophy than cost-based optimization. Rather than asking "Which plan is cheapest?", it asks "Which transformations make a plan universally better?"
This paradigm is built on a critical insight: some query transformations are almost always beneficial, regardless of data characteristics. Filtering rows before joining them, for instance, reduces the amount of work in nearly every conceivable scenario. You don't need statistics to know this—it's algebraically guaranteed.
A rule-based optimizer embodies several key principles that distinguish it from cost-based approaches:
The most famous rule-based optimizer was Oracle's RBO (Rule-Based Optimizer), which dominated database optimization from the 1970s through the late 1990s. Oracle's RBO used a numbered ranking system where each access path had a fixed priority:
| Rank | Access Path | Description |
|---|---|---|
| 1 | Single row by ROWID | Direct physical access |
| 2 | Single row by cluster join | Using cluster key |
| 3 | Single row by unique or primary key | Unique index lookup |
| 4 | Cluster hash | Hash cluster access |
| 5 | Cluster range | Cluster key range scan |
| ... | ... | ... |
| 15 | Full table scan | Complete table read |
The optimizer simply chose the access path with the lowest rank number, without considering how many rows would be returned or how selective conditions actually were. This approach worked remarkably well for many workloads but failed spectacularly when its assumptions didn't match reality.
Oracle deprecated its RBO in favor of the Cost-Based Optimizer (CBO) starting with Oracle 7. However, the RBO remained available for backward compatibility for decades, and many legacy applications explicitly requested RBO mode because their queries had been tuned assuming rule-based behavior. This illustrates how deeply optimization strategy affects application design.
A rule-based optimizer is essentially a pattern-matching system that recognizes suboptimal query structures and rewrites them into better forms. Each transformation rule has three components:
Consider a fundamental transformation rule: Selection Pushdown Past Join. This rule pushes selection (filter) operations below join operations whenever the selection references only columns from one side of the join.
12345678910111213141516171819202122232425262728293031323334
// RULE: Selection Pushdown Past Join// =====================================// Pattern:// σ[predicate](R ⋈ S)// // Condition:// predicate references ONLY columns from R (or ONLY from S)//// Action:// If predicate uses only R's columns:// Transform to: (σ[predicate](R)) ⋈ S// If predicate uses only S's columns:// Transform to: R ⋈ (σ[predicate](S)) RULE SelectionPushdownPastJoin { MATCH ( Select( predicate = P, input = Join(left = R, right = S, condition = J) ) ) WHEN ( referencedColumns(P) ⊆ columns(R) // P only uses R's columns ) THEN REPLACE WITH ( Join( left = Select(predicate = P, input = R), // Push selection to R right = S, condition = J ) )}Transformation rules are organized into categories based on the type of optimization they perform. Understanding these categories helps in comprehending how a complete rule-based optimizer is structured:
| Category | Purpose | Examples |
|---|---|---|
| Simplification | Eliminate redundant or unnecessary operations | Remove double negation, eliminate no-op projections, merge adjacent filters |
| Predicate Pushdown | Move filters closer to data sources | Push selections below joins, push filters into subqueries |
| Projection Pushdown | Reduce column set as early as possible | Eliminate unused columns before joins, prune intermediate schemas |
| Join Transformations | Reorder and restructure join trees | Apply commutativity, associativity; convert subqueries to joins |
| Subquery Decorrelation | Convert correlated subqueries to joins | Transform EXISTS to semi-join, IN to join with duplicate elimination |
| Aggregate Optimization | Optimize grouping operations | Push partial aggregates below joins, eliminate redundant grouping |
Different optimizers apply transformation rules using different strategies:
Top-Down Application Start from the root of the query plan and work downward. Each rule is tried at the current node before recursing into children. This approach ensures that high-level restructuring happens before low-level optimization.
Bottom-Up Application Start from the leaves (table scans) and work upward. This ensures that base relations are optimized before considering how they combine.
Fixed-Point Iteration Apply rules repeatedly until no more transformations are possible. This handles cases where one transformation enables another. For example, pushing a selection down might enable a join reordering that wasn't possible before.
Phased Application Organize rules into phases, where all rules in a phase are applied to completion before moving to the next phase. This provides more control over the optimization sequence.
Individual transformation rules are simple, but their power comes from composition. A sequence of simple rules can dramatically transform a query. This is similar to how simple algebraic identities combine to solve complex equations—each step is elementary, but the cumulative effect is profound.
Certain transformations are so universally beneficial that they appear in every rule-based optimizer. These "universal transformations" improve plans regardless of data distribution, making them safe to apply unconditionally.
Constant folding evaluates expressions involving only constants at compile time rather than execution time.
12345678
-- Before: Expression evaluated for every rowSELECT *FROM ordersWHERE total_price > 100 * 1.08 - 5 AND order_date >= '2024-01-01'::date + 30; -- The expressions "100 * 1.08 - 5" and -- "'2024-01-01' + 30" are computed repeatedly12345678
-- After: Constants pre-computedSELECT *FROM ordersWHERE total_price > 103.0 AND order_date >= '2024-01-31'; -- Expressions evaluated once at-- optimization time, not per rowPredicate simplification applies Boolean algebra to reduce the complexity of WHERE clauses. This eliminates redundant conditions and identifies tautologies (always true) or contradictions (always false).
12345678910111213141516171819202122
-- Contradiction Detection-- Before: This query should return empty resultSELECT * FROM employees WHERE department_id = 10 AND department_id = 20;-- After: Entire query can be eliminated (returns empty) -- Tautology Elimination-- Before: Redundant conditionSELECT * FROM orders WHERE status = 'SHIPPED' OR 1=1;-- After: No predicate needed (returns all rows) -- Redundant Conjunct Elimination-- Before: Overlapping conditionsSELECT * FROM products WHERE price > 100 AND price > 50;-- After: Only the stronger condition remainsSELECT * FROM products WHERE price > 100; -- Absorption Law-- Before: Complex expressionSELECT * FROM items WHERE (A AND B) OR A;-- After: Simplified using absorption (A OR (A AND B) = A)SELECT * FROM items WHERE A;Rule-based optimizers identify and remove operations that have no effect on the query result:
π[a,b,c](R) where R has exactly columns {a, b, c} is equivalent to just R.WHERE TRUE or WHERE 1=1) can be eliminated, as it passes all rows unchanged.When queries reference views, the optimizer must decide how to incorporate view definitions. Rule-based optimizers typically expand views inline and then merge the expanded query with the outer query:
1234567891011121314151617181920212223242526272829
-- View DefinitionCREATE VIEW expensive_products ASSELECT product_id, name, price, category_idFROM productsWHERE price > 1000; -- Query using the viewSELECT e.name, c.category_nameFROM expensive_products eJOIN categories c ON e.category_id = c.idWHERE e.category_id = 5; -- After View Expansion (conceptual intermediate step)SELECT e.name, c.category_nameFROM ( SELECT product_id, name, price, category_id FROM products WHERE price > 1000) eJOIN categories c ON e.category_id = c.idWHERE e.category_id = 5; -- After View Merging (final optimized form)SELECT p.name, c.category_nameFROM products pJOIN categories c ON p.category_id = c.idWHERE p.price > 1000 AND p.category_id = 5;-- Predicates merged, subquery eliminatedNot all views can be safely merged. Views containing DISTINCT, GROUP BY, LIMIT, or window functions often cannot be merged without changing semantics. Rule-based optimizers must recognize these patterns and avoid incorrect transformations.
A production rule-based optimizer maintains a comprehensive catalog of transformation rules. This catalog is the heart of the optimizer—it encodes decades of database optimization wisdom. Let's examine how such a catalog is structured.
Modern rule-based optimizers represent rules declaratively rather than as imperative code. This allows rules to be reasoned about, combined, and verified for correctness. Here's a detailed specification of a rule:
1234567891011121314151617181920212223242526272829303132333435363738
# Rule: JoinCommutativity# Algebraic basis: R ⋈ S ≡ S ⋈ R---rule_id: JOIN_COMMUTEname: "Join Commutativity"category: JOIN_TRANSFORMATIONSphase: LOGICAL_OPTIMIZATION pattern: type: InnerJoin left: bind: ?R # Bind left operand to variable ?R right: bind: ?S # Bind right operand to variable ?S condition: bind: ?cond # Bind join condition to variable ?cond preconditions: - type: InnerJoin # Only applies to inner joins # Note: Left/right/semi joins are NOT commutative action: type: InnerJoin left: ?S # Swap operands right: ?R condition: ?cond # Condition unchanged postconditions: # Verify logical equivalence - schema_preserved: true - cardinality_preserved: true metadata: cost_neutral: true # Neither better nor worse inherently enables: - SELECTION_PUSHDOWN # May enable other rules - JOIN_ASSOCIATIVITY priority: 50 # Medium priority in rule orderingRules don't operate in isolation—they interact in complex ways. Some rules enable others (cascading effects), while some rules compete for the same match (mutual exclusion). Understanding these interactions is crucial for optimizer design:
| Pattern | Description | Example |
|---|---|---|
| Enabling | Rule A creates a pattern that Rule B can match | Selection pushdown creates adjacent selections that can be merged |
| Blocking | Rule A's transformation makes Rule B inapplicable | Converting subquery to join removes the subquery pattern |
| Competing | Both Rule A and Rule B match, only one can apply | Join commutativity vs. selection pushdown when both match |
| Cascading | Rule application triggers recursive application | Pushing selection down enables another pushdown |
| Oscillating | Rules can undo each other, causing infinite loops | Join commutativity applied repeatedly swaps back and forth |
Two critical properties that a rule set must possess:
Termination: The rule application process must eventually stop. Without termination, the optimizer runs forever. Oscillation is a common threat—rules like join commutativity (R ⋈ S → S ⋈ R) could apply indefinitely. Optimizers prevent this through:
Confluence: Regardless of the order rules are applied, the same final result should be reached. This is harder to guarantee but desirable for predictability. Many optimizers don't achieve full confluence; they rely on carefully ordered rule phases to produce good (if not globally optimal) results.
The Volcano and Cascades optimization frameworks, developed by Goetz Graefe, introduced the concept of "rules" as first-class citizens in optimizer architecture. Rather than hard-coding transformations, these systems allow rules to be added, removed, and modified declaratively. This idea influenced virtually every modern query optimizer, including those in PostgreSQL, SQL Server, and Apache Calcite.
Modern optimizers don't fall neatly into "rule-based" or "cost-based" categories. Instead, they exist on a spectrum, using rules for some decisions and cost estimation for others. Understanding this spectrum helps clarify when each approach is appropriate.
Rule-based optimization is the right choice when:
Cost-based optimization is essential when:
Contemporary query optimizers use a two-phase architecture:
Phase 1: Rule-Based Logical Optimization
Phase 2: Cost-Based Physical Optimization
Rule-based transformations also serve a crucial purpose in cost-based optimization: they reduce the search space. By canonicalizing plans and eliminating obviously suboptimal structures, rules ensure the cost-based phase doesn't waste time evaluating plans that would never be chosen.
Implementing a rule-based optimizer requires careful attention to engineering details. The conceptual simplicity of "apply transformation rules" belies significant implementation complexity.
Rules match against portions of the query plan tree. With hundreds of rules and potentially large plans, naive pattern matching becomes prohibitively expensive. Production optimizers use several techniques:
12345678910111213141516171819202122232425262728293031323334353637
// Naive approach: O(rules × nodes) matching per iterationfunction naiveOptimize(plan): changed = true while changed: changed = false for each rule in rules: for each node in plan.allNodes(): if rule.matches(node): plan = rule.apply(plan, node) changed = true // Optimized: Rule Indexing + Event-Driven Matchingfunction optimizedOptimize(plan): // Index rules by root operator type ruleIndex = buildRuleIndex(rules) // {Join: [r1,r2], Select: [r3], ...} // Worklist of nodes to process worklist = new Queue(plan.allNodes()) while not worklist.isEmpty(): node = worklist.dequeue() applicableRules = ruleIndex.get(node.type) for rule in applicableRules: if rule.matches(node): newNode = rule.apply(node) // Only reprocess affected nodes worklist.add(newNode) worklist.addAll(newNode.children) break // Re-evaluate this node's rules // Key optimizations:// 1. Rule indexing by operator type// 2. Worklist avoids re-scanning entire tree// 3. Early termination on first match// 4. Minimal reprocessing after transformationAs rules transform the plan, various invariants must be maintained:
Schema Consistency: Every operator must have correctly typed inputs and outputs. Rules must update column references when they restructure the plan.
Semantics Preservation: The transformed plan must produce identical results to the original. This requires careful handling of NULL values, duplicate rows, and ordering.
Property Propagation: Physical properties (like sort order or data distribution) must be tracked and updated as transformations occur.
| Property | Description | Affected By |
|---|---|---|
| Output Schema | Column names, types, and nullability | Projection, join, rename operations |
| Sort Order | Ordering of output rows | Sort, index scan, merge join |
| Partitioning | How data is distributed | Shuffle, partition-aware joins |
| Cardinality Bounds | Min/max estimated row counts | All filter operations |
| Required Columns | Which columns are needed downstream | Projection pushdown analysis |
Rule-based optimizers require exhaustive testing because bugs in transformation rules can silently produce incorrect results. Key testing strategies include:
Property-Based Testing: Generate random queries and verify that optimized plans produce the same results as unoptimized plans.
Regression Testing: Maintain a corpus of queries where optimal plans are known and verify that new rules don't regress performance.
Formal Verification: Some systems use proof assistants to mathematically verify that transformation rules preserve semantics. This is especially important for rules handling edge cases like NULL values.
Plan Diff Tools: Compare plans before and after changes to rule sets, flagging unexpected regressions.
Many seemingly correct rule transformations fail in the presence of NULL values. For example, NOT (A = B) is not equivalent to A <> B when either value could be NULL—one can return FALSE while the other returns UNKNOWN. Rule writers must exhaustively consider NULL semantics, which accounts for many optimizer bugs in production systems.
We've explored the foundations of rule-based query optimization, understanding how this deterministic, pattern-based approach transforms queries into more efficient forms. Let's consolidate the key insights:
What's Next:
With the foundations of rule-based optimization established, we'll dive into the specific heuristics that drive query transformation. The next page explores Common Heuristics—the time-tested rules that optimizers use to improve virtually every query, from selection pushdown to join ordering strategies.
You now understand the rule-based optimization paradigm—its historical roots, architectural principles, and role in modern query optimization. This foundation prepares you to study the specific heuristic rules that drive practical query optimization.