Optimization Overview - Learning Module

Loading content...

0/241

Plan Enumeration

Systematic Exploration of Possibilities

The search space of execution plans is incomprehensibly vast—potentially billions or trillions of possibilities for complex queries. Yet the query optimizer must navigate this space quickly, typically in milliseconds. How does it accomplish this seemingly impossible task?

The answer lies in plan enumeration—the systematic process of generating and evaluating candidate execution plans. But "systematic" doesn't mean "exhaustive." Modern optimizers employ sophisticated algorithms that intelligently explore the most promising regions of the search space while safely ignoring vast regions of poor alternatives.

In this page, we'll examine the major enumeration strategies, from the elegant dynamic programming approach that guarantees optimality within its scope, to heuristic and randomized methods that trade guarantees for speed.

What You Will Learn

By the end of this page, you will understand the major plan enumeration strategies (exhaustive, dynamic programming, greedy, randomized), how dynamic programming exploits optimal substructure to find optimal join orders in polynomial time, the role of 'interesting orders' in reducing enumeration complexity, and when different strategies are appropriate based on query characteristics.

The Enumeration Problem

Plan enumeration is the process of generating candidate execution plans from the search space. The fundamental challenge is:

Generate enough plans to find a good solution, but not so many that optimization becomes prohibitively expensive.

This framing reveals enumeration as a search problem with the following characteristics:

1.1 What Are We Searching For?

We seek the execution plan with minimum estimated cost. The plan includes:

Join ordering (which tables join in what sequence)
Join algorithms (hash, merge, nested-loop for each join)
Access methods (scan, index scan, etc. for each table)
Operator placement (where filters, projections are applied)

1.2 What Constraints Must We Satisfy?

Time Budget: Enumeration must complete within a reasonable fraction of expected query time
Correctness: Every enumerated plan must be semantically valid
Completeness (Ideal): The optimal plan should be among those enumerated

1.3 The Enumeration Taxonomy

Enumeration strategies fall into several categories based on their approach:

Plan Enumeration Strategy Categories
Category	Approach	Optimality	Scalability
Exhaustive	Enumerate all valid plans	Guaranteed optimal	Infeasible beyond small queries
Dynamic Programming	Build optimal from subproblems	Optimal within scope	Polynomial for restricted shapes
Greedy / Heuristic	Make locally optimal choices	No guarantee	Scales to large queries
Randomized	Sample search space	Probabilistic quality	Handles complex queries
Hybrid	Combine multiple strategies	Depends on components	Adaptive to query complexity

No Silver Bullet

Each enumeration strategy has trade-offs. Production optimizers typically use hybrid approaches—dynamic programming for join ordering up to a threshold, heuristics for operator placement, and randomization for very large queries.

Dynamic Programming: The Workhorse Algorithm

Dynamic programming (DP) is the dominant algorithm for join order enumeration in relational databases. Introduced in IBM's System R and refined over decades, DP exploits a key property of join ordering: optimal substructure.

2.1 The Principle of Optimal Substructure

Optimal substructure means:

The optimal plan for joining a set of tables S can be built from optimal plans for subsets of S.

Specifically, if the optimal plan for tables {A, B, C, D} joins the result of {A, B} with the result of {C, D}, then:

The plan for {A, B} must be optimal among all plans for {A, B}
The plan for {C, D} must be optimal among all plans for {C, D}

This property enables bottom-up construction of the optimal plan.

2.2 The System R Algorithm (Left-Deep)

The classic System R algorithm enumerates left-deep join orderings:

1. Initialize: For each single table t:
      BestPlan[{t}] = best access method for t
      
2. For size = 2 to n:  // n = number of tables
      For each subset S of size 'size' that is connected:
          For each t in S where S - {t} is connected:
              For each join algorithm J:
                  plan = BestPlan[S - {t}] JOIN[J] AccessMethod[t]
                  if cost(plan) < cost(BestPlan[S]):
                      BestPlan[S] = plan
                      
3. Return BestPlan[{all tables}]

Key Insight: Instead of evaluating n! orderings, we evaluate O(n × 2^n) subproblems—exponential, but much smaller than factorial for practical n.

dp_join_ordering.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
def dp_join_ordering(tables, join_conditions):
    """
    Dynamic programming algorithm for optimal join ordering.
    Returns the minimum-cost plan for joining all tables.
    """
    n = len(tables)
    # BestPlan maps table subsets to (cost, plan) tuples
    BestPlan = {}
    
    # Initialize single-table plans (base case)
    for t in tables:
        subset = frozenset([t])
        BestPlan[subset] = (
            estimate_scan_cost(t),
            AccessPlan(t, best_access_method(t))
        )
    
    # Build up optimal plans for larger subsets
    for size in range(2, n + 1):
        for subset in subsets_of_size(tables, size):
            if not is_connected(subset, join_conditions):
                continue  # Skip disconnected subsets
                
            best_cost = float('inf')
            best_plan = None
            
            # Try all ways to partition subset into two parts
            for left_subset in proper_subsets(subset):
                right_subset = subset - left_subset
                
                if not is_connected(left_subset, join_conditions):
                    continue
                if not is_connected(right_subset, join_conditions):
                    continue
                if not has_join_condition(left_subset, right_subset, join_conditions):
                    continue  # Would create Cartesian product
                
                # Get optimal subplans
                left_cost, left_plan = BestPlan.get(left_subset, (float('inf'), None))
                right_cost, right_plan = BestPlan.get(right_subset, (float('inf'), None))
                
                # Try different join algorithms
                for join_alg in [HASH_JOIN, MERGE_JOIN, NESTED_LOOP]:
                    join_cost = estimate_join_cost(
                        left_plan, right_plan, join_alg, join_conditions
                    )
                    total_cost = left_cost + right_cost + join_cost
                    
                    if total_cost < best_cost:
                        best_cost = total_cost
                        best_plan = JoinPlan(left_plan, right_plan, join_alg)
            
            if best_plan:
                BestPlan[subset] = (best_cost, best_plan)
    
    # Return optimal plan for full table set
    return BestPlan[frozenset(tables)]

2.3 Complexity Analysis

Time Complexity: O(3^n) for bushy trees, O(n × 2^n) for left-deep

We enumerate O(2^n) subsets
For each subset, we consider O(2^k) partitions (k = subset size)
Summing over all subsets gives O(3^n) by the binomial theorem

Space Complexity: O(2^n) to store best plans for all subsets

2.4 Practical Thresholds

Given the exponential complexity, DP becomes impractical for large n:

Tables	2^n Subsets	3^n Operations	Typical Time
4	16	81	< 1 ms
8	256	6,561	~1 ms
12	4,096	531,441	~100 ms
16	65,536	43M	~10 sec
20	1M	3.5B	~hours (impractical)

Most systems switch to heuristics for queries joining more than 10-15 tables.

The n = 15 Barrier

For most commercial systems, n ≈ 12-15 tables is the practical limit for exhaustive DP. Beyond this, optimization time becomes comparable to or exceeds query execution time, defeating the purpose. Systems fall back to heuristics, random sampling, or user hints.

Interesting Orders: A Key Optimization

Pure cost comparison misses an important consideration: the order of tuples in intermediate results matters. An intermediate result sorted on a particular attribute might enable cheaper subsequent operations.

3.1 The Problem

Consider two plans for the same subset {A, B}:

Plan 1: Hash join, cost 100, result unordered
Plan 2: Merge join, cost 120, result sorted on join key

Pure DP would keep only Plan 1. But if we later need to:

Join with C using merge join (requires sorted input)
Apply ORDER BY on the join key

Plan 2 might be better overall because it avoids a later sort.

3.2 The Solution: Track Physical Properties

Instead of keeping one best plan per subset, keep:

Best unordered plan
Best plan for each interesting order

An interesting order is a sort order that might benefit later operations:

Join attributes for potential merge joins
GROUP BY columns
ORDER BY columns
DISTINCT columns

interesting_orders.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def dp_with_interesting_orders(tables, query):
    """
    DP algorithm tracking interesting orders for each subset.
    """
    # Compute set of interesting orders from query
    interesting_orders = compute_interesting_orders(query)
    # Includes: join keys, GROUP BY cols, ORDER BY cols
    
    # BestPlan[subset][order] = (cost, plan)
    # order = None means unordered
    BestPlan = defaultdict(dict)
    
    # Initialize single-table plans
    for t in tables:
        subset = frozenset([t])
        # Unordered scan
        BestPlan[subset][None] = (scan_cost(t), SeqScan(t))
        
        # For each interesting order, check if index exists
        for order in interesting_orders:
            if has_index(t, order):
                cost = index_scan_cost(t, order)
                BestPlan[subset][order] = (cost, IndexScan(t, order))
    
    # Build up larger subsets
    for size in range(2, len(tables) + 1):
        for subset in connected_subsets_of_size(tables, size):
            # Try all join combinations
            for left, right in valid_partitions(subset):
                for l_order, (l_cost, l_plan) in BestPlan[left].items():
                    for r_order, (r_cost, r_plan) in BestPlan[right].items():
                        
                        # Try merge join if inputs can be ordered appropriately
                        join_key = get_join_key(left, right)
                        if join_key in interesting_orders:
                            # Check if we can use merge join
                            if l_order == join_key or l_order is None:
                                if r_order == join_key or r_order is None:
                                    plan, cost = try_merge_join(
                                        l_plan, r_plan, l_order, r_order, join_key
                                    )
                                    update_best(BestPlan[subset], join_key, cost, plan)
                        
                        # Hash join (destroys order)
                        plan, cost = try_hash_join(l_plan, r_plan)
                        update_best(BestPlan[subset], None, cost, plan)
                        
                        # Nested loop (preserves outer order)
                        plan, cost = try_nested_loop(l_plan, r_plan)
                        update_best(BestPlan[subset], l_order, cost, plan)
    
    # Return best final plan, considering any required final ordering
    return select_final_plan(BestPlan[frozenset(tables)], query)

3.3 Impact on Complexity

With k interesting orders, we keep up to k+1 plans per subset, increasing space to O(k × 2^n) and time proportionally. However, k is typically small (often ≤10), so this overhead is acceptable for the significant plan quality improvements.

3.4 Physical Property Propagation

Beyond ordering, some systems generalize to physical properties:

Partitioning: Data distribution across nodes (for parallel/distributed)
Compression: Whether data is compressed
Location: Which node holds the data

This generalization is essential for distributed query optimization.

Avoid Redundant Sorts

A key benefit of tracking interesting orders is avoiding redundant sort operations. Without this, an optimizer might plan: (a) Hash join producing unordered result, (b) Sort for merge join, (c) Sort for ORDER BY. With interesting orders, a single sort can serve multiple purposes.

Greedy and Heuristic Enumeration

When DP becomes too expensive (many tables, complex predicates), optimizers fall back to heuristic enumeration—methods that don't guarantee optimality but produce reasonable plans quickly.

4.1 Greedy Join Ordering

The simplest heuristic: greedily extend the join order by choosing the next table that minimizes cost increase:

1. Start with smallest table (or most selective)
2. Repeat until all tables joined:
   a. For each remaining table t:
      - Estimate cost of joining current result with t
   b. Add the table that minimizes cost
3. Return the constructed order

Complexity: O(n²) comparisons—dramatically less than O(3^n) DP Quality: Often 2-5× worse than optimal, occasionally much worse

4.2 Minimum Selectivity Heuristic

Choose join order to minimize intermediate result sizes by joining most selective relationships first:

Estimate selectivity of each join condition
Order joins from most selective (smallest result ratio) to least
Apply this order, breaking ties by table size

Intuition: Smaller intermediate results → less work at each step Weakness: Ignores index availability, doesn't consider algorithm costs

4.3 Common Heuristic Rules

Most optimizers apply certain rules as "no-brainer" improvements, avoiding the need to enumerate alternatives:

Rule	Description	Rationale
Push Selections	Apply filters as early as possible	Reduce intermediate result sizes
Smallest Outer	Use smaller table as outer in nested loop	Fewer outer loop iterations
Build on Small	Use smaller table as build side in hash join	Smaller hash table fits memory
Avoid Cartesian	Never plan Cartesian products if alternatives exist	Cartesian products are disastrous
Use Indexes	Use index if selectivity < threshold (often 5-10%)	Index scan beats full scan for selective queries
Prefer Pipelining	Prefer operators that don't block (for first-row latency)	Better responsiveness

4.4 When Heuristics Fail

Heuristics make decisions based on local information, which can lead to globally poor choices:

Example: Table A has 1M rows, B has 100 rows, C has 10K rows.

Heuristic: Join small first → B ⋈ C, then result ⋈ A
But: B ⋈ C might have selectivity < 1% due to key constraint
Better: A ⋈ B first uses index on A, producing 100 rows

Heuristics lack the global view that DP provides.

Heuristics as Complement, Not Replacement

Well-designed optimizers use heuristics and DP together. Heuristics quickly eliminate obviously bad alternatives (like Cartesian products or unindexed filters), reducing the space that DP must search. This hybrid approach works better than either alone.

Randomized Enumeration

For very large queries where even heuristics struggle, randomized enumeration provides an alternative. Instead of systematically exploring the space, these methods sample it randomly, relying on probability to find good plans.

5.1 Random Walk / Iterative Improvement

Start with an arbitrary valid plan and repeatedly make random modifications:

1. Generate initial plan P (random or heuristic)
2. Repeat for k iterations:
   a. Generate neighbor plan P' by:
      - Swapping two adjacent joins
      - Changing one join algorithm
      - Changing one access method
   b. If cost(P') < cost(P): P = P'
3. Return best P found

This is essentially hill climbing in the cost landscape.

5.2 Simulated Annealing

To escape local minima, simulated annealing occasionally accepts worse plans:

1. Generate initial plan P, set temperature T = T_initial
2. Repeat until T is cold:
   a. Generate random neighbor P'
   b. if cost(P') < cost(P):
        P = P'  // Always accept improvements
   c. else:
        Accept P' with probability exp(-(cost(P')-cost(P))/T)
   d. T = cooling_rate × T
3. Return best P found

At high temperatures, worse plans are often accepted, enabling exploration. As temperature decreases, the algorithm focuses on refinement.

simulated_annealing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import random
import math
 
def simulated_annealing_optimize(query, max_iterations=10000):
    """
    Use simulated annealing to find a good query plan.
    """
    # Start with heuristic plan
    current = generate_heuristic_plan(query)
    current_cost = estimate_cost(current)
    best = current
    best_cost = current_cost
    
    # Annealing parameters
    initial_temp = current_cost * 0.1  # 10% of initial plan cost
    cooling_rate = 0.9995
    temp = initial_temp
    
    for i in range(max_iterations):
        # Generate random neighbor
        neighbor = generate_neighbor(current)
        neighbor_cost = estimate_cost(neighbor)
        
        # Cost difference
        delta = neighbor_cost - current_cost
        
        # Accept or reject
        if delta < 0:
            # Improvement: always accept
            current = neighbor
            current_cost = neighbor_cost
            if current_cost < best_cost:
                best = current
                best_cost = current_cost
        else:
            # Worse plan: accept probabilistically
            acceptance_prob = math.exp(-delta / temp)
            if random.random() < acceptance_prob:
                current = neighbor
                current_cost = neighbor_cost
        
        # Cool down
        temp *= cooling_rate
        
        # Early termination if cold enough
        if temp < 0.01:
            break
    
    return best
 
def generate_neighbor(plan):
    """
    Generate a neighboring plan by random modification.
    """
    modifications = [
        swap_adjacent_joins,
        change_join_algorithm,
        change_access_method,
        rotate_join_subtree,
    ]
    modifier = random.choice(modifications)
    return modifier(plan)

5.3 Genetic Algorithms

Some optimizers use genetic algorithms (GAs) for plan enumeration:

Population: Maintain a set of candidate plans
Selection: Choose plans with lower costs more often
Crossover: Combine two plans to create offspring
Mutation: Randomly modify plans
Evolution: Repeat for several generations

GAs can explore diverse regions of the search space simultaneously.

5.4 When to Use Randomized Methods

Scenario	Recommendation
< 10 tables, simple predicates	Full DP
10-15 tables	Left-deep DP with pruning
15-25 tables	Simulated annealing or genetic
25+ tables or time-critical	Greedy + local improvement
Already slow query, optimization is cheap	More exploration acceptable

PostgreSQL's Genetic Query Optimizer (GEQO)

PostgreSQL uses a genetic algorithm for queries joining more than 12 tables (configurable via geqo_threshold). GEQO doesn't guarantee optimality but produces reasonable plans for complex queries that would overwhelm exhaustive search.

Pruning Strategies

Pruning eliminates portions of the search space without explicitly evaluating them. Effective pruning can reduce exponential search to tractable levels.

6.1 Branch and Bound

Maintain a bound on the best solution found so far. If a partial plan's cost already exceeds the bound, abandon it:

1. Initialize best_cost = ∞
2. During enumeration of partial plan P:
   a. Compute lower_bound(P) = cost so far + min possible completion cost
   b. If lower_bound(P) ≥ best_cost: PRUNE (skip this branch)
   c. If P is complete and cost(P) < best_cost: 
        best_cost = cost(P)

Key Challenge: Computing a good lower bound. Loose bounds prune little; tight bounds are expensive to compute.

6.2 Dominance Pruning

If plan A dominates plan B (A is better in all respects), B can be eliminated:

Plan A dominates Plan B if:
  - cost(A) ≤ cost(B)
  AND
  - A's physical properties are at least as good as B's
    (e.g., A is sorted on more useful columns)

Dominated plans can never be part of an optimal final plan.

6.3 Cardinality-Based Pruning

Some plans can be identified as poor based on cardinality alone:

Cartesian Product Detection: Any plan producing a Cartesian product is almost certainly suboptimal if alternatives exist.

Result Size Bounds: If a partial plan produces an intermediate result larger than the full Cartesian product of remaining tables, it's likely suboptimal.

6.4 Search Space Restriction

The most aggressive pruning: simply don't consider certain plan types:

Restriction	Effect	Justification
Left-deep only	n! instead of 4^n	Pipelining, usually good enough
No cross products	Exponential reduction	Cross products rarely optimal
Limit join algorithms	2^n reduction factor	If statistics favor one algorithm
Index-only for selective	Eliminate scan options	Clear selectivity threshold

6.5 Pruning Trade-offs

Aggressive pruning speeds optimization but risks missing the optimal plan:

Example: Sometimes a Cartesian product followed by a selective filter is optimal if the filter is very selective and other join paths are expensive. Blanket prohibition would miss this.

Good optimizers use adaptive pruning: more aggressive for complex queries, more thorough for simpler ones where exhaustive search is affordable.

The Pruning Paradox

Overly aggressive pruning can harm plan quality more than it helps optimization time. Modern systems often prefer to spend more time optimizing (within reason) rather than risking poor plans. Memory and CPU have gotten cheaper; bad query plans remain expensive.

Modern Enumerator Architectures

Modern database systems use sophisticated enumerator architectures that combine multiple strategies:

7.1 Cascades / Volcano Framework

The Cascades framework (used in SQL Server, derivatives in many systems) represents the state of the art:

Key Concepts:

Memo Structure: Hash-tables storing equivalent plans for each expression
Pattern Matching: Transformation rules fire when patterns match
Top-Down Enumeration: Start from goal, recursively optimize sub-goals
Goal-Driven: Only compute what's needed for the current goal

Optimize(Goal G):
  1. If memo contains solution for G: return it
  2. Apply logical transformations to generate equivalent logical expressions
  3. Apply physical transformations to generate physical plans
  4. For each required input, recursively Optimize(sub-goal)
  5. Cost complete plans, prune dominated ones
  6. Memoize and return best plan for G

7.2 Benefits of Cascades

Memoization: Avoids recomputing overlapping subproblems across rules. Extensibility: New operators/rules added without modifying core algorithm. Flexibility: Mix of logical and physical optimization in unified framework. Pruning Integration: Natural points for branch-and-bound.

Enumeration Architectures in Major Systems
System	Architecture	Key Features
PostgreSQL	Bottom-up DP + GEQO	Standard DP for small queries, genetic algorithm for large
MySQL	Greedy + exhaustive for small	Simple greedy for large joins, exhaustive ≤ 6 tables
SQL Server	Cascades variant	Top-down, transformation rules, extensive memoization
Oracle	Cost-based + transformations	Sophisticated transformation rules, adaptive optimization
Apache Calcite	Volcano/Cascades	Used by Hive, Flink, Druid; rule-based

7.3 Adaptive Query Optimization

Cutting-edge systems perform adaptive optimization:

Compile-Time Adaptation:

Adjust enumeration thoroughness based on query complexity
Use more resources for high-value queries (determined by statistics or hints)
Timeout with fallback to heuristic plan

Runtime Adaptation:

Monitor actual cardinalities during execution
Re-optimize mid-query if estimates were wrong
Adjust operator parameters (e.g., hash table size)

Oracle's Adaptive Query Processing and SQL Server's Adaptive Joins exemplify runtime adaptation.

The Future: Learned Optimization

Emerging research applies machine learning to query optimization. ML models learn from past queries to predict good plans directly, potentially bypassing traditional enumeration. Systems like Neo, Bao, and Balsa show promising results, though production deployment remains challenging.

Summary: Plan Enumeration

Plan enumeration is the core algorithmic challenge of query optimization—navigating an exponential search space to find efficient execution plans. We've explored the major strategies and their trade-offs.

Key Takeaways

•Dynamic programming exploits optimal substructure — Building optimal plans from optimal subplans enables polynomial-time (O(3^n)) enumeration for bushy trees.
•Interesting orders extend DP — Tracking physical properties like sort order avoids missed optimization opportunities.
•Heuristics provide speed at the cost of quality — Greedy algorithms give reasonable plans in O(n²) time.
•Randomized methods handle complex queries — Simulated annealing and genetic algorithms explore large spaces effectively.
•Pruning is essential — Branch-and-bound, dominance pruning, and search space restrictions make enumeration tractable.
•Modern systems use hybrid approaches — Combining DP, heuristics, and transformation rules in frameworks like Cascades.
•Adaptive optimization represents the frontier — Runtime re-optimization and machine learning promise further improvements.

What's Next:

Enumeration generates candidate plans, but how do we compare them? The next page examines Cost-Based Selection—the estimation techniques, cost models, and selection algorithms that determine which enumerated plan ultimately executes.

Page Complete

You now understand the major plan enumeration strategies used by query optimizers. This algorithmic foundation is essential for understanding how databases choose execution plans and why certain query patterns are more challenging to optimize than others.