Loading learning content...
Imagine you're planning a road trip visiting 10 cities. How many different routes could you take? The answer—over 3.6 million—hints at a combinatorial explosion that query optimizers face with every query. But the optimizer's challenge is far worse: it's not just choosing an order, but also selecting how to travel between each city, what vehicle to use, whether to make stops, and countless other decisions.
The search space is the complete set of all possible execution plans for a query. Understanding its structure, size, and properties is essential for understanding how optimizers work—and why they sometimes fail. In this page, we'll map this vast landscape and understand the mathematical principles that govern it.
By the end of this page, you will understand how the search space of execution plans is structured, why it grows explosively with query complexity, the key factors that determine search space size, and how different plan topologies (linear vs. bushy trees) affect optimization. This knowledge is foundational for understanding the enumeration and pruning strategies covered in subsequent pages.
The search space for a query is the set of all possible execution plans that correctly implement the query's semantics. Each point in this space represents a distinct way of executing the query—a unique combination of operators, algorithms, orderings, and access methods.
Formally, we can define the search space as:
Search Space S(Q) = { P | P is a valid execution plan for query Q }
Where a valid plan P satisfies:
The search space is implicit—it's not explicitly enumerated in memory but is defined by the grammar of possible plan structures. The optimizer explores this space using generation rules that produce valid plans.
Every execution plan is defined by choices across multiple dimensions. Each dimension is independent, and the total search space is the Cartesian product of choices across all dimensions:
| Dimension | Description | Example Choices | Typical Options |
|---|---|---|---|
| Join Order | Sequence in which tables are joined | R ⋈ S ⋈ T vs R ⋈ T ⋈ S | n! / 2 orderings |
| Join Algorithms | Physical algorithm for each join | Hash, Merge, Nested-Loop | 3-5 per join |
| Access Methods | How to read each base table | Table scan, Index scan, Index-only | 2-5 per table |
| Join Type Implementation | Left-deep, right-deep, or bushy tree | Tree shape for multi-way join | Catalan(n) shapes |
| Operator Placement | Where to apply filters, projections | Push down vs. post-join | Multiple positions |
| Parallelism | Degree and partitioning strategy | Sequential, hash-partition, broadcast | Varies by system |
Because dimensions are largely independent, their choices multiply together. If you have 10 join orders, 3 algorithms per join, and 2 access methods per table—for a 4-table join with 4 tables—the space explodes: 10 × 3³ × 2⁴ = 10 × 27 × 16 = 4,320 plans. Each additional dimension multiplies the total.
Among all dimensions of the search space, join ordering typically dominates complexity. The order in which tables are joined can affect query performance by orders of magnitude, and the number of possible orderings grows super-exponentially with the number of tables.
Consider a simple example with three tables:
Query: Find orders from premium customers for electronics products.
SELECT o.order_id, c.name, p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE c.tier = 'premium' AND p.category = 'electronics';
The filter on customers yields ~5,000 rows (10%). The filter on products yields ~1,000 rows (10%). But orders has no direct filter.
The difference is approximately 20× in this example, and real-world differences can exceed 1000×. This enormous variance makes join order optimization the primary focus of most query optimizers.
For n tables, how many join orderings exist? The answer depends on whether we consider linear (left-deep/right-deep) or bushy trees:
Linear Trees (Left-Deep): In a left-deep tree, the inner relation of each join is always a base table. The number of left-deep orderings is simply the number of table permutations:
$$\text{Left-deep orderings} = n!$$
For commutative joins (where A ⋈ B = B ⋈ A), this reduces to n!/2.
Bushy Trees: Bushy trees allow arbitrary nesting—any intermediate result can join with any other. The number of distinct bushy tree shapes follows the Catalan numbers:
$$C_n = \frac{1}{n+1}\binom{2n}{n} = \frac{(2n)!}{(n+1)!n!}$$
Catalan(n) × n! gives the total bushy orderings, growing approximately as 4^n/n^1.5.
| Tables (n) | Left-Deep (n!/2) | Catalan(n-1) | Bushy Total | Comparison |
|---|---|---|---|---|
| 2 | 1 | 1 | 1 | Trivial |
| 3 | 3 | 2 | 6 | Manageable |
| 4 | 12 | 5 | 60 | Small |
| 5 | 60 | 14 | 840 | Moderate |
| 6 | 360 | 42 | 15,120 | Large |
| 8 | 20,160 | 429 | 4.7M | Challenging |
| 10 | 1.8M | 4,862 | 176B | Extreme |
| 12 | 239M | 58,786 | ~10^15 | Impossible |
Even if we could evaluate one plan per nanosecond, exhaustively searching 10^15 plans would take over 30 years. No optimization technique eliminates this fundamental explosion—we can only navigate it intelligently through pruning and heuristics.
Beyond join ordering, the optimizer must choose which algorithm to use for each operator. This creates another multiplicative factor in the search space.
Most database systems support multiple join algorithms, each optimal for different data characteristics:
| Algorithm | Best When | Strengths | Weaknesses |
|---|---|---|---|
| Nested Loop | Small outer, indexed inner | Simple, works with any join condition | O(n×m) worst case |
| Block Nested Loop | Medium tables, limited memory | Better cache utilization | Still O(n×m) |
| Index Nested Loop | Indexed inner table | Efficient for selective joins | Requires suitable index |
| Hash Join | Equi-joins, one small table | O(n+m) time | Requires memory for hash table |
| Sort-Merge Join | Pre-sorted data, range joins | Handles duplicates well | Requires sorting if not sorted |
| Grace Hash Join | Large tables, limited memory | Disk-based, predictable | Multiple passes over data |
For a query with k joins, if each has a choice of j algorithms, this dimension contributes j^k plans.
For each base table, the optimizer chooses how to retrieve data:
| Access Method | Description | Cost Characteristics |
|---|---|---|
| Sequential Scan | Read entire table | O(pages), good for large result sets |
| Index Scan | Use B-tree/hash index | O(log n + k), good for selective queries |
| Index-Only Scan | Answer from index alone | Fastest when applicable |
| Bitmap Scan | Combine multiple indexes | Good for OR conditions |
| TID Scan | Direct tuple access | Used for specific row access |
With a average of a access methods per table and t tables, this contributes a^t plans.
The total algorithm selection space for a query with t tables and t-1 joins, with j join algorithms and a access methods:
$$\text{Algorithm Space} = a^t \times j^{t-1}$$
For 5 tables with 3 access methods each and 4 join algorithms: $$3^5 \times 4^4 = 243 \times 256 = 62,208 \text{ algorithm combinations}$$
This multiplies with join orderings. A 5-table query might have: $$60 \text{ (join orders)} \times 62,208 \text{ (algorithms)} = 3.7 \text{ million plans}$$
In practice, many algorithm choices can be pruned early. If hash join always beats nested-loop for a particular join (given statistics), nested-loop is eliminated without full plan enumeration. This 'dominance pruning' dramatically reduces the effective search space.
The shape of the execution plan tree is itself a choice that affects both the search space size and the execution characteristics. Understanding plan topologies is crucial for optimization strategy.
In a left-deep tree, the right input of every join is always a base table. The left input is either a base table (for the first join) or the result of a previous join.
⋈
/ \
⋈ T4
/ \
⋈ T3
/ \
T1 T2
Characteristics:
The mirror image—left input of every join is a base table:
⋈
/ \
T1 ⋈
/ \
T2 ⋈
/ \
T3 T4
Characteristics:
Bushy trees allow any structure—intermediate results can join with other intermediate results:
⋈
/ \
⋈ ⋈
/ \ / \
T1 T2 T3 T4
Characteristics:
Despite bushy trees' flexibility, most commercial optimizers focus on left-deep trees:
However, modern analytical systems with abundant memory and parallel hardware increasingly explore bushy plans.
Beyond joins, the placement of other operators—selections, projections, aggregations—creates additional search space dimensions.
Filters can often be applied at multiple points in the plan:
Example Query:
SELECT * FROM orders o JOIN customers c ON o.cust_id = c.id
WHERE c.country = 'USA' AND o.amount > 1000;
Placement Options:
Filter(customers, country='USA') ⋈ Filter(orders, amount>1000)Filter(customers ⋈ orders, country='USA' AND amount>1000)Generally, pushing selections down reduces intermediate result sizes, improving performance. But there are exceptions:
Projections remove unneeded columns. Like selections, they can be placed at various points:
Early Projection Benefits:
Early Projection Risks:
For queries with GROUP BY, aggregation can sometimes be pushed down:
SELECT dept_id, SUM(salary)
FROM employees e JOIN departments d ON e.dept_id = d.id
GROUP BY dept_id;
Options:
Pre-aggregation is legal only for decomposable aggregates (SUM, COUNT, MIN, MAX) and can dramatically reduce join input size.
While operator placement is often handled by heuristic rewrite rules (Module 3: Heuristic Optimization), it also creates search space dimensions. Advanced optimizers consider placement alternatives during cost-based search, potentially finding placements that heuristics miss.
Not all mathematically possible plans are valid. Various constraints restrict the search space:
The join graph of a query defines which tables can be joined directly. A join between tables A and B is only possible if there's a join predicate connecting them.
-- Join Graph:
-- Customers ── Orders ── Products
-- (customers.id = orders.cust_id)
-- (orders.prod_id = products.id)
Without a direct predicate, joining Customers with Products first would create a Cartesian product—almost always disastrous. The optimizer typically constrains the search space to avoid intermediate Cartesian products.
Outer joins (LEFT, RIGHT, FULL) are not freely reorderable like inner joins:
SELECT * FROM A LEFT JOIN B ON ... LEFT JOIN C ON ...;
The order of outer joins affects semantics. A LEFT JOIN B LEFT JOIN C ≠ A LEFT JOIN C LEFT JOIN B. This constrains which orderings are semantically valid.
| Constraint Type | Example | Space Reduction |
|---|---|---|
| No Cartesian Products | Require join predicate connection | Often 90%+ reduction |
| Outer Join Order | Preserve semantic ordering | Significant for outer-heavy queries |
| Index Availability | Only indexed paths where indexes exist | Varies by schema |
| Memory Bounds | Exclude plans exceeding limits | Depends on statistics |
| Parallel Compatibility | Not all operators parallelize | Moderate reduction |
Far from being limitations, constraints dramatically help the optimizer by eliminating massive regions of the search space. A well-constrained query (many indexes, clear join structure, memory bounds) optimizes faster and more reliably than an unconstrained one.
Conceptualizing the search space helps developers understand optimization behavior. While the actual space has as many dimensions as there are choices, we can visualize simplified representations:
Imagine the search space as a mountainous terrain where:
Landscape Characteristics:
Two plans are "neighbors" if one can be reached from the other by a single transformation:
The optimizer explores by moving through neighborhoods. Different search strategies (exhaustive, greedy, simulated annealing) navigate differently:
| Strategy | Behavior | Strengths | Risks |
|---|---|---|---|
| Exhaustive | Visit every point | Finds global optimum | Infeasible for large spaces |
| Greedy | Always move downhill | Fast convergence | Trapped in local minima |
| Hill Climbing | Move to best neighbor | Simple, effective | Local minima traps |
| Simulated Annealing | Probabilistic uphill moves | Escapes local minima | Slower convergence |
| Dynamic Programming | Optimal substructure | Polynomial for some dimensions | Requires problem structure |
The 'elevation' in our metaphor—the cost function—is itself an estimate. The true landscape (actual execution time) differs from the estimated landscape (optimizer's cost model). This mismatch explains why optimizers sometimes choose 'optimal' plans that perform poorly: they're finding the minimum of the wrong function.
Understanding the search space informs how optimizers are designed:
The mathematical structure of different search dimensions enables different optimization strategies:
Join Ordering (exhibits optimal substructure): → Dynamic programming is effective because the best plan for joining tables {A,B,C} can be built from best plans for {A,B}, {A,C}, {B,C}.
Algorithm Selection (per-operator choice): → Can be decided locally once join order is fixed, often using simple cost comparison.
Operator Placement (rule-based transformations): → Heuristics work well because 'push down selections' is almost always beneficial.
Real optimizers don't explore the full theoretical space. Reduction techniques include:
For some queries, even reduced search spaces are too large:
Symptoms:
Causes:
Responses:
An EXPLAIN or optimizer trace showing excessive optimization time, fallback to heuristic mode, or dramatically suboptimal plans often indicate search space explosion. Recognizing this pattern is essential for effective query tuning.
The search space of possible execution plans is the arena in which query optimization takes place. Understanding its structure and growth patterns is essential for understanding optimizer behavior and limitations.
What's Next:
Now that we understand the vast landscape the optimizer must navigate, the next page examines Plan Enumeration—the algorithms and strategies optimizers use to systematically explore the search space. We'll see how dynamic programming, heuristics, and randomization combine to find good plans within time budgets.
You now understand the structure and size of the query optimization search space. This knowledge is foundational for understanding the enumeration algorithms, cost models, and selection strategies covered in the remaining pages of this module.