Database Management SystemsQuery Optimization

Equivalence Rules

LevelIntermediate

Duration60 mins

TopicQuery Optimization

5 / 5

Join Associativity

The Foundation of Join Ordering

If there's one optimization that has the potential to transform hours-long queries into millisecond responses, it's join ordering—and join ordering is made possible by associativity.

Associativity tells us that (A ⋈ B) ⋈ C is equivalent to A ⋈ (B ⋈ C). Mathematically obvious, but the performance implications are staggering. The choice of which tables to join first can determine whether intermediate results are thousands of rows or billions. The wrong order can crash servers; the right order makes queries fly.

This page explores join associativity in depth: why it holds, how it interacts with join conditions, and most importantly, how optimizers exploit it to find efficient execution plans.

What You Will Learn

By the end of this page, you will understand: (1) the formal statement and proof of join associativity, (2) how it enables join reordering, (3) the concept of left-deep vs. bushy join trees, (4) join graph representations, and (5) the complexity of join ordering optimization.

The Associativity Property

Formal Statement

For natural join:

(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)

For theta join (with appropriate condition handling):

(R ⋈_{θ₁} S) ⋈_{θ₂} T ≡ R ⋈_{θ₁∧θ₃} (S ⋈_{θ₂} T)

where θ₃ captures conditions between R and T that were implicit.

Why Does This Hold?

Intuitively, the result of joining R, S, and T contains tuples where:

R's portion matches S's portion on their join attributes, AND
S's portion matches T's portion on their join attributes, AND
(if applicable) R's portion matches T's portion on their join attributes

The order in which we perform these pairwise joins doesn't affect which complete (R, S, T) combinations satisfy all conditions.

Proof Sketch:

Define the result of R ⋈ S ⋈ T as the set of all (r, s, t) tuples where r∈R, s∈S, t∈T and the relevant join conditions hold.
Whether we compute (R ⋈ S) first and then join with T, or compute (S ⋈ T) first and then join with R, we're selecting the same set of (r, s, t) combinations.
The intermediate groupings differ, but the final set is identical.

Intermediates Differ, Results Match

The key insight is that while (R ⋈ S) and (S ⋈ T) may have very different sizes, once joined with the third table, the final result is the same. The optimizer's job is to choose the grouping that minimizes intermediate sizes.

Example: The Power of Reordering

Consider three tables:

Orders: 10M rows
Customers: 100K rows
Products: 1K rows

Query: Find orders with their customer and product details.

Plan 1: (Orders ⋈ Customers) ⋈ Products

Orders ⋈ Customers: potentially 10M result rows (all orders have customers)
Result ⋈ Products: 10M × (1K selectivity) depending on join

Plan 2: (Customers ⋈ Orders) ⋈ Products (same as Plan 1, just commuted)

Plan 3: Orders ⋈ (Customers ⋈ Products)

Customers ⋈ Products: 100K × 1K = unrestricted cross product? Only if no join condition!
If there IS a condition between Customers and Products, this might work.

The right plan depends on selectivities and join conditions. Associativity gives us the freedom to explore all options.

Join Trees and Shapes

When joining n tables, associativity generates different join tree shapes. Understanding these shapes is fundamental to join ordering optimization.

Left-Deep Trees

A left-deep tree always joins the next table to the "left" result:

((R₁ ⋈ R₂) ⋈ R₃) ⋈ R₄

        ⋈
       / \
      ⋈   R₄
     / \
    ⋈   R₃
   / \
  R₁  R₂

Characteristics:

Linear chain structure
Left child is always the growing intermediate result
Right children are base tables
Well-suited for pipelined execution

Right-Deep Trees

Mirror of left-deep:

R₁ ⋈ (R₂ ⋈ (R₃ ⋈ R₄))

      ⋈
     / \
    R₁  ⋈
       / \
      R₂  ⋈
         / \
        R₃  R₄

Characteristics:

Right child is the growing intermediate
Less common in practice due to execution model mismatch

Bushy Trees

Balanced or irregular trees with internal joins on both branches:

(R₁ ⋈ R₂) ⋈ (R₃ ⋈ R₄)

       ⋈
      / \
     ⋈   ⋈
    / \ / \
   R₁ R₂ R₃ R₄

Characteristics:

More complex structure
Can exploit parallelism (join R₁⋈R₂ and R₃⋈R₄ independently)
Larger search space
Sometimes optimal for specific query patterns

Comparison of Join Tree Shapes
Property	Left-Deep	Right-Deep	Bushy
Search space for n tables	n!	n!	Much larger: (2n-2)! / (n-1)!
Pipeline-friendly	Excellent	Good	Complex
Memory usage	Predictable	Opposite pattern	May need multiple buffers
Parallel potential	Limited	Limited	High
Optimizer complexity	Tractable	Tractable	Exponential
Common in practice	Most common	Rare	Used for OLAP

Why Left-Deep Dominates

Most OLTP optimizers restrict search to left-deep trees. This reduces the search space to n! orderings (still large, but manageable with dynamic programming). Left-deep plans work well with the iterator/volcano execution model where tuples flow one-at-a-time through the pipeline.

The Join Graph

A useful way to visualize join structure is the join graph.

Definition

The join graph has:

Nodes: Each relation in the query
Edges: Between relations that have a join condition

Example:

SELECT * FROM A, B, C, D
WHERE A.x = B.x 
  AND B.y = C.y
  AND C.z = D.z;

Join graph:

A --- B --- C --- D
  (x)   (y)   (z)

This is a chain (linear graph). Each table joins only with its neighbors.

Graph Shapes

Chain: Linear sequence of joins

A - B - C - D

Optimization: n! orderings, well-understood.

Star: One central table joins with all others

    B
    |
 A--C--D   (C is the center)
    |
    E

Optimization: Central table often appears early in join order.

Clique: Every table joins with every other (fully connected)

  A---B
  |\ /|
  | X |
  |/ \|
  C---D

Optimization: Maximum flexibility, complex optimization.

Acyclic (tree): No cycles in the join graph

    A
   / \
  B   C
     / \
    D   E

Optimization: Special algorithms exist (Yannakakis, etc.).

Cross Products and Missing Edges

If two tables have no edge (no join condition between them), joining them produces a Cartesian product—typically disastrous for performance.

Example:

SELECT * FROM A, B, C
WHERE A.x = C.x;  -- No condition between A and B or B and C!

Join graph:

A --- C    B (isolated)

Any join involving B before linking through C produces a cross product. Good optimizers:

Detect disconnected components
Warn about Cartesian products
Try to find implicit conditions
Order joins to delay cross products until last

The Cross Product Trap

An accidental cross product can multiply result sizes catastrophically. If you're joining 10M rows with 100K rows without a condition, you get 1 trillion row pairs. Always verify your join graph is connected, with edges (conditions) between logically related tables.

Join Predicates in Reordering

When reordering joins, the optimizer must correctly track and apply join predicates.

Predicate Classification

For a query with predicates P:

Local predicates: Reference only one table (selection pushdown handles these)
Join predicates: Reference exactly two tables (these drive the join graph)
Multi-table predicates: Reference 3+ tables (complex; require special handling)

Predicate Migration During Reordering

When we transform (R ⋈_{p1} S) ⋈_{p2} T to R ⋈_{p1'} (S ⋈_{p2'} T), predicates may migrate:

Before:

p1 applied at R⋈S join
p2 applied at (R⋈S)⋈T join

After:

If p2 only involves S and T: p2 can apply at S⋈T
If p2 involves R and T: must wait until final join (becomes p1')
p1 still applies between R and the result of S⋈T

Predicate Migration Example

Analysis

Query:
SELECT * FROM R, S, T
WHERE R.a = S.a      -- predicate p1: joins R and S
  AND S.b = T.b      -- predicate p2: joins S and T
  AND R.c = T.c      -- predicate p3: joins R and T
 
Join graph: fully connected (R-S, S-T, R-T)
 
Plan 1: (R ⋈ S) ⋈ T
  - R ⋈ S with p1
  - Result ⋈ T with (p2 AND p3)
 
Plan 2: R ⋈ (S ⋈ T)
  - S ⋈ T with p2
  - R ⋈ Result with (p1 AND p3)
 
Plan 3: (R ⋈ T) ⋈ S
  - R ⋈ T with p3
  - Result ⋈ S with (p1 AND p2)
 
All three plans apply all predicates, just at different stages.
The optimizer chooses based on selectivity and intermediate sizes.

The Transitive Closure

Optimizers often compute the transitive closure of equality predicates:

WHERE R.a = S.a AND S.a = T.a
-- Implies: R.a = T.a (transitive)

Adding the implied R.a = T.a edge to the join graph creates more reordering options. This is called predicate inference and is a key optimization technique.

Cartesian Tolerance

Some orderings require temporary Cartesian products:

Join graph: R -- S    T -- U
            (no direct path from left to right)

To join all four, we must either:
1. Create a cross product between {R,S} and {T,U}
2. Or find implicit conditions

Optimizers delay Cartesian products as long as possible, but sometimes they're unavoidable. Associativity still applies; we choose the least costly position for the cross product.

The Complexity of Join Ordering

Join ordering is computationally challenging. Let's quantify the problem.

Number of Join Orders

Left-deep trees only:

n! possible orderings
n=5: 120 plans
n=10: 3,628,800 plans
n=15: ~1.3 trillion plans

Including bushy trees:

(2n-2)! / (n-1)! possible tree structures
Much larger than n!

With algorithm selection:

Multiply by choices per join (e.g., 3 algorithms)
Multiply again by commutation choices

Dynamic Programming Approach

For small n (typically ≤ 10-15), optimizers use dynamic programming:

Enumerate all subsets of tables
For each subset, compute optimal join plan
Build larger subsets from smaller optimal sub-plans
Space complexity: O(2ⁿ)
Time complexity: O(3ⁿ) for left-deep, O(4ⁿ) for all shapes

This is the System R algorithm (IBM, 1979), still the foundation of most optimizers.

Join Ordering Complexity
Number of Tables	Left-Deep Plans (n!)	All Plans (approx)	DP States (2ⁿ)
4	24	~100	16
6	720	~10,000	64
8	40,320	~2.5 million	256
10	3.6 million	~1 billion	1,024
12	479 million	~1 trillion	4,096
15	1.3 trillion	Astronomical	32,768

Heuristic Approaches for Large Queries

When n exceeds ~10-15, exact optimization is impractical. Alternatives:

1. Greedy / Heuristic Algorithms

Start with smallest table
Repeatedly join with next best partner
Fast but potentially suboptimal

2. Genetic Algorithms

Evolve population of join orders
Use crossover and mutation
Find good (not necessarily optimal) solutions

3. Simulated Annealing

Random walk with cooling schedule
Accepts worse solutions initially to escape local minima

4. Query Decomposition

Break query into independently optimizable parts
Especially for star schemas

5. User Hints

Allow DBA to specify join order
/*+ LEADING(a, b, c) */ in Oracle

Optimizer Time Limits

Production optimizers impose time limits on optimization. If the limit is reached, they return the best plan found so far. This ensures that optimization itself doesn't become a bottleneck, though it means very complex queries may get suboptimal plans.

Outer Join Associativity Limitations

Unlike inner joins, outer joins have severely limited associativity. This restricts join reordering for queries with outer joins.

Left Outer Join Associativity

Does NOT hold in general:

(R ⟕ S) ⟕ T ≢ R ⟕ (S ⟕ T)

Example showing non-equivalence:

R = {1}
S = {2}
T = {1}

R ⟕ S = {(1, NULL)}  -- 1 in R has no match in S
(R ⟕ S) ⟕ T = {(1, NULL, 1)}  -- matches T on R's attribute

S ⟕ T = {(2, NULL)}  -- 2 in S has no match in T
R ⟕ (S ⟕ T) = {(1, NULL, NULL)}  -- different result!

Valid Outer Join Transformations

Some specific transformations ARE valid:

1. Inner join associates with outer join (sometimes):

(R ⟕ S) ⋈ T ≡ (R ⋈ T) ⟕ S  [if T only joins with R]

2. Outer join converts to inner when possible:

If a WHERE condition rejects NULL values from the outer join:
R ⟕ S WHERE S.a IS NOT NULL ≡ R ⋈ S

This enables normal associativity.

3. Full outer joins with inner:

(R ⟗ S) ⋈ T has limited transformations

Outer Join Optimization Challenges

•Limited reordering — Cannot freely reorder tables when outer joins are involved
•Order-dependent semantics — The query's written order affects results for outer joins
•Predicate placement sensitivity — WHERE vs. ON clause placement matters significantly
•NULL handling complexity — Predicates must be analyzed for null-rejection
•Conflict detection — Optimizer must detect when transformations are invalid

Practical Advice

When performance is critical and outer joins are needed, be aware that the optimizer has fewer reordering options. Consider restructuring queries to use inner joins where semantically possible, or explicitly order tables in the FROM clause as a hint to less sophisticated optimizers.

Modern Optimizer Techniques

Modern query optimizers have evolved sophisticated techniques for join ordering.

Adaptive Query Processing

Rather than committing to a join order before execution, some systems adapt at runtime:

1. Adaptive Join Ordering

Start with estimated best order
Monitor actual cardinalities during execution
Reorder mid-query if estimates were wrong

2. Eddies (Research)

Route each tuple independently through operators
Automatically adapts to skewed data
High overhead but robust to estimation errors

Cardinality Feedback

First execution: Use estimates
Measure actual cardinalities
Second execution: Use measured values for better ordering

SQL Server, PostgreSQL 14+, and Oracle implement forms of this.

Machine Learning for Join Ordering

Recent research applies ML:

Learned Cardinality Estimation — Train models on workload queries to predict cardinalities better than histograms
Reinforcement Learning for Join Ordering — Learn which orderings work well for similar queries
Workload-Aware Optimization — Pre-compute good plans for common query patterns

Join Ordering Evolution
Era	Technique	Strength	Weakness
1970s	System R DP	Provably optimal for small n	Exponential in tables
1980s-90s	Heuristics (Greedy)	Handles large n	May miss optimal
2000s	Genetic/Randomized	Scales better	Non-deterministic
2010s	Adaptive execution	Handles estimation errors	Overhead complexity
2020s	ML-based	Learns from workload	Training data needed

Practical Implementation: PostgreSQL's GEQO

PostgreSQL uses the Genetic Query Optimizer (GEQO) for queries with 12+ tables:

1. Represent join orders as chromosomes
2. Generate initial random population
3. Evaluate fitness (estimated cost)
4. Crossover: combine good orderings
5. Mutation: random swaps
6. Repeat for N generations
7. Return best ordering found

Configuration: geqo_threshold controls when GEQO kicks in (default: 12 tables).

Summary: The Power of Regrouping Joins

Join associativity enables the most impactful query transformation: choosing the order in which tables are joined. Let's consolidate the key insights:

Key Takeaways

•Associativity enables reordering — (R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T) means join order is the optimizer's choice, not the query writer's.
•Join trees have shapes — Left-deep trees are common and tractable; bushy trees offer more options but explode the search space.
•Join graphs guide optimization — The connectivity of join predicates determines valid orderings and cross product locations.
•Complexity is exponential — n! orderings for left-deep, much more for bushy. DP handles small n; heuristics handle large n.
•Outer joins are restricted — LEFT/RIGHT joins don't associate freely; this limits optimizer flexibility.
•Modern techniques adapt — Cardinality feedback, adaptive execution, and ML-based approaches address estimation errors.

Module Complete

With this page, we conclude the Equivalence Rules module. You've learned how relational algebra equivalences—commutativity, associativity, distributivity—empower query optimizers to transform queries into efficient execution plans. Selection pushdown, projection pushdown, join commutativity, and join associativity are the core transformations that make declarative SQL viable for high-performance databases.

These principles remain constant across database systems, from PostgreSQL to Oracle, from MySQL to distributed systems like Snowflake. Master them, and you understand the heart of query optimization.

Module Complete

Congratulations! You now have comprehensive knowledge of equivalence rules—the algebraic foundation of query optimization. These rules transform naive query execution into highly efficient plans, enabling databases to handle billions of rows with sub-second response times.

5 / 5

Loading learning content...

Database Management SystemsQuery Optimization

Equivalence Rules

LevelIntermediate

Duration60 mins

TopicQuery Optimization

5 / 5

Join Associativity

The Foundation of Join Ordering

If there's one optimization that has the potential to transform hours-long queries into millisecond responses, it's join ordering—and join ordering is made possible by associativity.

This page explores join associativity in depth: why it holds, how it interacts with join conditions, and most importantly, how optimizers exploit it to find efficient execution plans.

What You Will Learn

The Associativity Property

Formal Statement

For natural join:

(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)

For theta join (with appropriate condition handling):

(R ⋈_{θ₁} S) ⋈_{θ₂} T ≡ R ⋈_{θ₁∧θ₃} (S ⋈_{θ₂} T)

where θ₃ captures conditions between R and T that were implicit.

Why Does This Hold?

Intuitively, the result of joining R, S, and T contains tuples where:

R's portion matches S's portion on their join attributes, AND
S's portion matches T's portion on their join attributes, AND
(if applicable) R's portion matches T's portion on their join attributes

The order in which we perform these pairwise joins doesn't affect which complete (R, S, T) combinations satisfy all conditions.

Proof Sketch:

Define the result of R ⋈ S ⋈ T as the set of all (r, s, t) tuples where r∈R, s∈S, t∈T and the relevant join conditions hold.
Whether we compute (R ⋈ S) first and then join with T, or compute (S ⋈ T) first and then join with R, we're selecting the same set of (r, s, t) combinations.
The intermediate groupings differ, but the final set is identical.

Intermediates Differ, Results Match

Example: The Power of Reordering

Consider three tables:

Orders: 10M rows
Customers: 100K rows
Products: 1K rows

Query: Find orders with their customer and product details.

Plan 1: (Orders ⋈ Customers) ⋈ Products

Orders ⋈ Customers: potentially 10M result rows (all orders have customers)
Result ⋈ Products: 10M × (1K selectivity) depending on join

Plan 2: (Customers ⋈ Orders) ⋈ Products (same as Plan 1, just commuted)

Plan 3: Orders ⋈ (Customers ⋈ Products)

Customers ⋈ Products: 100K × 1K = unrestricted cross product? Only if no join condition!
If there IS a condition between Customers and Products, this might work.

The right plan depends on selectivities and join conditions. Associativity gives us the freedom to explore all options.

Join Trees and Shapes

When joining n tables, associativity generates different join tree shapes. Understanding these shapes is fundamental to join ordering optimization.

Left-Deep Trees

A left-deep tree always joins the next table to the "left" result:

((R₁ ⋈ R₂) ⋈ R₃) ⋈ R₄

        ⋈
       / \
      ⋈   R₄
     / \
    ⋈   R₃
   / \
  R₁  R₂

Characteristics:

Linear chain structure
Left child is always the growing intermediate result
Right children are base tables
Well-suited for pipelined execution

Right-Deep Trees

Mirror of left-deep:

R₁ ⋈ (R₂ ⋈ (R₃ ⋈ R₄))

      ⋈
     / \
    R₁  ⋈
       / \
      R₂  ⋈
         / \
        R₃  R₄

Characteristics:

Right child is the growing intermediate
Less common in practice due to execution model mismatch

Bushy Trees

Balanced or irregular trees with internal joins on both branches:

(R₁ ⋈ R₂) ⋈ (R₃ ⋈ R₄)

       ⋈
      / \
     ⋈   ⋈
    / \ / \
   R₁ R₂ R₃ R₄

Characteristics:

More complex structure
Can exploit parallelism (join R₁⋈R₂ and R₃⋈R₄ independently)
Larger search space
Sometimes optimal for specific query patterns

Comparison of Join Tree Shapes
Property	Left-Deep	Right-Deep	Bushy
Search space for n tables	n!	n!	Much larger: (2n-2)! / (n-1)!
Pipeline-friendly	Excellent	Good	Complex
Memory usage	Predictable	Opposite pattern	May need multiple buffers
Parallel potential	Limited	Limited	High
Optimizer complexity	Tractable	Tractable	Exponential
Common in practice	Most common	Rare	Used for OLAP

Why Left-Deep Dominates

The Join Graph

A useful way to visualize join structure is the join graph.

Definition

The join graph has:

Nodes: Each relation in the query
Edges: Between relations that have a join condition

Example:

SELECT * FROM A, B, C, D
WHERE A.x = B.x 
  AND B.y = C.y
  AND C.z = D.z;

Join graph:

A --- B --- C --- D
  (x)   (y)   (z)

This is a chain (linear graph). Each table joins only with its neighbors.

Graph Shapes

Chain: Linear sequence of joins

A - B - C - D

Optimization: n! orderings, well-understood.

Star: One central table joins with all others

    B
    |
 A--C--D   (C is the center)
    |
    E

Optimization: Central table often appears early in join order.

Clique: Every table joins with every other (fully connected)

  A---B
  |\ /|
  | X |
  |/ \|
  C---D

Optimization: Maximum flexibility, complex optimization.

Acyclic (tree): No cycles in the join graph

    A
   / \
  B   C
     / \
    D   E

Optimization: Special algorithms exist (Yannakakis, etc.).

Cross Products and Missing Edges

If two tables have no edge (no join condition between them), joining them produces a Cartesian product—typically disastrous for performance.

Example:

SELECT * FROM A, B, C
WHERE A.x = C.x;  -- No condition between A and B or B and C!

Join graph:

A --- C    B (isolated)

Any join involving B before linking through C produces a cross product. Good optimizers:

Detect disconnected components
Warn about Cartesian products
Try to find implicit conditions
Order joins to delay cross products until last

The Cross Product Trap

Join Predicates in Reordering

When reordering joins, the optimizer must correctly track and apply join predicates.

Predicate Classification

For a query with predicates P:

Local predicates: Reference only one table (selection pushdown handles these)
Join predicates: Reference exactly two tables (these drive the join graph)
Multi-table predicates: Reference 3+ tables (complex; require special handling)

Predicate Migration During Reordering

When we transform (R ⋈_{p1} S) ⋈_{p2} T to R ⋈_{p1'} (S ⋈_{p2'} T), predicates may migrate:

Before:

p1 applied at R⋈S join
p2 applied at (R⋈S)⋈T join

After:

If p2 only involves S and T: p2 can apply at S⋈T
If p2 involves R and T: must wait until final join (becomes p1')
p1 still applies between R and the result of S⋈T

Predicate Migration Example

Analysis

Query:
SELECT * FROM R, S, T
WHERE R.a = S.a      -- predicate p1: joins R and S
  AND S.b = T.b      -- predicate p2: joins S and T
  AND R.c = T.c      -- predicate p3: joins R and T
 
Join graph: fully connected (R-S, S-T, R-T)
 
Plan 1: (R ⋈ S) ⋈ T
  - R ⋈ S with p1
  - Result ⋈ T with (p2 AND p3)
 
Plan 2: R ⋈ (S ⋈ T)
  - S ⋈ T with p2
  - R ⋈ Result with (p1 AND p3)
 
Plan 3: (R ⋈ T) ⋈ S
  - R ⋈ T with p3
  - Result ⋈ S with (p1 AND p2)
 
All three plans apply all predicates, just at different stages.
The optimizer chooses based on selectivity and intermediate sizes.

The Transitive Closure

Optimizers often compute the transitive closure of equality predicates:

WHERE R.a = S.a AND S.a = T.a
-- Implies: R.a = T.a (transitive)

Adding the implied R.a = T.a edge to the join graph creates more reordering options. This is called predicate inference and is a key optimization technique.

Cartesian Tolerance

Some orderings require temporary Cartesian products:

Join graph: R -- S    T -- U
            (no direct path from left to right)

To join all four, we must either:
1. Create a cross product between {R,S} and {T,U}
2. Or find implicit conditions

Optimizers delay Cartesian products as long as possible, but sometimes they're unavoidable. Associativity still applies; we choose the least costly position for the cross product.

The Complexity of Join Ordering

Join ordering is computationally challenging. Let's quantify the problem.

Number of Join Orders

Left-deep trees only:

n! possible orderings
n=5: 120 plans
n=10: 3,628,800 plans
n=15: ~1.3 trillion plans

Including bushy trees:

(2n-2)! / (n-1)! possible tree structures
Much larger than n!

With algorithm selection:

Multiply by choices per join (e.g., 3 algorithms)
Multiply again by commutation choices

Dynamic Programming Approach

For small n (typically ≤ 10-15), optimizers use dynamic programming:

Enumerate all subsets of tables
For each subset, compute optimal join plan
Build larger subsets from smaller optimal sub-plans
Space complexity: O(2ⁿ)
Time complexity: O(3ⁿ) for left-deep, O(4ⁿ) for all shapes

This is the System R algorithm (IBM, 1979), still the foundation of most optimizers.

Join Ordering Complexity
Number of Tables	Left-Deep Plans (n!)	All Plans (approx)	DP States (2ⁿ)
4	24	~100	16
6	720	~10,000	64
8	40,320	~2.5 million	256
10	3.6 million	~1 billion	1,024
12	479 million	~1 trillion	4,096
15	1.3 trillion	Astronomical	32,768

Heuristic Approaches for Large Queries

When n exceeds ~10-15, exact optimization is impractical. Alternatives:

1. Greedy / Heuristic Algorithms

Start with smallest table
Repeatedly join with next best partner
Fast but potentially suboptimal

2. Genetic Algorithms

Evolve population of join orders
Use crossover and mutation
Find good (not necessarily optimal) solutions

3. Simulated Annealing

Random walk with cooling schedule
Accepts worse solutions initially to escape local minima

4. Query Decomposition

Break query into independently optimizable parts
Especially for star schemas

5. User Hints

Allow DBA to specify join order
/*+ LEADING(a, b, c) */ in Oracle

Optimizer Time Limits

Outer Join Associativity Limitations

Unlike inner joins, outer joins have severely limited associativity. This restricts join reordering for queries with outer joins.

Left Outer Join Associativity

Does NOT hold in general:

(R ⟕ S) ⟕ T ≢ R ⟕ (S ⟕ T)

Example showing non-equivalence:

R = {1}
S = {2}
T = {1}

R ⟕ S = {(1, NULL)}  -- 1 in R has no match in S
(R ⟕ S) ⟕ T = {(1, NULL, 1)}  -- matches T on R's attribute

S ⟕ T = {(2, NULL)}  -- 2 in S has no match in T
R ⟕ (S ⟕ T) = {(1, NULL, NULL)}  -- different result!

Valid Outer Join Transformations

Some specific transformations ARE valid:

1. Inner join associates with outer join (sometimes):

(R ⟕ S) ⋈ T ≡ (R ⋈ T) ⟕ S  [if T only joins with R]

2. Outer join converts to inner when possible:

If a WHERE condition rejects NULL values from the outer join:
R ⟕ S WHERE S.a IS NOT NULL ≡ R ⋈ S

This enables normal associativity.

3. Full outer joins with inner:

(R ⟗ S) ⋈ T has limited transformations

Outer Join Optimization Challenges

•Limited reordering — Cannot freely reorder tables when outer joins are involved
•Order-dependent semantics — The query's written order affects results for outer joins
•Predicate placement sensitivity — WHERE vs. ON clause placement matters significantly
•NULL handling complexity — Predicates must be analyzed for null-rejection
•Conflict detection — Optimizer must detect when transformations are invalid

Practical Advice

Modern Optimizer Techniques

Modern query optimizers have evolved sophisticated techniques for join ordering.

Adaptive Query Processing

Rather than committing to a join order before execution, some systems adapt at runtime:

1. Adaptive Join Ordering

Start with estimated best order
Monitor actual cardinalities during execution
Reorder mid-query if estimates were wrong

2. Eddies (Research)

Route each tuple independently through operators
Automatically adapts to skewed data
High overhead but robust to estimation errors

Cardinality Feedback

First execution: Use estimates
Measure actual cardinalities
Second execution: Use measured values for better ordering

SQL Server, PostgreSQL 14+, and Oracle implement forms of this.

Machine Learning for Join Ordering

Recent research applies ML:

Learned Cardinality Estimation — Train models on workload queries to predict cardinalities better than histograms
Reinforcement Learning for Join Ordering — Learn which orderings work well for similar queries
Workload-Aware Optimization — Pre-compute good plans for common query patterns

Join Ordering Evolution
Era	Technique	Strength	Weakness
1970s	System R DP	Provably optimal for small n	Exponential in tables
1980s-90s	Heuristics (Greedy)	Handles large n	May miss optimal
2000s	Genetic/Randomized	Scales better	Non-deterministic
2010s	Adaptive execution	Handles estimation errors	Overhead complexity
2020s	ML-based	Learns from workload	Training data needed

Practical Implementation: PostgreSQL's GEQO

PostgreSQL uses the Genetic Query Optimizer (GEQO) for queries with 12+ tables:

1. Represent join orders as chromosomes
2. Generate initial random population
3. Evaluate fitness (estimated cost)
4. Crossover: combine good orderings
5. Mutation: random swaps
6. Repeat for N generations
7. Return best ordering found

Configuration: geqo_threshold controls when GEQO kicks in (default: 12 tables).

Summary: The Power of Regrouping Joins

Join associativity enables the most impactful query transformation: choosing the order in which tables are joined. Let's consolidate the key insights:

Key Takeaways

•Associativity enables reordering — (R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T) means join order is the optimizer's choice, not the query writer's.
•Join trees have shapes — Left-deep trees are common and tractable; bushy trees offer more options but explode the search space.
•Join graphs guide optimization — The connectivity of join predicates determines valid orderings and cross product locations.
•Complexity is exponential — n! orderings for left-deep, much more for bushy. DP handles small n; heuristics handle large n.
•Outer joins are restricted — LEFT/RIGHT joins don't associate freely; this limits optimizer flexibility.
•Modern techniques adapt — Cardinality feedback, adaptive execution, and ML-based approaches address estimation errors.

Module Complete

These principles remain constant across database systems, from PostgreSQL to Oracle, from MySQL to distributed systems like Snowflake. Master them, and you understand the heart of query optimization.

Module Complete

5 / 5