Loading learning content...
If there's one optimization that has the potential to transform hours-long queries into millisecond responses, it's join ordering—and join ordering is made possible by associativity.
Associativity tells us that (A ⋈ B) ⋈ C is equivalent to A ⋈ (B ⋈ C). Mathematically obvious, but the performance implications are staggering. The choice of which tables to join first can determine whether intermediate results are thousands of rows or billions. The wrong order can crash servers; the right order makes queries fly.
This page explores join associativity in depth: why it holds, how it interacts with join conditions, and most importantly, how optimizers exploit it to find efficient execution plans.
By the end of this page, you will understand: (1) the formal statement and proof of join associativity, (2) how it enables join reordering, (3) the concept of left-deep vs. bushy join trees, (4) join graph representations, and (5) the complexity of join ordering optimization.
Formal Statement
For natural join:
(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)
For theta join (with appropriate condition handling):
(R ⋈_{θ₁} S) ⋈_{θ₂} T ≡ R ⋈_{θ₁∧θ₃} (S ⋈_{θ₂} T)
where θ₃ captures conditions between R and T that were implicit.
Why Does This Hold?
Intuitively, the result of joining R, S, and T contains tuples where:
The order in which we perform these pairwise joins doesn't affect which complete (R, S, T) combinations satisfy all conditions.
Proof Sketch:
Define the result of R ⋈ S ⋈ T as the set of all (r, s, t) tuples where r∈R, s∈S, t∈T and the relevant join conditions hold.
Whether we compute (R ⋈ S) first and then join with T, or compute (S ⋈ T) first and then join with R, we're selecting the same set of (r, s, t) combinations.
The intermediate groupings differ, but the final set is identical.
The key insight is that while (R ⋈ S) and (S ⋈ T) may have very different sizes, once joined with the third table, the final result is the same. The optimizer's job is to choose the grouping that minimizes intermediate sizes.
Example: The Power of Reordering
Consider three tables:
Orders: 10M rowsCustomers: 100K rowsProducts: 1K rowsQuery: Find orders with their customer and product details.
Plan 1: (Orders ⋈ Customers) ⋈ Products
Plan 2: (Customers ⋈ Orders) ⋈ Products (same as Plan 1, just commuted)
Plan 3: Orders ⋈ (Customers ⋈ Products)
The right plan depends on selectivities and join conditions. Associativity gives us the freedom to explore all options.
When joining n tables, associativity generates different join tree shapes. Understanding these shapes is fundamental to join ordering optimization.
A left-deep tree always joins the next table to the "left" result:
((R₁ ⋈ R₂) ⋈ R₃) ⋈ R₄
⋈
/ \
⋈ R₄
/ \
⋈ R₃
/ \
R₁ R₂
Characteristics:
Mirror of left-deep:
R₁ ⋈ (R₂ ⋈ (R₃ ⋈ R₄))
⋈
/ \
R₁ ⋈
/ \
R₂ ⋈
/ \
R₃ R₄
Characteristics:
Balanced or irregular trees with internal joins on both branches:
(R₁ ⋈ R₂) ⋈ (R₃ ⋈ R₄)
⋈
/ \
⋈ ⋈
/ \ / \
R₁ R₂ R₃ R₄
Characteristics:
| Property | Left-Deep | Right-Deep | Bushy |
|---|---|---|---|
| Search space for n tables | n! | n! | Much larger: (2n-2)! / (n-1)! |
| Pipeline-friendly | Excellent | Good | Complex |
| Memory usage | Predictable | Opposite pattern | May need multiple buffers |
| Parallel potential | Limited | Limited | High |
| Optimizer complexity | Tractable | Tractable | Exponential |
| Common in practice | Most common | Rare | Used for OLAP |
Most OLTP optimizers restrict search to left-deep trees. This reduces the search space to n! orderings (still large, but manageable with dynamic programming). Left-deep plans work well with the iterator/volcano execution model where tuples flow one-at-a-time through the pipeline.
A useful way to visualize join structure is the join graph.
The join graph has:
Example:
SELECT * FROM A, B, C, D
WHERE A.x = B.x
AND B.y = C.y
AND C.z = D.z;
Join graph:
A --- B --- C --- D
(x) (y) (z)
This is a chain (linear graph). Each table joins only with its neighbors.
Chain: Linear sequence of joins
A - B - C - D
Optimization: n! orderings, well-understood.
Star: One central table joins with all others
B
|
A--C--D (C is the center)
|
E
Optimization: Central table often appears early in join order.
Clique: Every table joins with every other (fully connected)
A---B
|\ /|
| X |
|/ \|
C---D
Optimization: Maximum flexibility, complex optimization.
Acyclic (tree): No cycles in the join graph
A
/ \
B C
/ \
D E
Optimization: Special algorithms exist (Yannakakis, etc.).
If two tables have no edge (no join condition between them), joining them produces a Cartesian product—typically disastrous for performance.
Example:
SELECT * FROM A, B, C
WHERE A.x = C.x; -- No condition between A and B or B and C!
Join graph:
A --- C B (isolated)
Any join involving B before linking through C produces a cross product. Good optimizers:
An accidental cross product can multiply result sizes catastrophically. If you're joining 10M rows with 100K rows without a condition, you get 1 trillion row pairs. Always verify your join graph is connected, with edges (conditions) between logically related tables.
When reordering joins, the optimizer must correctly track and apply join predicates.
For a query with predicates P:
When we transform (R ⋈_{p1} S) ⋈_{p2} T to R ⋈_{p1'} (S ⋈_{p2'} T), predicates may migrate:
Before:
After:
12345678910111213141516171819202122
Query:SELECT * FROM R, S, TWHERE R.a = S.a -- predicate p1: joins R and S AND S.b = T.b -- predicate p2: joins S and T AND R.c = T.c -- predicate p3: joins R and T Join graph: fully connected (R-S, S-T, R-T) Plan 1: (R ⋈ S) ⋈ T - R ⋈ S with p1 - Result ⋈ T with (p2 AND p3) Plan 2: R ⋈ (S ⋈ T) - S ⋈ T with p2 - R ⋈ Result with (p1 AND p3) Plan 3: (R ⋈ T) ⋈ S - R ⋈ T with p3 - Result ⋈ S with (p1 AND p2) All three plans apply all predicates, just at different stages.The optimizer chooses based on selectivity and intermediate sizes.Optimizers often compute the transitive closure of equality predicates:
WHERE R.a = S.a AND S.a = T.a
-- Implies: R.a = T.a (transitive)
Adding the implied R.a = T.a edge to the join graph creates more reordering options. This is called predicate inference and is a key optimization technique.
Some orderings require temporary Cartesian products:
Join graph: R -- S T -- U
(no direct path from left to right)
To join all four, we must either:
1. Create a cross product between {R,S} and {T,U}
2. Or find implicit conditions
Optimizers delay Cartesian products as long as possible, but sometimes they're unavoidable. Associativity still applies; we choose the least costly position for the cross product.
Join ordering is computationally challenging. Let's quantify the problem.
Left-deep trees only:
Including bushy trees:
With algorithm selection:
For small n (typically ≤ 10-15), optimizers use dynamic programming:
This is the System R algorithm (IBM, 1979), still the foundation of most optimizers.
| Number of Tables | Left-Deep Plans (n!) | All Plans (approx) | DP States (2ⁿ) |
|---|---|---|---|
| 4 | 24 | ~100 | 16 |
| 6 | 720 | ~10,000 | 64 |
| 8 | 40,320 | ~2.5 million | 256 |
| 10 | 3.6 million | ~1 billion | 1,024 |
| 12 | 479 million | ~1 trillion | 4,096 |
| 15 | 1.3 trillion | Astronomical | 32,768 |
When n exceeds ~10-15, exact optimization is impractical. Alternatives:
1. Greedy / Heuristic Algorithms
2. Genetic Algorithms
3. Simulated Annealing
4. Query Decomposition
5. User Hints
/*+ LEADING(a, b, c) */ in OracleProduction optimizers impose time limits on optimization. If the limit is reached, they return the best plan found so far. This ensures that optimization itself doesn't become a bottleneck, though it means very complex queries may get suboptimal plans.
Unlike inner joins, outer joins have severely limited associativity. This restricts join reordering for queries with outer joins.
Does NOT hold in general:
(R ⟕ S) ⟕ T ≢ R ⟕ (S ⟕ T)
Example showing non-equivalence:
R = {1}
S = {2}
T = {1}
R ⟕ S = {(1, NULL)} -- 1 in R has no match in S
(R ⟕ S) ⟕ T = {(1, NULL, 1)} -- matches T on R's attribute
S ⟕ T = {(2, NULL)} -- 2 in S has no match in T
R ⟕ (S ⟕ T) = {(1, NULL, NULL)} -- different result!
Some specific transformations ARE valid:
1. Inner join associates with outer join (sometimes):
(R ⟕ S) ⋈ T ≡ (R ⋈ T) ⟕ S [if T only joins with R]
2. Outer join converts to inner when possible:
If a WHERE condition rejects NULL values from the outer join:
R ⟕ S WHERE S.a IS NOT NULL ≡ R ⋈ S
This enables normal associativity.
3. Full outer joins with inner:
(R ⟗ S) ⋈ T has limited transformations
When performance is critical and outer joins are needed, be aware that the optimizer has fewer reordering options. Consider restructuring queries to use inner joins where semantically possible, or explicitly order tables in the FROM clause as a hint to less sophisticated optimizers.
Modern query optimizers have evolved sophisticated techniques for join ordering.
Rather than committing to a join order before execution, some systems adapt at runtime:
1. Adaptive Join Ordering
2. Eddies (Research)
First execution: Use estimates
Measure actual cardinalities
Second execution: Use measured values for better ordering
SQL Server, PostgreSQL 14+, and Oracle implement forms of this.
Recent research applies ML:
| Era | Technique | Strength | Weakness |
|---|---|---|---|
| 1970s | System R DP | Provably optimal for small n | Exponential in tables |
| 1980s-90s | Heuristics (Greedy) | Handles large n | May miss optimal |
| 2000s | Genetic/Randomized | Scales better | Non-deterministic |
| 2010s | Adaptive execution | Handles estimation errors | Overhead complexity |
| 2020s | ML-based | Learns from workload | Training data needed |
PostgreSQL uses the Genetic Query Optimizer (GEQO) for queries with 12+ tables:
1. Represent join orders as chromosomes
2. Generate initial random population
3. Evaluate fitness (estimated cost)
4. Crossover: combine good orderings
5. Mutation: random swaps
6. Repeat for N generations
7. Return best ordering found
Configuration: geqo_threshold controls when GEQO kicks in (default: 12 tables).
Join associativity enables the most impactful query transformation: choosing the order in which tables are joined. Let's consolidate the key insights:
Module Complete
With this page, we conclude the Equivalence Rules module. You've learned how relational algebra equivalences—commutativity, associativity, distributivity—empower query optimizers to transform queries into efficient execution plans. Selection pushdown, projection pushdown, join commutativity, and join associativity are the core transformations that make declarative SQL viable for high-performance databases.
These principles remain constant across database systems, from PostgreSQL to Oracle, from MySQL to distributed systems like Snowflake. Master them, and you understand the heart of query optimization.
Congratulations! You now have comprehensive knowledge of equivalence rules—the algebraic foundation of query optimization. These rules transform naive query execution into highly efficient plans, enabling databases to handle billions of rows with sub-second response times.