Loading learning content...
We've examined three nested loop join variants—simple nested loop, block nested loop, and index nested loop. Each has distinct performance characteristics that make it optimal under specific conditions. But how does a query optimizer decide which to use? How can a database administrator predict query performance?
The answer lies in cost modeling—the process of estimating the resources (I/O operations, CPU cycles, memory) required to execute a query using a particular physical plan. Modern query optimizers maintain detailed cost models that consider disk latencies, buffer pool sizes, index structures, and data statistics.
This page develops a comprehensive cost analysis framework for nested loop joins. We'll derive precise cost formulas, analyze break-even points between algorithms, examine sensitivity to key parameters, and establish practical decision criteria for algorithm selection.
By the end of this page, you will understand unified cost models for all nested loop variants, how to calculate and compare join costs, sensitivity analysis to parameters like memory and selectivity, break-even points for algorithm selection, and real-world cost tuning strategies.
Before comparing algorithms, we must establish a common framework for measuring cost. Database cost models typically separate costs into distinct components:
Cost Components:
Standard Notation:
We'll use consistent notation throughout this analysis:
| Symbol | Description | Typical Value |
|---|---|---|
| b_R, b_S | Number of disk blocks in relations R, S | Varies |
| n_R, n_S | Number of tuples in relations R, S | Varies |
| f_R, f_S | Blocking factor (tuples per block) | 10-1000 |
| M | Available memory in blocks | 10-10,000 |
| H_I | Height of index on inner relation | 2-4 |
| t_seq | Time per sequential I/O (ms) | 0.5-5 (SSD) / 5-15 (HDD) |
| t_rand | Time per random I/O (ms) | 0.1-1 (SSD) / 10-15 (HDD) |
| t_cpu | Time per tuple comparison (μs) | 0.1-10 |
| sel | Join selectivity (matching pairs / total pairs) | 0-1 |
For traditional disk-based systems, I/O cost dominates (often 99%+ of execution time). For in-memory databases or SSD-based systems with hot caches, CPU cost becomes more significant. Modern cost models weight both based on storage characteristics.
Let's consolidate the cost formulas for each nested loop variant into a unified framework:
Simple Nested Loop Join (SNL):
12345678910111213141516
// Simple Nested Loop Join Cost// Tuple-at-a-time processing I/O Cost: C_io_snl = b_R + (n_R × b_S) = b_R + (b_R × f_R × b_S) // Intuition: Read R once, scan S for every tuple in R CPU Cost: C_cpu_snl = n_R × n_S × t_cpu // Intuition: Compare every pair of tuples Total Time: T_snl = (b_R × t_seq) + (n_R × b_S × t_seq) + (n_R × n_S × t_cpu)Block Nested Loop Join (BNLJ):
1234567891011121314151617181920
// Block Nested Loop Join Cost// Block-at-a-time processing with M memory blocks I/O Cost: C_io_bnlj = b_R + ⌈b_R / (M - 1)⌉ × b_S // Intuition: Read R once, scan S for every block-chunk of R Best Case (R fits in memory, M > b_R): C_io_bnlj_best = b_R + b_S CPU Cost: C_cpu_bnlj = n_R × n_S × t_cpu // Same as SNL // With hash optimization on outer buffer: C_cpu_bnlj_hash = n_R + n_S × (hash_probe_cost) ≈ O(n_R + n_S) Total Time: T_bnlj = (b_R × t_seq) + (⌈b_R/(M-1)⌉ × b_S × t_seq) + CPU_costIndex Nested Loop Join (INLJ):
1234567891011121314151617181920212223
// Index Nested Loop Join Cost// Index probe for each outer tuple I/O Cost: C_io_inlj = b_R + n_R × C_probe Where C_probe (cost per index probe): // B-tree, unclustered, unique key: C_probe = H_I + 1 // Height + 1 data page // B-tree, clustered: C_probe = H_I // No separate data page // With matches_per_key > 1: C_probe = H_I + ⌈matches_per_key / f_S⌉ // Adjusting for random I/O: T_io_inlj = (b_R × t_seq) + (n_R × C_probe × t_rand) CPU Cost: C_cpu_inlj = n_R × (index_compare_cost + matches_per_key × t_cpu) // Typically much lower than scan-based joinsThe key insight in INLJ cost is the random I/O component. On spinning disks, t_rand ≈ 10-15ms vs t_seq ≈ 0.1ms per page. This 100x penalty means INLJ must do 100x fewer I/Os than BNLJ to break even on HDDs. On SSDs (t_rand ≈ 0.1-0.5ms), the penalty is only 2-10x.
Let's compare the three variants with a concrete example:
Scenario:
Cost Calculations:
| Algorithm | I/O Count | I/O Time | CPU Time | Total |
|---|---|---|---|---|
| Simple NLJ | 1,000 + 100,000 × 200 = 20,001,000 | ~20,000 sec | ~100 sec | ~5.6 hours |
| Block NLJ | 1,000 + ⌈1000/51⌉ × 200 = 5,000 | 5 sec | ~100 sec | ~105 sec |
| Index NLJ | 1,000 + 100,000 × 4 = 401,000 | 4,010 sec | ~1 sec | ~67 min |
Analysis:
SNL is catastrophic — 20 million I/Os make it completely impractical.
BNLJ is excellent — With 51 blocks for outer buffer, we scan S only 20 times. 5,000 I/Os ≈ 5 seconds.
INLJ is costly here — 400,000 random I/Os at 10ms each = 67 minutes. The random I/O penalty destroys performance.
Why BNLJ Wins This Scenario:
Now let's change parameters:
| Scenario Change | BNLJ Cost | INLJ Cost | Winner |
|---|---|---|---|
| SSD: t_rand = 0.5ms | 5,000 × 0.5ms = 2.5 sec | 401,000 × 0.5ms = 200 sec | BNLJ |
| Selective: only 1,000 outer tuples match | Still 5,000 I/Os | 1,000 × 4 = 4,000 random I/Os = 2 sec (SSD) | INLJ |
| Large S: 50,000 blocks | 1,000 + 20 × 50,000 = 1,001,000 I/Os | 1,000 + 100,000 × 4 = 401,000 I/Os | INLJ (if SSD) |
| S fits in memory: M = 250 | 1,000 + 200 = 1,200 I/Os | Still 401,000 I/Os | BNLJ |
The optimal algorithm depends on the specific combination of data sizes, memory, storage characteristics, and selectivity. Query optimizers must re-evaluate for each query with current statistics.
Let's derive the conditions under which one algorithm beats another:
BNLJ vs INLJ Break-Even:
Setting costs equal and solving for the crossover point:
123456789101112131415161718192021
// BNLJ vs INLJ Break-Even Analysis BNLJ I/O Cost = b_R + ⌈b_R/(M-1)⌉ × b_SINLJ I/O Cost = b_R + n_R × C_probe // Set equal (ignoring ceiling):⌈b_R/(M-1)⌉ × b_S = n_R × C_probe(b_R / (M-1)) × b_S = n_R × C_probe // Solving for n_R (outer cardinality where INLJ breaks even):n_R_break = (b_R × b_S) / ((M-1) × C_probe) // Example: b_R=1000, b_S=200, M=52, C_probe=4n_R_break = (1000 × 200) / (51 × 4) = 980 tuples // If effective n_R < 980, INLJ wins (before random I/O penalty) // Adjusting for random vs sequential I/O:// INLJ effective cost = n_R × C_probe × (t_rand / t_seq)// On HDD (ratio = 10): n_R_break = 980 / 10 = 98 tuples// On SSD (ratio = 2): n_R_break = 980 / 2 = 490 tuplesInterpretation:
On HDD, INLJ is only preferable when fewer than ~100 outer tuples need joining (after filtering). This is why INLJ excels for:
Memory Threshold for BNLJ Optimality:
BNLJ achieves optimal cost (one scan of each relation) when outer fits in memory:
12345678910111213
// Memory threshold for optimal BNLJ Optimal when: M - 1 >= b_RTherefore: M >= b_R + 1 // For our example: M >= 1001 blocks // Cost when outer fits:C_io_bnlj_optimal = b_R + b_S = 1000 + 200 = 1200 I/Os (1.2 seconds on SSD) // This is the theoretical minimum for any nested loop variant// (Must read both relations at least once)BNLJ cost decreases rapidly as memory increases until the outer relation fits completely. Beyond this point, additional memory provides no benefit for BNLJ. The optimizer should consider reallocating excess memory to other operations.
Understanding how costs change with parameters helps predict performance and guide tuning:
Parameter Sensitivity:
| Parameter | Effect on SNL | Effect on BNLJ | Effect on INLJ |
|---|---|---|---|
| Double outer rows (n_R) | 2× (linear) | 1× (none, if blocks same) | 2× (linear in probes) |
| Double outer blocks (b_R) | 2× (linear) | 2× (linear) | ~1× (minimal) |
| Double inner blocks (b_S) | 2× (linear) | 2× (linear) | ~1× (minimal, if index cached) |
| Double memory (M) | 1× (none) | ~0.5× (halves scans) | 1× (none directly) |
| SSD vs HDD | ~10× faster | ~10× faster | ~20-50× faster |
| Clustered vs unclustered index | N/A | N/A | ~50% faster (no data page fetch) |
Graphical Intuition:
Imagine plotting cost against outer relation size:
Cost
|
| ___________SNL (slope = b_S per tuple)
| ___/
| ___/ ___________BNLJ (slope = b_S / (M-1) per block)
| ___/ ___/
| / ___/ ___________INLJ (slope = C_probe per tuple)
| / ___/ ___/
|/___/_______/
+-------------------------------- Outer Size
^ ^
| |
Small outer Crossover points
Key observations:
Selectivity is often the most powerful lever. A WHERE clause that reduces outer cardinality by 100x can make INLJ 100x cheaper while barely affecting BNLJ (which still scans S fully). This is why the optimizer pushes selections down aggressively.
Real cost analysis considers multiple dimensions simultaneously:
Combined I/O and CPU Cost:
Modern optimizers combine costs using weighted formulas:
12345678910111213141516171819202122
// PostgreSQL-style combined cost Total_Cost = (seq_page_cost × sequential_reads) + (random_page_cost × random_reads) + (cpu_tuple_cost × tuples_processed) + (cpu_index_tuple_cost × index_entries) + (cpu_operator_cost × operator_evaluations) // Default PostgreSQL values:seq_page_cost = 1.0random_page_cost = 4.0cpu_tuple_cost = 0.01cpu_index_tuple_cost = 0.005cpu_operator_cost = 0.0025 // Example: INLJ with 100,000 outer tuples, 4 I/Os per probeCost_INLJ = (1.0 × 1,000) // Sequential read of outer + (4.0 × 400,000) // Random index probes + (0.01 × 100,000) // Outer tuple processing + (0.005 × 400,000) // Index entries examined = 1,000 + 1,600,000 + 1,000 + 2,000 = 1,604,000 cost unitsMemory Pressure Considerations:
The cost model must account for memory constraints:
Result Size Considerations:
The number of output tuples affects downstream costs:
12345678910111213
// Output cardinality estimationoutput_rows = n_R × n_S × selectivity // For equality joins on keys:selectivity ≈ 1 / max(distinct_R, distinct_S) // Example: R has 100K tuples, 10K distinct join keys// S has 10K tuples, 10K distinct join keys (all unique)selectivity = 1 / 10,000 = 0.0001output_rows = 100,000 × 10,000 × 0.0001 = 100,000 // Output materialization cost (if needed):output_cost = output_rows × tuple_size / page_size × write_costPostgreSQL tracks both startup cost (time until first tuple) and total cost (time for all tuples). INLJ has low startup cost—it can emit results immediately. BNLJ's startup cost includes loading the first outer block. For LIMIT queries, startup cost matters more than total cost.
Database administrators can adjust cost model parameters to reflect actual hardware characteristics:
PostgreSQL Tuning:
12345678910111213141516171819
-- View current cost settingsSHOW seq_page_cost; -- Default: 1.0SHOW random_page_cost; -- Default: 4.0SHOW cpu_tuple_cost; -- Default: 0.01SHOW effective_cache_size; -- Expected OS cache size -- Tune for SSD storage (random approaches sequential)SET random_page_cost = 1.1; -- Much closer to sequential -- Tune for battery-backed cache with fast writesSET seq_page_cost = 0.5;SET random_page_cost = 0.5; -- Tune for large memory systemsSET effective_cache_size = '64GB'; -- More optimistic caching assumptions -- Per-query cost adjustment (testing)SET LOCAL random_page_cost = 1.0;EXPLAIN (ANALYZE, BUFFERS) SELECT ...;MySQL/InnoDB Tuning:
12345678910111213141516
-- MySQL cost model server configurationSET GLOBAL optimizer_switch = 'index_merge=on,block_nested_loop=on'; -- Join buffer size (affects BNLJ block size)SET SESSION join_buffer_size = 4 * 1024 * 1024; -- 4MB -- View cost constants (MySQL 8.0+)SELECT * FROM mysql.server_cost;SELECT * FROM mysql.engine_cost; -- Adjust disk I/O cost for SSDUPDATE mysql.engine_cost SET cost_value = 0.25 WHERE cost_name = 'io_block_read_cost'; FLUSH OPTIMIZER_COSTS;Cost tuning is often less impactful than ensuring accurate statistics. A 2x error in random_page_cost is manageable; a 100x error in row count estimation causes catastrophic plan choices. Prioritize ANALYZE/UPDATE STATISTICS over cost parameter tuning.
Synthesizing our analysis, here's a practical decision framework for selecting nested loop variants:
Decision Tree:
123456789101112131415161718192021222324252627282930313233343536
// Nested Loop Join Algorithm Selection function SelectNLJVariant(R, S, join_pred, memory, indexes): // Check for index availability first index_on_S = find_suitable_index(S, join_pred) if index_on_S exists: // Calculate INLJ cost outer_rows = estimate_rows(R, predicates_on_R) inlj_cost = outer_rows × probe_cost(index_on_S) // Calculate BNLJ cost bnlj_cost = blocks(R) + ceil(blocks(R) / (memory-1)) × blocks(S) // Adjust for random vs sequential I/O if storage_is_ssd(): adjustment_factor = 2 else: adjustment_factor = 10 if inlj_cost × adjustment_factor < bnlj_cost: return INDEX_NESTED_LOOP // No useful index, or BNLJ is cheaper if memory >= min(blocks(R), blocks(S)) + 1: // One relation fits; position smaller as outer if blocks(R) < blocks(S): return BLOCK_NESTED_LOOP(outer=R, inner=S) else: return BLOCK_NESTED_LOOP(outer=S, inner=R) else: // Neither fits; still prefer BNLJ over SNL // Position smaller as outer to minimize inner scans smaller = R if blocks(R) < blocks(S) else S return BLOCK_NESTED_LOOP(outer=smaller)Quick Reference Rules:
| Condition | Recommended Algorithm | Reason |
|---|---|---|
| Outer < 100 rows, index on inner | INLJ | Few probes, even with random I/O |
| Outer fits in memory | BNLJ (outer=smaller) | Single scan of inner |
| Inner fits in memory | BNLJ (swap relations) | Single scan of inner |
| Neither fits, no useful index | BNLJ | Always better than SNL |
| Theta-join / complex predicate | BNLJ or SNL | Index may not help |
| SSD storage + large outer + index | Consider INLJ | Lower random I/O penalty |
| Correlated subquery | INLJ | Unavoidable per-row execution |
This decision framework covers nested loop variants only. Hash joins and merge joins often outperform all nested loop variants for large equi-joins. The optimizer considers all algorithms and chooses the lowest estimated cost across all options.
We've developed a comprehensive framework for analyzing and comparing nested loop join costs. Let's consolidate the essential insights:
What's Next:
With a solid understanding of costs, the final page explores when to use nested loop joins—examining scenarios where nested loops excel, when to prefer hash or merge joins, and practical guidance for query optimization and index design.
You now have a comprehensive framework for analyzing nested loop join costs. You can calculate costs, compare variants, understand break-even points, and tune cost parameters. Next, we'll synthesize this knowledge into practical guidance for choosing when to use nested loop joins.