Database Management SystemsPhysical Operators

Nested Loop Join Algorithms

LevelIntermediate

Duration60 mins

TopicPhysical Operators

4 / 5

Cost Analysis

Quantifying Join Performance

We've examined three nested loop join variants—simple nested loop, block nested loop, and index nested loop. Each has distinct performance characteristics that make it optimal under specific conditions. But how does a query optimizer decide which to use? How can a database administrator predict query performance?

The answer lies in cost modeling—the process of estimating the resources (I/O operations, CPU cycles, memory) required to execute a query using a particular physical plan. Modern query optimizers maintain detailed cost models that consider disk latencies, buffer pool sizes, index structures, and data statistics.

This page develops a comprehensive cost analysis framework for nested loop joins. We'll derive precise cost formulas, analyze break-even points between algorithms, examine sensitivity to key parameters, and establish practical decision criteria for algorithm selection.

What You Will Learn

By the end of this page, you will understand unified cost models for all nested loop variants, how to calculate and compare join costs, sensitivity analysis to parameters like memory and selectivity, break-even points for algorithm selection, and real-world cost tuning strategies.

Cost Model Foundations

Before comparing algorithms, we must establish a common framework for measuring cost. Database cost models typically separate costs into distinct components:

Cost Components:

Primary Cost Factors

•Disk I/O Cost — Time to read/write pages from storage. Dominates for disk-resident data. Divided into sequential and random I/O.
•CPU Cost — Time for tuple processing, predicate evaluation, hash computations. Becomes significant for in-memory data or complex predicates.
•Memory Cost — Buffer space required. Not directly a time cost, but constrains other operations and affects I/O patterns.
•Network Cost — For distributed systems, data transfer between nodes. Not covered in this analysis.

Standard Notation:

We'll use consistent notation throughout this analysis:

Cost Model Parameters
Symbol	Description	Typical Value
b_R, b_S	Number of disk blocks in relations R, S	Varies
n_R, n_S	Number of tuples in relations R, S	Varies
f_R, f_S	Blocking factor (tuples per block)	10-1000
M	Available memory in blocks	10-10,000
H_I	Height of index on inner relation	2-4
t_seq	Time per sequential I/O (ms)	0.5-5 (SSD) / 5-15 (HDD)
t_rand	Time per random I/O (ms)	0.1-1 (SSD) / 10-15 (HDD)
t_cpu	Time per tuple comparison (μs)	0.1-10
sel	Join selectivity (matching pairs / total pairs)	0-1

I/O vs CPU Dominance

For traditional disk-based systems, I/O cost dominates (often 99%+ of execution time). For in-memory databases or SSD-based systems with hot caches, CPU cost becomes more significant. Modern cost models weight both based on storage characteristics.

Unified Cost Formulas

Let's consolidate the cost formulas for each nested loop variant into a unified framework:

Simple Nested Loop Join (SNL):

snl_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Simple Nested Loop Join Cost
// Tuple-at-a-time processing
 
I/O Cost:
  C_io_snl = b_R + (n_R × b_S)
           = b_R + (b_R × f_R × b_S)
 
// Intuition: Read R once, scan S for every tuple in R
 
CPU Cost:
  C_cpu_snl = n_R × n_S × t_cpu
 
// Intuition: Compare every pair of tuples
 
Total Time:
  T_snl = (b_R × t_seq) + (n_R × b_S × t_seq) + (n_R × n_S × t_cpu)

Block Nested Loop Join (BNLJ):

bnlj_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Block Nested Loop Join Cost
// Block-at-a-time processing with M memory blocks
 
I/O Cost:
  C_io_bnlj = b_R + ⌈b_R / (M - 1)⌉ × b_S
 
// Intuition: Read R once, scan S for every block-chunk of R
 
Best Case (R fits in memory, M > b_R):
  C_io_bnlj_best = b_R + b_S
 
CPU Cost:
  C_cpu_bnlj = n_R × n_S × t_cpu  // Same as SNL
 
// With hash optimization on outer buffer:
  C_cpu_bnlj_hash = n_R + n_S × (hash_probe_cost)
                  ≈ O(n_R + n_S)
 
Total Time:
  T_bnlj = (b_R × t_seq) + (⌈b_R/(M-1)⌉ × b_S × t_seq) + CPU_cost

Index Nested Loop Join (INLJ):

inlj_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Index Nested Loop Join Cost
// Index probe for each outer tuple
 
I/O Cost:
  C_io_inlj = b_R + n_R × C_probe
 
Where C_probe (cost per index probe):
  // B-tree, unclustered, unique key:
  C_probe = H_I + 1  // Height + 1 data page
  
  // B-tree, clustered:
  C_probe = H_I      // No separate data page
  
  // With matches_per_key > 1:
  C_probe = H_I + ⌈matches_per_key / f_S⌉
 
// Adjusting for random I/O:
  T_io_inlj = (b_R × t_seq) + (n_R × C_probe × t_rand)
 
CPU Cost:
  C_cpu_inlj = n_R × (index_compare_cost + matches_per_key × t_cpu)
 
// Typically much lower than scan-based joins

The Random I/O Penalty

The key insight in INLJ cost is the random I/O component. On spinning disks, t_rand ≈ 10-15ms vs t_seq ≈ 0.1ms per page. This 100x penalty means INLJ must do 100x fewer I/Os than BNLJ to break even on HDDs. On SSDs (t_rand ≈ 0.1-0.5ms), the penalty is only 2-10x.

Comparative Analysis

Let's compare the three variants with a concrete example:

Scenario:

Relation R: 100,000 tuples, 1,000 blocks (f_R = 100 tuples/block)
Relation S: 10,000 tuples, 200 blocks (f_S = 50 tuples/block)
Memory: 52 blocks available
B-tree index on S.join_key: height = 3
Join selectivity: 1 match per outer tuple on average
t_seq = 1ms, t_rand = 10ms (HDD)

Cost Calculations:

Cost Comparison (HDD Storage)
Algorithm	I/O Count	I/O Time	CPU Time	Total
Simple NLJ	1,000 + 100,000 × 200 = 20,001,000	~20,000 sec	~100 sec	~5.6 hours
Block NLJ	1,000 + ⌈1000/51⌉ × 200 = 5,000	5 sec	~100 sec	~105 sec
Index NLJ	1,000 + 100,000 × 4 = 401,000	4,010 sec	~1 sec	~67 min

Analysis:

SNL is catastrophic — 20 million I/Os make it completely impractical.
BNLJ is excellent — With 51 blocks for outer buffer, we scan S only 20 times. 5,000 I/Os ≈ 5 seconds.
INLJ is costly here — 400,000 random I/Os at 10ms each = 67 minutes. The random I/O penalty destroys performance.

Why BNLJ Wins This Scenario:

S is small (200 blocks) — scanning it 20 times is feasible
Memory is sufficient — 51 block outer buffer reduces scans dramatically
Join matches every tuple — no selectivity advantage for INLJ

Now let's change parameters:

Cost Comparison (SSD Storage, Selective Join)
Scenario Change	BNLJ Cost	INLJ Cost	Winner
SSD: t_rand = 0.5ms	5,000 × 0.5ms = 2.5 sec	401,000 × 0.5ms = 200 sec	BNLJ
Selective: only 1,000 outer tuples match	Still 5,000 I/Os	1,000 × 4 = 4,000 random I/Os = 2 sec (SSD)	INLJ
Large S: 50,000 blocks	1,000 + 20 × 50,000 = 1,001,000 I/Os	1,000 + 100,000 × 4 = 401,000 I/Os	INLJ (if SSD)
S fits in memory: M = 250	1,000 + 200 = 1,200 I/Os	Still 401,000 I/Os	BNLJ

No Universal Winner

The optimal algorithm depends on the specific combination of data sizes, memory, storage characteristics, and selectivity. Query optimizers must re-evaluate for each query with current statistics.

Break-Even Analysis

Let's derive the conditions under which one algorithm beats another:

BNLJ vs INLJ Break-Even:

Setting costs equal and solving for the crossover point:

break_even.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// BNLJ vs INLJ Break-Even Analysis
 
BNLJ I/O Cost = b_R + ⌈b_R/(M-1)⌉ × b_S
INLJ I/O Cost = b_R + n_R × C_probe
 
// Set equal (ignoring ceiling):
⌈b_R/(M-1)⌉ × b_S = n_R × C_probe
(b_R / (M-1)) × b_S = n_R × C_probe
 
// Solving for n_R (outer cardinality where INLJ breaks even):
n_R_break = (b_R × b_S) / ((M-1) × C_probe)
 
// Example: b_R=1000, b_S=200, M=52, C_probe=4
n_R_break = (1000 × 200) / (51 × 4) = 980 tuples
 
// If effective n_R < 980, INLJ wins (before random I/O penalty)
 
// Adjusting for random vs sequential I/O:
// INLJ effective cost = n_R × C_probe × (t_rand / t_seq)
// On HDD (ratio = 10): n_R_break = 980 / 10 = 98 tuples
// On SSD (ratio = 2):  n_R_break = 980 / 2 = 490 tuples

Interpretation:

On HDD, INLJ is only preferable when fewer than ~100 outer tuples need joining (after filtering). This is why INLJ excels for:

Highly selective WHERE clauses
Correlated subqueries with few outer iterations
Top-K queries that stop early

Memory Threshold for BNLJ Optimality:

BNLJ achieves optimal cost (one scan of each relation) when outer fits in memory:

memory_threshold.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
// Memory threshold for optimal BNLJ
 
Optimal when: M - 1 >= b_R
Therefore: M >= b_R + 1
 
// For our example: M >= 1001 blocks
 
// Cost when outer fits:
C_io_bnlj_optimal = b_R + b_S
                  = 1000 + 200 = 1200 I/Os (1.2 seconds on SSD)
 
// This is the theoretical minimum for any nested loop variant
// (Must read both relations at least once)

The Memory Knee Point

BNLJ cost decreases rapidly as memory increases until the outer relation fits completely. Beyond this point, additional memory provides no benefit for BNLJ. The optimizer should consider reallocating excess memory to other operations.

Sensitivity Analysis

Understanding how costs change with parameters helps predict performance and guide tuning:

Parameter Sensitivity:

Cost Sensitivity to Parameter Changes
Parameter	Effect on SNL	Effect on BNLJ	Effect on INLJ
Double outer rows (n_R)	2× (linear)	1× (none, if blocks same)	2× (linear in probes)
Double outer blocks (b_R)	2× (linear)	2× (linear)	~1× (minimal)
Double inner blocks (b_S)	2× (linear)	2× (linear)	~1× (minimal, if index cached)
Double memory (M)	1× (none)	~0.5× (halves scans)	1× (none directly)
SSD vs HDD	~10× faster	~10× faster	~20-50× faster
Clustered vs unclustered index	N/A	N/A	~50% faster (no data page fetch)

Graphical Intuition:

Imagine plotting cost against outer relation size:

Cost
 |
 |               ___________SNL (slope = b_S per tuple)
 |           ___/
 |       ___/    ___________BNLJ (slope = b_S / (M-1) per block)
 |   ___/    ___/
 |  /    ___/       ___________INLJ (slope = C_probe per tuple)
 | / ___/       ___/
 |/___/_______/
 +-------------------------------- Outer Size
  ^           ^
  |           |
  Small outer  Crossover points

Key observations:

INLJ has lowest cost for very small outer relations
BNLJ dominates in the middle range
All converge to very high costs for large outer relations (relative to available memory)

The Selectivity Lever

Selectivity is often the most powerful lever. A WHERE clause that reduces outer cardinality by 100x can make INLJ 100x cheaper while barely affecting BNLJ (which still scans S fully). This is why the optimizer pushes selections down aggressively.

Multi-Dimensional Cost Considerations

Real cost analysis considers multiple dimensions simultaneously:

Combined I/O and CPU Cost:

Modern optimizers combine costs using weighted formulas:

combined_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// PostgreSQL-style combined cost
 
Total_Cost = (seq_page_cost × sequential_reads) 
           + (random_page_cost × random_reads)
           + (cpu_tuple_cost × tuples_processed)
           + (cpu_index_tuple_cost × index_entries)
           + (cpu_operator_cost × operator_evaluations)
 
// Default PostgreSQL values:
seq_page_cost = 1.0
random_page_cost = 4.0
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025
 
// Example: INLJ with 100,000 outer tuples, 4 I/Os per probe
Cost_INLJ = (1.0 × 1,000)           // Sequential read of outer
          + (4.0 × 400,000)         // Random index probes
          + (0.01 × 100,000)        // Outer tuple processing
          + (0.005 × 400,000)       // Index entries examined
          = 1,000 + 1,600,000 + 1,000 + 2,000
          = 1,604,000 cost units

Memory Pressure Considerations:

The cost model must account for memory constraints:

•Work memory limits — Operators may not exceed allocated work_mem. BNLJ adapts by using smaller outer blocks, increasing scans.
•Concurrent queries — Buffer pool is shared. Large BNLJ allocations may evict other queries' pages.
•Spilling to disk — If memory is insufficient, some algorithms spill intermediate results, dramatically increasing cost.
•Index caching — Frequently accessed index upper levels stay cached, reducing effective probe cost over many iterations.

Result Size Considerations:

The number of output tuples affects downstream costs:

output_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
// Output cardinality estimation
output_rows = n_R × n_S × selectivity
 
// For equality joins on keys:
selectivity ≈ 1 / max(distinct_R, distinct_S)
 
// Example: R has 100K tuples, 10K distinct join keys
//          S has 10K tuples, 10K distinct join keys (all unique)
selectivity = 1 / 10,000 = 0.0001
output_rows = 100,000 × 10,000 × 0.0001 = 100,000
 
// Output materialization cost (if needed):
output_cost = output_rows × tuple_size / page_size × write_cost

Startup vs Total Cost

PostgreSQL tracks both startup cost (time until first tuple) and total cost (time for all tuples). INLJ has low startup cost—it can emit results immediately. BNLJ's startup cost includes loading the first outer block. For LIMIT queries, startup cost matters more than total cost.

Practical Cost Tuning

Database administrators can adjust cost model parameters to reflect actual hardware characteristics:

PostgreSQL Tuning:

postgres_tuning.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- View current cost settings
SHOW seq_page_cost;        -- Default: 1.0
SHOW random_page_cost;     -- Default: 4.0
SHOW cpu_tuple_cost;       -- Default: 0.01
SHOW effective_cache_size; -- Expected OS cache size
 
-- Tune for SSD storage (random approaches sequential)
SET random_page_cost = 1.1;  -- Much closer to sequential
 
-- Tune for battery-backed cache with fast writes
SET seq_page_cost = 0.5;
SET random_page_cost = 0.5;
 
-- Tune for large memory systems
SET effective_cache_size = '64GB';  -- More optimistic caching assumptions
 
-- Per-query cost adjustment (testing)
SET LOCAL random_page_cost = 1.0;
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

MySQL/InnoDB Tuning:

mysql_tuning.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- MySQL cost model server configuration
SET GLOBAL optimizer_switch = 'index_merge=on,block_nested_loop=on';
 
-- Join buffer size (affects BNLJ block size)
SET SESSION join_buffer_size = 4 * 1024 * 1024;  -- 4MB
 
-- View cost constants (MySQL 8.0+)
SELECT * FROM mysql.server_cost;
SELECT * FROM mysql.engine_cost;
 
-- Adjust disk I/O cost for SSD
UPDATE mysql.engine_cost 
SET cost_value = 0.25 
WHERE cost_name = 'io_block_read_cost';
 
FLUSH OPTIMIZER_COSTS;

Tuning Best Practices

•Benchmark before tuning — Measure actual I/O latencies on your hardware. Don't guess.
•Update for SSD — Default cost models assume HDDs. SSDs need significantly lower random_page_cost.
•Consider caching effects — If working set fits in RAM, disk cost parameters matter less.
•Test with realistic workloads — Cost changes affect all queries. Regression testing is essential.
•Document changes — Record why parameters were changed for future reference.

Statistics Matter More

Cost tuning is often less impactful than ensuring accurate statistics. A 2x error in random_page_cost is manageable; a 100x error in row count estimation causes catastrophic plan choices. Prioritize ANALYZE/UPDATE STATISTICS over cost parameter tuning.

Algorithm Selection Decision Framework

Synthesizing our analysis, here's a practical decision framework for selecting nested loop variants:

Decision Tree:

decision_tree.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Nested Loop Join Algorithm Selection
 
function SelectNLJVariant(R, S, join_pred, memory, indexes):
    
    // Check for index availability first
    index_on_S = find_suitable_index(S, join_pred)
    
    if index_on_S exists:
        // Calculate INLJ cost
        outer_rows = estimate_rows(R, predicates_on_R)
        inlj_cost = outer_rows × probe_cost(index_on_S)
        
        // Calculate BNLJ cost
        bnlj_cost = blocks(R) + ceil(blocks(R) / (memory-1)) × blocks(S)
        
        // Adjust for random vs sequential I/O
        if storage_is_ssd():
            adjustment_factor = 2
        else:
            adjustment_factor = 10
        
        if inlj_cost × adjustment_factor < bnlj_cost:
            return INDEX_NESTED_LOOP
    
    // No useful index, or BNLJ is cheaper
    if memory >= min(blocks(R), blocks(S)) + 1:
        // One relation fits; position smaller as outer
        if blocks(R) < blocks(S):
            return BLOCK_NESTED_LOOP(outer=R, inner=S)
        else:
            return BLOCK_NESTED_LOOP(outer=S, inner=R)
    else:
        // Neither fits; still prefer BNLJ over SNL
        // Position smaller as outer to minimize inner scans
        smaller = R if blocks(R) < blocks(S) else S
        return BLOCK_NESTED_LOOP(outer=smaller)

Quick Reference Rules:

Algorithm Selection Quick Reference
Condition	Recommended Algorithm	Reason
Outer < 100 rows, index on inner	INLJ	Few probes, even with random I/O
Outer fits in memory	BNLJ (outer=smaller)	Single scan of inner
Inner fits in memory	BNLJ (swap relations)	Single scan of inner
Neither fits, no useful index	BNLJ	Always better than SNL
Theta-join / complex predicate	BNLJ or SNL	Index may not help
SSD storage + large outer + index	Consider INLJ	Lower random I/O penalty
Correlated subquery	INLJ	Unavoidable per-row execution

Don't Forget Other Join Algorithms

This decision framework covers nested loop variants only. Hash joins and merge joins often outperform all nested loop variants for large equi-joins. The optimizer considers all algorithms and chooses the lowest estimated cost across all options.

Summary: Cost Analysis Framework

We've developed a comprehensive framework for analyzing and comparing nested loop join costs. Let's consolidate the essential insights:

Key Takeaways

•Cost Components — I/O cost (sequential + random) dominates for disk-resident data. CPU cost matters for in-memory operations.
•SNL Cost — O(b_R + n_R × b_S). Never practical for large relations. Use as baseline only.
•BNLJ Cost — O(b_R + ⌈b_R/(M-1)⌉ × b_S). Memory inversely affects cost. Optimal when outer fits in memory.
•INLJ Cost — O(b_R + n_R × C_probe). Dominated by probe count and random I/O penalty.
•Break-Even Points — Derive crossover conditions by equating cost formulas. Account for random vs sequential I/O ratio.
•Sensitivity — BNLJ is sensitive to memory; INLJ is sensitive to outer cardinality. Both sensitive to storage type.
•Cost Tuning — Adjust random_page_cost for SSDs. Maintain accurate statistics. Benchmark before tuning.
•Decision Framework — Check index availability, compare costs with I/O adjustment, position smaller relation as outer.

What's Next:

With a solid understanding of costs, the final page explores when to use nested loop joins—examining scenarios where nested loops excel, when to prefer hash or merge joins, and practical guidance for query optimization and index design.

Page Complete

You now have a comprehensive framework for analyzing nested loop join costs. You can calculate costs, compare variants, understand break-even points, and tune cost parameters. Next, we'll synthesize this knowledge into practical guidance for choosing when to use nested loop joins.

4 / 5

Loading learning content...

Database Management SystemsPhysical Operators

Nested Loop Join Algorithms

LevelIntermediate

Duration60 mins

TopicPhysical Operators

4 / 5

Cost Analysis

Quantifying Join Performance

What You Will Learn

Cost Model Foundations

Before comparing algorithms, we must establish a common framework for measuring cost. Database cost models typically separate costs into distinct components:

Cost Components:

Primary Cost Factors

•Disk I/O Cost — Time to read/write pages from storage. Dominates for disk-resident data. Divided into sequential and random I/O.
•CPU Cost — Time for tuple processing, predicate evaluation, hash computations. Becomes significant for in-memory data or complex predicates.
•Memory Cost — Buffer space required. Not directly a time cost, but constrains other operations and affects I/O patterns.
•Network Cost — For distributed systems, data transfer between nodes. Not covered in this analysis.

Standard Notation:

We'll use consistent notation throughout this analysis:

Cost Model Parameters
Symbol	Description	Typical Value
b_R, b_S	Number of disk blocks in relations R, S	Varies
n_R, n_S	Number of tuples in relations R, S	Varies
f_R, f_S	Blocking factor (tuples per block)	10-1000
M	Available memory in blocks	10-10,000
H_I	Height of index on inner relation	2-4
t_seq	Time per sequential I/O (ms)	0.5-5 (SSD) / 5-15 (HDD)
t_rand	Time per random I/O (ms)	0.1-1 (SSD) / 10-15 (HDD)
t_cpu	Time per tuple comparison (μs)	0.1-10
sel	Join selectivity (matching pairs / total pairs)	0-1

I/O vs CPU Dominance

Unified Cost Formulas

Let's consolidate the cost formulas for each nested loop variant into a unified framework:

Simple Nested Loop Join (SNL):

snl_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Simple Nested Loop Join Cost
// Tuple-at-a-time processing
 
I/O Cost:
  C_io_snl = b_R + (n_R × b_S)
           = b_R + (b_R × f_R × b_S)
 
// Intuition: Read R once, scan S for every tuple in R
 
CPU Cost:
  C_cpu_snl = n_R × n_S × t_cpu
 
// Intuition: Compare every pair of tuples
 
Total Time:
  T_snl = (b_R × t_seq) + (n_R × b_S × t_seq) + (n_R × n_S × t_cpu)

Block Nested Loop Join (BNLJ):

bnlj_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Block Nested Loop Join Cost
// Block-at-a-time processing with M memory blocks
 
I/O Cost:
  C_io_bnlj = b_R + ⌈b_R / (M - 1)⌉ × b_S
 
// Intuition: Read R once, scan S for every block-chunk of R
 
Best Case (R fits in memory, M > b_R):
  C_io_bnlj_best = b_R + b_S
 
CPU Cost:
  C_cpu_bnlj = n_R × n_S × t_cpu  // Same as SNL
 
// With hash optimization on outer buffer:
  C_cpu_bnlj_hash = n_R + n_S × (hash_probe_cost)
                  ≈ O(n_R + n_S)
 
Total Time:
  T_bnlj = (b_R × t_seq) + (⌈b_R/(M-1)⌉ × b_S × t_seq) + CPU_cost

Index Nested Loop Join (INLJ):

inlj_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Index Nested Loop Join Cost
// Index probe for each outer tuple
 
I/O Cost:
  C_io_inlj = b_R + n_R × C_probe
 
Where C_probe (cost per index probe):
  // B-tree, unclustered, unique key:
  C_probe = H_I + 1  // Height + 1 data page
  
  // B-tree, clustered:
  C_probe = H_I      // No separate data page
  
  // With matches_per_key > 1:
  C_probe = H_I + ⌈matches_per_key / f_S⌉
 
// Adjusting for random I/O:
  T_io_inlj = (b_R × t_seq) + (n_R × C_probe × t_rand)
 
CPU Cost:
  C_cpu_inlj = n_R × (index_compare_cost + matches_per_key × t_cpu)
 
// Typically much lower than scan-based joins

The Random I/O Penalty

Comparative Analysis

Let's compare the three variants with a concrete example:

Scenario:

Relation R: 100,000 tuples, 1,000 blocks (f_R = 100 tuples/block)
Relation S: 10,000 tuples, 200 blocks (f_S = 50 tuples/block)
Memory: 52 blocks available
B-tree index on S.join_key: height = 3
Join selectivity: 1 match per outer tuple on average
t_seq = 1ms, t_rand = 10ms (HDD)

Cost Calculations:

Cost Comparison (HDD Storage)
Algorithm	I/O Count	I/O Time	CPU Time	Total
Simple NLJ	1,000 + 100,000 × 200 = 20,001,000	~20,000 sec	~100 sec	~5.6 hours
Block NLJ	1,000 + ⌈1000/51⌉ × 200 = 5,000	5 sec	~100 sec	~105 sec
Index NLJ	1,000 + 100,000 × 4 = 401,000	4,010 sec	~1 sec	~67 min

Analysis:

SNL is catastrophic — 20 million I/Os make it completely impractical.
BNLJ is excellent — With 51 blocks for outer buffer, we scan S only 20 times. 5,000 I/Os ≈ 5 seconds.
INLJ is costly here — 400,000 random I/Os at 10ms each = 67 minutes. The random I/O penalty destroys performance.

Why BNLJ Wins This Scenario:

S is small (200 blocks) — scanning it 20 times is feasible
Memory is sufficient — 51 block outer buffer reduces scans dramatically
Join matches every tuple — no selectivity advantage for INLJ

Now let's change parameters:

Cost Comparison (SSD Storage, Selective Join)
Scenario Change	BNLJ Cost	INLJ Cost	Winner
SSD: t_rand = 0.5ms	5,000 × 0.5ms = 2.5 sec	401,000 × 0.5ms = 200 sec	BNLJ
Selective: only 1,000 outer tuples match	Still 5,000 I/Os	1,000 × 4 = 4,000 random I/Os = 2 sec (SSD)	INLJ
Large S: 50,000 blocks	1,000 + 20 × 50,000 = 1,001,000 I/Os	1,000 + 100,000 × 4 = 401,000 I/Os	INLJ (if SSD)
S fits in memory: M = 250	1,000 + 200 = 1,200 I/Os	Still 401,000 I/Os	BNLJ

No Universal Winner

The optimal algorithm depends on the specific combination of data sizes, memory, storage characteristics, and selectivity. Query optimizers must re-evaluate for each query with current statistics.

Break-Even Analysis

Let's derive the conditions under which one algorithm beats another:

BNLJ vs INLJ Break-Even:

Setting costs equal and solving for the crossover point:

break_even.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// BNLJ vs INLJ Break-Even Analysis
 
BNLJ I/O Cost = b_R + ⌈b_R/(M-1)⌉ × b_S
INLJ I/O Cost = b_R + n_R × C_probe
 
// Set equal (ignoring ceiling):
⌈b_R/(M-1)⌉ × b_S = n_R × C_probe
(b_R / (M-1)) × b_S = n_R × C_probe
 
// Solving for n_R (outer cardinality where INLJ breaks even):
n_R_break = (b_R × b_S) / ((M-1) × C_probe)
 
// Example: b_R=1000, b_S=200, M=52, C_probe=4
n_R_break = (1000 × 200) / (51 × 4) = 980 tuples
 
// If effective n_R < 980, INLJ wins (before random I/O penalty)
 
// Adjusting for random vs sequential I/O:
// INLJ effective cost = n_R × C_probe × (t_rand / t_seq)
// On HDD (ratio = 10): n_R_break = 980 / 10 = 98 tuples
// On SSD (ratio = 2):  n_R_break = 980 / 2 = 490 tuples

Interpretation:

On HDD, INLJ is only preferable when fewer than ~100 outer tuples need joining (after filtering). This is why INLJ excels for:

Highly selective WHERE clauses
Correlated subqueries with few outer iterations
Top-K queries that stop early

Memory Threshold for BNLJ Optimality:

BNLJ achieves optimal cost (one scan of each relation) when outer fits in memory:

memory_threshold.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
// Memory threshold for optimal BNLJ
 
Optimal when: M - 1 >= b_R
Therefore: M >= b_R + 1
 
// For our example: M >= 1001 blocks
 
// Cost when outer fits:
C_io_bnlj_optimal = b_R + b_S
                  = 1000 + 200 = 1200 I/Os (1.2 seconds on SSD)
 
// This is the theoretical minimum for any nested loop variant
// (Must read both relations at least once)

The Memory Knee Point

Sensitivity Analysis

Understanding how costs change with parameters helps predict performance and guide tuning:

Parameter Sensitivity:

Cost Sensitivity to Parameter Changes
Parameter	Effect on SNL	Effect on BNLJ	Effect on INLJ
Double outer rows (n_R)	2× (linear)	1× (none, if blocks same)	2× (linear in probes)
Double outer blocks (b_R)	2× (linear)	2× (linear)	~1× (minimal)
Double inner blocks (b_S)	2× (linear)	2× (linear)	~1× (minimal, if index cached)
Double memory (M)	1× (none)	~0.5× (halves scans)	1× (none directly)
SSD vs HDD	~10× faster	~10× faster	~20-50× faster
Clustered vs unclustered index	N/A	N/A	~50% faster (no data page fetch)

Graphical Intuition:

Imagine plotting cost against outer relation size:

Cost
 |
 |               ___________SNL (slope = b_S per tuple)
 |           ___/
 |       ___/    ___________BNLJ (slope = b_S / (M-1) per block)
 |   ___/    ___/
 |  /    ___/       ___________INLJ (slope = C_probe per tuple)
 | / ___/       ___/
 |/___/_______/
 +-------------------------------- Outer Size
  ^           ^
  |           |
  Small outer  Crossover points

Key observations:

INLJ has lowest cost for very small outer relations
BNLJ dominates in the middle range
All converge to very high costs for large outer relations (relative to available memory)

The Selectivity Lever

Multi-Dimensional Cost Considerations

Real cost analysis considers multiple dimensions simultaneously:

Combined I/O and CPU Cost:

Modern optimizers combine costs using weighted formulas:

combined_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// PostgreSQL-style combined cost
 
Total_Cost = (seq_page_cost × sequential_reads) 
           + (random_page_cost × random_reads)
           + (cpu_tuple_cost × tuples_processed)
           + (cpu_index_tuple_cost × index_entries)
           + (cpu_operator_cost × operator_evaluations)
 
// Default PostgreSQL values:
seq_page_cost = 1.0
random_page_cost = 4.0
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025
 
// Example: INLJ with 100,000 outer tuples, 4 I/Os per probe
Cost_INLJ = (1.0 × 1,000)           // Sequential read of outer
          + (4.0 × 400,000)         // Random index probes
          + (0.01 × 100,000)        // Outer tuple processing
          + (0.005 × 400,000)       // Index entries examined
          = 1,000 + 1,600,000 + 1,000 + 2,000
          = 1,604,000 cost units

Memory Pressure Considerations:

The cost model must account for memory constraints:

•Work memory limits — Operators may not exceed allocated work_mem. BNLJ adapts by using smaller outer blocks, increasing scans.
•Concurrent queries — Buffer pool is shared. Large BNLJ allocations may evict other queries' pages.
•Spilling to disk — If memory is insufficient, some algorithms spill intermediate results, dramatically increasing cost.
•Index caching — Frequently accessed index upper levels stay cached, reducing effective probe cost over many iterations.

Result Size Considerations:

The number of output tuples affects downstream costs:

output_cost.formula
1
2
3
4
5
6
7
8
9
10
11
12
13
// Output cardinality estimation
output_rows = n_R × n_S × selectivity
 
// For equality joins on keys:
selectivity ≈ 1 / max(distinct_R, distinct_S)
 
// Example: R has 100K tuples, 10K distinct join keys
//          S has 10K tuples, 10K distinct join keys (all unique)
selectivity = 1 / 10,000 = 0.0001
output_rows = 100,000 × 10,000 × 0.0001 = 100,000
 
// Output materialization cost (if needed):
output_cost = output_rows × tuple_size / page_size × write_cost

Startup vs Total Cost

Practical Cost Tuning

Database administrators can adjust cost model parameters to reflect actual hardware characteristics:

PostgreSQL Tuning:

postgres_tuning.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- View current cost settings
SHOW seq_page_cost;        -- Default: 1.0
SHOW random_page_cost;     -- Default: 4.0
SHOW cpu_tuple_cost;       -- Default: 0.01
SHOW effective_cache_size; -- Expected OS cache size
 
-- Tune for SSD storage (random approaches sequential)
SET random_page_cost = 1.1;  -- Much closer to sequential
 
-- Tune for battery-backed cache with fast writes
SET seq_page_cost = 0.5;
SET random_page_cost = 0.5;
 
-- Tune for large memory systems
SET effective_cache_size = '64GB';  -- More optimistic caching assumptions
 
-- Per-query cost adjustment (testing)
SET LOCAL random_page_cost = 1.0;
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

MySQL/InnoDB Tuning:

mysql_tuning.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- MySQL cost model server configuration
SET GLOBAL optimizer_switch = 'index_merge=on,block_nested_loop=on';
 
-- Join buffer size (affects BNLJ block size)
SET SESSION join_buffer_size = 4 * 1024 * 1024;  -- 4MB
 
-- View cost constants (MySQL 8.0+)
SELECT * FROM mysql.server_cost;
SELECT * FROM mysql.engine_cost;
 
-- Adjust disk I/O cost for SSD
UPDATE mysql.engine_cost 
SET cost_value = 0.25 
WHERE cost_name = 'io_block_read_cost';
 
FLUSH OPTIMIZER_COSTS;

Tuning Best Practices

•Benchmark before tuning — Measure actual I/O latencies on your hardware. Don't guess.
•Update for SSD — Default cost models assume HDDs. SSDs need significantly lower random_page_cost.
•Consider caching effects — If working set fits in RAM, disk cost parameters matter less.
•Test with realistic workloads — Cost changes affect all queries. Regression testing is essential.
•Document changes — Record why parameters were changed for future reference.

Statistics Matter More

Algorithm Selection Decision Framework

Synthesizing our analysis, here's a practical decision framework for selecting nested loop variants:

Decision Tree:

decision_tree.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Nested Loop Join Algorithm Selection
 
function SelectNLJVariant(R, S, join_pred, memory, indexes):
    
    // Check for index availability first
    index_on_S = find_suitable_index(S, join_pred)
    
    if index_on_S exists:
        // Calculate INLJ cost
        outer_rows = estimate_rows(R, predicates_on_R)
        inlj_cost = outer_rows × probe_cost(index_on_S)
        
        // Calculate BNLJ cost
        bnlj_cost = blocks(R) + ceil(blocks(R) / (memory-1)) × blocks(S)
        
        // Adjust for random vs sequential I/O
        if storage_is_ssd():
            adjustment_factor = 2
        else:
            adjustment_factor = 10
        
        if inlj_cost × adjustment_factor < bnlj_cost:
            return INDEX_NESTED_LOOP
    
    // No useful index, or BNLJ is cheaper
    if memory >= min(blocks(R), blocks(S)) + 1:
        // One relation fits; position smaller as outer
        if blocks(R) < blocks(S):
            return BLOCK_NESTED_LOOP(outer=R, inner=S)
        else:
            return BLOCK_NESTED_LOOP(outer=S, inner=R)
    else:
        // Neither fits; still prefer BNLJ over SNL
        // Position smaller as outer to minimize inner scans
        smaller = R if blocks(R) < blocks(S) else S
        return BLOCK_NESTED_LOOP(outer=smaller)

Quick Reference Rules:

Algorithm Selection Quick Reference
Condition	Recommended Algorithm	Reason
Outer < 100 rows, index on inner	INLJ	Few probes, even with random I/O
Outer fits in memory	BNLJ (outer=smaller)	Single scan of inner
Inner fits in memory	BNLJ (swap relations)	Single scan of inner
Neither fits, no useful index	BNLJ	Always better than SNL
Theta-join / complex predicate	BNLJ or SNL	Index may not help
SSD storage + large outer + index	Consider INLJ	Lower random I/O penalty
Correlated subquery	INLJ	Unavoidable per-row execution

Don't Forget Other Join Algorithms

Summary: Cost Analysis Framework

We've developed a comprehensive framework for analyzing and comparing nested loop join costs. Let's consolidate the essential insights:

Key Takeaways

•Cost Components — I/O cost (sequential + random) dominates for disk-resident data. CPU cost matters for in-memory operations.
•SNL Cost — O(b_R + n_R × b_S). Never practical for large relations. Use as baseline only.
•BNLJ Cost — O(b_R + ⌈b_R/(M-1)⌉ × b_S). Memory inversely affects cost. Optimal when outer fits in memory.
•INLJ Cost — O(b_R + n_R × C_probe). Dominated by probe count and random I/O penalty.
•Break-Even Points — Derive crossover conditions by equating cost formulas. Account for random vs sequential I/O ratio.
•Sensitivity — BNLJ is sensitive to memory; INLJ is sensitive to outer cardinality. Both sensitive to storage type.
•Cost Tuning — Adjust random_page_cost for SSDs. Maintain accurate statistics. Benchmark before tuning.
•Decision Framework — Check index availability, compare costs with I/O adjustment, position smaller relation as outer.

What's Next:

Page Complete

4 / 5