Nested Loop Join - Learning Module

Loading content...

0/241

Index Nested Loop Join

Leveraging Indexes for Join Acceleration

The block nested loop join's cost is proportional to the product of relation sizes divided by available memory. Even with generous buffers, joining two large relations requires reading the inner relation multiple times—thousands or millions of disk pages repeatedly fetched from storage.

But what if we could avoid scanning the inner relation entirely? What if, instead of examining every inner tuple, we could jump directly to the matching tuples using an index?

This is the essence of the index nested loop join (INLJ). By leveraging a B-tree, hash index, or other access structure on the join key of the inner relation, we transform the join from a brute-force scan into a series of targeted lookups. For highly selective joins—where each outer tuple matches only a few inner tuples—this can reduce execution time by orders of magnitude.

What You Will Learn

By the end of this page, you will understand how the index nested loop join uses indexes to accelerate joins, its cost model, when it outperforms scan-based joins, what index types are suitable, and the critical factors that influence its effectiveness.

The Index Advantage

Consider a join between Orders (10 million rows) and Customers (1 million rows) on Orders.customer_id = Customers.id. With BNLJ, even with 1GB of memory, we'd scan the Customers table hundreds of times.

But Customers likely has a primary key index on id. For each order, we can look up the matching customer in approximately 3-4 index traversals (B-tree height) plus 1 data page access. Instead of scanning 50,000 pages of Customers repeatedly, we make 10 million targeted lookups—each costing ~4 I/Os.

The Transformation:

BNLJ Cost: b_orders + ⌈b_orders/M⌉ × b_customers
         ≈ 100,000 + (100 × 50,000) = 5,100,000 I/Os

INLJ Cost: b_orders + n_orders × (index_height + 1)
         ≈ 100,000 + (10,000,000 × 4) = 40,100,000 lookups
         BUT: with caching, upper index levels stay cached
         Effective: 100,000 + (10,000,000 × 1.5) = 15,100,000 I/Os

Wait—15 million is worse than 5 million! What's happening?

Not Always Better

INLJ trades random I/O for reduced total I/O. Each index probe requires random access, while scans are sequential. On spinning disks, random I/O is 100x slower than sequential. INLJ wins when join selectivity is high enough that fewer lookups are needed than full scans would provide.

The critical insight: INLJ shines when the number of outer tuples requiring lookup is small relative to the inner relation, or when the inner relation doesn't fit in memory for BNLJ but the index does.

The Selectivity Factor:

Let's reconsider with a selective predicate. Suppose we're joining orders from only one customer:

SELECT * FROM Orders o
JOIN Customers c ON o.customer_id = c.id
WHERE o.customer_id = 12345;

Now we have perhaps 100 orders (not 10 million). INLJ cost:

Scan matching orders (via index): ~10 I/Os
For each order, lookup customer: 100 × 4 = 400 I/Os (heavily cached)
Total: ~50 I/Os (with caching)

BNLJ would still scan the entire Customers table. INLJ is 1000x faster for this selective query.

The Index Nested Loop Algorithm

INLJ replaces the inner scan with index lookups. The algorithm structure resembles simple nested loop, but the inner loop becomes an index probe:

Algorithm Structure:

index_nested_loop_join.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Index Nested Loop Join Algorithm
// R: Outer relation
// S: Inner relation with index on join key
// θ: Join predicate (includes equality on indexed key)
// index_S: Index on S.join_key
 
function IndexNestedLoopJoin(R, S, θ, index_S):
    result ← ∅
    
    for each tuple r in R:                    // OUTER LOOP
        // Extract join key from outer tuple
        key ← r[join_key]
        
        // Use index to find matching inner tuples
        matching_S_tuples ← index_S.lookup(key)
        
        for each tuple s in matching_S_tuples: // MATCHED TUPLES ONLY
            if θ(r, s):                        // Evaluate residual predicate
                result ← result ∪ {r ⊕ s}
    
    return result

Execution Flow Detail:

Outer Iteration: Fetch the next tuple from the outer relation R. This can be a sequential scan or another index scan if there's a selection predicate on R.
Key Extraction: Extract the join key value(s) from the outer tuple. For composite keys, extract all relevant attributes.
Index Probing: Use the index on S to find all tuples where S.join_key matches the extracted key. For B-trees, this involves:
- Traversing from root to leaf (typically 2-4 I/Os, often cached)
- Scanning leaf entries with matching key
- Following pointers to data pages
Tuple Retrieval: Fetch the actual data tuples from the inner relation. For non-covering indexes, this requires reading the data page containing each matching tuple.
Residual Predicate: If the join predicate has components beyond the indexed equality (e.g., range conditions, additional filters), evaluate them now.
Result Emission: Matching tuples are concatenated and added to the result.
Repeat: Continue for all outer tuples.

Index Selection

The index must cover the join equality predicate. For R.a = S.b, we need an index on S.b. If only an index on R.a exists, the optimizer might swap the relations—but then R becomes inner and S becomes outer.

Cost Model and Analysis

The INLJ cost model is more nuanced than BNLJ because it depends on index characteristics, data distribution, and caching behavior.

Basic Cost Formula:

Cost_INLJ = Cost_scan_R + n_R × Cost_per_probe

Where:

Cost_scan_R: Cost to read the outer relation (b_R for sequential scan, or index scan cost if filtered)
n_R: Number of outer tuples (after any selection predicates)
Cost_per_probe: Cost of one index lookup plus data retrieval

Breaking Down Cost_per_probe:

Index Probe Cost Components
Component	B-Tree Cost	Hash Index Cost
Index traversal	Height_I (typically 2-4)	1-2 (hash + overflow)
Leaf page scan	⌈matching_entries / entries_per_page⌉	N/A (direct match)
Data page retrieval	matching_tuples (worst case)	matching_tuples
Typical total (unique key)	4 I/Os	2-3 I/Os
With caching (upper nodes)	1-2 I/Os	1-2 I/Os

Detailed Cost Model:

For a more precise analysis, we consider:

Cost_per_probe = C_index + C_data

C_index = Height_I                           (assuming upper levels cached)
C_data  = matching_tuples / tuples_per_page  (if clustered index)
        = matching_tuples                     (if non-clustered, worst case)

Example Calculation:

Joining 100,000 orders to products on order.product_id = product.id:

Products: 10,000 rows, 500 pages, B-tree index on id (height 3)
Each order matches exactly 1 product (primary key)
Assume index upper levels cached after first few probes

Cost = b_orders + n_orders × (1 data page fetch)
     = 5,000 + 100,000 × 1
     = 105,000 I/Os

Compare to BNLJ with 100 buffer pages:

Cost_BNLJ = 5,000 + ⌈5,000/99⌉ × 500
          = 5,000 + 51 × 500
          = 30,500 I/Os

BNLJ wins here! Why? Because each order accesses a different product page (random I/O), while BNLJ scans products sequentially.

Clustering Matters Enormously

If the inner relation is clustered on the join key, consecutive outer tuples with similar keys access the same data pages. This turns random I/O into near-sequential I/O and dramatically improves cache hit rates. Always consider clustering when designing for INLJ workloads.

When Index Nested Loop Wins

Given the competing costs of random I/O (INLJ) versus sequential I/O (BNLJ), the optimizer must carefully evaluate when INLJ is superior:

INLJ Favorable Conditions:

INLJ Excels When

•Small outer relation — Few outer tuples mean few index probes. The fixed cost of scanning S in BNLJ exceeds INLJ's probe costs.
•Highly selective predicates on outer — WHERE clauses that drastically reduce outer cardinality make INLJ attractive.
•Unique or near-unique join keys — Each probe returns 1 tuple. No fan-out amplifies the probe cost.
•Clustered index on inner — Data pages are accessed in index order, maximizing cache efficiency.
•Large inner relation — When S doesn't fit in memory, BNLJ requires many scans. INLJ's per-lookup cost may be lower.
•SSD/NVMe storage — Random I/O penalty is much smaller (10x vs 100x for HDDs). INLJ becomes competitive earlier.
•Covering index — If the index includes all needed columns, no data page access is required.
•Correlated subqueries — Each outer row triggers one subquery execution; index lookup is the only practical approach.

INLJ Unfavorable Conditions:

Avoid INLJ When

•Large outer relation, small inner — More efficient to scan small inner than probe for every large outer tuple.
•High fan-out joins — Each probe returns many matching tuples, making probe cost approach scan cost.
•Non-indexed join keys — Falls back to simple nested loop with no index benefit.
•Unclustered index with wide result set — Random I/O to scattered data pages dominates.
•Memory sufficient for BNLJ — When inner fits in cache, BNLJ's sequential read is faster than random probes.

The Break-Even Point

As a rule of thumb, INLJ becomes favorable when the number of outer tuples × average matches per probe < 10-15% of inner relation size (for unclustered indexes) or < 30-40% (for clustered). SSDs shift these thresholds significantly higher.

Index Types and Their Impact

Different index structures affect INLJ performance in significant ways:

B-Tree Indexes:

•Most common — Standard index type in virtually all databases
•Logarithmic lookup — O(log n) traversal, typically 3-4 levels for millions of rows
•Range support — Can handle <, >, BETWEEN in addition to equality
•Ordered access — Adjacent keys are adjacent in leaves, supporting range scans efficiently
•Composite key support — Multi-column indexes for (a, b) style join keys

Hash Indexes:

•O(1) lookup — Constant time for equality probes, slightly faster than B-tree
•Equality only — Cannot support range predicates
•Memory-based — Often kept entirely in memory (PostgreSQL hash indexes)
•Less common — B-trees dominate due to versatility

Clustered vs Non-Clustered:

Clustered Index

•Table data stored in index order
•No separate data page access
•Ideal for range joins and sorted outputs
•One clustered index per table
•Probe cost ≈ tree height only

Non-Clustered Index

•Index entries point to data pages
•Requires data page lookup per match
•Random I/O for scattered matches
•Multiple non-clustered allowed
•Probe cost = height + data pages

Covering Indexes:

A covering index includes all columns needed by the query in its leaf nodes (via INCLUDE clause or composite key). No data page access is required:

covering_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Create covering index for a common join pattern
CREATE INDEX idx_orders_customer_covering 
ON Orders(customer_id) 
INCLUDE (order_date, total_amount, status);
 
-- This query uses index-only scan for inner relation
SELECT o.order_date, o.total_amount, c.name
FROM Customers c
JOIN Orders o ON c.id = o.customer_id
WHERE c.region = 'APAC';
 
-- INLJ probes idx_orders_customer_covering
-- Never touches Orders data pages

Design Indexes for Joins

When optimizing critical join queries, consider creating covering indexes that include frequently selected columns. The storage overhead is often justified by avoiding data page I/O during INLJ execution.

Batched Key Lookups

A powerful optimization combines BNLJ's blocking with INLJ's index usage. Instead of probing the index for each outer tuple individually, we collect a batch of keys and probe them together:

Batched Key Lookup Algorithm:

batched_key_lookup.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Batched Key Lookup Join
// Combines BNLJ blocking with sorted index probes
 
function BatchedKeyLookupJoin(R, S, index_S, batch_size):
    result ← ∅
    
    for each batch in partition(R, batch_size):
        // Collect and sort keys from batch
        keys ← [r[join_key] for r in batch]
        sorted_keys ← sort(deduplicate(keys))
        
        // Create hash map: key → outer tuples with that key
        key_to_outer ← build_hashmap(batch, join_key)
        
        // Probe index in sorted key order (sequential-ish access)
        for each key in sorted_keys:
            matching_inner ← index_S.lookup(key)
            for each s in matching_inner:
                for each r in key_to_outer[key]:
                    result ← result ∪ {r ⊕ s}
    
    return result

Benefits of Batching:

Batching Advantages

•Sorted probe order — Probing keys in sorted order leads to more sequential leaf page access in B-tree indexes.
•Deduplication — If multiple outer tuples have the same join key, probe the index once and reuse results.
•Prefetching — The sorted key list enables predictive prefetching of index and data pages.
•Reduced round trips — For distributed/remote databases, batch requests reduce network latency impact.
•Cache warming — Processing related keys together improves buffer pool hit rates.

MySQL's Batched Key Access (BKA):

MySQL implements this optimization as BKA (Batched Key Access), controlled by optimizer switches:

mysql_bka.sql
1
2
3
4
5
6
7
8
9
10
-- Enable Batched Key Access in MySQL
SET optimizer_switch = 'batched_key_access=on,mrr=on,mrr_cost_based=off';
 
-- BKA uses Multi-Range Read (MRR) for sorted access
EXPLAIN 
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2024-01-01';
 
-- Look for "Using join buffer (Batched Key Access)" in Extra

Multi-Range Read (MRR)

MRR is the underlying mechanism that sorts row IDs before data retrieval. By fetching data pages in physical order rather than logical order, MRR converts random I/O patterns to sequential scans, particularly beneficial for spinning disks.

Index Nested Loop for Outer Joins

INLJ supports outer join semantics with the same modifications as simple nested loop:

Left Outer Join:

The outer relation (with preserved NULL rows) must be the driving relation. After probing, if no matches found, emit outer tuple with NULL padding:

inlj_left_outer.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Index Nested Loop Left Outer Join
function INLJ_LeftOuter(R, S, index_S):
    result ← ∅
    
    for each tuple r in R:
        key ← r[join_key]
        matching_S ← index_S.lookup(key)
        
        if matching_S is empty:
            // No match found, preserve outer with NULLs
            result ← result ∪ {r ⊕ NULL_S}
        else:
            for each s in matching_S:
                result ← result ∪ {r ⊕ s}
    
    return result

Right Outer Join:

More problematic—we need to track which inner tuples were never matched. Options:

Swap and use left outer — Make S outer, R inner. Requires index on R's join key.
Post-processing scan — After INLJ completes, scan S and emit unmatched tuples.
Bitmap tracking — Maintain a bitmap of matched inner tuples during INLJ.

inlj_right_outer.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Index Nested Loop Right Outer Join (with post-processing)
function INLJ_RightOuter(R, S, index_S):
    result ← ∅
    matched_S_rowids ← new HashSet()
    
    for each tuple r in R:
        key ← r[join_key]
        matching_S ← index_S.lookup_with_rowids(key)
        
        for each (s, rowid) in matching_S:
            result ← result ∪ {r ⊕ s}
            matched_S_rowids.add(rowid)
    
    // Post-processing: scan S for unmatched
    for each (s, rowid) in S.scan_with_rowids():
        if rowid not in matched_S_rowids:
            result ← result ∪ {NULL_R ⊕ s}
    
    return result

Right Outer INLJ Overhead

Right outer joins with INLJ require either an additional full scan of S or memory proportional to |S| for tracking. This often negates INLJ's benefits, leading optimizers to prefer hash or merge joins for right outer operations.

Anti-Joins and Semi-Joins:

INLJ is particularly efficient for EXISTS and NOT EXISTS patterns:

-- Semi-join: EXISTS
SELECT * FROM Customers c
WHERE EXISTS (
    SELECT 1 FROM Orders o WHERE o.customer_id = c.id
);
-- INLJ: For each customer, probe index once. Stop on first match.

-- Anti-join: NOT EXISTS  
SELECT * FROM Customers c
WHERE NOT EXISTS (
    SELECT 1 FROM Orders o WHERE o.customer_id = c.id
);
-- INLJ: For each customer, probe index. Emit if no matches found.

The early termination property (one match is enough for EXISTS, no matches needed for NOT EXISTS) makes INLJ highly efficient for these patterns.

Query Optimizer Considerations

Query optimizers must carefully evaluate whether INLJ is the right choice for a given query:

Cardinality Estimation:

Accurate estimation of outer relation cardinality is critical. The INLJ cost formula is n_R × cost_per_probe—both multiplicands matter.

Common Estimation Errors:

Estimation Errors and INLJ Impact
Estimation Error	Impact on INLJ Choice	Consequence
Underestimate outer rows	INLJ chosen incorrectly	Millions of probes when scan was cheaper
Overestimate outer rows	INLJ avoided incorrectly	Slow scan when probes were faster
Ignore correlation	Wrong join order	Larger relation driving smaller one
Stale statistics	Any of the above	Optimizer working with wrong numbers

Index Availability:

The optimizer must check:

Does an index exist on the potential inner relation's join key?
Is the index selective enough to be useful?
Would creating a temporary index (for hash join) be cheaper?

Cost Model Parameters:

postgres_cost_params.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- PostgreSQL cost model parameters affecting INLJ decisions
 
-- Random page access cost (default 4.0)
-- Higher values penalize INLJ's random access pattern
SHOW random_page_cost;
 
-- Sequential page access cost (default 1.0)  
-- Lower relative to random makes scans more attractive
SHOW seq_page_cost;
 
-- CPU cost per tuple processing
SHOW cpu_tuple_cost;
 
-- CPU cost per index entry processing
SHOW cpu_index_tuple_cost;
 
-- Adjust for SSD (random is closer to sequential)
SET random_page_cost = 1.1;  -- INLJ becomes more attractive

Forcing Join Methods

When you know the optimizer is making a suboptimal choice (perhaps due to stale statistics), you can use hints to force INLJ. In PostgreSQL: SET enable_hashjoin = off; SET enable_mergejoin = off;. In MySQL: use FORCE INDEX (idx_name) on the inner table.

Summary: Index Nested Loop Join

We've explored the index nested loop join—a powerful optimization that transforms join execution from scans to targeted lookups. Let's consolidate the key insights:

Key Takeaways

•Index-Driven Execution — Replace inner relation scans with index lookups. Each outer tuple probes the index to find matches directly.
•Cost Model — Cost = b_R + n_R × (index_height + data_pages). Dominated by number of outer tuples and probe cost.
•Random vs Sequential I/O — INLJ trades random I/O for fewer total I/Os. Favorable on SSDs and for selective joins.
•Optimal Conditions — Small outer relation, unique/selective join keys, clustered or covering indexes, insufficient memory for BNLJ.
•Index Requirements — Requires appropriate index on inner relation's join key. B-tree for range support, hash for equality only.
•Batched Key Access — Collecting and sorting probe keys reduces random I/O impact through prefetching and sequential access patterns.
•Outer Join Complexity — Left outer is straightforward; right outer requires additional tracking or post-processing scans.
•Optimizer Sensitivity — Accurate cardinality estimation is critical. Adjust cost parameters for storage characteristics.

What's Next:

With the three nested loop variants covered—simple, block, and index—we need a framework to compare and choose between them. The next page develops a comprehensive cost analysis framework, combining I/O costs, CPU costs, and memory requirements to understand when each variant is optimal.

Page Complete

You now understand the index nested loop join—how it leverages indexes to accelerate matching, its cost model, when it outperforms scan-based alternatives, and the critical role of index design. Next, we'll synthesize this knowledge into a comprehensive cost analysis framework.