Loading content...
The block nested loop join's cost is proportional to the product of relation sizes divided by available memory. Even with generous buffers, joining two large relations requires reading the inner relation multiple times—thousands or millions of disk pages repeatedly fetched from storage.
But what if we could avoid scanning the inner relation entirely? What if, instead of examining every inner tuple, we could jump directly to the matching tuples using an index?
This is the essence of the index nested loop join (INLJ). By leveraging a B-tree, hash index, or other access structure on the join key of the inner relation, we transform the join from a brute-force scan into a series of targeted lookups. For highly selective joins—where each outer tuple matches only a few inner tuples—this can reduce execution time by orders of magnitude.
By the end of this page, you will understand how the index nested loop join uses indexes to accelerate joins, its cost model, when it outperforms scan-based joins, what index types are suitable, and the critical factors that influence its effectiveness.
Consider a join between Orders (10 million rows) and Customers (1 million rows) on Orders.customer_id = Customers.id. With BNLJ, even with 1GB of memory, we'd scan the Customers table hundreds of times.
But Customers likely has a primary key index on id. For each order, we can look up the matching customer in approximately 3-4 index traversals (B-tree height) plus 1 data page access. Instead of scanning 50,000 pages of Customers repeatedly, we make 10 million targeted lookups—each costing ~4 I/Os.
The Transformation:
BNLJ Cost: b_orders + ⌈b_orders/M⌉ × b_customers
≈ 100,000 + (100 × 50,000) = 5,100,000 I/Os
INLJ Cost: b_orders + n_orders × (index_height + 1)
≈ 100,000 + (10,000,000 × 4) = 40,100,000 lookups
BUT: with caching, upper index levels stay cached
Effective: 100,000 + (10,000,000 × 1.5) = 15,100,000 I/Os
Wait—15 million is worse than 5 million! What's happening?
INLJ trades random I/O for reduced total I/O. Each index probe requires random access, while scans are sequential. On spinning disks, random I/O is 100x slower than sequential. INLJ wins when join selectivity is high enough that fewer lookups are needed than full scans would provide.
The critical insight: INLJ shines when the number of outer tuples requiring lookup is small relative to the inner relation, or when the inner relation doesn't fit in memory for BNLJ but the index does.
The Selectivity Factor:
Let's reconsider with a selective predicate. Suppose we're joining orders from only one customer:
SELECT * FROM Orders o
JOIN Customers c ON o.customer_id = c.id
WHERE o.customer_id = 12345;
Now we have perhaps 100 orders (not 10 million). INLJ cost:
BNLJ would still scan the entire Customers table. INLJ is 1000x faster for this selective query.
INLJ replaces the inner scan with index lookups. The algorithm structure resembles simple nested loop, but the inner loop becomes an index probe:
Algorithm Structure:
123456789101112131415161718192021
// Index Nested Loop Join Algorithm// R: Outer relation// S: Inner relation with index on join key// θ: Join predicate (includes equality on indexed key)// index_S: Index on S.join_key function IndexNestedLoopJoin(R, S, θ, index_S): result ← ∅ for each tuple r in R: // OUTER LOOP // Extract join key from outer tuple key ← r[join_key] // Use index to find matching inner tuples matching_S_tuples ← index_S.lookup(key) for each tuple s in matching_S_tuples: // MATCHED TUPLES ONLY if θ(r, s): // Evaluate residual predicate result ← result ∪ {r ⊕ s} return resultExecution Flow Detail:
Outer Iteration: Fetch the next tuple from the outer relation R. This can be a sequential scan or another index scan if there's a selection predicate on R.
Key Extraction: Extract the join key value(s) from the outer tuple. For composite keys, extract all relevant attributes.
Index Probing: Use the index on S to find all tuples where S.join_key matches the extracted key. For B-trees, this involves:
Tuple Retrieval: Fetch the actual data tuples from the inner relation. For non-covering indexes, this requires reading the data page containing each matching tuple.
Residual Predicate: If the join predicate has components beyond the indexed equality (e.g., range conditions, additional filters), evaluate them now.
Result Emission: Matching tuples are concatenated and added to the result.
Repeat: Continue for all outer tuples.
The index must cover the join equality predicate. For R.a = S.b, we need an index on S.b. If only an index on R.a exists, the optimizer might swap the relations—but then R becomes inner and S becomes outer.
The INLJ cost model is more nuanced than BNLJ because it depends on index characteristics, data distribution, and caching behavior.
Basic Cost Formula:
Cost_INLJ = Cost_scan_R + n_R × Cost_per_probe
Where:
Breaking Down Cost_per_probe:
| Component | B-Tree Cost | Hash Index Cost |
|---|---|---|
| Index traversal | Height_I (typically 2-4) | 1-2 (hash + overflow) |
| Leaf page scan | ⌈matching_entries / entries_per_page⌉ | N/A (direct match) |
| Data page retrieval | matching_tuples (worst case) | matching_tuples |
| Typical total (unique key) | 4 I/Os | 2-3 I/Os |
| With caching (upper nodes) | 1-2 I/Os | 1-2 I/Os |
Detailed Cost Model:
For a more precise analysis, we consider:
Cost_per_probe = C_index + C_data
C_index = Height_I (assuming upper levels cached)
C_data = matching_tuples / tuples_per_page (if clustered index)
= matching_tuples (if non-clustered, worst case)
Example Calculation:
Joining 100,000 orders to products on order.product_id = product.id:
id (height 3)Cost = b_orders + n_orders × (1 data page fetch)
= 5,000 + 100,000 × 1
= 105,000 I/Os
Compare to BNLJ with 100 buffer pages:
Cost_BNLJ = 5,000 + ⌈5,000/99⌉ × 500
= 5,000 + 51 × 500
= 30,500 I/Os
BNLJ wins here! Why? Because each order accesses a different product page (random I/O), while BNLJ scans products sequentially.
If the inner relation is clustered on the join key, consecutive outer tuples with similar keys access the same data pages. This turns random I/O into near-sequential I/O and dramatically improves cache hit rates. Always consider clustering when designing for INLJ workloads.
Given the competing costs of random I/O (INLJ) versus sequential I/O (BNLJ), the optimizer must carefully evaluate when INLJ is superior:
INLJ Favorable Conditions:
INLJ Unfavorable Conditions:
As a rule of thumb, INLJ becomes favorable when the number of outer tuples × average matches per probe < 10-15% of inner relation size (for unclustered indexes) or < 30-40% (for clustered). SSDs shift these thresholds significantly higher.
Different index structures affect INLJ performance in significant ways:
B-Tree Indexes:
<, >, BETWEEN in addition to equalityHash Indexes:
Clustered vs Non-Clustered:
Covering Indexes:
A covering index includes all columns needed by the query in its leaf nodes (via INCLUDE clause or composite key). No data page access is required:
12345678910111213
-- Create covering index for a common join patternCREATE INDEX idx_orders_customer_covering ON Orders(customer_id) INCLUDE (order_date, total_amount, status); -- This query uses index-only scan for inner relationSELECT o.order_date, o.total_amount, c.nameFROM Customers cJOIN Orders o ON c.id = o.customer_idWHERE c.region = 'APAC'; -- INLJ probes idx_orders_customer_covering-- Never touches Orders data pagesWhen optimizing critical join queries, consider creating covering indexes that include frequently selected columns. The storage overhead is often justified by avoiding data page I/O during INLJ execution.
A powerful optimization combines BNLJ's blocking with INLJ's index usage. Instead of probing the index for each outer tuple individually, we collect a batch of keys and probe them together:
Batched Key Lookup Algorithm:
12345678910111213141516171819202122
// Batched Key Lookup Join// Combines BNLJ blocking with sorted index probes function BatchedKeyLookupJoin(R, S, index_S, batch_size): result ← ∅ for each batch in partition(R, batch_size): // Collect and sort keys from batch keys ← [r[join_key] for r in batch] sorted_keys ← sort(deduplicate(keys)) // Create hash map: key → outer tuples with that key key_to_outer ← build_hashmap(batch, join_key) // Probe index in sorted key order (sequential-ish access) for each key in sorted_keys: matching_inner ← index_S.lookup(key) for each s in matching_inner: for each r in key_to_outer[key]: result ← result ∪ {r ⊕ s} return resultBenefits of Batching:
MySQL's Batched Key Access (BKA):
MySQL implements this optimization as BKA (Batched Key Access), controlled by optimizer switches:
12345678910
-- Enable Batched Key Access in MySQLSET optimizer_switch = 'batched_key_access=on,mrr=on,mrr_cost_based=off'; -- BKA uses Multi-Range Read (MRR) for sorted accessEXPLAIN SELECT * FROM orders oJOIN customers c ON o.customer_id = c.idWHERE o.order_date > '2024-01-01'; -- Look for "Using join buffer (Batched Key Access)" in ExtraMRR is the underlying mechanism that sorts row IDs before data retrieval. By fetching data pages in physical order rather than logical order, MRR converts random I/O patterns to sequential scans, particularly beneficial for spinning disks.
INLJ supports outer join semantics with the same modifications as simple nested loop:
Left Outer Join:
The outer relation (with preserved NULL rows) must be the driving relation. After probing, if no matches found, emit outer tuple with NULL padding:
12345678910111213141516
// Index Nested Loop Left Outer Joinfunction INLJ_LeftOuter(R, S, index_S): result ← ∅ for each tuple r in R: key ← r[join_key] matching_S ← index_S.lookup(key) if matching_S is empty: // No match found, preserve outer with NULLs result ← result ∪ {r ⊕ NULL_S} else: for each s in matching_S: result ← result ∪ {r ⊕ s} return resultRight Outer Join:
More problematic—we need to track which inner tuples were never matched. Options:
12345678910111213141516171819
// Index Nested Loop Right Outer Join (with post-processing)function INLJ_RightOuter(R, S, index_S): result ← ∅ matched_S_rowids ← new HashSet() for each tuple r in R: key ← r[join_key] matching_S ← index_S.lookup_with_rowids(key) for each (s, rowid) in matching_S: result ← result ∪ {r ⊕ s} matched_S_rowids.add(rowid) // Post-processing: scan S for unmatched for each (s, rowid) in S.scan_with_rowids(): if rowid not in matched_S_rowids: result ← result ∪ {NULL_R ⊕ s} return resultRight outer joins with INLJ require either an additional full scan of S or memory proportional to |S| for tracking. This often negates INLJ's benefits, leading optimizers to prefer hash or merge joins for right outer operations.
Anti-Joins and Semi-Joins:
INLJ is particularly efficient for EXISTS and NOT EXISTS patterns:
-- Semi-join: EXISTS
SELECT * FROM Customers c
WHERE EXISTS (
SELECT 1 FROM Orders o WHERE o.customer_id = c.id
);
-- INLJ: For each customer, probe index once. Stop on first match.
-- Anti-join: NOT EXISTS
SELECT * FROM Customers c
WHERE NOT EXISTS (
SELECT 1 FROM Orders o WHERE o.customer_id = c.id
);
-- INLJ: For each customer, probe index. Emit if no matches found.
The early termination property (one match is enough for EXISTS, no matches needed for NOT EXISTS) makes INLJ highly efficient for these patterns.
Query optimizers must carefully evaluate whether INLJ is the right choice for a given query:
Cardinality Estimation:
Accurate estimation of outer relation cardinality is critical. The INLJ cost formula is n_R × cost_per_probe—both multiplicands matter.
Common Estimation Errors:
| Estimation Error | Impact on INLJ Choice | Consequence |
|---|---|---|
| Underestimate outer rows | INLJ chosen incorrectly | Millions of probes when scan was cheaper |
| Overestimate outer rows | INLJ avoided incorrectly | Slow scan when probes were faster |
| Ignore correlation | Wrong join order | Larger relation driving smaller one |
| Stale statistics | Any of the above | Optimizer working with wrong numbers |
Index Availability:
The optimizer must check:
Cost Model Parameters:
123456789101112131415161718
-- PostgreSQL cost model parameters affecting INLJ decisions -- Random page access cost (default 4.0)-- Higher values penalize INLJ's random access patternSHOW random_page_cost; -- Sequential page access cost (default 1.0) -- Lower relative to random makes scans more attractiveSHOW seq_page_cost; -- CPU cost per tuple processingSHOW cpu_tuple_cost; -- CPU cost per index entry processingSHOW cpu_index_tuple_cost; -- Adjust for SSD (random is closer to sequential)SET random_page_cost = 1.1; -- INLJ becomes more attractiveWhen you know the optimizer is making a suboptimal choice (perhaps due to stale statistics), you can use hints to force INLJ. In PostgreSQL: SET enable_hashjoin = off; SET enable_mergejoin = off;. In MySQL: use FORCE INDEX (idx_name) on the inner table.
We've explored the index nested loop join—a powerful optimization that transforms join execution from scans to targeted lookups. Let's consolidate the key insights:
What's Next:
With the three nested loop variants covered—simple, block, and index—we need a framework to compare and choose between them. The next page develops a comprehensive cost analysis framework, combining I/O costs, CPU costs, and memory requirements to understand when each variant is optimal.
You now understand the index nested loop join—how it leverages indexes to accelerate matching, its cost model, when it outperforms scan-based alternatives, and the critical role of index design. Next, we'll synthesize this knowledge into a comprehensive cost analysis framework.