Loading learning content...
The simple nested loop join's fatal flaw is clear: for every tuple in the outer relation, the entire inner relation is scanned from disk. When the outer relation contains millions of tuples, this means millions of complete scans of the inner relation—an astronomical I/O cost that renders the algorithm impractical for large datasets.
The block nested loop join (BNLJ) addresses this inefficiency with an elegant insight: instead of processing outer tuples one at a time, we can load a block of outer tuples into memory and then scan the inner relation just once to find matches for all outer tuples in that block. This simple change transforms the cost equation from per-tuple to per-block, often reducing I/O by orders of magnitude.
This optimization is so fundamental that when database practitioners refer to 'nested loop join,' they often mean the block variant rather than the simple tuple-at-a-time version. Understanding BNLJ is essential for query tuning, memory configuration, and appreciating how query optimizers make join algorithm decisions.
By the end of this page, you will understand how the block nested loop join works, how to calculate its I/O cost, how buffer pool size affects performance, the relationship between outer block size and efficiency, and when BNLJ becomes the preferred join method.
The key observation behind BNLJ is that the inner relation scan is the expensive part. Each scan reads b_S blocks from disk. The simple nested loop scans S once per outer tuple, giving n_R scans. BNLJ scans S once per outer block, giving ⌈b_R / B⌉ scans, where B is the number of blocks we can fit in the outer buffer.
The Transformation:
Simple NLJ: Scans of S = n_R (one per outer tuple)
Block NLJ: Scans of S = ⌈b_R / B⌉ (one per outer block)
If R contains 1 million tuples across 10,000 blocks, and we allocate 1,000 blocks for the outer buffer:
This is a 100,000x reduction in inner relation scans!
Conceptual Model:
Think of it as batch processing. Instead of asking "does this one outer tuple match any inner tuple?" repeatedly, we ask "which of these thousand outer tuples match this inner tuple?" Once we have the inner tuple in memory, checking against a thousand outer tuples costs almost nothing compared to the disk I/O that brought it in.
BNLJ trades memory for I/O savings. The more memory allocated to the outer buffer, the fewer times we scan the inner relation. This is one of the clearest examples in database systems of the space-time tradeoff—investing in RAM to save disk operations.
The BNLJ algorithm introduces a third loop level (conceptually) or restructures the nesting to process blocks rather than individual tuples.
Algorithm Structure:
12345678910111213141516171819202122232425
// Block Nested Loop Join Algorithm// R: Outer relation// S: Inner relation// θ: Join predicate// B: Number of blocks allocated for outer buffer (B ≥ 2) function BlockNestedLoopJoin(R, S, θ, B): result ← ∅ outer_buffer ← allocate_blocks(B - 1) // Reserve 1 block for inner inner_buffer ← allocate_blocks(1) for each block_chunk in partition(R, B - 1): // OUTER BLOCK LOOP // Load B-1 blocks of R into outer_buffer load_into_buffer(outer_buffer, block_chunk) for each block s_block in S: // INNER BLOCK LOOP load_into_buffer(inner_buffer, s_block) // Match each inner tuple against all outer tuples in buffer for each tuple s in inner_buffer: // INNER TUPLE LOOP for each tuple r in outer_buffer: // OUTER TUPLE LOOP if θ(r, s): result ← result ∪ {r ⊕ s} return resultDetailed Execution Flow:
Buffer Allocation: Allocate available memory. Reserve one block for loading inner relation pages; use remaining blocks for outer relation buffer.
Outer Block Loading: Read B-1 blocks from the outer relation R into the outer buffer. These blocks contain potentially thousands of tuples.
Inner Scan Start: Begin scanning the inner relation S from the first block.
Inner Block Processing: Load one block of S into the inner buffer.
Cross-Matching: For each tuple s in the inner block, iterate through all tuples in the outer buffer, evaluating θ(r, s). This is efficient because both are in memory.
Result Emission: Matching pairs are concatenated and emitted.
Inner Scan Continuation: Move to the next block of S and repeat step 4-6.
Inner Scan Completion: When S is fully scanned, we've found all matches for the current outer block.
Outer Block Advancement: Load the next B-1 blocks of R and repeat from step 3.
Termination: When all blocks of R have been processed, the join is complete.
We reserve just one block for the inner relation because we're scanning it sequentially. More inner buffers don't help—the inner is read completely regardless. All extra memory goes to the outer buffer to reduce the number of inner scans required.
Precise I/O cost analysis is essential for query optimization. Let's derive the BNLJ cost formula and examine how it improves on the simple NLJ.
Defining Variables:
Cost Formula:
Cost_BNLJ = b_R + ⌈b_R / (M - 1)⌉ × b_S
Breaking down the formula:
Cost Comparison:
| Metric | Simple NLJ | Block NLJ (M=102 blocks) |
|---|---|---|
| Formula | b_R + n_R × b_S | b_R + ⌈b_R/101⌉ × b_S |
| R: 10K blocks, S: 5K blocks, n_R = 1M | 10K + 1M × 5K = 5,000,010,000 | 10K + ⌈10K/101⌉ × 5K = 10K + 100 × 5K = 510,000 |
| Improvement factor | — | ~9,800x fewer I/Os |
| At 10ms per I/O | ~1.6 years | ~85 minutes |
Impact of Memory Allocation:
The cost formula reveals a critical relationship between memory and I/O:
| Available Memory (M) | Outer Blocks | Inner Scans | Total I/O Cost |
|---|---|---|---|
| 3 blocks (minimum) | 2 | 5,000 | 10,000 + 5,000 × 5,000 = 25,010,000 |
| 11 blocks | 10 | 1,000 | 10,000 + 1,000 × 5,000 = 5,010,000 |
| 102 blocks | 101 | 100 | 10,000 + 100 × 5,000 = 510,000 |
| 1,001 blocks | 1,000 | 10 | 10,000 + 10 × 5,000 = 60,000 |
| 10,001 blocks (R fits) | 10,000 | 1 | 10,000 + 1 × 5,000 = 15,000 |
When the outer relation fits entirely in the buffer (M > b_R + 1), the inner relation is scanned exactly once. This achieves the optimal cost of b_R + b_S—a single pass over each relation. This is why query optimizers position the smaller relation as outer when memory permits.
Asymptotic Analysis:
The key insight: BNLJ converts a problem that scales with tuples into one that scales with blocks, and then further divides by available memory. Each doubling of memory approximately halves the cost.
Effective buffer management is crucial for BNLJ performance. Several strategies optimize memory usage beyond the basic algorithm:
1. Outer Buffer Maximization:
Allocate as much memory as possible to the outer buffer. The relationship is clear from the cost formula—larger outer buffer = fewer inner scans. In multi-query environments, this means the join operator must negotiate for buffer pool space.
12345678910111213141516171819
// Buffer allocation strategy for BNLJfunction allocate_buffers(available_memory, b_R, b_S): // Strategy 1: Fit outer relation entirely if available_memory > b_R + 1: outer_buffer ← b_R inner_buffer ← 1 // Result: Just 1 scan of S // Strategy 2: Fit inner relation entirely (alternative) else if available_memory > b_S + 1: // Better to swap: S becomes outer, R becomes inner return swap_and_allocate(available_memory, b_S, b_R) // Strategy 3: Neither fits; maximize outer buffer else: outer_buffer ← available_memory - 1 inner_buffer ← 1 return (outer_buffer, inner_buffer)2. Relation Ordering Decision:
The cost is asymmetric with respect to which relation is outer vs. inner:
Cost(R outer) = b_R + ⌈b_R/(M-1)⌉ × b_S
Cost(S outer) = b_S + ⌈b_S/(M-1)⌉ × b_R
When neither fits in memory, the smaller relation should be outer. This minimizes the number of scans of the larger relation.
| Configuration | R (1,000 blocks) as Outer | R as Inner (S = 5,000 as Outer) |
|---|---|---|
| Outer scans | 1 | 1 |
| Inner scans | ⌈1000/101⌉ = 10 | ⌈5000/101⌉ = 50 |
| Total I/O | 1,000 + 10×5,000 = 51,000 | 5,000 + 50×1,000 = 55,000 |
| Winner | ✓ (8% fewer I/Os) | — |
3. Double Buffering:
Overlap I/O with computation by using two buffers for each role—while processing one buffer's contents, the next buffer is loaded asynchronously:
12345678910111213141516171819202122
// Double buffering for BNLJfunction BNLJ_double_buffered(R, S, θ, M): // Allocate: (M-2)/2 blocks for each outer buffer, 2 blocks for inner outer_buf_A, outer_buf_B ← allocate_blocks((M-2)/2) inner_buf_A, inner_buf_B ← allocate_blocks(1), allocate_blocks(1) // Initial async load async_load(outer_buf_A, next_outer_chunk()) for each outer_chunk in R: // Wait for current buffer, start loading next wait(outer_buf_A) start_async_load(outer_buf_B, next_outer_chunk()) // Process inner relation against outer_buf_A for each s_block in S: async_load(inner_buf_B, next_inner_block()) process_matches(outer_buf_A, inner_buf_A, θ) swap(inner_buf_A, inner_buf_B) // Swap buffers for next iteration swap(outer_buf_A, outer_buf_B)In production systems, the join operator doesn't directly manage disk I/O. Instead, it requests pages from the buffer pool, which handles caching, eviction, and prefetching. The operator's 'buffer allocation' is really a memory grant from the query executor, determining how many pages it can pin simultaneously.
While BNLJ dramatically reduces I/O, the in-memory matching phase still performs O(buffer_size × inner_block_size) comparisons per inner block. For large buffers, this CPU cost becomes significant. Several optimizations address this:
1. In-Memory Hash Index on Outer Buffer:
For equality joins, build a temporary hash table on the outer buffer's join key. When processing inner tuples, probe the hash table instead of scanning the entire buffer:
1234567891011121314151617181920
// BNLJ with in-memory hash optimization (for equality joins)function HashOptimizedBNLJ(R, S, join_key, M): result ← ∅ for each outer_chunk in partition(R, M - 2): // Build hash table on outer chunk hash_table ← new HashMap() for each r in outer_chunk: key ← r[join_key] hash_table.append(key, r) // Handle duplicates // Probe with inner relation for each s_block in S: for each s in s_block: key ← s[join_key] matching_r_tuples ← hash_table.get(key) for each r in matching_r_tuples: result ← result ∪ {r ⊕ s} return resultThis transforms the matching phase from O(n × m) to O(n + m) for each outer block, a substantial improvement.
2. Bloom Filter Pre-Check:
Build a Bloom filter on the outer buffer's join keys. Before probing the hash table, check the Bloom filter—if the inner key is definitely not present, skip the hash probe:
1234567891011121314
// BNLJ with Bloom filter optimizationfunction BloomOptimizedBNLJ(R, S, join_key, M): for each outer_chunk in partition(R, M - 3): hash_table ← build_hash_table(outer_chunk, join_key) bloom_filter ← build_bloom_filter(outer_chunk, join_key) for each s in S: key ← s[join_key] // Fast path: Bloom filter says "definitely not present" if not bloom_filter.may_contain(key): continue // Slow path: might be present, check hash table matching_r_tuples ← hash_table.get(key) emit_matches(matching_r_tuples, s)R.a = S.a AND expensive_function(R.b, S.b), check equality before calling the function.For SSDs and NVMe storage where I/O is fast, or when data is cached in memory, the in-memory matching cost becomes the dominant factor. Modern query optimizers consider both I/O and CPU costs when choosing join algorithms.
BNLJ naturally supports outer join semantics with minor modifications. The key challenge is tracking which tuples have found matches.
Left Outer Join:
Track which outer tuples in the current buffer have matched. After scanning the entire inner relation, emit unmatched outer tuples with NULL padding:
123456789101112131415161718192021
// Block Nested Loop Left Outer Joinfunction BNLJ_LeftOuter(R, S, θ, M): result ← ∅ for each outer_chunk in partition(R, M - 1): // Track matches for each tuple in the chunk matched ← new BitSet(size(outer_chunk)) for each s_block in S: for each s in s_block: for i, r in enumerate(outer_chunk): if θ(r, s): result ← result ∪ {r ⊕ s} matched.set(i) // Emit unmatched outer tuples after full inner scan for i, r in enumerate(outer_chunk): if not matched.get(i): result ← result ∪ {r ⊕ NULL_S} return resultRight Outer Join:
More complex—we must track which inner tuples have been matched across all outer blocks. One approach: maintain a global bit vector for the inner relation:
12345678910111213141516171819202122232425
// Block Nested Loop Right Outer Joinfunction BNLJ_RightOuter(R, S, θ, M): result ← ∅ // Global match tracking for inner relation inner_matched ← new BitSet(|S|) inner_tuple_index ← 0 for each outer_chunk in partition(R, M - 1): inner_tuple_index ← 0 for each s_block in S: for each s in s_block: for each r in outer_chunk: if θ(r, s): result ← result ∪ {r ⊕ s} inner_matched.set(inner_tuple_index) inner_tuple_index += 1 // Final pass: emit unmatched inner tuples inner_tuple_index ← 0 for each s in S: if not inner_matched.get(inner_tuple_index): result ← result ∪ {NULL_R ⊕ s} inner_tuple_index += 1 return resultRight outer and full outer joins require a bit vector proportional to the inner relation's tuple count. For very large inner relations (billions of rows), this overhead becomes significant—potentially requiring on-disk tracking structures or alternative algorithms.
Full Outer Join:
Combines both techniques—track matches for outer (per block) and inner (globally). After processing, emit unmatched from both sides.
Optimization for Right Outer:
An alternative approach reverses the roles: swap R and S, perform a left outer join with S as outer, then swap the attributes back. This works when swapping is semantically equivalent (which it is for outer joins).
Understanding BNLJ's performance envelope helps identify when it's the right choice and how to tune it:
I/O Patterns:
Sensitivity Analysis:
| Parameter | Impact on Cost | Practical Implication |
|---|---|---|
| Outer size (b_R) | Linear: cost ~ b_R × (b_S/M) | Smaller outer is better |
| Inner size (b_S) | Linear: cost ~ b_S × (b_R/M) | Inner size directly affects cost |
| Memory (M) | Inverse: cost ~ 1/M | Doubling memory halves cost |
| Selectivity | No direct impact | All tuples scanned regardless |
| Join key skew | No impact on I/O | May affect hash-optimized variants |
Comparison with Alternatives:
| Algorithm | Best Case Cost | Memory Requirement | Join Predicate Support |
|---|---|---|---|
| Block NLJ | b_R + b_S (R fits) | 2+ blocks | Any predicate |
| Hash Join | ~3 × (b_R + b_S) | min(b_R, b_S) blocks | Equality only |
| Sort-Merge | b_R + b_S + sort costs | ~√(b_R) | Equality only |
| Index NLJ | b_R + n_R × (tree height) | ~2 blocks | Indexed attributes |
BNLJ beats hash join when the outer relation fits (or nearly fits) in memory, because it avoids the partition/build overhead. Query optimizers calculate the expected costs of each algorithm and choose the cheapest. You can often influence this by adjusting work_mem (PostgreSQL) or similar memory parameters.
Production database systems implement BNLJ with additional sophistication:
PostgreSQL's Nested Loop:
PostgreSQL implements nested loop joins with optional materialization of the inner relation. If the inner is a subquery result or the optimizer expects multiple scans, it materializes to a temp file:
12345678910
-- PostgreSQL EXPLAIN showing nested loop with materializationEXPLAIN (ANALYZE, BUFFERS)SELECT * FROM large_table lJOIN small_table s ON l.id = s.foreign_id; -- Output might show:-- Nested Loop (cost=0.00..1234.56 rows=1000)-- -> Seq Scan on large_table l-- -> Materialize (cost=0.00..12.50 rows=100)-- -> Seq Scan on small_table sMySQL's Block Nested Loop:
MySQL explicitly implements BNLJ using the join_buffer_size parameter. Each join in a multi-table query can use up to this amount of memory:
12345678910
-- MySQL configuration for BNLJSET join_buffer_size = 256 * 1024; -- 256KB per join -- Check execution planEXPLAIN FORMAT=JSON SELECT * FROM orders o JOIN customers c ON o.customer_id = c.idWHERE c.region = 'EMEA'; -- Look for "using join buffer (Block Nested Loop)" in Extra fieldModern query executors can adapt at runtime. If actual cardinalities differ from estimates, the executor may switch strategies mid-query—for example, falling back to BNLJ if a hash table grows too large to fit in the allocated memory.
We've examined the block nested loop join—a crucial optimization over the simple nested loop that makes nested loop joins practical for medium-sized relations. Let's consolidate the key points:
What's Next:
Even BNLJ scans the inner relation multiple times. The next page introduces the index nested loop join, which eliminates inner scans entirely by using an index to directly locate matching tuples. When a suitable index exists, this can reduce join cost from O(b_R × b_S / M) to O(b_R + n_R × log(n_S))—often orders of magnitude faster.
You now understand the block nested loop join—how it reduces I/O through blocking, its cost model, buffer management strategies, and real-world implementation. Next, we'll explore how indexes can further accelerate nested loop joins.