Database Management SystemsPhysical Operators

Nested Loop Join Algorithms

LevelIntermediate

Duration60 mins

TopicPhysical Operators

2 / 5

Block Nested Loop Join

Optimizing Inner Relation Scans

The simple nested loop join's fatal flaw is clear: for every tuple in the outer relation, the entire inner relation is scanned from disk. When the outer relation contains millions of tuples, this means millions of complete scans of the inner relation—an astronomical I/O cost that renders the algorithm impractical for large datasets.

The block nested loop join (BNLJ) addresses this inefficiency with an elegant insight: instead of processing outer tuples one at a time, we can load a block of outer tuples into memory and then scan the inner relation just once to find matches for all outer tuples in that block. This simple change transforms the cost equation from per-tuple to per-block, often reducing I/O by orders of magnitude.

This optimization is so fundamental that when database practitioners refer to 'nested loop join,' they often mean the block variant rather than the simple tuple-at-a-time version. Understanding BNLJ is essential for query tuning, memory configuration, and appreciating how query optimizers make join algorithm decisions.

What You Will Learn

By the end of this page, you will understand how the block nested loop join works, how to calculate its I/O cost, how buffer pool size affects performance, the relationship between outer block size and efficiency, and when BNLJ becomes the preferred join method.

The Core Insight: Amortizing Scans

The key observation behind BNLJ is that the inner relation scan is the expensive part. Each scan reads b_S blocks from disk. The simple nested loop scans S once per outer tuple, giving n_R scans. BNLJ scans S once per outer block, giving ⌈b_R / B⌉ scans, where B is the number of blocks we can fit in the outer buffer.

The Transformation:

Simple NLJ:  Scans of S = n_R (one per outer tuple)
Block NLJ:   Scans of S = ⌈b_R / B⌉ (one per outer block)

If R contains 1 million tuples across 10,000 blocks, and we allocate 1,000 blocks for the outer buffer:

Simple NLJ: 1,000,000 scans of S
Block NLJ: ⌈10,000 / 1,000⌉ = 10 scans of S

This is a 100,000x reduction in inner relation scans!

Conceptual Model:

Think of it as batch processing. Instead of asking "does this one outer tuple match any inner tuple?" repeatedly, we ask "which of these thousand outer tuples match this inner tuple?" Once we have the inner tuple in memory, checking against a thousand outer tuples costs almost nothing compared to the disk I/O that brought it in.

The Memory-I/O Tradeoff

BNLJ trades memory for I/O savings. The more memory allocated to the outer buffer, the fewer times we scan the inner relation. This is one of the clearest examples in database systems of the space-time tradeoff—investing in RAM to save disk operations.

The Block Nested Loop Algorithm

The BNLJ algorithm introduces a third loop level (conceptually) or restructures the nesting to process blocks rather than individual tuples.

Algorithm Structure:

block_nested_loop_join.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Block Nested Loop Join Algorithm
// R: Outer relation
// S: Inner relation
// θ: Join predicate
// B: Number of blocks allocated for outer buffer (B ≥ 2)
 
function BlockNestedLoopJoin(R, S, θ, B):
    result ← ∅
    outer_buffer ← allocate_blocks(B - 1)  // Reserve 1 block for inner
    inner_buffer ← allocate_blocks(1)
    
    for each block_chunk in partition(R, B - 1):      // OUTER BLOCK LOOP
        // Load B-1 blocks of R into outer_buffer
        load_into_buffer(outer_buffer, block_chunk)
        
        for each block s_block in S:                   // INNER BLOCK LOOP
            load_into_buffer(inner_buffer, s_block)
            
            // Match each inner tuple against all outer tuples in buffer
            for each tuple s in inner_buffer:          // INNER TUPLE LOOP
                for each tuple r in outer_buffer:      // OUTER TUPLE LOOP
                    if θ(r, s):
                        result ← result ∪ {r ⊕ s}
    
    return result

Detailed Execution Flow:

Buffer Allocation: Allocate available memory. Reserve one block for loading inner relation pages; use remaining blocks for outer relation buffer.
Outer Block Loading: Read B-1 blocks from the outer relation R into the outer buffer. These blocks contain potentially thousands of tuples.
Inner Scan Start: Begin scanning the inner relation S from the first block.
Inner Block Processing: Load one block of S into the inner buffer.
Cross-Matching: For each tuple s in the inner block, iterate through all tuples in the outer buffer, evaluating θ(r, s). This is efficient because both are in memory.
Result Emission: Matching pairs are concatenated and emitted.
Inner Scan Continuation: Move to the next block of S and repeat step 4-6.
Inner Scan Completion: When S is fully scanned, we've found all matches for the current outer block.
Outer Block Advancement: Load the next B-1 blocks of R and repeat from step 3.
Termination: When all blocks of R have been processed, the join is complete.

Inner Buffer Size

We reserve just one block for the inner relation because we're scanning it sequentially. More inner buffers don't help—the inner is read completely regardless. All extra memory goes to the outer buffer to reduce the number of inner scans required.

I/O Cost Analysis

Precise I/O cost analysis is essential for query optimization. Let's derive the BNLJ cost formula and examine how it improves on the simple NLJ.

Defining Variables:

b_R: Blocks in outer relation R
b_S: Blocks in inner relation S
M: Total memory (in blocks) available for the join
B: Blocks allocated to outer buffer (typically M-1, reserving 1 for inner)

Cost Formula:

Cost_BNLJ = b_R + ⌈b_R / (M - 1)⌉ × b_S

Breaking down the formula:

b_R: We read the entire outer relation once (sequentially)
⌈b_R / (M - 1)⌉: Number of outer blocks (each containing M-1 pages)
× b_S: For each outer block, we scan the entire inner relation

Cost Comparison:

I/O Cost: Simple NLJ vs Block NLJ
Metric	Simple NLJ	Block NLJ (M=102 blocks)
Formula	b_R + n_R × b_S	b_R + ⌈b_R/101⌉ × b_S
R: 10K blocks, S: 5K blocks, n_R = 1M	10K + 1M × 5K = 5,000,010,000	10K + ⌈10K/101⌉ × 5K = 10K + 100 × 5K = 510,000
Improvement factor	—	~9,800x fewer I/Os
At 10ms per I/O	~1.6 years	~85 minutes

Impact of Memory Allocation:

The cost formula reveals a critical relationship between memory and I/O:

Memory Impact on BNLJ (b_R = 10,000, b_S = 5,000)
Available Memory (M)	Outer Blocks	Inner Scans	Total I/O Cost
3 blocks (minimum)	2	5,000	10,000 + 5,000 × 5,000 = 25,010,000
11 blocks	10	1,000	10,000 + 1,000 × 5,000 = 5,010,000
102 blocks	101	100	10,000 + 100 × 5,000 = 510,000
1,001 blocks	1,000	10	10,000 + 10 × 5,000 = 60,000
10,001 blocks (R fits)	10,000	1	10,000 + 1 × 5,000 = 15,000

The Golden Scenario

When the outer relation fits entirely in the buffer (M > b_R + 1), the inner relation is scanned exactly once. This achieves the optimal cost of b_R + b_S—a single pass over each relation. This is why query optimizers position the smaller relation as outer when memory permits.

Asymptotic Analysis:

Best case (R fits in memory): O(b_R + b_S) — linear in combined relation sizes
Typical case: O(b_R × b_S / M) — super-linear but memory-dependent
Worst case (M = 3): O(b_R × b_S) — nearly as bad as simple NLJ per-tuple

The key insight: BNLJ converts a problem that scales with tuples into one that scales with blocks, and then further divides by available memory. Each doubling of memory approximately halves the cost.

Buffer Management Strategies

Effective buffer management is crucial for BNLJ performance. Several strategies optimize memory usage beyond the basic algorithm:

1. Outer Buffer Maximization:

Allocate as much memory as possible to the outer buffer. The relationship is clear from the cost formula—larger outer buffer = fewer inner scans. In multi-query environments, this means the join operator must negotiate for buffer pool space.

buffer_strategy.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Buffer allocation strategy for BNLJ
function allocate_buffers(available_memory, b_R, b_S):
    // Strategy 1: Fit outer relation entirely
    if available_memory > b_R + 1:
        outer_buffer ← b_R
        inner_buffer ← 1
        // Result: Just 1 scan of S
        
    // Strategy 2: Fit inner relation entirely (alternative)
    else if available_memory > b_S + 1:
        // Better to swap: S becomes outer, R becomes inner
        return swap_and_allocate(available_memory, b_S, b_R)
        
    // Strategy 3: Neither fits; maximize outer buffer
    else:
        outer_buffer ← available_memory - 1
        inner_buffer ← 1
        
    return (outer_buffer, inner_buffer)

2. Relation Ordering Decision:

The cost is asymmetric with respect to which relation is outer vs. inner:

Cost(R outer) = b_R + ⌈b_R/(M-1)⌉ × b_S
Cost(S outer) = b_S + ⌈b_S/(M-1)⌉ × b_R

When neither fits in memory, the smaller relation should be outer. This minimizes the number of scans of the larger relation.

Relation Ordering Impact (M = 102 blocks)
Configuration	R (1,000 blocks) as Outer	R as Inner (S = 5,000 as Outer)
Outer scans	1	1
Inner scans	⌈1000/101⌉ = 10	⌈5000/101⌉ = 50
Total I/O	1,000 + 10×5,000 = 51,000	5,000 + 50×1,000 = 55,000
Winner	✓ (8% fewer I/Os)	—

3. Double Buffering:

Overlap I/O with computation by using two buffers for each role—while processing one buffer's contents, the next buffer is loaded asynchronously:

double_buffering.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Double buffering for BNLJ
function BNLJ_double_buffered(R, S, θ, M):
    // Allocate: (M-2)/2 blocks for each outer buffer, 2 blocks for inner
    outer_buf_A, outer_buf_B ← allocate_blocks((M-2)/2)
    inner_buf_A, inner_buf_B ← allocate_blocks(1), allocate_blocks(1)
    
    // Initial async load
    async_load(outer_buf_A, next_outer_chunk())
    
    for each outer_chunk in R:
        // Wait for current buffer, start loading next
        wait(outer_buf_A)
        start_async_load(outer_buf_B, next_outer_chunk())
        
        // Process inner relation against outer_buf_A
        for each s_block in S:
            async_load(inner_buf_B, next_inner_block())
            process_matches(outer_buf_A, inner_buf_A, θ)
            swap(inner_buf_A, inner_buf_B)
        
        // Swap buffers for next iteration
        swap(outer_buf_A, outer_buf_B)

Buffer Pool Integration

In production systems, the join operator doesn't directly manage disk I/O. Instead, it requests pages from the buffer pool, which handles caching, eviction, and prefetching. The operator's 'buffer allocation' is really a memory grant from the query executor, determining how many pages it can pin simultaneously.

In-Memory Matching Optimizations

While BNLJ dramatically reduces I/O, the in-memory matching phase still performs O(buffer_size × inner_block_size) comparisons per inner block. For large buffers, this CPU cost becomes significant. Several optimizations address this:

1. In-Memory Hash Index on Outer Buffer:

For equality joins, build a temporary hash table on the outer buffer's join key. When processing inner tuples, probe the hash table instead of scanning the entire buffer:

hash_optimized_bnlj.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// BNLJ with in-memory hash optimization (for equality joins)
function HashOptimizedBNLJ(R, S, join_key, M):
    result ← ∅
    
    for each outer_chunk in partition(R, M - 2):
        // Build hash table on outer chunk
        hash_table ← new HashMap()
        for each r in outer_chunk:
            key ← r[join_key]
            hash_table.append(key, r)  // Handle duplicates
        
        // Probe with inner relation
        for each s_block in S:
            for each s in s_block:
                key ← s[join_key]
                matching_r_tuples ← hash_table.get(key)
                for each r in matching_r_tuples:
                    result ← result ∪ {r ⊕ s}
    
    return result

This transforms the matching phase from O(n × m) to O(n + m) for each outer block, a substantial improvement.

2. Bloom Filter Pre-Check:

Build a Bloom filter on the outer buffer's join keys. Before probing the hash table, check the Bloom filter—if the inner key is definitely not present, skip the hash probe:

bloom_filter_bnlj.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// BNLJ with Bloom filter optimization
function BloomOptimizedBNLJ(R, S, join_key, M):
    for each outer_chunk in partition(R, M - 3):
        hash_table ← build_hash_table(outer_chunk, join_key)
        bloom_filter ← build_bloom_filter(outer_chunk, join_key)
        
        for each s in S:
            key ← s[join_key]
            // Fast path: Bloom filter says "definitely not present"
            if not bloom_filter.may_contain(key):
                continue
            // Slow path: might be present, check hash table
            matching_r_tuples ← hash_table.get(key)
            emit_matches(matching_r_tuples, s)

Additional Optimizations

•SIMD Comparisons — Vectorized comparison of join keys using CPU SIMD instructions. Compare 4-8 keys simultaneously.
•Partition Filtering — If outer buffer is partitioned by key ranges, skip inner tuples whose keys fall outside all partition ranges.
•Predicate Ordering — Evaluate cheap predicates first. If join predicate is R.a = S.a AND expensive_function(R.b, S.b), check equality before calling the function.
•Compiled Predicates — JIT-compile the predicate function for tight inner loops, eliminating interpretation overhead.

When CPU Matters

For SSDs and NVMe storage where I/O is fast, or when data is cached in memory, the in-memory matching cost becomes the dominant factor. Modern query optimizers consider both I/O and CPU costs when choosing join algorithms.

Handling Outer Joins

BNLJ naturally supports outer join semantics with minor modifications. The key challenge is tracking which tuples have found matches.

Left Outer Join:

Track which outer tuples in the current buffer have matched. After scanning the entire inner relation, emit unmatched outer tuples with NULL padding:

bnlj_left_outer.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Block Nested Loop Left Outer Join
function BNLJ_LeftOuter(R, S, θ, M):
    result ← ∅
    
    for each outer_chunk in partition(R, M - 1):
        // Track matches for each tuple in the chunk
        matched ← new BitSet(size(outer_chunk))
        
        for each s_block in S:
            for each s in s_block:
                for i, r in enumerate(outer_chunk):
                    if θ(r, s):
                        result ← result ∪ {r ⊕ s}
                        matched.set(i)
        
        // Emit unmatched outer tuples after full inner scan
        for i, r in enumerate(outer_chunk):
            if not matched.get(i):
                result ← result ∪ {r ⊕ NULL_S}
    
    return result

Right Outer Join:

More complex—we must track which inner tuples have been matched across all outer blocks. One approach: maintain a global bit vector for the inner relation:

bnlj_right_outer.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Block Nested Loop Right Outer Join
function BNLJ_RightOuter(R, S, θ, M):
    result ← ∅
    // Global match tracking for inner relation
    inner_matched ← new BitSet(|S|)
    inner_tuple_index ← 0
    
    for each outer_chunk in partition(R, M - 1):
        inner_tuple_index ← 0
        for each s_block in S:
            for each s in s_block:
                for each r in outer_chunk:
                    if θ(r, s):
                        result ← result ∪ {r ⊕ s}
                        inner_matched.set(inner_tuple_index)
                inner_tuple_index += 1
    
    // Final pass: emit unmatched inner tuples
    inner_tuple_index ← 0
    for each s in S:
        if not inner_matched.get(inner_tuple_index):
            result ← result ∪ {NULL_R ⊕ s}
        inner_tuple_index += 1
    
    return result

Memory Overhead for Right/Full Outer

Right outer and full outer joins require a bit vector proportional to the inner relation's tuple count. For very large inner relations (billions of rows), this overhead becomes significant—potentially requiring on-disk tracking structures or alternative algorithms.

Full Outer Join:

Combines both techniques—track matches for outer (per block) and inner (globally). After processing, emit unmatched from both sides.

Optimization for Right Outer:

An alternative approach reverses the roles: swap R and S, perform a left outer join with S as outer, then swap the attributes back. This works when swapping is semantically equivalent (which it is for outer joins).

Performance Characteristics

Understanding BNLJ's performance envelope helps identify when it's the right choice and how to tune it:

I/O Patterns:

I/O Characteristics

•Outer Relation: Single sequential scan. Excellent for HDDs and SSDs alike. Prefetching is highly effective.
•Inner Relation: Multiple sequential scans. Number of scans = ⌈b_R/(M-1)⌉. Each scan is fully sequential.
•No Random I/O: Both relations are read sequentially. No seeks or random reads in the base algorithm.
•Predictable Cost: Given relation sizes and memory, cost is deterministic. No variance based on data distribution.

Sensitivity Analysis:

BNLJ Sensitivity to Parameters
Parameter	Impact on Cost	Practical Implication
Outer size (b_R)	Linear: cost ~ b_R × (b_S/M)	Smaller outer is better
Inner size (b_S)	Linear: cost ~ b_S × (b_R/M)	Inner size directly affects cost
Memory (M)	Inverse: cost ~ 1/M	Doubling memory halves cost
Selectivity	No direct impact	All tuples scanned regardless
Join key skew	No impact on I/O	May affect hash-optimized variants

Comparison with Alternatives:

BNLJ vs Other Join Algorithms
Algorithm	Best Case Cost	Memory Requirement	Join Predicate Support
Block NLJ	b_R + b_S (R fits)	2+ blocks	Any predicate
Hash Join	~3 × (b_R + b_S)	min(b_R, b_S) blocks	Equality only
Sort-Merge	b_R + b_S + sort costs	~√(b_R)	Equality only
Index NLJ	b_R + n_R × (tree height)	~2 blocks	Indexed attributes

The Crossover Point

BNLJ beats hash join when the outer relation fits (or nearly fits) in memory, because it avoids the partition/build overhead. Query optimizers calculate the expected costs of each algorithm and choose the cheapest. You can often influence this by adjusting work_mem (PostgreSQL) or similar memory parameters.

Real-World Implementation Details

Production database systems implement BNLJ with additional sophistication:

PostgreSQL's Nested Loop:

PostgreSQL implements nested loop joins with optional materialization of the inner relation. If the inner is a subquery result or the optimizer expects multiple scans, it materializes to a temp file:

postgres_explain.sql
1
2
3
4
5
6
7
8
9
10
-- PostgreSQL EXPLAIN showing nested loop with materialization
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM large_table l
JOIN small_table s ON l.id = s.foreign_id;
 
-- Output might show:
-- Nested Loop (cost=0.00..1234.56 rows=1000)
--   -> Seq Scan on large_table l
--   -> Materialize (cost=0.00..12.50 rows=100)
--       -> Seq Scan on small_table s

MySQL's Block Nested Loop:

MySQL explicitly implements BNLJ using the join_buffer_size parameter. Each join in a multi-table query can use up to this amount of memory:

mysql_bnlj.sql
1
2
3
4
5
6
7
8
9
10
-- MySQL configuration for BNLJ
SET join_buffer_size = 256 * 1024;  -- 256KB per join
 
-- Check execution plan
EXPLAIN FORMAT=JSON 
SELECT * FROM orders o 
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'EMEA';
 
-- Look for "using join buffer (Block Nested Loop)" in Extra field

Implementation Variations

•Oracle: Uses 'nested loops' with memory work areas. The MEMORY_TARGET and PGA_AGGREGATE_LIMIT parameters affect buffer sizes.
•SQL Server: Implements as 'nested loops' with batch mode for columnstore indexes. Uses memory grants from the query memory manager.
•SQLite: Uses simple nested loop by default. No block optimization but works well for its typical small-data use cases.
•DuckDB: Uses vectorized block nested loop with SIMD comparisons for maximum in-memory performance.

Adaptive Execution

Modern query executors can adapt at runtime. If actual cardinalities differ from estimates, the executor may switch strategies mid-query—for example, falling back to BNLJ if a hash table grows too large to fit in the allocated memory.

Summary: Block Nested Loop Join

We've examined the block nested loop join—a crucial optimization over the simple nested loop that makes nested loop joins practical for medium-sized relations. Let's consolidate the key points:

Key Takeaways

•Core Optimization — Process outer tuples in blocks, scanning the inner relation once per block instead of once per tuple.
•Cost Formula — Cost = b_R + ⌈b_R/(M-1)⌉ × b_S. Memory inversely affects the number of inner scans.
•Relation Ordering — Smaller relation as outer minimizes inner scans. When outer fits in memory, inner is scanned exactly once.
•Memory Management — Maximize outer buffer size. Reserve minimum space (1 block) for inner. Consider double buffering for parallelism.
•In-Memory Optimizations — Hash indexes, Bloom filters, and SIMD comparisons accelerate the matching phase for equality joins.
•Outer Join Support — Requires match tracking. Left outer is simple; right outer needs global state across all outer blocks.
•Predicate Flexibility — Like simple NLJ, BNLJ supports arbitrary predicates, making it valuable for theta joins.
•Production Use — All major databases implement BNLJ with configurable buffer sizes and adaptive strategies.

What's Next:

Even BNLJ scans the inner relation multiple times. The next page introduces the index nested loop join, which eliminates inner scans entirely by using an index to directly locate matching tuples. When a suitable index exists, this can reduce join cost from O(b_R × b_S / M) to O(b_R + n_R × log(n_S))—often orders of magnitude faster.

Page Complete

You now understand the block nested loop join—how it reduces I/O through blocking, its cost model, buffer management strategies, and real-world implementation. Next, we'll explore how indexes can further accelerate nested loop joins.

2 / 5

Loading learning content...

Database Management SystemsPhysical Operators

Nested Loop Join Algorithms

LevelIntermediate

Duration60 mins

TopicPhysical Operators

2 / 5

Block Nested Loop Join

Optimizing Inner Relation Scans

What You Will Learn

The Core Insight: Amortizing Scans

The Transformation:

Simple NLJ:  Scans of S = n_R (one per outer tuple)
Block NLJ:   Scans of S = ⌈b_R / B⌉ (one per outer block)

If R contains 1 million tuples across 10,000 blocks, and we allocate 1,000 blocks for the outer buffer:

Simple NLJ: 1,000,000 scans of S
Block NLJ: ⌈10,000 / 1,000⌉ = 10 scans of S

This is a 100,000x reduction in inner relation scans!

Conceptual Model:

The Memory-I/O Tradeoff

The Block Nested Loop Algorithm

The BNLJ algorithm introduces a third loop level (conceptually) or restructures the nesting to process blocks rather than individual tuples.

Algorithm Structure:

block_nested_loop_join.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Block Nested Loop Join Algorithm
// R: Outer relation
// S: Inner relation
// θ: Join predicate
// B: Number of blocks allocated for outer buffer (B ≥ 2)
 
function BlockNestedLoopJoin(R, S, θ, B):
    result ← ∅
    outer_buffer ← allocate_blocks(B - 1)  // Reserve 1 block for inner
    inner_buffer ← allocate_blocks(1)
    
    for each block_chunk in partition(R, B - 1):      // OUTER BLOCK LOOP
        // Load B-1 blocks of R into outer_buffer
        load_into_buffer(outer_buffer, block_chunk)
        
        for each block s_block in S:                   // INNER BLOCK LOOP
            load_into_buffer(inner_buffer, s_block)
            
            // Match each inner tuple against all outer tuples in buffer
            for each tuple s in inner_buffer:          // INNER TUPLE LOOP
                for each tuple r in outer_buffer:      // OUTER TUPLE LOOP
                    if θ(r, s):
                        result ← result ∪ {r ⊕ s}
    
    return result

Detailed Execution Flow:

Buffer Allocation: Allocate available memory. Reserve one block for loading inner relation pages; use remaining blocks for outer relation buffer.
Outer Block Loading: Read B-1 blocks from the outer relation R into the outer buffer. These blocks contain potentially thousands of tuples.
Inner Scan Start: Begin scanning the inner relation S from the first block.
Inner Block Processing: Load one block of S into the inner buffer.
Cross-Matching: For each tuple s in the inner block, iterate through all tuples in the outer buffer, evaluating θ(r, s). This is efficient because both are in memory.
Result Emission: Matching pairs are concatenated and emitted.
Inner Scan Continuation: Move to the next block of S and repeat step 4-6.
Inner Scan Completion: When S is fully scanned, we've found all matches for the current outer block.
Outer Block Advancement: Load the next B-1 blocks of R and repeat from step 3.
Termination: When all blocks of R have been processed, the join is complete.

Inner Buffer Size

I/O Cost Analysis

Precise I/O cost analysis is essential for query optimization. Let's derive the BNLJ cost formula and examine how it improves on the simple NLJ.

Defining Variables:

b_R: Blocks in outer relation R
b_S: Blocks in inner relation S
M: Total memory (in blocks) available for the join
B: Blocks allocated to outer buffer (typically M-1, reserving 1 for inner)

Cost Formula:

Cost_BNLJ = b_R + ⌈b_R / (M - 1)⌉ × b_S

Breaking down the formula:

b_R: We read the entire outer relation once (sequentially)
⌈b_R / (M - 1)⌉: Number of outer blocks (each containing M-1 pages)
× b_S: For each outer block, we scan the entire inner relation

Cost Comparison:

I/O Cost: Simple NLJ vs Block NLJ
Metric	Simple NLJ	Block NLJ (M=102 blocks)
Formula	b_R + n_R × b_S	b_R + ⌈b_R/101⌉ × b_S
R: 10K blocks, S: 5K blocks, n_R = 1M	10K + 1M × 5K = 5,000,010,000	10K + ⌈10K/101⌉ × 5K = 10K + 100 × 5K = 510,000
Improvement factor	—	~9,800x fewer I/Os
At 10ms per I/O	~1.6 years	~85 minutes

Impact of Memory Allocation:

The cost formula reveals a critical relationship between memory and I/O:

Memory Impact on BNLJ (b_R = 10,000, b_S = 5,000)
Available Memory (M)	Outer Blocks	Inner Scans	Total I/O Cost
3 blocks (minimum)	2	5,000	10,000 + 5,000 × 5,000 = 25,010,000
11 blocks	10	1,000	10,000 + 1,000 × 5,000 = 5,010,000
102 blocks	101	100	10,000 + 100 × 5,000 = 510,000
1,001 blocks	1,000	10	10,000 + 10 × 5,000 = 60,000
10,001 blocks (R fits)	10,000	1	10,000 + 1 × 5,000 = 15,000

The Golden Scenario

Asymptotic Analysis:

Best case (R fits in memory): O(b_R + b_S) — linear in combined relation sizes
Typical case: O(b_R × b_S / M) — super-linear but memory-dependent
Worst case (M = 3): O(b_R × b_S) — nearly as bad as simple NLJ per-tuple

Buffer Management Strategies

Effective buffer management is crucial for BNLJ performance. Several strategies optimize memory usage beyond the basic algorithm:

1. Outer Buffer Maximization:

buffer_strategy.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Buffer allocation strategy for BNLJ
function allocate_buffers(available_memory, b_R, b_S):
    // Strategy 1: Fit outer relation entirely
    if available_memory > b_R + 1:
        outer_buffer ← b_R
        inner_buffer ← 1
        // Result: Just 1 scan of S
        
    // Strategy 2: Fit inner relation entirely (alternative)
    else if available_memory > b_S + 1:
        // Better to swap: S becomes outer, R becomes inner
        return swap_and_allocate(available_memory, b_S, b_R)
        
    // Strategy 3: Neither fits; maximize outer buffer
    else:
        outer_buffer ← available_memory - 1
        inner_buffer ← 1
        
    return (outer_buffer, inner_buffer)

2. Relation Ordering Decision:

The cost is asymmetric with respect to which relation is outer vs. inner:

Cost(R outer) = b_R + ⌈b_R/(M-1)⌉ × b_S
Cost(S outer) = b_S + ⌈b_S/(M-1)⌉ × b_R

When neither fits in memory, the smaller relation should be outer. This minimizes the number of scans of the larger relation.

Relation Ordering Impact (M = 102 blocks)
Configuration	R (1,000 blocks) as Outer	R as Inner (S = 5,000 as Outer)
Outer scans	1	1
Inner scans	⌈1000/101⌉ = 10	⌈5000/101⌉ = 50
Total I/O	1,000 + 10×5,000 = 51,000	5,000 + 50×1,000 = 55,000
Winner	✓ (8% fewer I/Os)	—

3. Double Buffering:

Overlap I/O with computation by using two buffers for each role—while processing one buffer's contents, the next buffer is loaded asynchronously:

double_buffering.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Double buffering for BNLJ
function BNLJ_double_buffered(R, S, θ, M):
    // Allocate: (M-2)/2 blocks for each outer buffer, 2 blocks for inner
    outer_buf_A, outer_buf_B ← allocate_blocks((M-2)/2)
    inner_buf_A, inner_buf_B ← allocate_blocks(1), allocate_blocks(1)
    
    // Initial async load
    async_load(outer_buf_A, next_outer_chunk())
    
    for each outer_chunk in R:
        // Wait for current buffer, start loading next
        wait(outer_buf_A)
        start_async_load(outer_buf_B, next_outer_chunk())
        
        // Process inner relation against outer_buf_A
        for each s_block in S:
            async_load(inner_buf_B, next_inner_block())
            process_matches(outer_buf_A, inner_buf_A, θ)
            swap(inner_buf_A, inner_buf_B)
        
        // Swap buffers for next iteration
        swap(outer_buf_A, outer_buf_B)

Buffer Pool Integration

In-Memory Matching Optimizations

1. In-Memory Hash Index on Outer Buffer:

For equality joins, build a temporary hash table on the outer buffer's join key. When processing inner tuples, probe the hash table instead of scanning the entire buffer:

hash_optimized_bnlj.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// BNLJ with in-memory hash optimization (for equality joins)
function HashOptimizedBNLJ(R, S, join_key, M):
    result ← ∅
    
    for each outer_chunk in partition(R, M - 2):
        // Build hash table on outer chunk
        hash_table ← new HashMap()
        for each r in outer_chunk:
            key ← r[join_key]
            hash_table.append(key, r)  // Handle duplicates
        
        // Probe with inner relation
        for each s_block in S:
            for each s in s_block:
                key ← s[join_key]
                matching_r_tuples ← hash_table.get(key)
                for each r in matching_r_tuples:
                    result ← result ∪ {r ⊕ s}
    
    return result

This transforms the matching phase from O(n × m) to O(n + m) for each outer block, a substantial improvement.

2. Bloom Filter Pre-Check:

Build a Bloom filter on the outer buffer's join keys. Before probing the hash table, check the Bloom filter—if the inner key is definitely not present, skip the hash probe:

bloom_filter_bnlj.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// BNLJ with Bloom filter optimization
function BloomOptimizedBNLJ(R, S, join_key, M):
    for each outer_chunk in partition(R, M - 3):
        hash_table ← build_hash_table(outer_chunk, join_key)
        bloom_filter ← build_bloom_filter(outer_chunk, join_key)
        
        for each s in S:
            key ← s[join_key]
            // Fast path: Bloom filter says "definitely not present"
            if not bloom_filter.may_contain(key):
                continue
            // Slow path: might be present, check hash table
            matching_r_tuples ← hash_table.get(key)
            emit_matches(matching_r_tuples, s)

Additional Optimizations

•SIMD Comparisons — Vectorized comparison of join keys using CPU SIMD instructions. Compare 4-8 keys simultaneously.
•Partition Filtering — If outer buffer is partitioned by key ranges, skip inner tuples whose keys fall outside all partition ranges.
•Predicate Ordering — Evaluate cheap predicates first. If join predicate is R.a = S.a AND expensive_function(R.b, S.b), check equality before calling the function.
•Compiled Predicates — JIT-compile the predicate function for tight inner loops, eliminating interpretation overhead.

When CPU Matters

Handling Outer Joins

BNLJ naturally supports outer join semantics with minor modifications. The key challenge is tracking which tuples have found matches.

Left Outer Join:

Track which outer tuples in the current buffer have matched. After scanning the entire inner relation, emit unmatched outer tuples with NULL padding:

bnlj_left_outer.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Block Nested Loop Left Outer Join
function BNLJ_LeftOuter(R, S, θ, M):
    result ← ∅
    
    for each outer_chunk in partition(R, M - 1):
        // Track matches for each tuple in the chunk
        matched ← new BitSet(size(outer_chunk))
        
        for each s_block in S:
            for each s in s_block:
                for i, r in enumerate(outer_chunk):
                    if θ(r, s):
                        result ← result ∪ {r ⊕ s}
                        matched.set(i)
        
        // Emit unmatched outer tuples after full inner scan
        for i, r in enumerate(outer_chunk):
            if not matched.get(i):
                result ← result ∪ {r ⊕ NULL_S}
    
    return result

Right Outer Join:

More complex—we must track which inner tuples have been matched across all outer blocks. One approach: maintain a global bit vector for the inner relation:

bnlj_right_outer.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Block Nested Loop Right Outer Join
function BNLJ_RightOuter(R, S, θ, M):
    result ← ∅
    // Global match tracking for inner relation
    inner_matched ← new BitSet(|S|)
    inner_tuple_index ← 0
    
    for each outer_chunk in partition(R, M - 1):
        inner_tuple_index ← 0
        for each s_block in S:
            for each s in s_block:
                for each r in outer_chunk:
                    if θ(r, s):
                        result ← result ∪ {r ⊕ s}
                        inner_matched.set(inner_tuple_index)
                inner_tuple_index += 1
    
    // Final pass: emit unmatched inner tuples
    inner_tuple_index ← 0
    for each s in S:
        if not inner_matched.get(inner_tuple_index):
            result ← result ∪ {NULL_R ⊕ s}
        inner_tuple_index += 1
    
    return result

Memory Overhead for Right/Full Outer

Full Outer Join:

Combines both techniques—track matches for outer (per block) and inner (globally). After processing, emit unmatched from both sides.

Optimization for Right Outer:

Performance Characteristics

Understanding BNLJ's performance envelope helps identify when it's the right choice and how to tune it:

I/O Patterns:

I/O Characteristics

•Outer Relation: Single sequential scan. Excellent for HDDs and SSDs alike. Prefetching is highly effective.
•Inner Relation: Multiple sequential scans. Number of scans = ⌈b_R/(M-1)⌉. Each scan is fully sequential.
•No Random I/O: Both relations are read sequentially. No seeks or random reads in the base algorithm.
•Predictable Cost: Given relation sizes and memory, cost is deterministic. No variance based on data distribution.

Sensitivity Analysis:

BNLJ Sensitivity to Parameters
Parameter	Impact on Cost	Practical Implication
Outer size (b_R)	Linear: cost ~ b_R × (b_S/M)	Smaller outer is better
Inner size (b_S)	Linear: cost ~ b_S × (b_R/M)	Inner size directly affects cost
Memory (M)	Inverse: cost ~ 1/M	Doubling memory halves cost
Selectivity	No direct impact	All tuples scanned regardless
Join key skew	No impact on I/O	May affect hash-optimized variants

Comparison with Alternatives:

BNLJ vs Other Join Algorithms
Algorithm	Best Case Cost	Memory Requirement	Join Predicate Support
Block NLJ	b_R + b_S (R fits)	2+ blocks	Any predicate
Hash Join	~3 × (b_R + b_S)	min(b_R, b_S) blocks	Equality only
Sort-Merge	b_R + b_S + sort costs	~√(b_R)	Equality only
Index NLJ	b_R + n_R × (tree height)	~2 blocks	Indexed attributes

The Crossover Point

Real-World Implementation Details

Production database systems implement BNLJ with additional sophistication:

PostgreSQL's Nested Loop:

PostgreSQL implements nested loop joins with optional materialization of the inner relation. If the inner is a subquery result or the optimizer expects multiple scans, it materializes to a temp file:

postgres_explain.sql
1
2
3
4
5
6
7
8
9
10
-- PostgreSQL EXPLAIN showing nested loop with materialization
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM large_table l
JOIN small_table s ON l.id = s.foreign_id;
 
-- Output might show:
-- Nested Loop (cost=0.00..1234.56 rows=1000)
--   -> Seq Scan on large_table l
--   -> Materialize (cost=0.00..12.50 rows=100)
--       -> Seq Scan on small_table s

MySQL's Block Nested Loop:

MySQL explicitly implements BNLJ using the join_buffer_size parameter. Each join in a multi-table query can use up to this amount of memory:

mysql_bnlj.sql
1
2
3
4
5
6
7
8
9
10
-- MySQL configuration for BNLJ
SET join_buffer_size = 256 * 1024;  -- 256KB per join
 
-- Check execution plan
EXPLAIN FORMAT=JSON 
SELECT * FROM orders o 
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'EMEA';
 
-- Look for "using join buffer (Block Nested Loop)" in Extra field

Implementation Variations

•Oracle: Uses 'nested loops' with memory work areas. The MEMORY_TARGET and PGA_AGGREGATE_LIMIT parameters affect buffer sizes.
•SQL Server: Implements as 'nested loops' with batch mode for columnstore indexes. Uses memory grants from the query memory manager.
•SQLite: Uses simple nested loop by default. No block optimization but works well for its typical small-data use cases.
•DuckDB: Uses vectorized block nested loop with SIMD comparisons for maximum in-memory performance.

Adaptive Execution

Summary: Block Nested Loop Join

We've examined the block nested loop join—a crucial optimization over the simple nested loop that makes nested loop joins practical for medium-sized relations. Let's consolidate the key points:

Key Takeaways

•Core Optimization — Process outer tuples in blocks, scanning the inner relation once per block instead of once per tuple.
•Cost Formula — Cost = b_R + ⌈b_R/(M-1)⌉ × b_S. Memory inversely affects the number of inner scans.
•Relation Ordering — Smaller relation as outer minimizes inner scans. When outer fits in memory, inner is scanned exactly once.
•Memory Management — Maximize outer buffer size. Reserve minimum space (1 block) for inner. Consider double buffering for parallelism.
•In-Memory Optimizations — Hash indexes, Bloom filters, and SIMD comparisons accelerate the matching phase for equality joins.
•Outer Join Support — Requires match tracking. Left outer is simple; right outer needs global state across all outer blocks.
•Predicate Flexibility — Like simple NLJ, BNLJ supports arbitrary predicates, making it valuable for theta joins.
•Production Use — All major databases implement BNLJ with configurable buffer sizes and adaptive strategies.

What's Next:

Page Complete

2 / 5