Database Management SystemsHash Join

Hash Join Algorithms

LevelAdvanced

Duration75 mins

TopicHash Join

1 / 5

Build Phase

The Foundation of Hash-Based Joining

The hash join algorithm stands as one of the most important advances in relational database implementation. Unlike nested-loop joins that require O(n × m) comparisons, or sort-merge joins that demand complete sorting of both inputs, hash join exploits the power of hashing to achieve near-linear join performance under favorable conditions.

At the heart of every hash join lies a deceptively simple insight: if we can build a hash table from one relation and probe it with tuples from the other, we reduce join complexity from quadratic to linear. The build phase is where this magic begins—it transforms raw tuples into a searchable, constant-time-access structure that makes the entire algorithm possible.

What You Will Learn

By the end of this page, you will understand how the build phase constructs hash tables from the smaller relation, why relation selection matters critically for performance, how hash functions and bucket organization work, and what happens when memory constraints challenge the basic algorithm. This knowledge is essential for understanding query optimizer decisions and diagnosing join performance issues.

The Build Phase Concept

The build phase is the first of two phases in any hash join algorithm. Its purpose is elegantly simple: read tuples from one relation (called the build relation) and construct an in-memory hash table indexed by the join attribute(s).

Consider a join between tables Orders (10 million rows) and Customers (100,000 rows) on customer_id. The build phase would:

Select the smaller relation (Customers) as the build input
Read all tuples from the build relation
Hash each tuple's join key to determine its bucket
Store the tuple in the appropriate hash bucket

Once complete, we have a structure that can answer the question "Does any customer have ID = X?" in O(1) average time—a dramatic improvement over scanning through all customers repeatedly.

Why 'Build' and 'Probe'?

The terminology is intentional and meaningful. We 'build' the hash table once, then 'probe' it many times. The build relation contributes structure; the probe relation contributes lookups. This asymmetry is fundamental to hash join's efficiency—we invest upfront work in building to save repeated work during probing.

build_phase_algorithm.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Conceptual representation of the build phase
-- For a join: SELECT * FROM Orders O JOIN Customers C ON O.customer_id = C.customer_id
 
-- STEP 1: The optimizer identifies Customers as the smaller relation
-- Customers: 100,000 rows × 200 bytes = ~20 MB
-- Orders: 10,000,000 rows × 150 bytes = ~1.5 GB
 
-- STEP 2: Build phase processes Customers
-- Pseudocode:
--   hash_table = new HashMap()
--   
--   FOR each tuple t in Customers:
--       key = t.customer_id
--       bucket = hash(key) mod num_buckets
--       hash_table[bucket].append(t)
--   
--   RETURN hash_table
 
-- STEP 3: The resulting hash table structure
-- Bucket 0: [Customer(id=1000, name='Acme Corp', ...), Customer(id=5000, ...)]
-- Bucket 1: [Customer(id=1001, name='Beta Inc', ...)]
-- Bucket 2: []  -- empty bucket (sparse is normal)
-- Bucket 3: [Customer(id=1003, ...), Customer(id=7003, ...), ...]
-- ...
-- Bucket N-1: [...]

Build Relation Selection: The Critical Choice

Choosing which relation to build from is perhaps the most consequential decision in hash join execution. The choice directly determines memory requirements, I/O patterns, and whether the join can execute efficiently or must fall back to more expensive strategies.

The fundamental rule is clear: always build from the smaller relation. But 'smaller' requires careful definition—we mean the relation that produces fewer bytes after applying any preceding operators (selections, projections).

Impact of Build Relation Choice
Scenario	Build from Small	Build from Large	Performance Impact
Memory usage	Hash table fits in memory	Hash table exceeds memory	10-100× difference
I/O pattern	Single pass over build relation	Requires partitioning/spilling	5-50× more I/O
CPU efficiency	Cache-friendly hash table	Cache thrashing, TLB misses	2-10× slower
Probe efficiency	Fast constant-time lookups	Same (if in memory)	Similar
Overall cost	O(\|R\| + \|S\|) I/O	O(\|R\| + \|S\|) × partition factor	Massive difference

The mathematics of selection:

Suppose we join relations R (|R| = 1,000,000 tuples, 100 bytes/tuple) and S (|S| = 10,000 tuples, 200 bytes/tuple):

Build from R: Hash table ≈ 100 MB + overhead ≈ 120-150 MB
Build from S: Hash table ≈ 2 MB + overhead ≈ 2.5-3 MB

If available memory is 50 MB, building from R requires partitioning and multiple passes, while building from S completes in a single, efficient pass. The 50× size difference translates to dramatically different execution strategies.

Size Estimation Challenges

Optimizers estimate relation sizes using statistics (histograms, cardinality estimates). When estimates are wrong—which happens frequently with complex predicates, correlated columns, or stale statistics—the optimizer may choose the wrong build relation. This is a common cause of query performance regression and why database professionals monitor join operation metrics carefully.

Factors Affecting Relation Size

•Base table cardinality — The number of rows in the underlying table before any filtering
•Selection predicates — WHERE clauses can dramatically reduce effective size; a 1-million row table may produce only 100 rows after filtering
•Projection impact — Selecting fewer columns reduces per-tuple size, affecting hash table memory requirements
•Prior join results — In multi-way joins, intermediate results become inputs; sizes are harder to estimate
•Data skew — Non-uniform value distributions can cause certain hash buckets to grow unexpectedly large

Hash Function Design and Properties

The hash function used during the build phase is critical to performance. A well-designed hash function distributes tuples uniformly across buckets, minimizing collisions and ensuring O(1) average-case lookup during the probe phase. A poor hash function creates 'hot' buckets that degrade performance to O(n).

Key properties of effective join hash functions:

Hash Function Requirements

•Uniform distribution — Equal probability of mapping to any bucket, regardless of input patterns. This is crucial because real data is rarely uniform (e.g., customer IDs might cluster around certain ranges).
•Deterministic behavior — The same key must always hash to the same bucket. This sounds obvious but matters when handling NULL values and during partition-based processing.
•Computational efficiency — Hash computation happens for every tuple; even small per-tuple overhead compounds over millions of rows. Modern functions achieve < 10 CPU cycles per hash.
•Good avalanche effect — Small changes in input should cause large changes in output. This prevents clustering of similar keys (like sequential IDs).

hash_functions.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Common hash functions used in database systems
 
// 1. Multiplicative Hashing (simple, fast)
// Based on Knuth's suggestion using the golden ratio
uint64_t multiplicative_hash(uint64_t key, int num_buckets) {
    const uint64_t A = 11400714819323198485ULL; // 2^64 / φ
    return ((key * A) >> (64 - log2(num_buckets)));
}
 
// 2. MurmurHash3 (excellent distribution, widely used)
// PostgreSQL, MySQL, and many systems use variants
uint32_t murmur3_hash(const void* key, int len, uint32_t seed) {
    // Processes input in 4-byte chunks with mixing
    // Provides excellent avalanche properties
    // ~10-15 cycles for typical keys
    // ... implementation details omitted for brevity
}
 
// 3. CRC32 Hardware-Accelerated (Intel CRC32C instruction)
// Modern CPUs have dedicated instructions making this very fast
uint32_t crc32c_hash(uint64_t key) {
    // Uses _mm_crc32_u64 intrinsic on x86
    // Single cycle on modern CPUs
    return _mm_crc32_u64(0, key);
}
 
// 4. Composite Key Hashing
// For joins on multiple columns (composite keys)
uint64_t composite_hash(const Tuple& t, const vector<int>& key_columns) {
    uint64_t hash = 0;
    for (int col : key_columns) {
        // Combine hashes using XOR-shift pattern
        hash ^= hash_column(t, col);
        hash = hash * 31 + (hash >> 17);  // mixing step
    }
    return hash;
}

Hardware-Accelerated Hashing

Modern database systems heavily exploit CPU-specific hash instructions. Intel's CRC32C instruction, ARM's CRC32 intrinsics, and SIMD-based hashing (using AVX2/AVX-512) can compute hashes orders of magnitude faster than software implementations. This hardware acceleration is why hash join performance has improved dramatically on modern hardware.

Hash Table Structure and Organization

The hash table constructed during the build phase must support two operations efficiently:

Insert (during build): Add a tuple to the appropriate bucket
Lookup (during probe): Find all tuples matching a given key

Database systems use several hash table designs, each with different tradeoffs for memory efficiency, cache behavior, and collision handling:

Chained Hash Table

•Each bucket contains a linked list of tuples
•Simple insertion: append to list
•Handles collisions naturally
•Memory overhead: ~8-16 bytes per tuple for pointers
•Poor cache locality: pointer chasing
•Best for: variable-size tuples, many duplicates

Linear Probing Hash Table

•Open addressing: store tuples directly in array
•On collision: check next slot sequentially
•No pointer overhead, compact storage
•Excellent cache locality: sequential access
•Degrades at high load factors (> 70%)
•Best for: unique keys, memory-critical scenarios

hash_table_structures.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Chained Hash Table Implementation
struct ChainedHashTable {
    struct Entry {
        Tuple tuple;
        Entry* next;
    };
    
    Entry** buckets;      // Array of bucket heads
    size_t num_buckets;
    size_t num_entries;
    
    void insert(const Tuple& t, uint64_t key) {
        size_t bucket = hash(key) % num_buckets;
        Entry* entry = new Entry{t, buckets[bucket]};
        buckets[bucket] = entry;  // Prepend to chain
        num_entries++;
    }
    
    vector<Tuple*> find(uint64_t key) {
        vector<Tuple*> results;
        size_t bucket = hash(key) % num_buckets;
        Entry* curr = buckets[bucket];
        while (curr) {
            if (get_key(curr->tuple) == key) {
                results.push_back(&curr->tuple);
            }
            curr = curr->next;
        }
        return results;
    }
};
 
// Linear Probing Hash Table Implementation
struct LinearProbingHashTable {
    struct Slot {
        bool occupied;
        uint64_t key;
        Tuple tuple;
    };
    
    Slot* slots;
    size_t capacity;
    size_t size;
    
    void insert(const Tuple& t, uint64_t key) {
        size_t idx = hash(key) % capacity;
        while (slots[idx].occupied) {
            idx = (idx + 1) % capacity;  // Linear probe
        }
        slots[idx] = {true, key, t};
        size++;
    }
    
    // Note: More complex for duplicate keys
    Tuple* find(uint64_t key) {
        size_t idx = hash(key) % capacity;
        while (slots[idx].occupied) {
            if (slots[idx].key == key) {
                return &slots[idx].tuple;
            }
            idx = (idx + 1) % capacity;
        }
        return nullptr;  // Not found
    }
};

Memory layout considerations:

Modern systems optimize hash table layouts for cache efficiency. A typical CPU cache line is 64 bytes—if we can fit bucket headers or multiple hash entries within a single cache line, we dramatically reduce memory access latency.

Bucket sizing strategy:

The number of buckets is typically chosen as:

A prime number (reduces clustering for certain hash functions)
Approximately |build relation| / target_chain_length where target_chain_length is typically 1-4
Often oversized by 30-50% to maintain low load factor

For example, building from 100,000 tuples with target chain length 2:

Buckets needed ≈ 50,000
With 30% overhead ≈ 65,000 buckets
Rounded to nearest prime: 65,537 buckets

Memory Management During Build

Memory management during the build phase is fundamental to hash join performance. The ideal case—where the entire build relation fits in memory—enables the simplest and fastest execution. Real systems must handle the full spectrum from tiny tables to tables exceeding available RAM by orders of magnitude.

Memory Scenarios and Responses
Scenario	Memory Status	System Response	Performance
Tiny build	< 10% of buffer pool	Simple in-memory hash table	Optimal: single pass
Comfortable fit	50-70% of buffer pool	In-memory with headroom for probing	Optimal: single pass
Tight fit	80-95% of buffer pool	In-memory but monitor carefully	Good: single pass, some pressure
Slight overflow	100-200% of buffer pool	Partition with few passes	Degraded: 2-3 passes
Massive overflow	500% of buffer pool	Recursive partitioning	Severely degraded: many passes

Memory components during build:

Bucket array: Fixed overhead, typically num_buckets × 8 bytes for pointers
Tuple storage: The actual tuple data, varies by tuple width and count
Chain pointers: For chained hashing, 8 bytes per tuple
Metadata: Hash values stored alongside tuples (optional but speeds probing)
Page alignment: Overhead from rounding up to page boundaries

Total memory estimation:

memory_required = num_buckets × 8                    // Bucket array
                + num_tuples × (tuple_size + 8)      // Tuples + chain pointers
                + num_tuples × 8                     // Stored hash values (if used)
                + alignment_overhead                  // ~5-10% typical

Early Memory Assessment

Production database systems assess memory requirements before beginning the build phase. They use cardinality estimates and average tuple sizes to predict hash table size. If predicted size exceeds available memory, the system switches to a partitioned (Grace or Hybrid) strategy from the start—avoiding the costly scenario of building partially, running out of memory, and having to restart with a different strategy.

memory_estimation.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Memory estimation before build phase
struct BuildMemoryEstimate {
    size_t estimated_tuples;
    size_t avg_tuple_bytes;
    size_t join_key_bytes;
    
    size_t estimate_hash_table_size() {
        // Bucket array: use ~2x tuples for low load factor
        size_t num_buckets = next_prime(estimated_tuples * 2);
        size_t bucket_array = num_buckets * sizeof(void*);
        
        // Tuple storage with chain pointers
        size_t tuple_storage = estimated_tuples * 
            (avg_tuple_bytes + sizeof(void*) + sizeof(uint64_t));
            //  tuple data   + next pointer  + cached hash
        
        // Add 10% for fragmentation and alignment
        size_t total = (bucket_array + tuple_storage) * 1.1;
        
        return total;
    }
    
    bool fits_in_memory(size_t available_memory) {
        size_t required = estimate_hash_table_size();
        // Reserve 20% for probe-phase working memory
        return required < available_memory * 0.8;
    }
    
    int estimate_partitions_needed(size_t available_memory) {
        size_t required = estimate_hash_table_size();
        size_t usable = available_memory * 0.8;
        // Each partition should fit comfortably
        // Add one for safety margin
        return (required / usable) + 2;
    }
};

Build Phase Execution Flow

Let's trace through the complete build phase execution, understanding each step in detail:

Phase 1: Initialization

Before reading any tuples, the system prepares the hash table infrastructure:

build_phase_initialization.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Build phase initialization
class HashJoinBuildOperator {
    HashTable* hash_table;
    vector<int> join_key_columns;
    size_t tuples_processed;
    
    void initialize(const PlanNode& build_input, size_t available_memory) {
        // 1. Get cardinality estimate from optimizer
        size_t estimated_rows = build_input.estimated_cardinality();
        size_t tuple_width = build_input.output_schema().row_width();
        
        // 2. Determine hash table sizing
        size_t memory_needed = estimate_memory(estimated_rows, tuple_width);
        
        if (memory_needed > available_memory) {
            // Will need partitioning - handled by Hybrid/Grace strategies
            throw NeedPartitioningException(memory_needed, available_memory);
        }
        
        // 3. Allocate hash table with appropriate size
        size_t num_buckets = compute_bucket_count(estimated_rows);
        hash_table = new HashTable(num_buckets);
        
        // 4. Identify join key columns from query plan
        join_key_columns = build_input.join_keys();
        
        tuples_processed = 0;
    }
};

Phase 2: Tuple Processing

The core loop reads tuples from the build relation and inserts them into the hash table:

build_phase_processing.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void process_build_relation(Operator* build_child) {
    Tuple tuple;
    
    // Request tuples from child operator (may be scan, filter, prior join, etc.)
    while (build_child->next(&tuple)) {
        // 1. Extract join key value(s)
        JoinKey key = extract_key(tuple, join_key_columns);
        
        // 2. Compute hash value
        uint64_t hash_value = compute_hash(key);
        
        // 3. Insert into hash table
        //    Many implementations store the hash value alongside the tuple
        //    to avoid recomputing during probe phase comparisons
        hash_table->insert(hash_value, key, tuple);
        
        tuples_processed++;
        
        // 4. Optional: Check memory pressure periodically
        if (tuples_processed % 10000 == 0) {
            if (memory_pressure_detected()) {
                // Some systems can switch to hybrid approach mid-build
                trigger_partial_partition_flush();
            }
        }
    }
    
    // Build phase complete
    log_build_complete(tuples_processed, hash_table->memory_used());
}

Vectorized Build Operations

Modern analytical databases process tuples in batches (vectors) of 1,000-10,000 tuples rather than one at a time. Vectorized hash computation and insertion exploit CPU pipelining, SIMD instructions, and cache prefetching for 2-5× better performance than tuple-at-a-time processing. The fundamental algorithm remains the same, but the implementation is heavily optimized.

Phase 3: Finalization

After all tuples are inserted, the hash table may undergo final optimizations:

•Bucket chain sorting — Some systems sort chains by hash value for faster probe comparisons
•Empty bucket coalescence — Remove empty buckets to improve cache efficiency
•Statistics gathering — Compute actual vs. estimated sizes, chain length distribution for query feedback
•Bloom filter construction — Build a compact Bloom filter for fast negative lookups (discussed in probe phase)
•Memory barrier — Ensure all writes are visible before probe phase begins (important for parallel execution)

Handling NULL Values in Join Keys

NULL values in join keys require special consideration during the build phase. The SQL standard specifies that NULL ≠ NULL in join comparisons, which affects how we handle NULL-containing tuples.

Key insight: A tuple with NULL in a join key column can never match any probe tuple (since NULL ≠ anything, including NULL). Therefore, for standard inner joins, NULL-key tuples need not be inserted into the hash table at all.

Inner Join NULL Handling

•NULL-key tuples: skip during build
•Cannot match, so don't store
•Saves memory and insertion time
•Probe-side NULLs also produce no matches

Outer Join NULL Handling

•Must track NULL-key tuples separately
•LEFT/RIGHT OUTER: need to emit non-matching tuples
•FULL OUTER: both sides need tracking
•Special 'NULL bucket' or separate list

null_key_handling.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// NULL handling during build phase
void insert_tuple(const Tuple& tuple, JoinType join_type) {
    JoinKey key = extract_key(tuple, join_key_columns);
    
    // Check for NULL in any key column
    if (key.has_null()) {
        if (join_type == JoinType::INNER) {
            // Skip: NULL can never match
            null_tuples_skipped++;
            return;
        } else if (join_type == JoinType::LEFT_OUTER ||
                   join_type == JoinType::FULL_OUTER) {
            // Must preserve for potential non-matching output
            null_key_tuples.push_back(tuple);
            return;
        }
        // RIGHT_OUTER: build side doesn't need NULL tracking
    }
    
    // Normal insertion for non-NULL keys
    uint64_t hash_value = compute_hash(key);
    hash_table->insert(hash_value, key, tuple);
}
 
// For FULL OUTER JOIN, we also need to track which build tuples matched
struct BuildTupleEntry {
    Tuple tuple;
    bool matched;  // Set during probe phase
};

Multi-Column Key NULLs

For composite join keys (e.g., JOIN ON a.x = b.x AND a.y = b.y), a NULL in ANY key column prevents matching. However, some systems support non-standard semantics where NULLs match NULLs. The IS NOT DISTINCT FROM operator (or <=> in MySQL) provides this behavior and requires different NULL handling during build.

Build Phase Cost Analysis

Understanding the cost of the build phase is essential for query optimization and performance analysis. The cost model considers both I/O and CPU components:

Build Phase Cost Components
Component	Cost Formula	Typical Values	Notes
Read build relation	⌈\|R\| / page_size⌉ page I/Os	Sequential read	~10 MB/s HDD, ~500 MB/s SSD
Hash computation	\|R\| × hash_cost	~10-50 ns/tuple	Hardware-accelerated on modern CPUs
Hash table insertion	\|R\| × insert_cost	~50-200 ns/tuple	Depends on hash table design
Memory allocation	Variable	Amortized ~10 ns/tuple	Pre-allocation reduces overhead
Key extraction	\|R\| × extract_cost	~20-100 ns/tuple	More for wide/variable columns

I/O Cost Dominance:

For disk-resident data, I/O cost typically dominates. Reading a 1 GB build relation from HDD takes ~100 seconds, while CPU processing of the same data completes in ~10 seconds. With SSDs, the gap narrows but I/O often remains the bottleneck for cold data.

CPU Cost Visibility:

For cached data or very small relations, CPU cost becomes visible:

Building a hash table from 1 million rows takes ~100-500 ms of CPU time
Hardware effects (cache misses, NUMA penalties) can 2-3× this on large tables
Vectorized implementations reduce per-tuple overhead significantly

Total Build Phase Cost (in-memory case):

Build_cost = (|R| / page_size) × I/O_latency  // Read relation
           + |R| × (hash_time + insert_time)   // CPU processing
           + allocation_overhead                // Memory setup

For example, building from 100,000 tuples (50 MB):

SSD read: ~0.1 seconds
CPU: ~50-100 ms
Total: ~150-200 ms

This is why hash join is fast—the build phase is typically a single sequential scan plus linear-time hash table construction.

When Build Cost Explodes

Build phase cost can explode when: (1) the build relation doesn't fit in memory, requiring partitioning and multiple passes; (2) extreme data skew causes certain hash buckets to become very long chains; (3) poor cardinality estimates lead to hash table resize operations mid-build. Monitoring actual vs. estimated build sizes helps identify these issues.

Summary: Build Phase Mastery

The build phase establishes the foundation for efficient hash join execution. Let's consolidate the essential concepts:

Key Takeaways

•Build from the smaller relation — This single decision determines whether the join fits in memory and executes efficiently. Optimizers use statistics to estimate; wrong estimates cause performance problems.
•Hash function quality matters — Uniform distribution prevents hot buckets; modern CPUs provide hardware acceleration for common hash functions; composite keys require careful combination strategies.
•Hash table design affects performance — Chained tables handle collisions naturally but have pointer overhead; open addressing is cache-friendly but degrades at high load factors.
•Memory management is critical — Pre-estimate hash table size; reserve space for probe phase; detect memory pressure early to enable partitioning strategies.
•NULL handling varies by join type — Inner joins can skip NULL-key tuples; outer joins must track them separately for correct results.
•Build cost is typically I/O-dominated — Sequential read of build relation plus linear CPU processing; becomes CPU-bound only for cached or tiny relations.

What's Next:

With the hash table constructed, we're ready for the probe phase—where the larger relation is scanned and matched against our carefully built structure. The probe phase is where the investment in build phase optimization pays dividends, enabling constant-time lookups for each probe tuple.

Page Complete

You now understand the build phase of hash join algorithms: relation selection, hash function design, hash table structures, memory management, NULL handling, and cost analysis. This foundation is essential for understanding why hash join is the dominant join algorithm in modern database systems.

1 / 5

Loading learning content...

Database Management SystemsHash Join

Hash Join Algorithms

LevelAdvanced

Duration75 mins

TopicHash Join

1 / 5

Build Phase

The Foundation of Hash-Based Joining

What You Will Learn

The Build Phase Concept

Consider a join between tables Orders (10 million rows) and Customers (100,000 rows) on customer_id. The build phase would:

Select the smaller relation (Customers) as the build input
Read all tuples from the build relation
Hash each tuple's join key to determine its bucket
Store the tuple in the appropriate hash bucket

Once complete, we have a structure that can answer the question "Does any customer have ID = X?" in O(1) average time—a dramatic improvement over scanning through all customers repeatedly.

Why 'Build' and 'Probe'?

build_phase_algorithm.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Conceptual representation of the build phase
-- For a join: SELECT * FROM Orders O JOIN Customers C ON O.customer_id = C.customer_id
 
-- STEP 1: The optimizer identifies Customers as the smaller relation
-- Customers: 100,000 rows × 200 bytes = ~20 MB
-- Orders: 10,000,000 rows × 150 bytes = ~1.5 GB
 
-- STEP 2: Build phase processes Customers
-- Pseudocode:
--   hash_table = new HashMap()
--   
--   FOR each tuple t in Customers:
--       key = t.customer_id
--       bucket = hash(key) mod num_buckets
--       hash_table[bucket].append(t)
--   
--   RETURN hash_table
 
-- STEP 3: The resulting hash table structure
-- Bucket 0: [Customer(id=1000, name='Acme Corp', ...), Customer(id=5000, ...)]
-- Bucket 1: [Customer(id=1001, name='Beta Inc', ...)]
-- Bucket 2: []  -- empty bucket (sparse is normal)
-- Bucket 3: [Customer(id=1003, ...), Customer(id=7003, ...), ...]
-- ...
-- Bucket N-1: [...]

Build Relation Selection: The Critical Choice

Impact of Build Relation Choice
Scenario	Build from Small	Build from Large	Performance Impact
Memory usage	Hash table fits in memory	Hash table exceeds memory	10-100× difference
I/O pattern	Single pass over build relation	Requires partitioning/spilling	5-50× more I/O
CPU efficiency	Cache-friendly hash table	Cache thrashing, TLB misses	2-10× slower
Probe efficiency	Fast constant-time lookups	Same (if in memory)	Similar
Overall cost	O(\|R\| + \|S\|) I/O	O(\|R\| + \|S\|) × partition factor	Massive difference

The mathematics of selection:

Suppose we join relations R (|R| = 1,000,000 tuples, 100 bytes/tuple) and S (|S| = 10,000 tuples, 200 bytes/tuple):

Build from R: Hash table ≈ 100 MB + overhead ≈ 120-150 MB
Build from S: Hash table ≈ 2 MB + overhead ≈ 2.5-3 MB

Size Estimation Challenges

Factors Affecting Relation Size

•Base table cardinality — The number of rows in the underlying table before any filtering
•Selection predicates — WHERE clauses can dramatically reduce effective size; a 1-million row table may produce only 100 rows after filtering
•Projection impact — Selecting fewer columns reduces per-tuple size, affecting hash table memory requirements
•Prior join results — In multi-way joins, intermediate results become inputs; sizes are harder to estimate
•Data skew — Non-uniform value distributions can cause certain hash buckets to grow unexpectedly large

Hash Function Design and Properties

Key properties of effective join hash functions:

Hash Function Requirements

•Uniform distribution — Equal probability of mapping to any bucket, regardless of input patterns. This is crucial because real data is rarely uniform (e.g., customer IDs might cluster around certain ranges).
•Deterministic behavior — The same key must always hash to the same bucket. This sounds obvious but matters when handling NULL values and during partition-based processing.
•Computational efficiency — Hash computation happens for every tuple; even small per-tuple overhead compounds over millions of rows. Modern functions achieve < 10 CPU cycles per hash.
•Good avalanche effect — Small changes in input should cause large changes in output. This prevents clustering of similar keys (like sequential IDs).

hash_functions.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Common hash functions used in database systems
 
// 1. Multiplicative Hashing (simple, fast)
// Based on Knuth's suggestion using the golden ratio
uint64_t multiplicative_hash(uint64_t key, int num_buckets) {
    const uint64_t A = 11400714819323198485ULL; // 2^64 / φ
    return ((key * A) >> (64 - log2(num_buckets)));
}
 
// 2. MurmurHash3 (excellent distribution, widely used)
// PostgreSQL, MySQL, and many systems use variants
uint32_t murmur3_hash(const void* key, int len, uint32_t seed) {
    // Processes input in 4-byte chunks with mixing
    // Provides excellent avalanche properties
    // ~10-15 cycles for typical keys
    // ... implementation details omitted for brevity
}
 
// 3. CRC32 Hardware-Accelerated (Intel CRC32C instruction)
// Modern CPUs have dedicated instructions making this very fast
uint32_t crc32c_hash(uint64_t key) {
    // Uses _mm_crc32_u64 intrinsic on x86
    // Single cycle on modern CPUs
    return _mm_crc32_u64(0, key);
}
 
// 4. Composite Key Hashing
// For joins on multiple columns (composite keys)
uint64_t composite_hash(const Tuple& t, const vector<int>& key_columns) {
    uint64_t hash = 0;
    for (int col : key_columns) {
        // Combine hashes using XOR-shift pattern
        hash ^= hash_column(t, col);
        hash = hash * 31 + (hash >> 17);  // mixing step
    }
    return hash;
}

Hardware-Accelerated Hashing

Hash Table Structure and Organization

The hash table constructed during the build phase must support two operations efficiently:

Insert (during build): Add a tuple to the appropriate bucket
Lookup (during probe): Find all tuples matching a given key

Database systems use several hash table designs, each with different tradeoffs for memory efficiency, cache behavior, and collision handling:

Chained Hash Table

•Each bucket contains a linked list of tuples
•Simple insertion: append to list
•Handles collisions naturally
•Memory overhead: ~8-16 bytes per tuple for pointers
•Poor cache locality: pointer chasing
•Best for: variable-size tuples, many duplicates

Linear Probing Hash Table

•Open addressing: store tuples directly in array
•On collision: check next slot sequentially
•No pointer overhead, compact storage
•Excellent cache locality: sequential access
•Degrades at high load factors (> 70%)
•Best for: unique keys, memory-critical scenarios

hash_table_structures.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Chained Hash Table Implementation
struct ChainedHashTable {
    struct Entry {
        Tuple tuple;
        Entry* next;
    };
    
    Entry** buckets;      // Array of bucket heads
    size_t num_buckets;
    size_t num_entries;
    
    void insert(const Tuple& t, uint64_t key) {
        size_t bucket = hash(key) % num_buckets;
        Entry* entry = new Entry{t, buckets[bucket]};
        buckets[bucket] = entry;  // Prepend to chain
        num_entries++;
    }
    
    vector<Tuple*> find(uint64_t key) {
        vector<Tuple*> results;
        size_t bucket = hash(key) % num_buckets;
        Entry* curr = buckets[bucket];
        while (curr) {
            if (get_key(curr->tuple) == key) {
                results.push_back(&curr->tuple);
            }
            curr = curr->next;
        }
        return results;
    }
};
 
// Linear Probing Hash Table Implementation
struct LinearProbingHashTable {
    struct Slot {
        bool occupied;
        uint64_t key;
        Tuple tuple;
    };
    
    Slot* slots;
    size_t capacity;
    size_t size;
    
    void insert(const Tuple& t, uint64_t key) {
        size_t idx = hash(key) % capacity;
        while (slots[idx].occupied) {
            idx = (idx + 1) % capacity;  // Linear probe
        }
        slots[idx] = {true, key, t};
        size++;
    }
    
    // Note: More complex for duplicate keys
    Tuple* find(uint64_t key) {
        size_t idx = hash(key) % capacity;
        while (slots[idx].occupied) {
            if (slots[idx].key == key) {
                return &slots[idx].tuple;
            }
            idx = (idx + 1) % capacity;
        }
        return nullptr;  // Not found
    }
};

Memory layout considerations:

Bucket sizing strategy:

The number of buckets is typically chosen as:

A prime number (reduces clustering for certain hash functions)
Approximately |build relation| / target_chain_length where target_chain_length is typically 1-4
Often oversized by 30-50% to maintain low load factor

For example, building from 100,000 tuples with target chain length 2:

Buckets needed ≈ 50,000
With 30% overhead ≈ 65,000 buckets
Rounded to nearest prime: 65,537 buckets

Memory Management During Build

Memory Scenarios and Responses
Scenario	Memory Status	System Response	Performance
Tiny build	< 10% of buffer pool	Simple in-memory hash table	Optimal: single pass
Comfortable fit	50-70% of buffer pool	In-memory with headroom for probing	Optimal: single pass
Tight fit	80-95% of buffer pool	In-memory but monitor carefully	Good: single pass, some pressure
Slight overflow	100-200% of buffer pool	Partition with few passes	Degraded: 2-3 passes
Massive overflow	500% of buffer pool	Recursive partitioning	Severely degraded: many passes

Memory components during build:

Bucket array: Fixed overhead, typically num_buckets × 8 bytes for pointers
Tuple storage: The actual tuple data, varies by tuple width and count
Chain pointers: For chained hashing, 8 bytes per tuple
Metadata: Hash values stored alongside tuples (optional but speeds probing)
Page alignment: Overhead from rounding up to page boundaries

Total memory estimation:

memory_required = num_buckets × 8                    // Bucket array
                + num_tuples × (tuple_size + 8)      // Tuples + chain pointers
                + num_tuples × 8                     // Stored hash values (if used)
                + alignment_overhead                  // ~5-10% typical

Early Memory Assessment

memory_estimation.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Memory estimation before build phase
struct BuildMemoryEstimate {
    size_t estimated_tuples;
    size_t avg_tuple_bytes;
    size_t join_key_bytes;
    
    size_t estimate_hash_table_size() {
        // Bucket array: use ~2x tuples for low load factor
        size_t num_buckets = next_prime(estimated_tuples * 2);
        size_t bucket_array = num_buckets * sizeof(void*);
        
        // Tuple storage with chain pointers
        size_t tuple_storage = estimated_tuples * 
            (avg_tuple_bytes + sizeof(void*) + sizeof(uint64_t));
            //  tuple data   + next pointer  + cached hash
        
        // Add 10% for fragmentation and alignment
        size_t total = (bucket_array + tuple_storage) * 1.1;
        
        return total;
    }
    
    bool fits_in_memory(size_t available_memory) {
        size_t required = estimate_hash_table_size();
        // Reserve 20% for probe-phase working memory
        return required < available_memory * 0.8;
    }
    
    int estimate_partitions_needed(size_t available_memory) {
        size_t required = estimate_hash_table_size();
        size_t usable = available_memory * 0.8;
        // Each partition should fit comfortably
        // Add one for safety margin
        return (required / usable) + 2;
    }
};

Build Phase Execution Flow

Let's trace through the complete build phase execution, understanding each step in detail:

Phase 1: Initialization

Before reading any tuples, the system prepares the hash table infrastructure:

build_phase_initialization.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Build phase initialization
class HashJoinBuildOperator {
    HashTable* hash_table;
    vector<int> join_key_columns;
    size_t tuples_processed;
    
    void initialize(const PlanNode& build_input, size_t available_memory) {
        // 1. Get cardinality estimate from optimizer
        size_t estimated_rows = build_input.estimated_cardinality();
        size_t tuple_width = build_input.output_schema().row_width();
        
        // 2. Determine hash table sizing
        size_t memory_needed = estimate_memory(estimated_rows, tuple_width);
        
        if (memory_needed > available_memory) {
            // Will need partitioning - handled by Hybrid/Grace strategies
            throw NeedPartitioningException(memory_needed, available_memory);
        }
        
        // 3. Allocate hash table with appropriate size
        size_t num_buckets = compute_bucket_count(estimated_rows);
        hash_table = new HashTable(num_buckets);
        
        // 4. Identify join key columns from query plan
        join_key_columns = build_input.join_keys();
        
        tuples_processed = 0;
    }
};

Phase 2: Tuple Processing

The core loop reads tuples from the build relation and inserts them into the hash table:

build_phase_processing.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
void process_build_relation(Operator* build_child) {
    Tuple tuple;
    
    // Request tuples from child operator (may be scan, filter, prior join, etc.)
    while (build_child->next(&tuple)) {
        // 1. Extract join key value(s)
        JoinKey key = extract_key(tuple, join_key_columns);
        
        // 2. Compute hash value
        uint64_t hash_value = compute_hash(key);
        
        // 3. Insert into hash table
        //    Many implementations store the hash value alongside the tuple
        //    to avoid recomputing during probe phase comparisons
        hash_table->insert(hash_value, key, tuple);
        
        tuples_processed++;
        
        // 4. Optional: Check memory pressure periodically
        if (tuples_processed % 10000 == 0) {
            if (memory_pressure_detected()) {
                // Some systems can switch to hybrid approach mid-build
                trigger_partial_partition_flush();
            }
        }
    }
    
    // Build phase complete
    log_build_complete(tuples_processed, hash_table->memory_used());
}

Vectorized Build Operations

Phase 3: Finalization

After all tuples are inserted, the hash table may undergo final optimizations:

•Bucket chain sorting — Some systems sort chains by hash value for faster probe comparisons
•Empty bucket coalescence — Remove empty buckets to improve cache efficiency
•Statistics gathering — Compute actual vs. estimated sizes, chain length distribution for query feedback
•Bloom filter construction — Build a compact Bloom filter for fast negative lookups (discussed in probe phase)
•Memory barrier — Ensure all writes are visible before probe phase begins (important for parallel execution)

Handling NULL Values in Join Keys

Inner Join NULL Handling

•NULL-key tuples: skip during build
•Cannot match, so don't store
•Saves memory and insertion time
•Probe-side NULLs also produce no matches

Outer Join NULL Handling

•Must track NULL-key tuples separately
•LEFT/RIGHT OUTER: need to emit non-matching tuples
•FULL OUTER: both sides need tracking
•Special 'NULL bucket' or separate list

null_key_handling.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// NULL handling during build phase
void insert_tuple(const Tuple& tuple, JoinType join_type) {
    JoinKey key = extract_key(tuple, join_key_columns);
    
    // Check for NULL in any key column
    if (key.has_null()) {
        if (join_type == JoinType::INNER) {
            // Skip: NULL can never match
            null_tuples_skipped++;
            return;
        } else if (join_type == JoinType::LEFT_OUTER ||
                   join_type == JoinType::FULL_OUTER) {
            // Must preserve for potential non-matching output
            null_key_tuples.push_back(tuple);
            return;
        }
        // RIGHT_OUTER: build side doesn't need NULL tracking
    }
    
    // Normal insertion for non-NULL keys
    uint64_t hash_value = compute_hash(key);
    hash_table->insert(hash_value, key, tuple);
}
 
// For FULL OUTER JOIN, we also need to track which build tuples matched
struct BuildTupleEntry {
    Tuple tuple;
    bool matched;  // Set during probe phase
};

Multi-Column Key NULLs

Build Phase Cost Analysis

Understanding the cost of the build phase is essential for query optimization and performance analysis. The cost model considers both I/O and CPU components:

Build Phase Cost Components
Component	Cost Formula	Typical Values	Notes
Read build relation	⌈\|R\| / page_size⌉ page I/Os	Sequential read	~10 MB/s HDD, ~500 MB/s SSD
Hash computation	\|R\| × hash_cost	~10-50 ns/tuple	Hardware-accelerated on modern CPUs
Hash table insertion	\|R\| × insert_cost	~50-200 ns/tuple	Depends on hash table design
Memory allocation	Variable	Amortized ~10 ns/tuple	Pre-allocation reduces overhead
Key extraction	\|R\| × extract_cost	~20-100 ns/tuple	More for wide/variable columns

I/O Cost Dominance:

CPU Cost Visibility:

For cached data or very small relations, CPU cost becomes visible:

Building a hash table from 1 million rows takes ~100-500 ms of CPU time
Hardware effects (cache misses, NUMA penalties) can 2-3× this on large tables
Vectorized implementations reduce per-tuple overhead significantly

Total Build Phase Cost (in-memory case):

Build_cost = (|R| / page_size) × I/O_latency  // Read relation
           + |R| × (hash_time + insert_time)   // CPU processing
           + allocation_overhead                // Memory setup

For example, building from 100,000 tuples (50 MB):

SSD read: ~0.1 seconds
CPU: ~50-100 ms
Total: ~150-200 ms

This is why hash join is fast—the build phase is typically a single sequential scan plus linear-time hash table construction.

When Build Cost Explodes

Summary: Build Phase Mastery

The build phase establishes the foundation for efficient hash join execution. Let's consolidate the essential concepts:

Key Takeaways

•Build from the smaller relation — This single decision determines whether the join fits in memory and executes efficiently. Optimizers use statistics to estimate; wrong estimates cause performance problems.
•Hash function quality matters — Uniform distribution prevents hot buckets; modern CPUs provide hardware acceleration for common hash functions; composite keys require careful combination strategies.
•Hash table design affects performance — Chained tables handle collisions naturally but have pointer overhead; open addressing is cache-friendly but degrades at high load factors.
•Memory management is critical — Pre-estimate hash table size; reserve space for probe phase; detect memory pressure early to enable partitioning strategies.
•NULL handling varies by join type — Inner joins can skip NULL-key tuples; outer joins must track them separately for correct results.
•Build cost is typically I/O-dominated — Sequential read of build relation plus linear CPU processing; becomes CPU-bound only for cached or tiny relations.

What's Next:

Page Complete

1 / 5