Loading learning content...
The hash join algorithm stands as one of the most important advances in relational database implementation. Unlike nested-loop joins that require O(n × m) comparisons, or sort-merge joins that demand complete sorting of both inputs, hash join exploits the power of hashing to achieve near-linear join performance under favorable conditions.
At the heart of every hash join lies a deceptively simple insight: if we can build a hash table from one relation and probe it with tuples from the other, we reduce join complexity from quadratic to linear. The build phase is where this magic begins—it transforms raw tuples into a searchable, constant-time-access structure that makes the entire algorithm possible.
By the end of this page, you will understand how the build phase constructs hash tables from the smaller relation, why relation selection matters critically for performance, how hash functions and bucket organization work, and what happens when memory constraints challenge the basic algorithm. This knowledge is essential for understanding query optimizer decisions and diagnosing join performance issues.
The build phase is the first of two phases in any hash join algorithm. Its purpose is elegantly simple: read tuples from one relation (called the build relation) and construct an in-memory hash table indexed by the join attribute(s).
Consider a join between tables Orders (10 million rows) and Customers (100,000 rows) on customer_id. The build phase would:
Once complete, we have a structure that can answer the question "Does any customer have ID = X?" in O(1) average time—a dramatic improvement over scanning through all customers repeatedly.
The terminology is intentional and meaningful. We 'build' the hash table once, then 'probe' it many times. The build relation contributes structure; the probe relation contributes lookups. This asymmetry is fundamental to hash join's efficiency—we invest upfront work in building to save repeated work during probing.
12345678910111213141516171819202122232425
-- Conceptual representation of the build phase-- For a join: SELECT * FROM Orders O JOIN Customers C ON O.customer_id = C.customer_id -- STEP 1: The optimizer identifies Customers as the smaller relation-- Customers: 100,000 rows × 200 bytes = ~20 MB-- Orders: 10,000,000 rows × 150 bytes = ~1.5 GB -- STEP 2: Build phase processes Customers-- Pseudocode:-- hash_table = new HashMap()-- -- FOR each tuple t in Customers:-- key = t.customer_id-- bucket = hash(key) mod num_buckets-- hash_table[bucket].append(t)-- -- RETURN hash_table -- STEP 3: The resulting hash table structure-- Bucket 0: [Customer(id=1000, name='Acme Corp', ...), Customer(id=5000, ...)]-- Bucket 1: [Customer(id=1001, name='Beta Inc', ...)]-- Bucket 2: [] -- empty bucket (sparse is normal)-- Bucket 3: [Customer(id=1003, ...), Customer(id=7003, ...), ...]-- ...-- Bucket N-1: [...]Choosing which relation to build from is perhaps the most consequential decision in hash join execution. The choice directly determines memory requirements, I/O patterns, and whether the join can execute efficiently or must fall back to more expensive strategies.
The fundamental rule is clear: always build from the smaller relation. But 'smaller' requires careful definition—we mean the relation that produces fewer bytes after applying any preceding operators (selections, projections).
| Scenario | Build from Small | Build from Large | Performance Impact |
|---|---|---|---|
| Memory usage | Hash table fits in memory | Hash table exceeds memory | 10-100× difference |
| I/O pattern | Single pass over build relation | Requires partitioning/spilling | 5-50× more I/O |
| CPU efficiency | Cache-friendly hash table | Cache thrashing, TLB misses | 2-10× slower |
| Probe efficiency | Fast constant-time lookups | Same (if in memory) | Similar |
| Overall cost | O(|R| + |S|) I/O | O(|R| + |S|) × partition factor | Massive difference |
The mathematics of selection:
Suppose we join relations R (|R| = 1,000,000 tuples, 100 bytes/tuple) and S (|S| = 10,000 tuples, 200 bytes/tuple):
If available memory is 50 MB, building from R requires partitioning and multiple passes, while building from S completes in a single, efficient pass. The 50× size difference translates to dramatically different execution strategies.
Optimizers estimate relation sizes using statistics (histograms, cardinality estimates). When estimates are wrong—which happens frequently with complex predicates, correlated columns, or stale statistics—the optimizer may choose the wrong build relation. This is a common cause of query performance regression and why database professionals monitor join operation metrics carefully.
The hash function used during the build phase is critical to performance. A well-designed hash function distributes tuples uniformly across buckets, minimizing collisions and ensuring O(1) average-case lookup during the probe phase. A poor hash function creates 'hot' buckets that degrade performance to O(n).
Key properties of effective join hash functions:
12345678910111213141516171819202122232425262728293031323334353637
// Common hash functions used in database systems // 1. Multiplicative Hashing (simple, fast)// Based on Knuth's suggestion using the golden ratiouint64_t multiplicative_hash(uint64_t key, int num_buckets) { const uint64_t A = 11400714819323198485ULL; // 2^64 / φ return ((key * A) >> (64 - log2(num_buckets)));} // 2. MurmurHash3 (excellent distribution, widely used)// PostgreSQL, MySQL, and many systems use variantsuint32_t murmur3_hash(const void* key, int len, uint32_t seed) { // Processes input in 4-byte chunks with mixing // Provides excellent avalanche properties // ~10-15 cycles for typical keys // ... implementation details omitted for brevity} // 3. CRC32 Hardware-Accelerated (Intel CRC32C instruction)// Modern CPUs have dedicated instructions making this very fastuint32_t crc32c_hash(uint64_t key) { // Uses _mm_crc32_u64 intrinsic on x86 // Single cycle on modern CPUs return _mm_crc32_u64(0, key);} // 4. Composite Key Hashing// For joins on multiple columns (composite keys)uint64_t composite_hash(const Tuple& t, const vector<int>& key_columns) { uint64_t hash = 0; for (int col : key_columns) { // Combine hashes using XOR-shift pattern hash ^= hash_column(t, col); hash = hash * 31 + (hash >> 17); // mixing step } return hash;}Modern database systems heavily exploit CPU-specific hash instructions. Intel's CRC32C instruction, ARM's CRC32 intrinsics, and SIMD-based hashing (using AVX2/AVX-512) can compute hashes orders of magnitude faster than software implementations. This hardware acceleration is why hash join performance has improved dramatically on modern hardware.
The hash table constructed during the build phase must support two operations efficiently:
Database systems use several hash table designs, each with different tradeoffs for memory efficiency, cache behavior, and collision handling:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
// Chained Hash Table Implementationstruct ChainedHashTable { struct Entry { Tuple tuple; Entry* next; }; Entry** buckets; // Array of bucket heads size_t num_buckets; size_t num_entries; void insert(const Tuple& t, uint64_t key) { size_t bucket = hash(key) % num_buckets; Entry* entry = new Entry{t, buckets[bucket]}; buckets[bucket] = entry; // Prepend to chain num_entries++; } vector<Tuple*> find(uint64_t key) { vector<Tuple*> results; size_t bucket = hash(key) % num_buckets; Entry* curr = buckets[bucket]; while (curr) { if (get_key(curr->tuple) == key) { results.push_back(&curr->tuple); } curr = curr->next; } return results; }}; // Linear Probing Hash Table Implementationstruct LinearProbingHashTable { struct Slot { bool occupied; uint64_t key; Tuple tuple; }; Slot* slots; size_t capacity; size_t size; void insert(const Tuple& t, uint64_t key) { size_t idx = hash(key) % capacity; while (slots[idx].occupied) { idx = (idx + 1) % capacity; // Linear probe } slots[idx] = {true, key, t}; size++; } // Note: More complex for duplicate keys Tuple* find(uint64_t key) { size_t idx = hash(key) % capacity; while (slots[idx].occupied) { if (slots[idx].key == key) { return &slots[idx].tuple; } idx = (idx + 1) % capacity; } return nullptr; // Not found }};Memory layout considerations:
Modern systems optimize hash table layouts for cache efficiency. A typical CPU cache line is 64 bytes—if we can fit bucket headers or multiple hash entries within a single cache line, we dramatically reduce memory access latency.
Bucket sizing strategy:
The number of buckets is typically chosen as:
For example, building from 100,000 tuples with target chain length 2:
Memory management during the build phase is fundamental to hash join performance. The ideal case—where the entire build relation fits in memory—enables the simplest and fastest execution. Real systems must handle the full spectrum from tiny tables to tables exceeding available RAM by orders of magnitude.
| Scenario | Memory Status | System Response | Performance |
|---|---|---|---|
| Tiny build | < 10% of buffer pool | Simple in-memory hash table | Optimal: single pass |
| Comfortable fit | 50-70% of buffer pool | In-memory with headroom for probing | Optimal: single pass |
| Tight fit | 80-95% of buffer pool | In-memory but monitor carefully | Good: single pass, some pressure |
| Slight overflow | 100-200% of buffer pool | Partition with few passes | Degraded: 2-3 passes |
| Massive overflow | 500% of buffer pool | Recursive partitioning | Severely degraded: many passes |
Memory components during build:
Total memory estimation:
memory_required = num_buckets × 8 // Bucket array
+ num_tuples × (tuple_size + 8) // Tuples + chain pointers
+ num_tuples × 8 // Stored hash values (if used)
+ alignment_overhead // ~5-10% typical
Production database systems assess memory requirements before beginning the build phase. They use cardinality estimates and average tuple sizes to predict hash table size. If predicted size exceeds available memory, the system switches to a partitioned (Grace or Hybrid) strategy from the start—avoiding the costly scenario of building partially, running out of memory, and having to restart with a different strategy.
123456789101112131415161718192021222324252627282930313233343536
// Memory estimation before build phasestruct BuildMemoryEstimate { size_t estimated_tuples; size_t avg_tuple_bytes; size_t join_key_bytes; size_t estimate_hash_table_size() { // Bucket array: use ~2x tuples for low load factor size_t num_buckets = next_prime(estimated_tuples * 2); size_t bucket_array = num_buckets * sizeof(void*); // Tuple storage with chain pointers size_t tuple_storage = estimated_tuples * (avg_tuple_bytes + sizeof(void*) + sizeof(uint64_t)); // tuple data + next pointer + cached hash // Add 10% for fragmentation and alignment size_t total = (bucket_array + tuple_storage) * 1.1; return total; } bool fits_in_memory(size_t available_memory) { size_t required = estimate_hash_table_size(); // Reserve 20% for probe-phase working memory return required < available_memory * 0.8; } int estimate_partitions_needed(size_t available_memory) { size_t required = estimate_hash_table_size(); size_t usable = available_memory * 0.8; // Each partition should fit comfortably // Add one for safety margin return (required / usable) + 2; }};Let's trace through the complete build phase execution, understanding each step in detail:
Phase 1: Initialization
Before reading any tuples, the system prepares the hash table infrastructure:
1234567891011121314151617181920212223242526272829
// Build phase initializationclass HashJoinBuildOperator { HashTable* hash_table; vector<int> join_key_columns; size_t tuples_processed; void initialize(const PlanNode& build_input, size_t available_memory) { // 1. Get cardinality estimate from optimizer size_t estimated_rows = build_input.estimated_cardinality(); size_t tuple_width = build_input.output_schema().row_width(); // 2. Determine hash table sizing size_t memory_needed = estimate_memory(estimated_rows, tuple_width); if (memory_needed > available_memory) { // Will need partitioning - handled by Hybrid/Grace strategies throw NeedPartitioningException(memory_needed, available_memory); } // 3. Allocate hash table with appropriate size size_t num_buckets = compute_bucket_count(estimated_rows); hash_table = new HashTable(num_buckets); // 4. Identify join key columns from query plan join_key_columns = build_input.join_keys(); tuples_processed = 0; }};Phase 2: Tuple Processing
The core loop reads tuples from the build relation and inserts them into the hash table:
123456789101112131415161718192021222324252627282930
void process_build_relation(Operator* build_child) { Tuple tuple; // Request tuples from child operator (may be scan, filter, prior join, etc.) while (build_child->next(&tuple)) { // 1. Extract join key value(s) JoinKey key = extract_key(tuple, join_key_columns); // 2. Compute hash value uint64_t hash_value = compute_hash(key); // 3. Insert into hash table // Many implementations store the hash value alongside the tuple // to avoid recomputing during probe phase comparisons hash_table->insert(hash_value, key, tuple); tuples_processed++; // 4. Optional: Check memory pressure periodically if (tuples_processed % 10000 == 0) { if (memory_pressure_detected()) { // Some systems can switch to hybrid approach mid-build trigger_partial_partition_flush(); } } } // Build phase complete log_build_complete(tuples_processed, hash_table->memory_used());}Modern analytical databases process tuples in batches (vectors) of 1,000-10,000 tuples rather than one at a time. Vectorized hash computation and insertion exploit CPU pipelining, SIMD instructions, and cache prefetching for 2-5× better performance than tuple-at-a-time processing. The fundamental algorithm remains the same, but the implementation is heavily optimized.
Phase 3: Finalization
After all tuples are inserted, the hash table may undergo final optimizations:
NULL values in join keys require special consideration during the build phase. The SQL standard specifies that NULL ≠ NULL in join comparisons, which affects how we handle NULL-containing tuples.
Key insight: A tuple with NULL in a join key column can never match any probe tuple (since NULL ≠ anything, including NULL). Therefore, for standard inner joins, NULL-key tuples need not be inserted into the hash table at all.
1234567891011121314151617181920212223242526272829
// NULL handling during build phasevoid insert_tuple(const Tuple& tuple, JoinType join_type) { JoinKey key = extract_key(tuple, join_key_columns); // Check for NULL in any key column if (key.has_null()) { if (join_type == JoinType::INNER) { // Skip: NULL can never match null_tuples_skipped++; return; } else if (join_type == JoinType::LEFT_OUTER || join_type == JoinType::FULL_OUTER) { // Must preserve for potential non-matching output null_key_tuples.push_back(tuple); return; } // RIGHT_OUTER: build side doesn't need NULL tracking } // Normal insertion for non-NULL keys uint64_t hash_value = compute_hash(key); hash_table->insert(hash_value, key, tuple);} // For FULL OUTER JOIN, we also need to track which build tuples matchedstruct BuildTupleEntry { Tuple tuple; bool matched; // Set during probe phase};For composite join keys (e.g., JOIN ON a.x = b.x AND a.y = b.y), a NULL in ANY key column prevents matching. However, some systems support non-standard semantics where NULLs match NULLs. The IS NOT DISTINCT FROM operator (or <=> in MySQL) provides this behavior and requires different NULL handling during build.
Understanding the cost of the build phase is essential for query optimization and performance analysis. The cost model considers both I/O and CPU components:
| Component | Cost Formula | Typical Values | Notes |
|---|---|---|---|
| Read build relation | ⌈|R| / page_size⌉ page I/Os | Sequential read | ~10 MB/s HDD, ~500 MB/s SSD |
| Hash computation | |R| × hash_cost | ~10-50 ns/tuple | Hardware-accelerated on modern CPUs |
| Hash table insertion | |R| × insert_cost | ~50-200 ns/tuple | Depends on hash table design |
| Memory allocation | Variable | Amortized ~10 ns/tuple | Pre-allocation reduces overhead |
| Key extraction | |R| × extract_cost | ~20-100 ns/tuple | More for wide/variable columns |
I/O Cost Dominance:
For disk-resident data, I/O cost typically dominates. Reading a 1 GB build relation from HDD takes ~100 seconds, while CPU processing of the same data completes in ~10 seconds. With SSDs, the gap narrows but I/O often remains the bottleneck for cold data.
CPU Cost Visibility:
For cached data or very small relations, CPU cost becomes visible:
Total Build Phase Cost (in-memory case):
Build_cost = (|R| / page_size) × I/O_latency // Read relation
+ |R| × (hash_time + insert_time) // CPU processing
+ allocation_overhead // Memory setup
For example, building from 100,000 tuples (50 MB):
This is why hash join is fast—the build phase is typically a single sequential scan plus linear-time hash table construction.
Build phase cost can explode when: (1) the build relation doesn't fit in memory, requiring partitioning and multiple passes; (2) extreme data skew causes certain hash buckets to become very long chains; (3) poor cardinality estimates lead to hash table resize operations mid-build. Monitoring actual vs. estimated build sizes helps identify these issues.
The build phase establishes the foundation for efficient hash join execution. Let's consolidate the essential concepts:
What's Next:
With the hash table constructed, we're ready for the probe phase—where the larger relation is scanned and matched against our carefully built structure. The probe phase is where the investment in build phase optimization pays dividends, enabling constant-time lookups for each probe tuple.
You now understand the build phase of hash join algorithms: relation selection, hash function design, hash table structures, memory management, NULL handling, and cost analysis. This foundation is essential for understanding why hash join is the dominant join algorithm in modern database systems.