Hash Join - Learning Module

Loading content...

0/252

Partition Handling

When Data Exceeds Memory

The basic hash join algorithm assumes a simple premise: the build relation fits in available memory. This assumption enables single-pass processing and O(1) lookups—the foundation of hash join efficiency. But real-world data regularly violates this assumption.

A 100 GB table joined against a 10 GB table cannot build a 10+ GB hash table when only 2 GB of memory is available. Without careful handling, the system would either crash, thrash to disk, or produce incorrect results. Partition handling solves this problem through divide-and-conquer: split both relations into smaller partitions that individually fit in memory, then process each partition pair independently.

What You Will Learn

By the end of this page, you will understand why partitioning is necessary, how partition-based hash joins work, the guarantees that make partitioning correct, memory allocation strategies, handling of partition overflow, and the I/O cost implications. This knowledge is essential for understanding how databases handle joins of arbitrary scale.

The Memory Overflow Problem

When the build relation is larger than available memory, three problematic scenarios can occur:

Scenario 1: Hash table allocation fails The system attempts to allocate memory for the hash table and receives an out-of-memory error. The query fails immediately.

Scenario 2: Uncontrolled spilling to disk The operating system's virtual memory swaps hash table pages to disk. Random access patterns cause severe thrashing—potentially 1000× slower than in-memory execution.

Scenario 3: Partial hash table Only part of the build relation fits. Probes against missing entries produce incorrect results (missing matches).

The fundamental constraint:

Let M = available memory in pages, B = block size Let |R| = size of build relation, |S| = size of probe relation

For single-pass hash join:

Hash table requires: |R| + overhead (typically 1.2-1.5× |R|)
Must satisfy: |R| × overhead_factor ≤ M

When this inequality doesn't hold, we need a different strategy.

Memory Scenarios and Required Strategies
Build Size vs Memory	Strategy	Passes Required	Complexity
< M	Simple hash join	1 pass each relation	O(\|R\| + \|S\|)
M to 2M	Partition (small)	2 passes each	O(2 × (\|R\| + \|S\|))
2M to M²	Partition (standard)	2-3 passes each	O(2-3 × (\|R\| + \|S\|))
M²	Recursive partition	Multiple passes	O(k × (\|R\| + \|S\|)) where k ≥ 3

The M² Boundary

With M pages of memory, we can create at most M-1 output partitions (one page per partition buffer plus one page for input). Each partition can grow up to M pages in the subsequent phase. This means single-level partitioning handles build relations up to M × (M-1) ≈ M² pages. Beyond this, recursive partitioning is necessary—a key threshold in hash join theory.

Partition-Based Hash Join Overview

Partition-based hash join splits the problem into manageable pieces. The core insight: if we partition both relations using the same hash function on the join key, matching tuples are guaranteed to be in corresponding partitions.

This guarantee is crucial. A tuple from R with key value k will hash to partition i. Any matching tuple from S with the same key k will also hash to partition i. Therefore, we only need to compare tuples within matching partition pairs—partition i of R against partition i of S.

The two-phase algorithm:

Phase 1: Partition

Determine number of partitions n (based on memory size and relation estimate)
Allocate output buffer pages, one per partition
Scan build relation R:
- For each tuple, compute partition = hash₁(join_key) mod n
- Write tuple to partition buffer
- When buffer full, flush to disk
Repeat for probe relation S using same hash function

Phase 2: Join

For each partition pair (Rᵢ, Sᵢ):
- Read Rᵢ into memory
- Build hash table from Rᵢ using hash₂ (different from hash₁)
- Scan Sᵢ, probing the hash table
- Output matching pairs
- Discard hash table, proceed to next partition

partitioned_hash_join.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Partitioned hash join algorithm
class PartitionedHashJoin {
    int num_partitions;
    vector<PartitionFile> build_partitions;
    vector<PartitionFile> probe_partitions;
    HashFunction partition_hash;  // hash₁
    HashFunction join_hash;       // hash₂
    
    void execute(Relation* R, Relation* S, ResultSet* output) {
        // Phase 1: Partition both relations
        partition_relation(R, build_partitions);
        partition_relation(S, probe_partitions);
        
        // Phase 2: Join corresponding partitions
        for (int i = 0; i < num_partitions; i++) {
            if (build_partitions[i].empty()) {
                // No build tuples in this partition = no output
                continue;
            }
            
            // Build hash table from partition i of R
            HashTable* ht = build_hash_table(build_partitions[i]);
            
            // Probe with partition i of S
            probe_and_output(probe_partitions[i], ht, output);
            
            delete ht;
        }
    }
    
private:
    void partition_relation(Relation* rel, vector<PartitionFile>& partitions) {
        // Allocate one buffer page per partition
        vector<Page*> buffers(num_partitions);
        for (int i = 0; i < num_partitions; i++) {
            buffers[i] = allocate_page();
        }
        
        // Scan relation and partition
        for (Tuple& tuple : *rel) {
            int p = partition_hash(tuple.join_key) % num_partitions;
            
            if (buffers[p]->is_full()) {
                // Flush buffer to disk
                partitions[p].write(buffers[p]);
                buffers[p]->clear();
            }
            
            buffers[p]->add(tuple);
        }
        
        // Flush remaining tuples
        for (int i = 0; i < num_partitions; i++) {
            if (!buffers[i]->empty()) {
                partitions[i].write(buffers[i]);
            }
            free_page(buffers[i]);
        }
    }
    
    HashTable* build_hash_table(PartitionFile& partition) {
        HashTable* ht = new HashTable();
        for (Page* page : partition.pages()) {
            for (Tuple& tuple : *page) {
                // Use second hash function for hash table
                uint64_t h = join_hash(tuple.join_key);
                ht->insert(h, tuple);
            }
        }
        return ht;
    }
};

The Partitioning Correctness Guarantee

Partitioned hash join's correctness rests on a fundamental property of hash functions: determinism. The same key always produces the same hash value, and thus maps to the same partition.

Formal guarantee:

Let R and S be relations with join attribute A. Let h be a hash function and p(k) = h(k) mod n be the partition function.

For any two tuples r ∈ R and s ∈ S: If r.A = s.A, then p(r.A) = p(s.A)

Proof: Since r.A = s.A, and h is a function (deterministic), h(r.A) = h(s.A). Therefore h(r.A) mod n = h(s.A) mod n, which means p(r.A) = p(s.A). ∎

This means matching tuples always land in the same partition. No matches are missed; no cross-partition comparisons are needed.

Why Two Different Hash Functions?

We use one hash function (h₁) for partitioning and a different one (h₂) for the in-partition hash table. Why? If we used the same function, all tuples in partition i would have the same h mod n value, causing poor distribution in the hash table. Using a different function ensures good distribution within each partition's hash table.

two_hash_functions.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Using two different hash functions
class HashFunctions {
    // Hash function 1: for partitioning
    // Uses one set of mixing constants
    static uint64_t partition_hash(const JoinKey& key) {
        uint64_t h = key.as_uint64();
        // MurmurHash-style mixing
        h ^= h >> 33;
        h *= 0xff51afd7ed558ccdULL;
        h ^= h >> 33;
        h *= 0xc4ceb9fe1a85ec53ULL;
        h ^= h >> 33;
        return h;
    }
    
    // Hash function 2: for in-partition hash table
    // Uses different constants or algorithm
    static uint64_t table_hash(const JoinKey& key) {
        uint64_t h = key.as_uint64();
        // Different mixing pattern
        h *= 0x9e3779b97f4a7c15ULL;  // Based on golden ratio
        h ^= h >> 30;
        h *= 0xbf58476d1ce4e5b9ULL;
        h ^= h >> 27;
        h *= 0x94d049bb133111ebULL;
        h ^= h >> 31;
        return h;
    }
};
 
// Example showing why different functions matter
void demonstrate_hash_difference() {
    // All keys that map to partition 5 (out of 16) using partition_hash
    // have partition_hash(key) mod 16 == 5
    
    // If we used partition_hash for the hash table with 32 buckets:
    // All these keys would map to buckets 5 or 21 only!
    // (because hash mod 16 == 5 implies hash mod 32 is 5 or 21)
    
    // Using table_hash distributes these keys across all 32 buckets
}

NULL handling in partitioning:

NULL join keys present a special case. Since NULL ≠ NULL in SQL, tuples with NULL keys never match. These tuples can be:

Placed in a special 'null partition' that's never joined (for inner joins)
Tracked separately for outer joins where they must appear in output
Assigned to partition 0 by convention but skipped during matching

The key insight is that NULL-key tuples need consistent handling across build and probe partitioning—if they're excluded, they must be excluded from both sides.

Determining the Optimal Partition Count

Choosing the right number of partitions is a critical decision that balances multiple factors:

Too few partitions:

Individual partitions too large to fit in memory
Requires recursive partitioning (expensive)
Wastes the opportunity to parallelize

Too many partitions:

High overhead: many small files and I/Os
Memory fragmentation from many small buffers
Seek-dominated I/O patterns

The goal: Choose n such that each partition of the build relation fits in memory, with some safety margin.

Partition count formula:

Given:

M = available memory pages
|R| = estimated build relation size in pages
f = fudge factor for estimation error (typically 1.2-1.5)

Minimum partitions needed:

n ≥ f × |R| / M

Maximum practical partitions (given output buffers during partitioning):

n ≤ M - 1  (one page per partition buffer, leaving one for input)

Example calculation:

Memory: M = 1000 pages (8 MB if 8 KB pages)
Build relation: |R| = 10,000 pages (80 MB)
Fudge factor: f = 1.2

Minimum: n ≥ 1.2 × 10,000 / 1000 = 12 partitions Maximum: n ≤ 999 (plenty of room)

We might choose n = 16 (power of 2 for efficient modulo), giving expected partition size of 625 pages, well within the 1000-page memory limit.

Cardinality Estimation Uncertainty

Partition count decisions are made before reading data, based on estimated relation sizes. When estimates are wrong (often 10× or more for complex queries), partitions may be too large (requiring recursive partitioning) or too small (wasting resources). Adaptive approaches that adjust during execution are increasingly common in modern systems.

partition_sizing.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Partition count determination
class PartitionPlanner {
    size_t available_memory;     // In bytes
    size_t page_size;
    double fudge_factor;
    
    int compute_partition_count(
        size_t estimated_build_size,  // bytes
        size_t estimated_tuple_width
    ) {
        size_t memory_pages = available_memory / page_size;
        size_t build_pages = (estimated_build_size + page_size - 1) / page_size;
        
        // Account for hash table overhead
        // Hash table typically needs 1.5-2x the raw data size
        size_t ht_pages = build_pages * 2;
        
        // Minimum partitions to ensure each fits in memory
        int min_partitions = (int)ceil(fudge_factor * ht_pages / memory_pages);
        
        // Maximum partitions limited by buffer pool allocation
        // Need one buffer page per partition during partitioning phase
        int max_partitions = memory_pages - 2;  // Reserve pages for I/O
        
        // Choose power of 2 for efficient modulo
        int n = 1;
        while (n < min_partitions) n *= 2;
        
        // Clamp to valid range
        return min(n, max_partitions);
    }
    
    // Validate that partitions will fit
    bool validate_partition_plan(int n, size_t build_size) {
        size_t expected_partition_size = build_size / n;
        size_t ht_size = expected_partition_size * 2;  // 2x for hash table
        
        // Allow for skew: largest partition might be 2-3× average
        size_t worst_case = ht_size * 3;
        
        return worst_case < available_memory * 0.9;  // 90% threshold
    }
};

Partition Phase I/O Analysis

Partitioning adds I/O cost compared to simple hash join. Understanding this cost is essential for optimizer decisions about join strategies.

Cost breakdown for two-pass partitioned hash join:

Phase 1 (Partitioning):

Read build relation R: |R| page reads
Write build partitions: |R| page writes
Read probe relation S: |S| page reads
Write probe partitions: |S| page writes

Total Phase 1: 2|R| + 2|S| page I/Os

Phase 2 (Join):

Read each build partition: |R| page reads
Read each probe partition: |S| page reads

Total Phase 2: |R| + |S| page I/Os

Grand total: 3|R| + 3|S| page I/Os

Compare to simple hash join (when build fits in memory):

Read R: |R| page reads
Read S: |S| page reads
Total: |R| + |S| page I/Os

Partitioning costs 3× the I/O of simple hash join.

I/O Cost Comparison
Join Type	I/O Cost	When Appropriate
Simple hash join	\|R\| + \|S\|	Build relation fits in memory
2-pass partitioned	3 × (\|R\| + \|S\|)	Build > memory, but partition fits
3-pass partitioned	5 × (\|R\| + \|S\|)	Need recursive partitioning
Sort-merge join	3 × (\|R\| + \|S\|)	When data is nearly sorted
Block nested loop	\|S\| × ⌈\|R\|/M⌉ + \|R\|	When very limited memory

Sequential vs Random I/O

While partitioned hash join uses 3× the page I/Os of simple hash join, most of these are sequential. Partition files are written and read sequentially. On HDDs, sequential I/O can be 100× faster than random I/O. On SSDs, the gap is smaller (3-10×) but still significant. This makes partitioned hash join practical even at 3× the page count.

io_cost_analysis.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// I/O cost estimation for partitioned hash join
struct IOCostEstimate {
    size_t build_pages;
    size_t probe_pages;
    size_t memory_pages;
    double io_cost_per_page;  // milliseconds
    double random_io_penalty;  // multiplier for random vs sequential
    
    // Simple hash join (if build fits)
    double simple_hash_join_cost() {
        if (build_pages * 2 > memory_pages) {
            return INFINITY;  // Won't fit
        }
        // Sequential read of both relations
        return (build_pages + probe_pages) * io_cost_per_page;
    }
    
    // Two-pass partitioned hash join
    double partitioned_hash_join_cost() {
        // Phase 1: read + write both relations
        double phase1 = 2 * (build_pages + probe_pages) * io_cost_per_page;
        
        // Phase 2: read both relations again
        double phase2 = (build_pages + probe_pages) * io_cost_per_page;
        
        return phase1 + phase2;
    }
    
    // Block nested loop for comparison
    double block_nested_loop_cost() {
        size_t blocks_of_R = (build_pages + memory_pages - 1) / (memory_pages - 1);
        
        // Read R once, read S once per block of R
        double cost = build_pages * io_cost_per_page;
        cost += probe_pages * blocks_of_R * io_cost_per_page;
        
        // Add penalty for random access pattern
        cost *= random_io_penalty * 0.3;  // Partial random factor
        
        return cost;
    }
};
 
// Example: join 100,000-page R with 1,000,000-page S, 10,000-page memory
// Simple hash join: not possible (R doesn't fit)
// Partitioned: 3 × (100K + 1M) × 1ms = 3,300 seconds
// BNL: 100K × 1ms + 1M × 10 × 1ms = 10,100 seconds
// Partitioned hash join wins by 3×

Handling Partition Overflow

Even with careful partition count selection, individual partitions may overflow memory. This happens due to:

Cardinality estimation errors: Actual sizes differ from estimates
Data skew: Non-uniform key distribution causes hot partitions
Hash collisions: Many different keys map to same partition

Systems must detect and handle overflow gracefully.

Overflow Handling Strategies

•Recursive partitioning: Split the overflowing partition into sub-partitions using a different hash function. Repeat until each sub-partition fits. This is the approach used by Grace hash join.
•Nested loop fallback: If recursion depth becomes excessive (suggesting extreme skew on a single key value), fall back to nested loop join for that partition.
•Dynamic partition splitting: Mid-stream, split a growing partition into two before it overflows. Requires re-processing probe tuples that already went to the original partition.
•Spill with in-memory index: Keep as much as fits in memory with a Bloom filter or index; spill overflow to disk. Probe in-memory portion first, then disk portion.

overflow_handling.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Recursive partitioning for overflow
void process_partition_recursive(
    PartitionFile& build_partition,
    PartitionFile& probe_partition,
    int depth,
    ResultSet* output
) {
    // Read build partition and check size
    size_t partition_size = build_partition.size_bytes();
    
    if (partition_size <= available_memory * 0.8) {
        // Base case: partition fits - do simple hash join
        HashTable* ht = build_hash_table(build_partition);
        probe_and_output(probe_partition, ht, output);
        delete ht;
        return;
    }
    
    // Overflow case: need to re-partition
    if (depth > MAX_RECURSION_DEPTH) {
        // Extreme skew detected - fall back to nested loop
        log_warning("Partition overflow at max depth, using nested loop");
        nested_loop_join(build_partition, probe_partition, output);
        return;
    }
    
    // Recursive case: re-partition this partition into sub-partitions
    int sub_partitions = compute_sub_partition_count(partition_size);
    
    vector<PartitionFile> build_subparts(sub_partitions);
    vector<PartitionFile> probe_subparts(sub_partitions);
    
    // Use different hash function for sub-partitioning
    HashFunction sub_hash = get_hash_function(depth + 1);
    
    // Re-partition build side
    for (Tuple& t : build_partition) {
        int sp = sub_hash(t.join_key) % sub_partitions;
        build_subparts[sp].append(t);
    }
    
    // Re-partition probe side
    for (Tuple& t : probe_partition) {
        int sp = sub_hash(t.join_key) % sub_partitions;
        probe_subparts[sp].append(t);
    }
    
    // Recursively process sub-partitions
    for (int i = 0; i < sub_partitions; i++) {
        if (!build_subparts[i].empty()) {
            process_partition_recursive(
                build_subparts[i], probe_subparts[i],
                depth + 1, output
            );
        }
    }
}

The Extreme Skew Problem

If many tuples share the exact same join key value, no amount of re-partitioning helps—they'll always end up together. For example, if 1 million orders reference customer_id = 12345, partitioning always puts those 1 million tuples together. The fallback to nested loop (or other strategies like building a partial hash table) handles this edge case.

Memory Allocation Strategies

Effective memory allocation during partitioning significantly impacts performance. The challenge: balance between I/O efficiency (larger buffers) and partition count (more partitions require more buffers).

Buffer allocation during partitioning:

With M pages of memory and n partitions:

Input buffer: 1-4 pages for reading source relation
Output buffers: n buffers, one per partition
Space per output buffer: (M - input_pages) / n

Smaller output buffers mean more frequent flushes to disk, increasing I/O count but reducing I/O due to larger sequential writes when buffers are full.

Trade-off analysis:

With 1000 pages memory and 100 partitions:

Input: 4 pages
Output: 996 / 100 ≈ 9.9 pages per partition

Each partition buffer holds ~10 pages before flushing. With build relation of 10,000 pages:

Each partition receives ~100 pages on average
Each partition flushes ~10 times
Total flushes: ~1,000 (vs 10,000 if single-page buffers)

Memory Allocation Patterns
Strategy	I/O Pattern	Pros	Cons
Single page per partition	Many small writes	Simple, supports many partitions	Poor I/O efficiency
Multi-page buffers	Fewer, larger writes	Good sequential I/O	Limits partition count
Dynamic allocation	Varies	Adapts to actual usage	Complex management
Double buffering	Overlapped I/O	Maximum throughput	2× memory overhead

buffer_allocation.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Buffer allocation strategies for partitioning
class PartitionBufferManager {
    size_t total_memory;
    int num_partitions;
    
    // Strategy 1: Fixed equal allocation
    vector<Buffer> fixed_equal_allocation() {
        size_t input_reserved = 4 * PAGE_SIZE;
        size_t available = total_memory - input_reserved;
        size_t per_partition = available / num_partitions;
        
        vector<Buffer> buffers;
        for (int i = 0; i < num_partitions; i++) {
            buffers.push_back(Buffer(per_partition));
        }
        return buffers;
    }
    
    // Strategy 2: Dynamic with stealing
    class DynamicBufferPool {
        size_t total;
        size_t allocated;
        vector<Buffer*> partition_buffers;
        
        void write_and_potentially_expand(int partition, const Tuple& t) {
            Buffer* buf = partition_buffers[partition];
            
            if (buf->is_full()) {
                // Flush to disk
                flush_buffer(partition);
                
                // Try to expand if under-average allocation
                size_t avg = total / num_partitions;
                if (buf->size < avg && allocated < total) {
                    expand_buffer(buf, min(avg, total - allocated));
                }
            }
            
            buf->add(t);
        }
    };
    
    // Strategy 3: Double buffering for async I/O
    class DoubleBufferPartition {
        Buffer* active;   // Currently filling
        Buffer* flushing; // Being written to disk
        AsyncWriter* writer;
        
        void add(const Tuple& t) {
            if (active->is_full()) {
                // Wait for previous flush to complete
                writer->wait();
                
                // Swap buffers
                swap(active, flushing);
                
                // Start async write of full buffer
                writer->write_async(flushing);
            }
            active->add(t);
        }
    };
};

Asynchronous I/O

Modern systems overlap computation and I/O using asynchronous writes. While one buffer is being written to disk, another receives new tuples. This hides I/O latency behind CPU work, achieving near-maximum disk throughput. Linux's io_uring, Windows IOCP, and similar mechanisms enable this with minimal overhead.

Partition Skew Mitigation

Data skew is the nemesis of partitioned hash join. When join key values have highly non-uniform distribution, some partitions grow much larger than others, negating the benefits of partitioning.

Common skew scenarios:

Zipf distribution: A few popular values (e.g., customer IDs) appear in the vast majority of rows
Temporal clustering: Recent dates dominate, causing date-partitioned joins to skew
NULL concentration: Many NULLs in join columns (handled specially but can cause issues)
Foreign key patterns: Certain reference values are disproportionately common

Skew Mitigation Techniques

•Histogram-guided partitioning: Use statistics to identify heavy hitters; handle them separately. Assign frequent values to dedicated partitions or replicate across multiple partitions.
•Bloom filter early filtering: Build Bloom filters during partitioning. Skip probe partitions for which no build tuples exist (empty partition optimization).
•Dynamic load balancing: Monitor partition sizes during creation. Split growing partitions mid-stream or merge small partitions.
•Partial broadcast: For extremely skewed keys, replicate the build side across all processes (in parallel execution) rather than partitioning.
•Sort-based approach fallback: For heavily skewed data, sort-merge join may actually perform better since it handles skew naturally.

skew_mitigation.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Heavy hitter detection and special handling
class SkewAwarePartitioner {
    unordered_map<JoinKey, size_t> key_counts;
    size_t skew_threshold;
    set<JoinKey> heavy_hitters;
    
    // Phase 0: Sample to identify heavy hitters
    void detect_heavy_hitters(Relation* R, double sample_rate) {
        size_t sample_count = 0;
        for (Tuple& t : *R) {
            if (random() < sample_rate) {
                key_counts[t.join_key]++;
                sample_count++;
            }
        }
        
        // Keys appearing > 1% of samples are heavy hitters
        size_t threshold = sample_count / 100;
        for (auto& [key, count] : key_counts) {
            if (count > threshold) {
                heavy_hitters.insert(key);
            }
        }
        
        log_info("Detected {} heavy hitters", heavy_hitters.size());
    }
    
    // Modified partitioning with heavy hitter handling
    void partition_with_skew_handling(Relation* R, 
                                       vector<PartitionFile>& normal_partitions,
                                       vector<PartitionFile>& heavy_partitions) {
        for (Tuple& t : *R) {
            if (heavy_hitters.count(t.join_key)) {
                // Heavy hitter: use separate partition strategy
                // Option 1: Dedicated partition per heavy key
                // Option 2: Round-robin across multiple partitions
                int hh_partition = hash_heavy_hitter(t.join_key);
                heavy_partitions[hh_partition].append(t);
            } else {
                // Normal key: standard hash partitioning
                int partition = partition_hash(t.join_key) % num_partitions;
                normal_partitions[partition].append(t);
            }
        }
    }
    
    // Join heavy hitter partitions with broadcast
    void join_heavy_hitters(vector<PartitionFile>& build_heavy,
                            vector<PartitionFile>& probe_heavy,
                            ResultSet* output) {
        // For each heavy hitter, all matching probe tuples are in one partition
        // But we might replicate build tuples for parallel processing
        for (const JoinKey& hh_key : heavy_hitters) {
            // Simple approach: build hash table from heavy build tuples
            // and probe all matching probe tuples
            join_single_heavy_hitter(hh_key, build_heavy, probe_heavy, output);
        }
    }
};

Workload-Specific Tuning

Skew mitigation strategies are often workload-specific. A star schema with dimension tables rarely shows skew (dimension keys are uniformly distributed). But OLTP workloads joining on user_id or account_id frequently exhibit power-law distributions. Production systems often allow hints or automatic mode selection based on observed statistics.

Summary: Partition Handling Mastery

Partition handling transforms hash join from a memory-bound algorithm to one that scales to arbitrary data sizes. The cost is additional I/O, but the ability to handle joins of any scale makes this trade-off essential. Let's consolidate the key concepts:

Key Takeaways

•Partitioning divides to conquer — When data exceeds memory, split into manageable partitions that individually fit. Matching tuples are guaranteed to be in corresponding partitions due to hash determinism.
•Two different hash functions — Use one for partitioning (h₁) and another for the in-partition hash table (h₂). This ensures good distribution at both levels.
•I/O cost is 3× basic hash join — Read both relations, write partitions, then read partitions again. Still often faster than alternatives for large data.
•Partition count balances memory fit vs overhead — Too few partitions overflow; too many create excessive small files. Target: each partition fits with safety margin.
•Overflow handling via recursion — When partitions exceed memory, recursively re-partition using different hash function. Fall back to nested loop for extreme single-key skew.
•Memory allocation affects I/O efficiency — Larger partition buffers reduce flush frequency. Double buffering enables overlapped I/O.
•Skew mitigation is crucial — Detect heavy hitters, handle them specially. Pure hash partitioning fails on highly skewed data.

What's Next:

We've established the fundamentals of partition handling. The next page explores Grace Hash Join—a specific algorithm that systematically applies partitioning with optimal I/O behavior, and has influenced virtually all modern hash join implementations.

Page Complete

You now understand partition handling in hash joins: why partitioning is necessary, how it preserves correctness, partition count determination, I/O cost analysis, overflow handling, memory allocation strategies, and skew mitigation. This knowledge is essential for understanding how databases scale join operations to any data size.