Loading learning content...
The probe phase is where hash join's investment pays off. Having constructed a hash table in the build phase, we now scan the probe relation and perform constant-time lookups for each tuple. This is the moment when hash join demonstrates its power: what would be O(n × m) comparisons in a naive nested loop becomes O(n + m) operations.
The probe phase determines join output quality and performance. Efficient probe operations can process millions of tuples per second, while subtle inefficiencies—cache misses, poor memory access patterns, excessive hash collisions—can degrade performance by orders of magnitude. Understanding probe phase mechanics is essential for diagnosing join performance issues and appreciating optimizer decisions.
By the end of this page, you will understand how probe phase lookups work, why probe efficiency depends on build phase quality, how bloom filters accelerate negative lookups, the mechanics of output generation, and the special considerations for outer joins. This knowledge enables you to trace join execution paths and identify performance bottlenecks.
The probe phase follows a straightforward pattern: for each tuple in the probe relation, look up matching tuples in the hash table and output combined results. The elegance lies in the O(1) lookup cost per probe tuple.
Core probe algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// Basic probe phase implementationclass HashJoinProbeOperator { HashTable* hash_table; // Built in previous phase Operator* probe_child; // Source of probe tuples vector<int> join_key_columns; // Which columns form the join key // Current state for iterator model Tuple current_probe_tuple; vector<Tuple*>::iterator match_iterator; vector<Tuple*> current_matches; bool probe_exhausted; void initialize(HashTable* ht, Operator* probe_source) { hash_table = ht; probe_child = probe_source; probe_exhausted = false; // Load first probe tuple advance_probe(); } bool next(Tuple* output) { while (!probe_exhausted) { // Check if we have more matches for current probe tuple if (match_iterator != current_matches.end()) { // Combine probe tuple with matching build tuple combine_tuples(current_probe_tuple, **match_iterator, output); ++match_iterator; return true; } // No more matches; advance to next probe tuple if (!advance_probe()) { probe_exhausted = true; return false; } } return false; } private: bool advance_probe() { if (!probe_child->next(¤t_probe_tuple)) { return false; // Probe relation exhausted } // Look up matches in hash table JoinKey key = extract_key(current_probe_tuple, join_key_columns); if (key.has_null()) { // NULL keys never match; skip to next probe tuple current_matches.clear(); } else { uint64_t hash_value = compute_hash(key); current_matches = hash_table->find_all(hash_value, key); } match_iterator = current_matches.begin(); return true; }};The code above demonstrates the Volcano-style iterator model where each operator's next() call produces one output tuple. This enables pipelining—results flow to parent operators as they're produced. However, a single probe tuple may produce multiple output tuples (when multiple build tuples share the same key), requiring stateful iteration within the probe operator.
The probe lookup operation is deceptively simple in concept but nuanced in implementation. Each lookup involves hashing, bucket location, and key comparison—each step with performance implications.
Step 1: Hash Computation
The probe tuple's join key must be hashed using the identical hash function as the build phase. Using a different function—or even a different seed—would cause lookups to check wrong buckets, producing incorrect (missing) results.
Step 2: Bucket Location
The hash value maps to a bucket: bucket_index = hash_value mod num_buckets. This is a single arithmetic operation, but accessing the bucket itself may cause a cache miss if the bucket array is large and access patterns are random.
Step 3: Key Comparison
Within the bucket, we must compare the probe key against each stored key. This is necessary because:
Optimization: Store hash values alongside tuples
Comparing full keys (especially strings or composite keys) is expensive. A common optimization: store the full hash value with each tuple in the hash table. During probe, first compare hash values (cheap integer comparison), and only compare full keys if hash values match. This filters most non-matches cheaply.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
// Optimized probe lookup with stored hash valuesvector<Tuple*> HashTable::find_all(uint64_t probe_hash, const JoinKey& probe_key) { vector<Tuple*> matches; // Locate bucket size_t bucket_idx = probe_hash % num_buckets; HashEntry* entry = buckets[bucket_idx]; // Traverse chain while (entry != nullptr) { // First: cheap hash comparison if (entry->cached_hash == probe_hash) { // Hash matches - might be real match, might be hash collision // Second: full key comparison (more expensive) if (keys_equal(entry->key, probe_key)) { matches.push_back(&entry->tuple); } } entry = entry->next; } return matches;} // Comparison function for different key typesbool keys_equal(const JoinKey& a, const JoinKey& b) { if (a.type != b.type) return false; switch (a.type) { case KeyType::INTEGER: return a.int_value == b.int_value; case KeyType::STRING: // String comparison is expensive - // check length first as quick filter if (a.str_value.length() != b.str_value.length()) return false; return a.str_value == b.str_value; case KeyType::COMPOSITE: for (size_t i = 0; i < a.num_columns; i++) { if (!columns_equal(a.columns[i], b.columns[i])) return false; } return true; default: return false; }}| Factor | Impact | Mitigation |
|---|---|---|
| Hash computation | ~10-50 CPU cycles | Hardware-accelerated functions |
| Bucket access cache miss | ~100-300 cycles (L3/RAM) | Prefetching, smaller bucket array |
| Chain traversal | ~10-50 cycles per entry + cache miss per pointer dereference | Keep chains short, use contiguous storage |
| Key comparison | ~5-500 cycles depending on key type | Store hash values, compare hash first |
| Match output construction | ~50-200 cycles per output | Avoid deep copies where possible |
One of the most powerful probe phase optimizations is the Bloom filter—a probabilistic data structure that can quickly answer "Is this key definitely NOT in the hash table?" with zero false negatives.
The problem it solves:
In many joins, a significant fraction of probe tuples have no matching build tuples. For these non-matching probes, we still pay the full cost of hashing, bucket access, and (empty) chain traversal. If we could identify non-matches earlier, we could skip the expensive hash table lookup entirely.
How Bloom filters work:
A Bloom filter is a bit array of m bits, initially all zeros. During build phase, for each inserted key:
During probe phase, for each probe key:
The key insight: false positives are possible, but false negatives are not. A Bloom filter never claims a key is absent when it's present.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
// Bloom filter integration with hash joinclass BloomFilter { vector<bool> bits; size_t num_bits; int num_hash_functions; public: BloomFilter(size_t expected_elements, double fp_rate = 0.01) { // Optimal sizing formulas // m = -n * ln(p) / (ln(2)^2) num_bits = static_cast<size_t>( -expected_elements * log(fp_rate) / (log(2) * log(2)) ); // k = (m/n) * ln(2) num_hash_functions = static_cast<int>( (num_bits / expected_elements) * log(2) ); bits.resize(num_bits, false); } void insert(uint64_t primary_hash) { // Use double hashing to generate k hash values uint64_t h1 = primary_hash; uint64_t h2 = hash_secondary(primary_hash); for (int i = 0; i < num_hash_functions; i++) { size_t bit_pos = (h1 + i * h2) % num_bits; bits[bit_pos] = true; } } bool might_contain(uint64_t primary_hash) { uint64_t h1 = primary_hash; uint64_t h2 = hash_secondary(primary_hash); for (int i = 0; i < num_hash_functions; i++) { size_t bit_pos = (h1 + i * h2) % num_bits; if (!bits[bit_pos]) { return false; // Definitely not in set } } return true; // Might be in set }}; // Enhanced probe phase with Bloom filterbool advance_probe_with_bloom() { while (probe_child->next(¤t_probe_tuple)) { JoinKey key = extract_key(current_probe_tuple, join_key_columns); if (key.has_null()) continue; // Skip NULLs uint64_t hash_value = compute_hash(key); // Quick Bloom filter check before expensive hash table lookup if (!bloom_filter->might_contain(hash_value)) { // Definitely no match - skip hash table lookup entirely probe_tuples_bloom_filtered++; continue; } // Might have match - proceed with real lookup current_matches = hash_table->find_all(hash_value, key); if (!current_matches.empty()) { match_iterator = current_matches.begin(); return true; } // False positive case: Bloom said maybe, but no actual match } return false;}Bloom filters trade memory for lookup speed reduction. A 1% false positive rate requires ~10 bits per element. For 1 million build tuples, that's ~1.25 MB—negligible compared to the main hash table but providing significant speedup when many probe tuples don't match. Analytical databases with highly selective joins often see 80-90% of probe tuples eliminated by Bloom filters.
| False Positive Rate | Bits per Element | Hash Functions | Use Case |
|---|---|---|---|
| 10% | 4.8 | 3 | Quick filter, memory-constrained |
| 1% | 9.6 | 7 | Standard applications |
| 0.1% | 14.4 | 10 | High-selectivity joins |
| 0.01% | 19.2 | 13 | Very large ratio of non-matches |
When a probe tuple matches one or more build tuples, the join produces output tuples. This output generation step is often overlooked but can significantly impact performance, especially for wide tables or joins with high match rates.
Output tuple composition:
The output schema typically includes:
The physical representation depends on the execution model:
Materialized output: Copy all column values into a new tuple buffer. Safe but expensive for wide tuples.
Reference-based output: Store pointers to source tuples plus the matched row. Cheaper but requires source tuples to remain valid.
Column-position output: In columnar execution, output column pointers plus selection vectors. Most efficient for analytical workloads.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Different output generation strategies // Strategy 1: Materialized (copy all data)void combine_tuples_materialized( const Tuple& probe, const Tuple& build, Tuple* output, const OutputSchema& schema) { output->allocate(schema.total_bytes()); size_t offset = 0; // Copy selected probe columns for (int col : schema.probe_columns) { size_t col_size = probe.column_size(col); memcpy(output->data() + offset, probe.column_data(col), col_size); offset += col_size; } // Copy selected build columns for (int col : schema.build_columns) { size_t col_size = build.column_size(col); memcpy(output->data() + offset, build.column_data(col), col_size); offset += col_size; }} // Strategy 2: Reference-based (just store pointers)struct JoinOutputRef { const Tuple* probe_tuple; const Tuple* build_tuple; // Columns accessed on-demand through pointers}; // Strategy 3: Vectorized columnar outputstruct VectorizedJoinOutput { // For each output column, points to source column + selection vector struct ColumnBinding { Column* source_column; SelectionVector* selection; // Which rows matched bool from_probe; // vs from build int source_column_index; }; vector<ColumnBinding> output_columns; size_t num_output_rows;};Modern analytical databases practice 'late materialization'—deferring the assembly of complete output tuples until absolutely necessary. If a downstream operator only needs certain columns, or if additional filters might eliminate tuples, avoiding early materialization saves substantial memory bandwidth. This is especially powerful when join output feeds directly into aggregation that only needs a subset of columns.
Handling multiple matches:
When a probe tuple matches multiple build tuples, the join produces multiple outputs. This is common in many-to-many relationships and requires careful handling:
Probe tuple P1 matches build tuples B1, B2, B3
→ Outputs: (P1, B1), (P1, B2), (P1, B3)
The iterator model handles this by maintaining match state across next() calls. The current probe tuple is retained until all matches are emitted, then we advance to the next probe tuple.
For batch/vectorized execution, the output may be larger than the input batch, requiring output buffer management and potentially splitting output across multiple downstream calls.
Outer joins require additional bookkeeping during the probe phase to ensure non-matching tuples are properly included in the output. The specific requirements depend on which outer join variant is being executed.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
// LEFT OUTER JOIN probe phasebool probe_left_outer_next(Tuple* output) { while (!probe_exhausted) { if (match_iterator != current_matches.end()) { // Normal match - output joined tuple combine_tuples(current_probe_tuple, **match_iterator, output); ++match_iterator; return true; } // Check if we have any matches for current probe tuple if (current_matches.empty() && !current_probe_processed) { // No matches - output probe with NULLs for build columns combine_tuple_with_nulls(current_probe_tuple, output); current_probe_processed = true; return true; } // Advance to next probe tuple if (!advance_probe()) { probe_exhausted = true; return false; } current_probe_processed = false; } return false;} // RIGHT OUTER JOIN - requires tracking matched build tuplesstruct HashEntryWithMatch { HashEntry base; bool matched; // Set to true when this tuple matches a probe tuple}; bool probe_right_outer_next(Tuple* output) { // First phase: normal probing with match tracking while (!probe_exhausted) { if (match_iterator != current_matches.end()) { // Mark build tuple as matched (*match_iterator)->matched = true; combine_tuples(current_probe_tuple, **match_iterator, output); ++match_iterator; return true; } if (!advance_probe()) { probe_exhausted = true; // Start second phase: emit unmatched build tuples unmatched_iterator = hash_table->begin(); } } // Second phase: emit unmatched build tuples with NULL probe columns while (unmatched_iterator != hash_table->end()) { if (!unmatched_iterator->matched) { combine_nulls_with_tuple(unmatched_iterator->tuple, output); ++unmatched_iterator; return true; } ++unmatched_iterator; } return false;} // FULL OUTER JOIN combines both approaches// - Track matched build tuples (like RIGHT OUTER)// - Output non-matching probe tuples (like LEFT OUTER)// - After probe: emit unmatched build tuplesRIGHT OUTER and FULL OUTER joins require storing a matched flag for each build tuple. For large build relations, this adds significant memory overhead. Some systems use bitmap representations (1 bit per tuple) to minimize this cost. Others avoid hash join entirely for outer joins when the preserved side is too large.
Semi-joins and anti-joins are specialized join variants that can be executed more efficiently than full joins because they have simpler output requirements.
Semi-join (EXISTS semantics): Return probe tuples that have at least one match. Anti-join (NOT EXISTS semantics): Return probe tuples that have no matches.
Both variants output only probe-side columns—no joining of build columns is needed. This enables significant optimizations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
// Optimized semi-join probebool probe_semi_join_next(Tuple* output) { while (probe_child->next(¤t_probe_tuple)) { JoinKey key = extract_key(current_probe_tuple, join_key_columns); if (key.has_null()) continue; uint64_t hash_value = compute_hash(key); // Just check existence - don't need actual tuples if (hash_table->contains(hash_value, key)) { // Output probe tuple directly (no joining) *output = current_probe_tuple; return true; } } return false;} // Optimized anti-join probebool probe_anti_join_next(Tuple* output) { while (probe_child->next(¤t_probe_tuple)) { JoinKey key = extract_key(current_probe_tuple, join_key_columns); // NULL-key tuples: never match, so always included in anti-join output if (key.has_null()) { *output = current_probe_tuple; return true; } uint64_t hash_value = compute_hash(key); // Use Bloom filter for quick negative check if (!bloom_filter->might_contain(hash_value)) { // Definitely no match - include in anti-join output *output = current_probe_tuple; return true; } // Bloom says maybe - check hash table if (!hash_table->contains(hash_value, key)) { *output = current_probe_tuple; return true; } // Match found - exclude from anti-join output } return false;} // For semi-join, we can use a key-only hash tableclass KeyOnlyHashTable { // Stores only keys, not full tuples // ~50-90% memory reduction for wide tables vector<KeySet> buckets; void insert(uint64_t hash, const JoinKey& key) { buckets[hash % num_buckets].insert(key); } bool contains(uint64_t hash, const JoinKey& key) { return buckets[hash % num_buckets].contains(key); }};Query optimizers automatically convert appropriate patterns to semi-join or anti-join. An EXISTS subquery becomes a semi-join; NOT EXISTS becomes an anti-join. Understanding these conversions helps you write queries that the optimizer can execute efficiently.
Probe phase performance depends on several factors, each contributing to the overall throughput of tuples processed per second.
| Factor | Good Performance | Poor Performance | Diagnostic |
|---|---|---|---|
| Selectivity | Many non-matches (Bloom helps) | Most tuples match | Compare probe count vs output count |
| Match cardinality | 1:1 or low (1:few) | 1:many (fanout) | Check output rows / probe rows ratio |
| Hash table cache fit | Fits in L3 cache | Exceeds cache significantly | Monitor cache miss rates |
| Chain length | Average 1-2 entries | Long chains (>10) | Bucket size distribution |
| Key comparison cost | Integers, short strings | Wide composite, long strings | Profile key comparison time |
Probe phase cost model:
Probe_cost = |S| × (hash_cost + lookup_cost + match_cost × avg_matches)
+ |S| × tuple_read_cost
+ output_rows × output_cost
Where:
- |S| = probe relation cardinality
- lookup_cost = bucket_access + avg_chain_length × comparison_cost
- avg_matches = |output| / |S| (expected matches per probe)
Example calculation:
Join producing 5M output rows from 10M probe tuples (50% match rate, 1:1 on average):
If hash table exceeds cache, lookup cost could 3-5× higher due to cache misses.
Advanced implementations use software prefetching to hide memory latency. By computing hashes for upcoming tuples and issuing prefetch instructions for their buckets, the CPU can overlap memory access with processing. This technique can improve probe throughput by 2-3× for hash tables larger than cache.
1234567891011121314151617181920212223242526272829
// Prefetching-optimized probe phase (simplified)void probe_with_prefetch(Tuple* probe_batch, size_t batch_size) { const int PREFETCH_DISTANCE = 4; // Look ahead 4 tuples // Compute hashes and prefetch for upcoming tuples uint64_t hashes[batch_size]; for (size_t i = 0; i < batch_size; i++) { hashes[i] = compute_hash(extract_key(probe_batch[i])); } // Issue prefetches for first PREFETCH_DISTANCE buckets for (size_t i = 0; i < min(PREFETCH_DISTANCE, batch_size); i++) { _mm_prefetch(hash_table->bucket_ptr(hashes[i]), _MM_HINT_T0); } // Process tuples with rolling prefetch window for (size_t i = 0; i < batch_size; i++) { // Prefetch for tuple i + PREFETCH_DISTANCE if (i + PREFETCH_DISTANCE < batch_size) { _mm_prefetch( hash_table->bucket_ptr(hashes[i + PREFETCH_DISTANCE]), _MM_HINT_T0 ); } // Process tuple i (data should be in cache now) process_probe_tuple(probe_batch[i], hashes[i]); }}Understanding probe phase behavior is essential for diagnosing join performance issues. Production database systems expose metrics that reveal execution characteristics:
12345678910111213141516171819202122232425262728
-- PostgreSQL EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) output for hash joinHash Join (cost=3152.00..14829.00 rows=50000 width=200) (actual time=45.2..512.3 rows=47832 loops=1) Hash Cond: (orders.customer_id = customers.customer_id) -- Probe phase metrics visible in parent Hash Join node: -- actual rows=47832 indicates output cardinality -- actual time includes probe phase -> Seq Scan on orders (cost=0.00..8334.00 rows=500000 width=150) (actual time=0.02..95.1 rows=500000 loops=1) -- This is the probe relation: 500K rows -- 500K probes produced 47.8K matches (~9.5% match rate) -> Hash (cost=1769.00..1769.00 rows=100000 width=50) (actual time=44.9..44.9 rows=100000 loops=1) Buckets: 131072 Batches: 1 Memory Usage: 6231kB -- Build phase: 100K rows, single batch (fit in memory) -- 131072 buckets for 100K rows = 1.3 load factor -> Seq Scan on customers (cost=0.00..1769.00 rows=100000 width=50) (actual time=0.01..15.2 rows=100000 loops=1) -- Key insights from these metrics:-- - Build relation (customers): 100K rows, fit in 6MB hash table-- - Probe relation (orders): 500K rows-- - Output: 47.8K rows (many orders have no matching customer, or filtered earlier)-- - Build time ~45ms, total time ~512ms, so probe took ~467ms-- - Probe throughput: ~1.07 million rows/secondWhen estimated rows differs significantly from actual rows, it indicates cardinality estimation errors. For hash joins, this can mean: (1) wrong relation chosen as build side, (2) hash table sized poorly, (3) downstream operators receive unexpected volumes. Monitoring actual vs. estimated throughout the plan helps identify the source of estimation errors.
The probe phase is where hash join's efficiency materializes—where the upfront investment in building a hash table translates to fast, constant-time lookups. Let's consolidate the essential concepts:
What's Next:
So far we've assumed the build relation fits comfortably in memory. Reality is often less accommodating—very large relations may exceed available memory by orders of magnitude. The next page explores partition handling: techniques for handling hash joins when data exceeds memory, including the crucial concepts of partitioning and multi-pass processing.
You now understand the probe phase of hash join algorithms: lookup mechanics, Bloom filter optimization, output generation, outer join handling, semi-join/anti-join optimizations, and performance analysis. This knowledge is essential for understanding query execution plans and diagnosing join performance issues.