Loading content...
Query optimization is fundamentally a cost-minimization problem. To choose between nested loop join, hash join, and sort-merge join, the optimizer must accurately estimate the cost of each alternative. This requires precise mathematical models that capture the real-world performance characteristics of each algorithm.
Sort-merge join's cost is uniquely structured: high upfront sorting cost, minimal merge cost. This makes it an excellent choice when inputs are already sorted or when sorted output is required. Understanding the cost formula empowers you to predict when sort-merge join excels and when other methods win.
By the end of this page, you will understand the complete cost model for sort-merge join: I/O costs for sorting and merging, CPU costs for comparisons and tuple manipulation, how pre-sorted input dramatically reduces cost, the effect of buffer memory on performance, and direct comparison with alternative join methods.
Database cost models quantify performance using several metrics. Understanding these metrics is essential before diving into sort-merge join's specific costs:
Primary Cost Metrics:
| Metric | Symbol | Description | Typical Magnitude |
|---|---|---|---|
| Page I/O | 1 page | Read or write a single disk page | ~0.1ms (SSD), ~10ms (HDD) |
| Sequential Scan | — | Read N pages sequentially | ~N × 0.01ms (SSD), N × 0.1ms (HDD) |
| Random Access | — | Seek + read one page | ~0.1ms (SSD), ~10ms (HDD) |
| Tuple Comparison | — | Compare two sort keys | ~100 CPU cycles |
| Tuple Copy | — | Copy tuple between buffers | ~1000 CPU cycles |
| Hash Computation | — | Compute hash of join key | ~200 CPU cycles |
Why I/O Dominates:
For magnetic disks (HDD), a page read takes ~10 ms. In that time, a modern CPU executes ~40 billion cycles. Even with SSDs (~0.1 ms per random access), I/O is 100,000× slower than CPU operations per operation.
Consequently, cost models primarily count page I/O operations. CPU costs are considered only when:
Standard Notation:
We use the following notation throughout this analysis:
We use 'page' and 'block' interchangeably. Both refer to the unit of disk I/O (typically 4-16 KB). The notation b_R means 'blocks of R' — the number of disk pages needed to store relation R.
The total cost of sort-merge join decomposes into three phases:
Total Cost = Sort R + Sort S + Merge
Phase 1: Sort R (if not already sorted)
Using external merge sort:
Cost_Sort_R = 2 × b_R × (1 + ⌈log_{M-1}(⌈b_R/M⌉)⌉)
Breakdown:
Phase 2: Sort S (if not already sorted)
Identical formula:
Cost_Sort_S = 2 × b_S × (1 + ⌈log_{M-1}(⌈b_S/M⌉)⌉)
Phase 3: Merge Phase
The merge reads each sorted relation exactly once:
Cost_Merge = b_R + b_S
No writes are counted if we're streaming output to the next pipeline stage. If output is materialized:
Cost_Merge_Materialized = b_R + b_S + b_Output
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import mathfrom dataclasses import dataclassfrom typing import Optional @dataclassclass RelationStats: """Statistics for a relation.""" num_tuples: int # |R| or |S| num_pages: int # b_R or b_S is_sorted: bool # Already sorted on join key? tuple_size: int # Average tuple size in bytes def calculate_sort_cost(num_pages: int, buffer_pages: int, is_sorted: bool = False) -> int: """ Calculate I/O cost to sort a relation. Returns number of page I/Os required. """ if is_sorted: return 0 # No sorting needed b = num_pages M = buffer_pages # Number of initial runs num_runs = math.ceil(b / M) if num_runs <= 1: # Fits in memory - single pass sort return 2 * b # Read once, write once # Number of merge passes needed # Fan-in is (M-1) because one buffer is for output merge_passes = math.ceil(math.log(num_runs) / math.log(M - 1)) # Total: initial pass + merge passes, each reading and writing all data return 2 * b * (1 + merge_passes) def calculate_merge_cost(stats_R: RelationStats, stats_S: RelationStats, materialize_output: bool = False, output_pages: Optional[int] = None) -> int: """ Calculate I/O cost for merge phase. Args: stats_R: Statistics for left relation stats_S: Statistics for right relation materialize_output: Whether to write output to disk output_pages: If materializing, number of output pages Returns: Number of page I/Os for merge phase """ # Read both relations once cost = stats_R.num_pages + stats_S.num_pages if materialize_output: if output_pages is None: raise ValueError("Must provide output_pages when materializing") cost += output_pages return cost def calculate_sort_merge_join_cost( stats_R: RelationStats, stats_S: RelationStats, buffer_pages: int, materialize_output: bool = False, output_pages: Optional[int] = None) -> dict: """ Calculate complete sort-merge join cost. Returns breakdown of costs by phase. """ sort_R_cost = calculate_sort_cost( stats_R.num_pages, buffer_pages, stats_R.is_sorted ) sort_S_cost = calculate_sort_cost( stats_S.num_pages, buffer_pages, stats_S.is_sorted ) merge_cost = calculate_merge_cost( stats_R, stats_S, materialize_output, output_pages ) total_cost = sort_R_cost + sort_S_cost + merge_cost return { 'sort_R': sort_R_cost, 'sort_S': sort_S_cost, 'merge': merge_cost, 'total': total_cost, 'breakdown': f"Sort R: {sort_R_cost} + Sort S: {sort_S_cost} + Merge: {merge_cost} = {total_cost}" } # Example calculationsif __name__ == "__main__": # Scenario: Join 100MB table with 500MB table, 10MB buffer # Assume 8KB pages PAGE_SIZE = 8 * 1024 R = RelationStats( num_tuples=500_000, num_pages=100_000_000 // PAGE_SIZE, # ~12,500 pages is_sorted=False, tuple_size=200 ) S = RelationStats( num_tuples=2_500_000, num_pages=500_000_000 // PAGE_SIZE, # ~62,500 pages is_sorted=False, tuple_size=200 ) buffer_pages = 10_000_000 // PAGE_SIZE # ~1,250 pages result = calculate_sort_merge_join_cost(R, S, buffer_pages) print(result['breakdown'])For unsorted inputs, sorting typically accounts for 70-90% of the total cost. This is why pre-sorted input (from an index or previous operation) makes sort-merge join dramatically cheaper.
Let's work through several realistic scenarios to build intuition for sort-merge join costs:
Example 1: Small Tables, Moderate Buffer
12345678910111213141516
# Example 1: b_R = 1000, b_S = 500, M = 100 # Sort R:# - Initial runs: ceil(1000/100) = 10 runs# - Merge passes: ceil(log_99(10)) = ceil(0.502) = 1 pass# - Cost: 2 × 1000 × (1 + 1) = 4,000 page I/Os # Sort S:# - Initial runs: ceil(500/100) = 5 runs # - Merge passes: ceil(log_99(5)) = ceil(0.351) = 1 pass# - Cost: 2 × 500 × (1 + 1) = 2,000 page I/Os # Merge:# - Cost: 1000 + 500 = 1,500 page I/Os # TOTAL: 4,000 + 2,000 + 1,500 = 7,500 page I/OsExample 2: Large Tables, Small Buffer
123456789101112131415161718
# Example 2: b_R = 10,000, b_S = 50,000, M = 100 # Sort R:# - Initial runs: ceil(10,000/100) = 100 runs# - Merge passes: ceil(log_99(100)) = ceil(1.004) = 2 passes# - Cost: 2 × 10,000 × (1 + 2) = 60,000 page I/Os # Sort S:# - Initial runs: ceil(50,000/100) = 500 runs# - Merge passes: ceil(log_99(500)) = ceil(1.354) = 2 passes# - Cost: 2 × 50,000 × (1 + 2) = 300,000 page I/Os # Merge:# - Cost: 10,000 + 50,000 = 60,000 page I/Os # TOTAL: 60,000 + 300,000 + 60,000 = 420,000 page I/Os # Note: Sorting costs 360,000 of 420,000 total (86%)!Example 3: Pre-Sorted Input (Index Available)
Same tables as Example 2, but both are already sorted on join key:
1234567891011
# Example 3: b_R = 10,000, b_S = 50,000, M = 100# BOTH RELATIONS ALREADY SORTED # Sort R: 0 (already sorted)# Sort S: 0 (already sorted) # Merge: 10,000 + 50,000 = 60,000 page I/Os # TOTAL: 60,000 page I/Os # SAVINGS: 420,000 - 60,000 = 360,000 page I/Os (85.7% reduction!)| Scenario | Sort R | Sort S | Merge | Total | Notes |
|---|---|---|---|---|---|
| Example 1 (small) | 4,000 | 2,000 | 1,500 | 7,500 | 1 merge pass each |
| Example 2 (large) | 60,000 | 300,000 | 60,000 | 420,000 | 2 merge passes each |
| Example 3 (sorted) | 0 | 0 | 60,000 | 60,000 | 85% cheaper than Ex 2 |
In Example 2, with only 100 buffer pages for 50,000-page S, we need 500 initial runs and 2 merge passes. Doubling the buffer to 200 pages would reduce runs to 250, still needing 2 passes. But with 1,000 pages, we'd have 50 runs and only 1 pass—reducing S's sort cost from 300,000 to 200,000 I/Os.
To understand when sort-merge join excels, we must compare it with the alternatives:
Nested Loop Join (Block Nested Loop):
Cost_BNL = b_R + (b_R / (M-2)) × b_S
For each block of R (using M-2 buffers), scans all of S. Very expensive for large tables without indexes.
Hash Join (Simple Hash):
Cost_Hash = 3 × (b_R + b_S) [if both fit after partitioning]
Partition phase: read and write both relations once (2× each) Probe phase: read both relations once (1× each)
With recursive partitioning for large tables, more passes needed.
| Algorithm | Cost Formula | Result | Competitive When |
|---|---|---|---|
| Block Nested Loop | 1,000 + 11 × 5,000 | 56,000 | Small inner table, or index available |
| Hash Join | 3 × (1,000 + 5,000) | 18,000 | Unsorted equi-join, adequate memory |
| Sort-Merge (unsorted) | 2×1,000×2 + 2×5,000×2 + 6,000 | 30,000 | Sorted output needed, or many joins |
| Sort-Merge (sorted) | 0 + 0 + 6,000 | 6,000 | Inputs already sorted (index) |
When Sort-Merge Join Wins:
Modern query optimizers evaluate all these factors—input sizes, sort orders, available indexes, memory, and output requirements—to choose the best join algorithm. Understanding the cost model helps you predict and influence these decisions through query hints, index design, and schema optimization.
While I/O dominates for disk-based databases, CPU costs become significant for in-memory databases and fast SSDs. Let's analyze the CPU component of sort-merge join:
Sorting Phase CPU Cost:
In-memory sorting uses O(n log n) comparisons. For each tuple:
CPU_Sort = |R| × log(|R|) × C_compare + Copy_overhead
Merge Phase CPU Cost:
Merging is linear in output size:
CPU_Merge = (|R| + |S|) × C_compare + |Output| × C_concat
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
from dataclasses import dataclassfrom typing import Callableimport math @dataclass class CPUCostParams: """CPU cost parameters (in CPU cycles).""" key_extract: int = 50 # Extract sort key from tuple key_compare: int = 100 # Compare two keys tuple_copy: int = 500 # Copy one tuple in memory tuple_concat: int = 300 # Concatenate two tuples for output heap_operation: int = 200 # Heap push/pop in external sort def estimate_sort_cpu_cost(num_tuples: int, num_pages: int, buffer_pages: int, params: CPUCostParams) -> int: """ Estimate CPU cycles for external merge sort. Includes: - In-memory sorting of each run (quicksort) - Heap operations for k-way merge """ # Phase 1: In-memory sorting # Each run contains approximately (buffer_pages × tuples_per_page) tuples tuples_per_page = num_tuples / num_pages tuples_per_run = buffer_pages * tuples_per_page num_runs = math.ceil(num_pages / buffer_pages) # Quicksort: ~1.4 × n × log2(n) comparisons on average sort_comparisons = 1.4 * tuples_per_run * math.log2(max(tuples_per_run, 1)) sort_cost_per_run = sort_comparisons * ( params.key_extract * 2 + params.key_compare ) total_sort_cost = num_runs * sort_cost_per_run # Phase 2: Merge (if multiple runs) if num_runs <= 1: return int(total_sort_cost) # K-way merge: each tuple does log2(K) heap operations # for each merge pass num_passes = math.ceil(math.log(num_runs) / math.log(max(buffer_pages - 1, 2))) heap_cost = num_tuples * num_passes * ( params.heap_operation * math.log2(min(num_runs, buffer_pages - 1)) + params.key_compare ) return int(total_sort_cost + heap_cost) def estimate_merge_cpu_cost(num_R: int, num_S: int, output_tuples: int, params: CPUCostParams) -> int: """ Estimate CPU cycles for merge-join phase. Assumes unique keys (no cross-product). With duplicates, add |output| × C_concat for concatenation. """ # Direct merge: ~(|R| + |S|) comparisons compare_cost = (num_R + num_S) * ( params.key_extract * 2 + params.key_compare ) # Output concatenation output_cost = output_tuples * params.tuple_concat return compare_cost + output_cost # Example: Compare CPU costsR_tuples = 1_000_000S_tuples = 5_000_000R_pages = 10_000S_pages = 50_000M = 1_000output = 500_000 # Estimated join result size params = CPUCostParams() sort_R_cpu = estimate_sort_cpu_cost(R_tuples, R_pages, M, params)sort_S_cpu = estimate_sort_cpu_cost(S_tuples, S_pages, M, params)merge_cpu = estimate_merge_cpu_cost(R_tuples, S_tuples, output, params) print(f"Sort R CPU: {sort_R_cpu:,} cycles ({sort_R_cpu/1e9:.2f} billion)")print(f"Sort S CPU: {sort_S_cpu:,} cycles ({sort_S_cpu/1e9:.2f} billion)") print(f"Merge CPU: {merge_cpu:,} cycles ({merge_cpu/1e9:.2f} billion)") # At 3 GHz: cycles / 3e9 = secondsWhen CPU Costs Matter:
Modern CPUs offer SIMD instructions that compare multiple keys in parallel. Parallel sorting algorithms use multiple cores to sort chunks concurrently. These optimizations can reduce CPU time by 4-16× but are often hidden inside optimized library functions.
Buffer memory profoundly affects sort-merge join performance. Let's analyze how memory allocation impacts cost:
The Memory Sweet Spot:
External sort cost drops dramatically when memory crosses certain thresholds:
| Buffer Pages (M) | Initial Runs | Merge Passes | Total I/O | Reduction from M=50 |
|---|---|---|---|---|
| 50 | 200 | 2 | 60,000 | — |
| 100 | 100 | 2 | 60,000 | 0% |
| 200 | 50 | 1 | 40,000 | 33% |
| 500 | 20 | 1 | 40,000 | 33% |
| 1,000 | 10 | 1 | 40,000 | 33% |
| 5,000 | 2 | 1 | 40,000 | 33% |
| 10,000+ | 1 | 0 | 20,000 | 67% |
Observations:
Diminishing returns: Going from 50 to 200 buffer pages reduces cost by 33%. Going from 200 to 1,000 provides no additional benefit (same 1 pass).
Step function: Cost drops at thresholds where merge passes decrease. Between thresholds, extra memory provides minimal benefit.
In-memory threshold: When M ≥ b_R, we avoid merge passes entirely—the biggest single improvement.
Memory Allocation Strategy:
Given a fixed memory budget for a complex query with multiple joins:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
def optimal_memory_allocation(b_R: int, b_S: int, total_memory: int) -> dict: """ Allocate buffer memory between two sorts to minimize total cost. This is a simplified heuristic. Production optimizers use more sophisticated cost-based allocation. """ # Check if we can fit everything if total_memory >= b_R + b_S: return { 'R_memory': b_R, 'S_memory': b_S, 'merge_memory': 0, 'strategy': 'both fit in memory' } # Proportional allocation (simple but effective) ratio = b_R / (b_R + b_S) R_memory = int(total_memory * ratio) S_memory = total_memory - R_memory # Ensure minimum allocation for each min_allocation = max(3, total_memory // 20) # At least 3 pages or 5% R_memory = max(R_memory, min_allocation) S_memory = max(S_memory, min_allocation) # Adjust to not exceed total if R_memory + S_memory > total_memory: excess = (R_memory + S_memory) - total_memory if R_memory > S_memory: R_memory -= excess else: S_memory -= excess # Calculate resulting costs import math def passes(b, M): if M >= b: return 0 runs = math.ceil(b / M) if runs <= 1: return 0 return math.ceil(math.log(runs) / math.log(M - 1)) return { 'R_memory': R_memory, 'S_memory': S_memory, 'R_passes': passes(b_R, R_memory), 'S_passes': passes(b_S, S_memory), 'strategy': 'proportional allocation' } # Exampleresult = optimal_memory_allocation( b_R=10_000, b_S=50_000, total_memory=5_000)print(f"R memory: {result['R_memory']} pages ({result['R_passes']} passes)")print(f"S memory: {result['S_memory']} pages ({result['S_passes']} passes)")In a multi-query environment, multiple operations compete for buffer memory. The optimizer's memory estimates may not match runtime reality. Adaptive execution engines monitor actual memory usage and may switch strategies mid-query if needed.
A solid understanding of sort-merge join costs empowers you to predict performance, design efficient schemas, and understand optimizer decisions. Let's consolidate the essential formulas and insights:
Looking Ahead:
With cost analysis complete, we'll examine the blocking nature of sort-merge join—a key characteristic that affects query pipelining and execution strategy. Understanding this property is crucial for designing efficient query execution plans.
You now have a comprehensive understanding of sort-merge join cost analysis—from I/O formulas to CPU considerations, from concrete examples to memory optimization strategies. This quantitative foundation is essential for query optimization and performance engineering.