Sort Merge Join - Learning Module

Loading content...

0/241

External Sorting

The Sorting Challenge That Defined Database Engineering

Sorting is perhaps the most fundamental algorithm in computer science—a problem so well-studied that most developers consider it solved. Call Array.sort(), and the job is done. But this comfortable assumption shatters the moment your data exceeds available memory.

Imagine you need to sort 100 GB of transaction records on a machine with 8 GB of RAM. Suddenly, Array.sort() throws an out-of-memory exception, and you're confronted with a challenge that has occupied some of the brightest minds in computing: external sorting.

External sorting is the foundational algorithm that enables sort-merge join to operate on datasets of arbitrary size. Without it, database systems could only join tables that fit entirely in memory—a limitation that would render them useless for most real-world applications. Understanding external sorting is therefore essential to understanding how databases actually work at scale.

What You Will Learn

By the end of this page, you will understand the complete theory and practice of external sorting: why memory limitations necessitate fundamentally different algorithms, how external merge sort works at each stage, the mathematics governing its I/O cost, and the engineering optimizations that make it performant in production database systems. This knowledge forms the essential foundation for understanding sort-merge join.

The Memory Wall Problem

To appreciate external sorting, we must first understand the fundamental limitation it addresses: the memory wall. This isn't merely a resource constraint—it's a fundamental architectural property of computer systems that shapes how we process large datasets.

The Memory Hierarchy:

Modern computer systems exhibit a profound asymmetry between storage tiers:

Memory Hierarchy Characteristics (Approximate 2024 Values)
Storage Tier	Capacity	Latency	Bandwidth	Cost/GB
CPU Registers	~1 KB	< 1 ns	TB/s	N/A
L1 Cache	32-64 KB	~1 ns	~500 GB/s	N/A
L2 Cache	256 KB - 1 MB	~4 ns	~200 GB/s	N/A
L3 Cache	8-64 MB	~12 ns	~100 GB/s	N/A
Main Memory (RAM)	8-512 GB	~100 ns	~50 GB/s	~$3-5
NVMe SSD	1-8 TB	~100 µs	~7 GB/s	~$0.10
HDD	4-20 TB	~10 ms	~200 MB/s	~$0.02

Notice the orders of magnitude difference between RAM and disk:

Latency: RAM is ~1,000× faster than SSD, ~100,000× faster than HDD
Bandwidth: RAM is ~7× faster than NVMe SSD, ~250× faster than HDD

When your dataset exceeds memory, you're forced to access slower storage. An algorithm that was O(n log n) in memory becomes dominated by I/O costs that dwarf computation. The number of disk accesses—not CPU cycles—becomes the primary cost metric.

The I/O Bottleneck Reality

For external algorithms, I/O cost is everything. A single random disk read (seek + rotational delay + transfer) takes 10 milliseconds—time in which a modern CPU could execute 40 million instructions. Database algorithms are designed to minimize disk I/O above all else.

Why In-Memory Sorting Fails:

Consider what happens when you try to sort 100 GB of data with 8 GB of RAM using quicksort:

Quicksort requires random access to elements (for partitioning)
With data on disk, each random access triggers a disk seek
A disk seek takes ~10ms on HDD, ~0.1ms on SSD
For 100 GB of 100-byte records: 1 billion records
Quicksort needs O(n log n) ≈ 30 billion comparisons
Even with 1% cache miss rate: 300 million disk seeks
At 10ms per seek: 3 million seconds ≈ 35 days

This catastrophic performance isn't because quicksort is a bad algorithm—it's brilliant for in-memory data. It's because quicksort's random access pattern is fundamentally incompatible with disk characteristics. External sorting algorithms are designed from the ground up to work with disk architecture, not against it.

External Merge Sort: The Algorithm

External merge sort transforms the sorting problem into a two-phase process specifically designed to exploit sequential disk access patterns:

Phase 1: Run Generation (Sort Phase)

Divide the data into memory-sized chunks, sort each chunk in memory, and write the sorted chunks (called runs) back to disk.

Phase 2: Merge Phase

Repeatedly merge multiple sorted runs into larger sorted runs until a single fully-sorted output remains.

The key insight is that both phases primarily use sequential I/O, which is dramatically faster than random I/O. Let's examine each phase in detail.

external_merge_sort.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def external_merge_sort(input_file: str, output_file: str, 
                          memory_size: int) -> None:
    """
    External merge sort for files larger than available memory.
    
    Args:
        input_file: Path to unsorted input file
        output_file: Path for sorted output
        memory_size: Available memory in bytes for sorting
    """
    # Phase 1: Generate sorted runs
    run_files = generate_sorted_runs(input_file, memory_size)
    
    # Phase 2: Merge runs until single sorted output
    while len(run_files) > 1:
        run_files = merge_runs(run_files, memory_size)
    
    # Final run becomes output
    rename(run_files[0], output_file)
 
 
def generate_sorted_runs(input_file: str, memory_size: int) -> List[str]:
    """
    Phase 1: Read chunks, sort in memory, write sorted runs.
    
    Each run contains approximately (memory_size) bytes of sorted records.
    Uses any efficient in-memory sort (quicksort, heapsort, etc.)
    """
    runs = []
    chunk_buffer = allocate_buffer(memory_size)
    
    with open(input_file, 'rb') as f:
        run_number = 0
        while True:
            # Read memory-sized chunk
            records = read_records_into_buffer(f, chunk_buffer)
            if not records:
                break
            
            # Sort chunk in memory using any O(n log n) algorithm
            records.sort(key=lambda r: r.sort_key)
            
            # Write sorted run to temporary file
            run_file = f"run_{run_number}.tmp"
            write_records_to_file(records, run_file)
            runs.append(run_file)
            run_number += 1
    
    return runs
 
 
def merge_runs(run_files: List[str], memory_size: int) -> List[str]:
    """
    Merge multiple sorted runs into fewer, larger sorted runs.
    
    With B buffer pages, we can merge (B-1) runs at once,
    keeping 1 page for output buffer.
    """
    # Calculate how many runs we can merge at once
    # Reserve one buffer for output, rest for input runs
    buffer_pages = memory_size // PAGE_SIZE
    merge_fan_in = buffer_pages - 1
    
    new_runs = []
    
    # Process runs in groups of (merge_fan_in)
    for i in range(0, len(run_files), merge_fan_in):
        runs_to_merge = run_files[i:i + merge_fan_in]
        merged_run = multi_way_merge(runs_to_merge)
        new_runs.append(merged_run)
        
        # Clean up merged runs
        for run in runs_to_merge:
            delete_file(run)
    
    return new_runs

The Power of Sequential Access

Both phases read and write data sequentially. During run generation, we read the input file sequentially and write runs sequentially. During merging, we read each run sequentially and write the output sequentially. This transforms a problem dominated by random disk seeks into one dominated by sequential throughput—a speedup of 100× or more.

Run Generation: Maximizing Initial Runs

The run generation phase is more nuanced than simply reading chunks and sorting them. Sophisticated database systems employ advanced techniques to generate longer initial runs, which reduces the number of merge passes required.

Basic Run Generation:

With M bytes of memory and N bytes of data, basic run generation produces ⌈N/M⌉ runs, each of size M. For 100 GB of data with 1 GB of memory: 100 runs.

Replacement Selection: 2× Longer Runs:

Replacement selection is an ingenious algorithm that produces runs approximately twice the memory size on average, cutting the number of runs in half:

replacement_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def replacement_selection(input_file: str, memory_size: int) -> List[str]:
    """
    Replacement selection produces runs ~2× memory size on average.
    
    Uses a priority queue (min-heap) to always output the smallest
    record that can extend the current run.
    """
    heap = MinHeap(capacity=memory_size)
    runs = []
    current_run_file = None
    last_output_key = None
    
    with open(input_file, 'rb') as f:
        # Initialize heap with first memory-sized chunk
        initial_records = read_records(f, memory_size)
        for record in initial_records:
            heap.insert(record, current_run=True)
        
        while not heap.is_empty():
            # Find smallest record that can extend current run
            # (must be >= last output key)
            candidate = heap.get_min_for_current_run(last_output_key)
            
            if candidate is None:
                # No record can extend current run - start new run
                finalize_run(current_run_file)
                runs.append(current_run_file)
                current_run_file = new_run_file()
                last_output_key = None
                heap.reset_run_markers()
                continue
            
            # Output candidate to current run
            write_record(current_run_file, candidate)
            last_output_key = candidate.sort_key
            heap.remove(candidate)
            
            # Replace with next record from input (if available)
            next_record = read_one_record(f)
            if next_record:
                # Mark whether it can join current run
                can_join_current = next_record.sort_key >= last_output_key
                heap.insert(next_record, current_run=can_join_current)
    
    if current_run_file:
        finalize_run(current_run_file)
        runs.append(current_run_file)
    
    return runs

Why Replacement Selection Produces Longer Runs:

The key insight is that when a new record arrives that's larger than the last output, we can include it in the current run even if smaller records are waiting. Only when a record arrives that's smaller than the last output do we mark it for the next run.

For uniformly random input, approximately half of incoming records can extend the current run. This cascades: longer runs mean more opportunity for subsequent records to join. Mathematical analysis shows the expected run length is 2M (twice memory size) for random data.

For nearly-sorted input, runs can be dramatically longer—potentially the entire file if data is mostly sorted already.

Replacement Selection Trade-offs

Replacement selection uses a heap, adding O(log n) overhead per record compared to O(1) for simple quicksort. For CPU-bound scenarios (fast SSDs), this overhead may outweigh the I/O savings from fewer runs. Modern systems often use simpler methods with very fast SSDs, reserving replacement selection for slower storage.

Practical Run Generation Considerations:

Engineering Optimizations

•Double Buffering: While sorting one chunk in memory, asynchronously read the next chunk from disk. Overlaps I/O with computation, potentially halving total time.
•Compression: Compress runs before writing to disk. Reduces I/O volume at the cost of CPU cycles—usually advantageous given I/O bottlenecks.
•Parallel Sorting: Use all CPU cores to sort each chunk via parallel quicksort or parallel merge sort. Sorting becomes memory-bandwidth bound rather than CPU bound.
•Run Distribution: Write runs to different physical disks if available, enabling parallel reads during merge phase.
•Key Prefix Optimization: For variable-length keys, store a fixed-length prefix for comparison. Full key comparison only needed when prefixes match.

Multi-Way Merge: The Heart of External Sorting

The merge phase is where external sort's elegance becomes apparent. Given multiple sorted runs, we need to produce a single sorted output. The multi-way merge algorithm accomplishes this with minimal I/O:

Two-Way Merge (Binary Merge):

The simplest approach merges two runs at a time, just like in standard merge sort. With 100 runs, this requires:

Pass 1: 100 → 50 runs
Pass 2: 50 → 25 runs
Pass 3: 25 → 13 runs
Pass 4: 13 → 7 runs
Pass 5: 7 → 4 runs
Pass 6: 4 → 2 runs
Pass 7: 2 → 1 run

Total: 7 passes, each reading and writing all data. For 100 GB: 700 GB of I/O.

K-Way Merge:

With K input buffers plus one output buffer, we can merge K runs simultaneously:

k_way_merge.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import heapq
from dataclasses import dataclass
from typing import List, Iterator, Any
 
@dataclass
class HeapEntry:
    """Entry in merge heap with comparison support."""
    key: Any           # Sort key
    record: bytes      # Full record
    run_index: int     # Which run this came from
    
    def __lt__(self, other):
        return self.key < other.key
 
 
def k_way_merge(sorted_runs: List[str], output_file: str,
                buffer_pages: int) -> None:
    """
    Merge K sorted runs using a min-heap.
    
    Args:
        sorted_runs: List of sorted run file paths
        output_file: Path for merged output
        buffer_pages: Available buffer pages (K+1 needed for K runs)
    
    Time: O(N log K) where N = total records
    I/O: 2N pages (read all + write all)
    """
    K = len(sorted_runs)
    if K > buffer_pages - 1:
        raise ValueError(f"Cannot merge {K} runs with {buffer_pages} buffers")
    
    # Open all input runs
    run_readers = [BufferedReader(run_file) for run_file in sorted_runs]
    output_writer = BufferedWriter(output_file)
    
    # Initialize min-heap with first record from each run
    heap = []
    for i, reader in enumerate(run_readers):
        record = reader.read_next()
        if record:
            entry = HeapEntry(
                key=extract_sort_key(record),
                record=record,
                run_index=i
            )
            heapq.heappush(heap, entry)
    
    # Merge until all runs exhausted
    records_merged = 0
    while heap:
        # Extract minimum record
        min_entry = heapq.heappop(heap)
        output_writer.write(min_entry.record)
        records_merged += 1
        
        # Fetch next record from same run
        next_record = run_readers[min_entry.run_index].read_next()
        if next_record:
            next_entry = HeapEntry(
                key=extract_sort_key(next_record),
                record=next_record,
                run_index=min_entry.run_index
            )
            heapq.heappush(heap, next_entry)
    
    # Flush and close
    output_writer.flush()
    for reader in run_readers:
        reader.close()
    
    return records_merged
 
 
class BufferedReader:
    """Buffered sequential reader for sorted runs."""
    
    def __init__(self, file_path: str, buffer_size: int = 64 * 1024):
        self.file = open(file_path, 'rb')
        self.buffer_size = buffer_size
        self.buffer = b''
        self.position = 0
        self._fill_buffer()
    
    def _fill_buffer(self):
        """Read next chunk from disk into buffer."""
        self.buffer = self.file.read(self.buffer_size)
        self.position = 0
    
    def read_next(self) -> bytes:
        """Read next record from buffer, refilling as needed."""
        if self.position >= len(self.buffer):
            self._fill_buffer()
            if not self.buffer:
                return None  # EOF
        
        # Assume fixed-size records for simplicity
        # Real implementations handle variable-length records
        record_size = RECORD_SIZE
        record = self.buffer[self.position:self.position + record_size]
        self.position += record_size
        return record if len(record) == record_size else None

K-Way Merge Performance:

With K-way merge, the number of passes is ⌈log_K(R)⌉ where R is the number of initial runs.

For 100 runs with various K values:

K=2: ⌈log₂(100)⌉ = 7 passes
K=10: ⌈log₁₀(100)⌉ = 2 passes
K=100: ⌈log₁₀₀(100)⌉ = 1 pass

With 1 GB of memory and 8 KB pages, we have ~128,000 buffer pages. Using 127,999 for input and 1 for output, we can merge 127,999 runs in a single pass!

This is why external merge sort scales so well: even terabytes of data typically require only 2-3 passes when sufficient buffer memory is available.

The Heap Size is Logarithmic

Each heap operation (insert, extract-min) is O(log K). With K runs and N total records, merge takes O(N log K) time. Since K is limited by buffer count (typically a few thousand at most), log K ≤ 15 or so. The heap overhead is trivial compared to I/O costs.

I/O Cost Analysis of External Sorting

Precise cost analysis is essential for query optimization. Let's derive the I/O cost of external merge sort with careful attention to all parameters:

Definitions:

N = number of pages in input file
M = number of buffer pages available in memory
B = page size (typically 4-16 KB)

Phase 1 Cost (Run Generation):

We read all N pages once and write all N pages once:

Read cost: N page reads
Write cost: N page writes
Total Phase 1: 2N page I/Os

Number of runs generated: R = ⌈N/M⌉ (basic) or R ≈ ⌈N/(2M)⌉ (with replacement selection)

Phase 2 Cost (Merge Phase):

Each merge pass reads all data and writes all data:

Cost per pass: 2N page I/Os

Number of passes needed:

P = ⌈log_{M-1}(R)⌉ (merge fan-in is M-1 with one output buffer)

Total Phase 2: 2N × P page I/Os

Total External Sort Cost:

Total I/O = 2N × (1 + P) = 2N × (1 + ⌈log_{M-1}(⌈N/M⌉)⌉) page I/Os

External Sort Cost Examples (8 KB pages)
Data Size	Buffer Memory	Pages (N)	Buffers (M)	Runs (R)	Passes (P)	Total I/O
1 GB	64 MB	131,072	8,192	16	1	524,288 (4 GB)
10 GB	64 MB	1,310,720	8,192	160	1	5,242,880 (40 GB)
100 GB	64 MB	13,107,200	8,192	1,600	1	52,428,800 (400 GB)
1 TB	64 MB	134,217,728	8,192	16,384	2	805,306,368 (6 TB)
1 TB	1 GB	134,217,728	131,072	1,024	1	536,870,912 (4 TB)

Memory Matters Enormously

Notice how increasing buffer memory from 64 MB to 1 GB when sorting 1 TB reduces passes from 2 to 1, cutting I/O nearly in half. This is why database systems aggressively request memory for sorting operations—the I/O savings are massive.

Optimized Cost Formula:

With replacement selection (2× run length) and optimal buffer allocation:

Total I/O = 2N × (1 + ⌈log_{M-1}(⌈N/(2M)⌉)⌉)

This typically saves one entire merge pass—halving I/O for large datasets.

Real-World Considerations:

Asynchronous I/O: Modern systems overlap reads and writes with computation, reducing wall-clock time below the sum of I/O operations.
SSD Random Reads: On SSDs, random reads are only ~10-50× slower than sequential (vs 1000× on HDD). Some algorithms trade fewer sequential passes for more random I/O on SSD.
Compression: Compressed runs reduce I/O volume. At 2:1 compression, effective throughput doubles.
Parallel I/O: With N disks, we can achieve N× throughput for sorting operations.

Modern External Sorting Variants

The classic external merge sort algorithm has evolved significantly to exploit modern hardware capabilities. Here are the key variants used in production database systems:

Modern External Sort Techniques

•Parallel External Sort: Run generation uses parallel sorting across all CPU cores. Merge phase uses parallel heap structures (tournament trees) and parallel I/O across multiple disks/SSDs.
•Cache-Conscious Merge: Aligns data access patterns with CPU cache line sizes (64 bytes). Uses prefetching hints to load data into cache before needed.
•Tournament Trees: Replace binary heap with a loser tree or tournament tree for merge. Reduces comparisons from 2⌈log₂K⌉ to ⌈log₂K⌉ per output record.
•Compression-Aware Sorting: Integrates lightweight compression (LZ4, Snappy) into the sort. Compresses runs during generation, decompresses during merge.
•NUMA-Aware Sorting: On multi-socket systems, keeps data local to NUMA nodes during sorting. Minimizes expensive cross-socket memory transfers.

tournament_tree.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class TournamentTree:
    """
    Loser tree for efficient K-way merge.
    
    Each internal node stores the "loser" of that comparison.
    The winner propagates up. This reduces comparisons per
    output element from 2*log(K) to log(K).
    """
    def __init__(self, sources: List[Iterator]):
        self.K = len(sources)
        self.sources = sources
        
        # Tree has K-1 internal nodes plus K leaf positions
        self.tree = [None] * (2 * self.K)
        self.current_values = [None] * self.K
        
        # Initialize with first value from each source
        for i in range(self.K):
            self.current_values[i] = next(self.sources[i], None)
        
        self._build_tree()
    
    def _build_tree(self):
        """Build initial tournament tree structure."""
        # Leaves are at indices K to 2K-1
        for i in range(self.K):
            self.tree[self.K + i] = i
        
        # Build internal nodes bottom-up
        for i in range(self.K - 1, 0, -1):
            left_child = self.tree[2 * i]
            right_child = self.tree[2 * i + 1]
            winner, loser = self._compete(left_child, right_child)
            self.tree[i] = loser  # Loser stays at this node
            # Winner implicitly propagates to parent
    
    def _compete(self, idx1: int, idx2: int) -> Tuple[int, int]:
        """Compare two source indices, return (winner, loser)."""
        val1 = self.current_values[idx1]
        val2 = self.current_values[idx2]
        
        if val1 is None:
            return (idx2, idx1)
        if val2 is None:
            return (idx1, idx2)
        if val1 <= val2:
            return (idx1, idx2)
        return (idx2, idx1)
    
    def pop_min(self) -> Any:
        """
        Remove and return minimum element.
        Replay tournament for affected path - O(log K).
        """
        # Root contains overall loser; track winner separately
        winner_idx = self._find_overall_winner()
        if self.current_values[winner_idx] is None:
            return None  # All sources exhausted
        
        result = self.current_values[winner_idx]
        
        # Advance winning source
        self.current_values[winner_idx] = next(
            self.sources[winner_idx], None
        )
        
        # Replay from leaf to root
        self._replay(winner_idx)
        
        return result

Why Tournament Trees Win

In a binary heap, extracting the minimum requires comparing the replacement with both children at each level—2×log₂(K) comparisons. A tournament tree only needs log₂(K) comparisons because each path from leaf to root is predetermined by the tournament structure.

Summary: External Sorting Mastery

External sorting is the foundation upon which sort-merge join stands. Let's consolidate the essential knowledge:

Key Takeaways

•The Memory Wall: Data exceeding RAM requires algorithms designed for disk characteristics—sequential access is 100-1000× faster than random access.
•Two-Phase Approach: Run generation creates sorted chunks; merge phase combines them. Both phases use sequential I/O exclusively.
•Replacement Selection: Produces 2× longer runs on average, reducing merge passes. Trade-off: more CPU work per record.
•K-Way Merge: Higher fan-in means fewer passes. With adequate buffer memory, even terabytes sort in 1-2 passes.
•Cost Formula: Total I/O = 2N × (1 + ⌈log_{M-1}(R)⌉). More memory dramatically reduces I/O.
•Modern Optimizations: Tournament trees, compression, parallelism, and cache-conscious design make sorting ever more efficient.

Looking Ahead:

With external sorting mastered, we're ready to explore the merge phase of sort-merge join—where two sorted relations are combined to produce join results. The next page examines how the merge algorithm efficiently identifies matching tuples while maintaining the streaming, sequential access pattern that makes this join method so powerful.

Page Complete

You now understand external sorting—the algorithm enabling database systems to sort datasets of virtually unlimited size. This knowledge is essential for understanding why sort-merge join scales so effectively and when it's the optimal join strategy.