Loading content...
Sorting is perhaps the most fundamental algorithm in computer science—a problem so well-studied that most developers consider it solved. Call Array.sort(), and the job is done. But this comfortable assumption shatters the moment your data exceeds available memory.
Imagine you need to sort 100 GB of transaction records on a machine with 8 GB of RAM. Suddenly, Array.sort() throws an out-of-memory exception, and you're confronted with a challenge that has occupied some of the brightest minds in computing: external sorting.
External sorting is the foundational algorithm that enables sort-merge join to operate on datasets of arbitrary size. Without it, database systems could only join tables that fit entirely in memory—a limitation that would render them useless for most real-world applications. Understanding external sorting is therefore essential to understanding how databases actually work at scale.
By the end of this page, you will understand the complete theory and practice of external sorting: why memory limitations necessitate fundamentally different algorithms, how external merge sort works at each stage, the mathematics governing its I/O cost, and the engineering optimizations that make it performant in production database systems. This knowledge forms the essential foundation for understanding sort-merge join.
To appreciate external sorting, we must first understand the fundamental limitation it addresses: the memory wall. This isn't merely a resource constraint—it's a fundamental architectural property of computer systems that shapes how we process large datasets.
The Memory Hierarchy:
Modern computer systems exhibit a profound asymmetry between storage tiers:
| Storage Tier | Capacity | Latency | Bandwidth | Cost/GB |
|---|---|---|---|---|
| CPU Registers | ~1 KB | < 1 ns | TB/s | N/A |
| L1 Cache | 32-64 KB | ~1 ns | ~500 GB/s | N/A |
| L2 Cache | 256 KB - 1 MB | ~4 ns | ~200 GB/s | N/A |
| L3 Cache | 8-64 MB | ~12 ns | ~100 GB/s | N/A |
| Main Memory (RAM) | 8-512 GB | ~100 ns | ~50 GB/s | ~$3-5 |
| NVMe SSD | 1-8 TB | ~100 µs | ~7 GB/s | ~$0.10 |
| HDD | 4-20 TB | ~10 ms | ~200 MB/s | ~$0.02 |
Notice the orders of magnitude difference between RAM and disk:
When your dataset exceeds memory, you're forced to access slower storage. An algorithm that was O(n log n) in memory becomes dominated by I/O costs that dwarf computation. The number of disk accesses—not CPU cycles—becomes the primary cost metric.
For external algorithms, I/O cost is everything. A single random disk read (seek + rotational delay + transfer) takes 10 milliseconds—time in which a modern CPU could execute 40 million instructions. Database algorithms are designed to minimize disk I/O above all else.
Why In-Memory Sorting Fails:
Consider what happens when you try to sort 100 GB of data with 8 GB of RAM using quicksort:
This catastrophic performance isn't because quicksort is a bad algorithm—it's brilliant for in-memory data. It's because quicksort's random access pattern is fundamentally incompatible with disk characteristics. External sorting algorithms are designed from the ground up to work with disk architecture, not against it.
External merge sort transforms the sorting problem into a two-phase process specifically designed to exploit sequential disk access patterns:
Phase 1: Run Generation (Sort Phase)
Divide the data into memory-sized chunks, sort each chunk in memory, and write the sorted chunks (called runs) back to disk.
Phase 2: Merge Phase
Repeatedly merge multiple sorted runs into larger sorted runs until a single fully-sorted output remains.
The key insight is that both phases primarily use sequential I/O, which is dramatically faster than random I/O. Let's examine each phase in detail.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
def external_merge_sort(input_file: str, output_file: str, memory_size: int) -> None: """ External merge sort for files larger than available memory. Args: input_file: Path to unsorted input file output_file: Path for sorted output memory_size: Available memory in bytes for sorting """ # Phase 1: Generate sorted runs run_files = generate_sorted_runs(input_file, memory_size) # Phase 2: Merge runs until single sorted output while len(run_files) > 1: run_files = merge_runs(run_files, memory_size) # Final run becomes output rename(run_files[0], output_file) def generate_sorted_runs(input_file: str, memory_size: int) -> List[str]: """ Phase 1: Read chunks, sort in memory, write sorted runs. Each run contains approximately (memory_size) bytes of sorted records. Uses any efficient in-memory sort (quicksort, heapsort, etc.) """ runs = [] chunk_buffer = allocate_buffer(memory_size) with open(input_file, 'rb') as f: run_number = 0 while True: # Read memory-sized chunk records = read_records_into_buffer(f, chunk_buffer) if not records: break # Sort chunk in memory using any O(n log n) algorithm records.sort(key=lambda r: r.sort_key) # Write sorted run to temporary file run_file = f"run_{run_number}.tmp" write_records_to_file(records, run_file) runs.append(run_file) run_number += 1 return runs def merge_runs(run_files: List[str], memory_size: int) -> List[str]: """ Merge multiple sorted runs into fewer, larger sorted runs. With B buffer pages, we can merge (B-1) runs at once, keeping 1 page for output buffer. """ # Calculate how many runs we can merge at once # Reserve one buffer for output, rest for input runs buffer_pages = memory_size // PAGE_SIZE merge_fan_in = buffer_pages - 1 new_runs = [] # Process runs in groups of (merge_fan_in) for i in range(0, len(run_files), merge_fan_in): runs_to_merge = run_files[i:i + merge_fan_in] merged_run = multi_way_merge(runs_to_merge) new_runs.append(merged_run) # Clean up merged runs for run in runs_to_merge: delete_file(run) return new_runsBoth phases read and write data sequentially. During run generation, we read the input file sequentially and write runs sequentially. During merging, we read each run sequentially and write the output sequentially. This transforms a problem dominated by random disk seeks into one dominated by sequential throughput—a speedup of 100× or more.
The run generation phase is more nuanced than simply reading chunks and sorting them. Sophisticated database systems employ advanced techniques to generate longer initial runs, which reduces the number of merge passes required.
Basic Run Generation:
With M bytes of memory and N bytes of data, basic run generation produces ⌈N/M⌉ runs, each of size M. For 100 GB of data with 1 GB of memory: 100 runs.
Replacement Selection: 2× Longer Runs:
Replacement selection is an ingenious algorithm that produces runs approximately twice the memory size on average, cutting the number of runs in half:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
def replacement_selection(input_file: str, memory_size: int) -> List[str]: """ Replacement selection produces runs ~2× memory size on average. Uses a priority queue (min-heap) to always output the smallest record that can extend the current run. """ heap = MinHeap(capacity=memory_size) runs = [] current_run_file = None last_output_key = None with open(input_file, 'rb') as f: # Initialize heap with first memory-sized chunk initial_records = read_records(f, memory_size) for record in initial_records: heap.insert(record, current_run=True) while not heap.is_empty(): # Find smallest record that can extend current run # (must be >= last output key) candidate = heap.get_min_for_current_run(last_output_key) if candidate is None: # No record can extend current run - start new run finalize_run(current_run_file) runs.append(current_run_file) current_run_file = new_run_file() last_output_key = None heap.reset_run_markers() continue # Output candidate to current run write_record(current_run_file, candidate) last_output_key = candidate.sort_key heap.remove(candidate) # Replace with next record from input (if available) next_record = read_one_record(f) if next_record: # Mark whether it can join current run can_join_current = next_record.sort_key >= last_output_key heap.insert(next_record, current_run=can_join_current) if current_run_file: finalize_run(current_run_file) runs.append(current_run_file) return runsWhy Replacement Selection Produces Longer Runs:
The key insight is that when a new record arrives that's larger than the last output, we can include it in the current run even if smaller records are waiting. Only when a record arrives that's smaller than the last output do we mark it for the next run.
For uniformly random input, approximately half of incoming records can extend the current run. This cascades: longer runs mean more opportunity for subsequent records to join. Mathematical analysis shows the expected run length is 2M (twice memory size) for random data.
For nearly-sorted input, runs can be dramatically longer—potentially the entire file if data is mostly sorted already.
Replacement selection uses a heap, adding O(log n) overhead per record compared to O(1) for simple quicksort. For CPU-bound scenarios (fast SSDs), this overhead may outweigh the I/O savings from fewer runs. Modern systems often use simpler methods with very fast SSDs, reserving replacement selection for slower storage.
Practical Run Generation Considerations:
The merge phase is where external sort's elegance becomes apparent. Given multiple sorted runs, we need to produce a single sorted output. The multi-way merge algorithm accomplishes this with minimal I/O:
Two-Way Merge (Binary Merge):
The simplest approach merges two runs at a time, just like in standard merge sort. With 100 runs, this requires:
Total: 7 passes, each reading and writing all data. For 100 GB: 700 GB of I/O.
K-Way Merge:
With K input buffers plus one output buffer, we can merge K runs simultaneously:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import heapqfrom dataclasses import dataclassfrom typing import List, Iterator, Any @dataclassclass HeapEntry: """Entry in merge heap with comparison support.""" key: Any # Sort key record: bytes # Full record run_index: int # Which run this came from def __lt__(self, other): return self.key < other.key def k_way_merge(sorted_runs: List[str], output_file: str, buffer_pages: int) -> None: """ Merge K sorted runs using a min-heap. Args: sorted_runs: List of sorted run file paths output_file: Path for merged output buffer_pages: Available buffer pages (K+1 needed for K runs) Time: O(N log K) where N = total records I/O: 2N pages (read all + write all) """ K = len(sorted_runs) if K > buffer_pages - 1: raise ValueError(f"Cannot merge {K} runs with {buffer_pages} buffers") # Open all input runs run_readers = [BufferedReader(run_file) for run_file in sorted_runs] output_writer = BufferedWriter(output_file) # Initialize min-heap with first record from each run heap = [] for i, reader in enumerate(run_readers): record = reader.read_next() if record: entry = HeapEntry( key=extract_sort_key(record), record=record, run_index=i ) heapq.heappush(heap, entry) # Merge until all runs exhausted records_merged = 0 while heap: # Extract minimum record min_entry = heapq.heappop(heap) output_writer.write(min_entry.record) records_merged += 1 # Fetch next record from same run next_record = run_readers[min_entry.run_index].read_next() if next_record: next_entry = HeapEntry( key=extract_sort_key(next_record), record=next_record, run_index=min_entry.run_index ) heapq.heappush(heap, next_entry) # Flush and close output_writer.flush() for reader in run_readers: reader.close() return records_merged class BufferedReader: """Buffered sequential reader for sorted runs.""" def __init__(self, file_path: str, buffer_size: int = 64 * 1024): self.file = open(file_path, 'rb') self.buffer_size = buffer_size self.buffer = b'' self.position = 0 self._fill_buffer() def _fill_buffer(self): """Read next chunk from disk into buffer.""" self.buffer = self.file.read(self.buffer_size) self.position = 0 def read_next(self) -> bytes: """Read next record from buffer, refilling as needed.""" if self.position >= len(self.buffer): self._fill_buffer() if not self.buffer: return None # EOF # Assume fixed-size records for simplicity # Real implementations handle variable-length records record_size = RECORD_SIZE record = self.buffer[self.position:self.position + record_size] self.position += record_size return record if len(record) == record_size else NoneK-Way Merge Performance:
With K-way merge, the number of passes is ⌈log_K(R)⌉ where R is the number of initial runs.
For 100 runs with various K values:
With 1 GB of memory and 8 KB pages, we have ~128,000 buffer pages. Using 127,999 for input and 1 for output, we can merge 127,999 runs in a single pass!
This is why external merge sort scales so well: even terabytes of data typically require only 2-3 passes when sufficient buffer memory is available.
Each heap operation (insert, extract-min) is O(log K). With K runs and N total records, merge takes O(N log K) time. Since K is limited by buffer count (typically a few thousand at most), log K ≤ 15 or so. The heap overhead is trivial compared to I/O costs.
Precise cost analysis is essential for query optimization. Let's derive the I/O cost of external merge sort with careful attention to all parameters:
Definitions:
Phase 1 Cost (Run Generation):
We read all N pages once and write all N pages once:
Number of runs generated: R = ⌈N/M⌉ (basic) or R ≈ ⌈N/(2M)⌉ (with replacement selection)
Phase 2 Cost (Merge Phase):
Each merge pass reads all data and writes all data:
Number of passes needed:
Total Phase 2: 2N × P page I/Os
Total External Sort Cost:
Total I/O = 2N × (1 + P) = 2N × (1 + ⌈log_{M-1}(⌈N/M⌉)⌉) page I/Os
| Data Size | Buffer Memory | Pages (N) | Buffers (M) | Runs (R) | Passes (P) | Total I/O |
|---|---|---|---|---|---|---|
| 1 GB | 64 MB | 131,072 | 8,192 | 16 | 1 | 524,288 (4 GB) |
| 10 GB | 64 MB | 1,310,720 | 8,192 | 160 | 1 | 5,242,880 (40 GB) |
| 100 GB | 64 MB | 13,107,200 | 8,192 | 1,600 | 1 | 52,428,800 (400 GB) |
| 1 TB | 64 MB | 134,217,728 | 8,192 | 16,384 | 2 | 805,306,368 (6 TB) |
| 1 TB | 1 GB | 134,217,728 | 131,072 | 1,024 | 1 | 536,870,912 (4 TB) |
Notice how increasing buffer memory from 64 MB to 1 GB when sorting 1 TB reduces passes from 2 to 1, cutting I/O nearly in half. This is why database systems aggressively request memory for sorting operations—the I/O savings are massive.
Optimized Cost Formula:
With replacement selection (2× run length) and optimal buffer allocation:
Total I/O = 2N × (1 + ⌈log_{M-1}(⌈N/(2M)⌉)⌉)
This typically saves one entire merge pass—halving I/O for large datasets.
Real-World Considerations:
Asynchronous I/O: Modern systems overlap reads and writes with computation, reducing wall-clock time below the sum of I/O operations.
SSD Random Reads: On SSDs, random reads are only ~10-50× slower than sequential (vs 1000× on HDD). Some algorithms trade fewer sequential passes for more random I/O on SSD.
Compression: Compressed runs reduce I/O volume. At 2:1 compression, effective throughput doubles.
Parallel I/O: With N disks, we can achieve N× throughput for sorting operations.
The classic external merge sort algorithm has evolved significantly to exploit modern hardware capabilities. Here are the key variants used in production database systems:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
class TournamentTree: """ Loser tree for efficient K-way merge. Each internal node stores the "loser" of that comparison. The winner propagates up. This reduces comparisons per output element from 2*log(K) to log(K). """ def __init__(self, sources: List[Iterator]): self.K = len(sources) self.sources = sources # Tree has K-1 internal nodes plus K leaf positions self.tree = [None] * (2 * self.K) self.current_values = [None] * self.K # Initialize with first value from each source for i in range(self.K): self.current_values[i] = next(self.sources[i], None) self._build_tree() def _build_tree(self): """Build initial tournament tree structure.""" # Leaves are at indices K to 2K-1 for i in range(self.K): self.tree[self.K + i] = i # Build internal nodes bottom-up for i in range(self.K - 1, 0, -1): left_child = self.tree[2 * i] right_child = self.tree[2 * i + 1] winner, loser = self._compete(left_child, right_child) self.tree[i] = loser # Loser stays at this node # Winner implicitly propagates to parent def _compete(self, idx1: int, idx2: int) -> Tuple[int, int]: """Compare two source indices, return (winner, loser).""" val1 = self.current_values[idx1] val2 = self.current_values[idx2] if val1 is None: return (idx2, idx1) if val2 is None: return (idx1, idx2) if val1 <= val2: return (idx1, idx2) return (idx2, idx1) def pop_min(self) -> Any: """ Remove and return minimum element. Replay tournament for affected path - O(log K). """ # Root contains overall loser; track winner separately winner_idx = self._find_overall_winner() if self.current_values[winner_idx] is None: return None # All sources exhausted result = self.current_values[winner_idx] # Advance winning source self.current_values[winner_idx] = next( self.sources[winner_idx], None ) # Replay from leaf to root self._replay(winner_idx) return resultIn a binary heap, extracting the minimum requires comparing the replacement with both children at each level—2×log₂(K) comparisons. A tournament tree only needs log₂(K) comparisons because each path from leaf to root is predetermined by the tournament structure.
External sorting is the foundation upon which sort-merge join stands. Let's consolidate the essential knowledge:
Looking Ahead:
With external sorting mastered, we're ready to explore the merge phase of sort-merge join—where two sorted relations are combined to produce join results. The next page examines how the merge algorithm efficiently identifies matching tuples while maintaining the streaming, sequential access pattern that makes this join method so powerful.
You now understand external sorting—the algorithm enabling database systems to sort datasets of virtually unlimited size. This knowledge is essential for understanding why sort-merge join scales so effectively and when it's the optimal join strategy.