Loading learning content...
Sorting is arguably the most fundamental operation in database systems. It appears everywhere: ORDER BY clauses, merge joins, sort-based aggregation, duplicate elimination, index construction, and B-tree maintenance. Understanding sorting in the database context is essential because database sorting has unique characteristics that distinguish it from general-purpose sorting.
The Database Sorting Challenge:
Unlike in-memory algorithm courses where we sort arrays of integers, database sorting faces distinct challenges:
By the end of this page, you will understand how databases implement sorting at scale. You'll master external merge sort, the workhorse algorithm for disk-based sorting. You'll learn about replacement selection, run generation optimization, and modern techniques for cache-efficient sorting. You'll understand when different strategies excel and how to analyze sorting costs.
When data fits in memory, databases use efficient in-memory sorting algorithms. The choice of algorithm depends on data characteristics and hardware.
Common In-Memory Algorithms:
Quicksort: The default choice for many systems
Merge Sort: Used when stability is required
Radix Sort: For integer or fixed-width keys
12345678910111213141516171819202122232425262728293031323334353637
DATABASE_QUICKSORT(tuples, key_extractor, comparator): // Optimized quicksort for database tuples if len(tuples) <= INSERTION_THRESHOLD: // ~16-32 elements insertion_sort(tuples, key_extractor, comparator) return // Median-of-three pivot selection (avoid worst case) pivot = median_of_three( key_extractor(tuples[0]), key_extractor(tuples[len/2]), key_extractor(tuples[len-1]) ) // Three-way partition (handle duplicates efficiently) (lt, eq, gt) = three_way_partition(tuples, pivot, key_extractor) // Recurse on smaller partition first (limits stack depth) if len(lt) < len(gt): DATABASE_QUICKSORT(lt, key_extractor, comparator) DATABASE_QUICKSORT(gt, key_extractor, comparator) else: DATABASE_QUICKSORT(gt, key_extractor, comparator) DATABASE_QUICKSORT(lt, key_extractor, comparator) // Equal elements are already in final position // Key optimization: Extract and cache keysCACHE_AWARE_SORT(tuples, key_columns): // Extract keys once, avoid repeated extraction keys = [(extract_key(t, key_columns), i) for i, t in enumerate(tuples)] // Sort keys (smaller than full tuples, better cache use) sort(keys) // Reorder tuples according to sorted key positions return [tuples[k[1]] for k in keys]Database-Specific Optimizations:
Key Normalization: Transform complex keys (variable-length strings, multiple columns) into fixed-length binary format that supports direct byte comparison. This enables:
Pointer Sorting: For wide tuples, sort pointers rather than tuples:
Prefix Sorting: Store key prefix with pointer:
On modern hardware, cache misses often dominate sorting cost. A cache-oblivious merge sort or cache-aware quicksort can outperform theoretically faster algorithms. The key insight: keeping working set in L2/L3 cache (megabytes) provides 10-100x speedup over main memory access patterns that thrash the cache.
When data exceeds available memory, external merge sort is the dominant algorithm. It's elegant, predictable, and optimized for disk I/O patterns.
High-Level Algorithm:
External merge sort proceeds in two main phases:
Phase 1: Run Generation (Create Sorted Runs)
Phase 2: Merge Runs
| Parameter | Symbol | Description |
|---|---|---|
| Input size (pages) | N | Total pages of unsorted input data |
| Memory (pages) | M | Available buffer pages for sorting |
| Initial runs | ⌈N/M⌉ | Number of sorted runs after Phase 1 |
| Merge order | M-1 | Runs merged simultaneously (1 output buffer) |
| Merge passes | ⌈log_{M-1}(N/M)⌉ | Number of merge phases required |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
EXTERNAL_MERGE_SORT(input_file, memory_pages): // Phase 1: Generate initial sorted runs runs = [] while not end_of(input_file): // Read M pages into memory buffer buffer = read_pages(input_file, memory_pages) // Sort in memory using quicksort/mergesort in_memory_sort(buffer) // Write sorted run to temporary file run_file = create_temp_file() write_pages(run_file, buffer) runs.append(run_file) // Phase 2: Merge runs until one remains merge_order = memory_pages - 1 // Reserve 1 page for output while len(runs) > 1: new_runs = [] // Process runs in groups of (merge_order) for i in range(0, len(runs), merge_order): runs_to_merge = runs[i : i + merge_order] merged = merge_runs(runs_to_merge, memory_pages) new_runs.append(merged) delete_files(runs_to_merge) runs = new_runs return runs[0] // Single fully sorted file MERGE_RUNS(run_files, memory_pages): // Open input buffers for each run (1 page each) input_buffers = [open_buffer(f, 1 page) for f in run_files] output_buffer = allocate_output_buffer(1 page) output_file = create_temp_file() // Use min-heap to track smallest element from each run heap = MinHeap() for i, buf in enumerate(input_buffers): if not buf.empty(): heap.insert((buf.peek(), i)) while not heap.empty(): (min_tuple, run_idx) = heap.extract_min() output_buffer.append(min_tuple) if output_buffer.full(): write_pages(output_file, output_buffer) output_buffer.clear() input_buffers[run_idx].advance() if not input_buffers[run_idx].empty(): heap.insert((input_buffers[run_idx].peek(), run_idx)) elif input_buffers[run_idx].has_more_pages(): input_buffers[run_idx].load_next_page() heap.insert((input_buffers[run_idx].peek(), run_idx)) // Write remaining output if not output_buffer.empty(): write_pages(output_file, output_buffer) return output_fileExternal merge sort is ideal for disk-based sorting because:
Sequential I/O: Both run generation and merging access data sequentially—disks excel at sequential access
Predictable I/O: Every page is read and written a fixed number of times, enabling accurate cost estimation
Scalable: Works for any data size; more merge passes handle larger data
Parallelizable: Runs can be generated and merged in parallel
Understanding the I/O cost of external merge sort is crucial for query optimization. The cost depends on input size, memory, and data distribution.
Phase 1 (Run Generation):
Phase 2 (Merging):
Total I/O Cost:
Total = 2N × (1 + ⌈log_{M-1}(N/M)⌉)
= 2N × (1 + number_of_merge_passes)
| N (pages) | M (pages) | Initial Runs | Merge Passes | Total I/O |
|---|---|---|---|---|
| 1,000 | 100 | 10 | ⌈log₉₉(10)⌉ = 1 | 2×1000×2 = 4,000 |
| 10,000 | 100 | 100 | ⌈log₉₉(100)⌉ = 1 | 2×10000×2 = 40,000 |
| 100,000 | 100 | 1,000 | ⌈log₉₉(1000)⌉ = 2 | 2×100000×3 = 600,000 |
| 1,000,000 | 1,000 | 1,000 | ⌈log₉₉₉(1000)⌉ = 1 | 2×1000000×2 = 4,000,000 |
| 1,000,000 | 100 | 10,000 | ⌈log₉₉(10000)⌉ = 2 | 2×1000000×3 = 6,000,000 |
Key Observations:
Memory matters dramatically: Doubling memory from M to 2M potentially eliminates an entire merge pass, halving I/O for large datasets.
The logarithm is forgiving: Even for huge datasets, merge passes remain small. Sorting 1TB with 1GB memory: log₁₀₀₀(1000000) ≈ 2 passes.
Base-case optimization helps: If data nearly fits in memory (N ≈ M), total cost approaches 4N (one run, one merge).
Read:Write ratio: External merge sort has 1:1 read:write ratio. SSDs with asymmetric read/write performance may prefer read-heavy algorithms.
Sort I/O cost directly impacts query optimizer decisions:
Underestimating sort cost leads to query plans that thrash disk; overestimating leads to suboptimal index choices.
The basic external merge sort generates runs of exactly M pages. Replacement selection is a technique that can produce runs of 2M pages on average for random data—potentially eliminating a merge pass entirely.
The Idea:
Instead of filling memory, sorting, and writing, we use a priority queue (min-heap) in memory:
Tuples that can extend the current run do so immediately; tuples that would break sort order are "saved" for the next run.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
REPLACEMENT_SELECTION(input_file, heap_capacity): runs = [] heap = MinHeap(capacity=heap_capacity) current_run = [] next_run_buffer = [] last_written = -INFINITY // Initial fill while not heap.full() and not end_of(input_file): heap.insert(read_next_tuple(input_file)) while not heap.empty() or len(next_run_buffer) > 0: // If heap exhausted but next_run_buffer has tuples if heap.empty(): // Start new run with buffered tuples flush(current_run) runs.append(current_run) current_run = [] last_written = -INFINITY for tuple in next_run_buffer: heap.insert(tuple) next_run_buffer = [] // Extract minimum from heap min_tuple = heap.extract_min() if min_tuple >= last_written: // Can extend current run current_run.append(min_tuple) last_written = min_tuple // Replacement: read next input tuple if not end_of(input_file): new_tuple = read_next_tuple(input_file) if new_tuple >= last_written: heap.insert(new_tuple) // Can go in current run else: next_run_buffer.append(new_tuple) // Save for next run else: // This shouldn't happen with proper implementation next_run_buffer.append(min_tuple) // Flush final run if len(current_run) > 0: flush(current_run) runs.append(current_run) return runs -- For random input, expected run length = 2 × heap_capacity!Why 2M on Average?
For uniformly random input, approximately half the new tuples read will be larger than the last written tuple and can extend the current run. This cascade effect produces runs averaging 2M tuples—double the basic algorithm.
When Replacement Selection Excels:
Partially sorted input: If input has runs of ascending values, replacement selection produces extremely long runs—potentially the entire file in one run!
Large memory: More heap capacity means longer average runs.
When It Underperforms:
Reverse sorted input: Worst case—each new tuple is smaller than all heap elements, producing runs of exactly M.
Modern SSD storage: The CPU overhead of heap operations may not be worth it when I/O is fast.
Replacement selection is the algorithmic ancestor of LSM (Log-Structured Merge) tree compaction. In LSM trees, memtables (in-memory sorted structures) are periodically flushed as runs, then merged. The same principles apply: longer runs mean fewer merges, and nearly-sorted input produces very long runs.
The merge phase offers numerous optimization opportunities beyond the basic algorithm.
1. Double Buffering (Prefetching):
While processing one buffer page, prefetch the next:
For each input run:
Page A: currently being read by merge logic
Page B: being prefetched from disk in background
When A exhausted, swap A and B, start prefetching new B
This hides I/O latency by overlapping disk and CPU operations. Requires 2× memory for input buffers but dramatically improves throughput.
2. Read-Ahead for Sequential Runs:
Operating systems and disk controllers optimize sequential reads. Reading larger chunks (e.g., 64 pages instead of 1) reduces seek overhead and leverages disk prefetching.
3. Cascade Merge (Polyphase Merge):
For very large sorts with many initial runs, clever scheduling of merges can reduce total I/O. Instead of merging all runs at once (which limits merge order to M-1), cascade strategies merge subsets strategically.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
-- Optimization: Forecasting merge with varying run lengths-- When runs have different lengths, prioritize merging shorter runs first OPTIMIZED_MERGE_SCHEDULE(runs): // Sort runs by length (shortest first) sorted_runs = sort(runs, key=lambda r: r.length) while len(sorted_runs) > 1: // Take the (merge_order) shortest runs batch = sorted_runs[:merge_order] sorted_runs = sorted_runs[merge_order:] merged = merge(batch) // Insert merged run in sorted position insert_sorted(sorted_runs, merged) return sorted_runs[0] -- Optimization: Tournament tree for merge-- Instead of heap (O(log k) per element for k-way merge),-- use tournament tree (same complexity but better cache behavior) TOURNAMENT_MERGE(runs, num_runs): // Build tournament tree with (num_runs) leaves tree = build_tournament_tree(num_runs) // Initialize: load one element from each run into leaves for i in range(num_runs): tree.leaves[i] = runs[i].next() // Build initial tournament (O(num_runs)) tree.rebuild_tournament() while not tree.empty(): // Winner is at root winner = tree.root.value winner_source = tree.root.source output(winner) // Load replacement from winner's run if runs[winner_source].has_more(): tree.leaves[winner_source] = runs[winner_source].next() else: tree.leaves[winner_source] = INFINITY // Run exhausted // Replay tournament from leaf to root (O(log k)) tree.replay(winner_source)4. Avoiding Final Write:
If the sorted output is consumed by a streaming operator (like merge join), we can avoid writing the final sorted run:
Pipeline: Sort → Merge Join
Optimization:
- Perform sort, creating runs on disk
- During final merge pass, pipe directly to merge join
- No disk write for final sorted output
- Saves N page writes
5. External Quick Sort:
An alternative to merge sort for external sorting:
Can be faster when memory is very limited (fewer passes), but less predictable I/O patterns.
In cloud environments, network-attached storage has different I/O characteristics than local disks:
Cloud-optimized databases tune external sort for these realities.
Modern systems exploit parallelism at multiple levels to accelerate sorting.
Intra-Operator Parallelism:
Parallelize within a single sort operation:
1. Parallel Run Generation:
2. Parallel Merge:
12345678910111213141516171819202122232425262728293031323334353637383940414243
PARALLEL_EXTERNAL_SORT(input, num_threads, memory_per_thread): // Phase 1: Parallel run generation partitions = split_input(input, num_threads) runs = [] parallel_for i in range(num_threads): local_runs = EXTERNAL_MERGE_SORT(partitions[i], memory_per_thread) runs.extend(local_runs) // Thread-safe append barrier() // Wait for all threads // Phase 2: Parallel merge tree while len(runs) > num_threads: merge_batches = partition_runs(runs, num_threads) runs = [] parallel_for i in range(num_threads): merged = merge_runs(merge_batches[i]) runs.append(merged) barrier() // Final merge (single-threaded or with range partitioning) return final_merge(runs) -- Alternative: Range-partitioned parallel sortRANGE_PARALLEL_SORT(input, num_threads): // Step 1: Sample input to find partition boundaries sample = random_sample(input, 1000) boundaries = select_percentiles(sort(sample), num_threads) // Step 2: Partition data by key ranges // All tuples go to appropriate partition based on key for each tuple t in input: partition = find_partition(t.key, boundaries) send(t, partition) // Step 3: Each thread sorts its partition parallel_for i in range(num_threads): sorted_parts[i] = sort(partition[i]) // Step 4: Concatenate (already globally sorted by ranges) return concatenate(sorted_parts)Distributed Sorting:
In distributed systems, sorting involves network shuffling:
Challenges:
Modern GPUs can sort billions of records per second. GPU radix sort and bitonic sort algorithms achieve massive parallelism:
The challenge is PCIe transfer overhead—data must move to GPU memory and back. For very large sorts, GPU is used for in-memory phases while disk I/O remains CPU-managed.
Many queries don't need a fully sorted result—they need only the top K elements (ORDER BY with LIMIT). This enables significant optimization.
Heap-Based Top-K:
For finding the K smallest (or largest) elements:
TOP_K(input, K):
heap = MaxHeap(capacity=K) // For finding K smallest
for each tuple t in input:
if heap.size() < K:
heap.insert(t)
elif t < heap.peek(): // t smaller than largest in heap
heap.replace_max(t) // remove max, insert t
return heap.contents_sorted()
Complexity: O(n log K) instead of O(n log n) for full sort. When K << n, this is dramatically faster.
Memory: O(K) instead of O(n). For LIMIT 10 on a billion rows, we need space for only 10 tuples!
12345678910111213141516171819202122232425262728
-- Query pattern: Top-K optimization targetSELECT customer_id, total_purchasesFROM customersORDER BY total_purchases DESCLIMIT 100; -- Without optimization: Sort all 10M customers, then take 100-- Cost: O(10M log 10M) comparisons, potentially spill to disk -- With Top-K optimization: Maintain heap of 100 largest-- Cost: O(10M log 100) comparisons, O(100) memory-- Speedup: ~4x fewer comparisons, no disk I/O -- Index-based Top-K (even better):-- If index exists on total_purchases DESC:-- Plan: Index scan (backwards from max), stop after 100 rows-- Cost: O(100) - just read the index leaf pages! -- Combining with filter:SELECT customer_id, total_purchases FROM customersWHERE region = 'US'ORDER BY total_purchases DESCLIMIT 100; -- If index on (region, total_purchases DESC):-- Scan from (US, MAX) and stop after 100-- Near-optimal executionExternal Top-K:
For very large K that doesn't fit in memory, hybrid approaches work:
Two-pass algorithm:
Priority queue with spillover:
ORDER BY with OFFSET:
SELECT * FROM table ORDER BY col LIMIT 20 OFFSET 1000;
Unfortunately, OFFSET doesn't optimize well—we must find the top 1020 elements to discard 1000. For deep pagination, consider keyset pagination (WHERE col > last_seen_value) which can use indexes.
A common performance anti-pattern: using LIMIT/OFFSET for pagination. Page 100 with 20 items/page requires finding and discarding 1,980 rows. Page 1000 discards 19,980 rows. Performance degrades linearly with page number.
Solution: Keyset pagination using WHERE clause on the sort column:
WHERE created_at < :last_seen_timestamp ORDER BY created_at DESC LIMIT 20
Each page has constant cost, regardless of depth.
Sorting is the foundational primitive underlying numerous database operations. Let's consolidate the key concepts:
What's Next:
Our exploration of physical operators concludes with pipelining—the technique that allows operators to work together efficiently, passing tuples without materializing intermediate results. Understanding pipelining reveals how query execution engines achieve their remarkable performance.
You now understand how sorting works at database scale, from cache-efficient in-memory algorithms to external merge sort that handles petabytes. This knowledge applies to ORDER BY, merge joins, sorted aggregation, index creation, and countless internal operations. Next, we'll explore pipelining—the execution model that makes all these operators work together efficiently.