Database Management SystemsPhysical Operators

Other Physical Operators

LevelAdvanced

Duration90 mins

TopicPhysical Operators

4 / 5

Sorting Algorithms

The Fundamental Database Primitive

Sorting is arguably the most fundamental operation in database systems. It appears everywhere: ORDER BY clauses, merge joins, sort-based aggregation, duplicate elimination, index construction, and B-tree maintenance. Understanding sorting in the database context is essential because database sorting has unique characteristics that distinguish it from general-purpose sorting.

The Database Sorting Challenge:

Unlike in-memory algorithm courses where we sort arrays of integers, database sorting faces distinct challenges:

Scale: Datasets often exceed available memory by orders of magnitude
I/O Cost: Disk access is millions of times slower than memory access
Variable-Length Records: Tuples aren't uniform sized integers
Key Extraction: Comparison keys must be extracted from tuples
Stability: Sometimes the original order of equal elements must be preserved

What You Will Learn

By the end of this page, you will understand how databases implement sorting at scale. You'll master external merge sort, the workhorse algorithm for disk-based sorting. You'll learn about replacement selection, run generation optimization, and modern techniques for cache-efficient sorting. You'll understand when different strategies excel and how to analyze sorting costs.

In-Memory Sorting

When data fits in memory, databases use efficient in-memory sorting algorithms. The choice of algorithm depends on data characteristics and hardware.

Common In-Memory Algorithms:

Quicksort: The default choice for many systems

Average O(n log n), worst case O(n²)
Excellent cache performance due to sequential access
In-place (minimal extra memory)
Not stable (equal elements may reorder)

Merge Sort: Used when stability is required

Guaranteed O(n log n) time
Stable (preserves order of equal elements)
Requires O(n) extra space
Good for linked structures

Radix Sort: For integer or fixed-width keys

O(n × k) where k is key length in digits/bytes
Can outperform comparison-based sorts for suitable data
Not comparison-based—exploits key structure

database_quicksort.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
DATABASE_QUICKSORT(tuples, key_extractor, comparator):
    // Optimized quicksort for database tuples
    
    if len(tuples) <= INSERTION_THRESHOLD:  // ~16-32 elements
        insertion_sort(tuples, key_extractor, comparator)
        return
    
    // Median-of-three pivot selection (avoid worst case)
    pivot = median_of_three(
        key_extractor(tuples[0]),
        key_extractor(tuples[len/2]),
        key_extractor(tuples[len-1])
    )
    
    // Three-way partition (handle duplicates efficiently)
    (lt, eq, gt) = three_way_partition(tuples, pivot, key_extractor)
    
    // Recurse on smaller partition first (limits stack depth)
    if len(lt) < len(gt):
        DATABASE_QUICKSORT(lt, key_extractor, comparator)
        DATABASE_QUICKSORT(gt, key_extractor, comparator)
    else:
        DATABASE_QUICKSORT(gt, key_extractor, comparator)
        DATABASE_QUICKSORT(lt, key_extractor, comparator)
    
    // Equal elements are already in final position
 
// Key optimization: Extract and cache keys
CACHE_AWARE_SORT(tuples, key_columns):
    // Extract keys once, avoid repeated extraction
    keys = [(extract_key(t, key_columns), i) for i, t in enumerate(tuples)]
    
    // Sort keys (smaller than full tuples, better cache use)
    sort(keys)
    
    // Reorder tuples according to sorted key positions
    return [tuples[k[1]] for k in keys]

Database-Specific Optimizations:

Key Normalization: Transform complex keys (variable-length strings, multiple columns) into fixed-length binary format that supports direct byte comparison. This enables:

SIMD comparisons (compare 16-32 bytes at once)
Radix-based sorting on normalized keys
Elimination of expensive comparison functions

Pointer Sorting: For wide tuples, sort pointers rather than tuples:

Keep tuples stationary in memory
Sort array of (key, pointer) pairs
Final pass reorders tuples according to sorted pointers
Reduces data movement during swaps

Prefix Sorting: Store key prefix with pointer:

Many comparisons resolved by prefix alone
Only dereference pointer when prefixes are equal
Trades memory for comparison speed

Cache Efficiency Matters

On modern hardware, cache misses often dominate sorting cost. A cache-oblivious merge sort or cache-aware quicksort can outperform theoretically faster algorithms. The key insight: keeping working set in L2/L3 cache (megabytes) provides 10-100x speedup over main memory access patterns that thrash the cache.

External Merge Sort: The Workhorse Algorithm

When data exceeds available memory, external merge sort is the dominant algorithm. It's elegant, predictable, and optimized for disk I/O patterns.

High-Level Algorithm:

External merge sort proceeds in two main phases:

Phase 1: Run Generation (Create Sorted Runs)

Read as much data as fits in memory
Sort it using an in-memory algorithm
Write the sorted data as a "run" to disk
Repeat until all input is processed
Result: Multiple sorted runs on disk

Phase 2: Merge Runs

Read portions of multiple runs simultaneously
Merge them to produce larger sorted runs
Repeat until only one run remains
That final run is the sorted output

External Merge Sort Parameters
Parameter	Symbol	Description
Input size (pages)	N	Total pages of unsorted input data
Memory (pages)	M	Available buffer pages for sorting
Initial runs	⌈N/M⌉	Number of sorted runs after Phase 1
Merge order	M-1	Runs merged simultaneously (1 output buffer)
Merge passes	⌈log_{M-1}(N/M)⌉	Number of merge phases required

external_merge_sort.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
EXTERNAL_MERGE_SORT(input_file, memory_pages):
    // Phase 1: Generate initial sorted runs
    runs = []
    while not end_of(input_file):
        // Read M pages into memory buffer
        buffer = read_pages(input_file, memory_pages)
        
        // Sort in memory using quicksort/mergesort
        in_memory_sort(buffer)
        
        // Write sorted run to temporary file
        run_file = create_temp_file()
        write_pages(run_file, buffer)
        runs.append(run_file)
    
    // Phase 2: Merge runs until one remains
    merge_order = memory_pages - 1  // Reserve 1 page for output
    
    while len(runs) > 1:
        new_runs = []
        
        // Process runs in groups of (merge_order)
        for i in range(0, len(runs), merge_order):
            runs_to_merge = runs[i : i + merge_order]
            merged = merge_runs(runs_to_merge, memory_pages)
            new_runs.append(merged)
            delete_files(runs_to_merge)
        
        runs = new_runs
    
    return runs[0]  // Single fully sorted file
 
MERGE_RUNS(run_files, memory_pages):
    // Open input buffers for each run (1 page each)
    input_buffers = [open_buffer(f, 1 page) for f in run_files]
    output_buffer = allocate_output_buffer(1 page)
    output_file = create_temp_file()
    
    // Use min-heap to track smallest element from each run
    heap = MinHeap()
    for i, buf in enumerate(input_buffers):
        if not buf.empty():
            heap.insert((buf.peek(), i))
    
    while not heap.empty():
        (min_tuple, run_idx) = heap.extract_min()
        output_buffer.append(min_tuple)
        
        if output_buffer.full():
            write_pages(output_file, output_buffer)
            output_buffer.clear()
        
        input_buffers[run_idx].advance()
        if not input_buffers[run_idx].empty():
            heap.insert((input_buffers[run_idx].peek(), run_idx))
        elif input_buffers[run_idx].has_more_pages():
            input_buffers[run_idx].load_next_page()
            heap.insert((input_buffers[run_idx].peek(), run_idx))
    
    // Write remaining output
    if not output_buffer.empty():
        write_pages(output_file, output_buffer)
    
    return output_file

Why Merge Sort for Disk?

External merge sort is ideal for disk-based sorting because:

Sequential I/O: Both run generation and merging access data sequentially—disks excel at sequential access
Predictable I/O: Every page is read and written a fixed number of times, enabling accurate cost estimation
Scalable: Works for any data size; more merge passes handle larger data
Parallelizable: Runs can be generated and merged in parallel

I/O Cost Analysis

Understanding the I/O cost of external merge sort is crucial for query optimization. The cost depends on input size, memory, and data distribution.

Phase 1 (Run Generation):

Read all N pages: N reads
Write all N pages as runs: N writes
Total: 2N I/O operations

Phase 2 (Merging):

Each merge pass reads all data and writes merged runs
Each pass: 2N I/O operations
Number of passes: ⌈log_{M-1}(N/M)⌉

Total I/O Cost:

Total = 2N × (1 + ⌈log_{M-1}(N/M)⌉)
      = 2N × (1 + number_of_merge_passes)

External Merge Sort Cost Examples
N (pages)	M (pages)	Initial Runs	Merge Passes	Total I/O
1,000	100	10	⌈log₉₉(10)⌉ = 1	2×1000×2 = 4,000
10,000	100	100	⌈log₉₉(100)⌉ = 1	2×10000×2 = 40,000
100,000	100	1,000	⌈log₉₉(1000)⌉ = 2	2×100000×3 = 600,000
1,000,000	1,000	1,000	⌈log₉₉₉(1000)⌉ = 1	2×1000000×2 = 4,000,000
1,000,000	100	10,000	⌈log₉₉(10000)⌉ = 2	2×1000000×3 = 6,000,000

Key Observations:

Memory matters dramatically: Doubling memory from M to 2M potentially eliminates an entire merge pass, halving I/O for large datasets.
The logarithm is forgiving: Even for huge datasets, merge passes remain small. Sorting 1TB with 1GB memory: log₁₀₀₀(1000000) ≈ 2 passes.
Base-case optimization helps: If data nearly fits in memory (N ≈ M), total cost approaches 4N (one run, one merge).
Read:Write ratio: External merge sort has 1:1 read:write ratio. SSDs with asymmetric read/write performance may prefer read-heavy algorithms.

When Sort Cost Matters

Sort I/O cost directly impacts query optimizer decisions:

If sort cost exceeds index scan cost, optimizer may choose index
If multiple sorts needed, optimizer may reorder operations
If downstream uses sorted output (merge join, GROUP BY), sort cost amortizes

Underestimating sort cost leads to query plans that thrash disk; overestimating leads to suboptimal index choices.

Replacement Selection: Generating Longer Runs

The basic external merge sort generates runs of exactly M pages. Replacement selection is a technique that can produce runs of 2M pages on average for random data—potentially eliminating a merge pass entirely.

The Idea:

Instead of filling memory, sorting, and writing, we use a priority queue (min-heap) in memory:

Fill heap with initial tuples from input
Repeatedly:
- Extract minimum tuple from heap, write to current run
- Read next tuple from input
- If new tuple ≥ last written: add to heap (extends current run)
- If new tuple < last written: mark for next run (can't go in current run)
When heap empties of current-run tuples, start new run with marked tuples

Tuples that can extend the current run do so immediately; tuples that would break sort order are "saved" for the next run.

replacement_selection.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
REPLACEMENT_SELECTION(input_file, heap_capacity):
    runs = []
    heap = MinHeap(capacity=heap_capacity)
    current_run = []
    next_run_buffer = []
    last_written = -INFINITY
    
    // Initial fill
    while not heap.full() and not end_of(input_file):
        heap.insert(read_next_tuple(input_file))
    
    while not heap.empty() or len(next_run_buffer) > 0:
        // If heap exhausted but next_run_buffer has tuples
        if heap.empty():
            // Start new run with buffered tuples
            flush(current_run)
            runs.append(current_run)
            current_run = []
            last_written = -INFINITY
            
            for tuple in next_run_buffer:
                heap.insert(tuple)
            next_run_buffer = []
        
        // Extract minimum from heap
        min_tuple = heap.extract_min()
        
        if min_tuple >= last_written:
            // Can extend current run
            current_run.append(min_tuple)
            last_written = min_tuple
            
            // Replacement: read next input tuple
            if not end_of(input_file):
                new_tuple = read_next_tuple(input_file)
                if new_tuple >= last_written:
                    heap.insert(new_tuple)  // Can go in current run
                else:
                    next_run_buffer.append(new_tuple)  // Save for next run
        else:
            // This shouldn't happen with proper implementation
            next_run_buffer.append(min_tuple)
    
    // Flush final run
    if len(current_run) > 0:
        flush(current_run)
        runs.append(current_run)
    
    return runs
 
-- For random input, expected run length = 2 × heap_capacity!

Why 2M on Average?

For uniformly random input, approximately half the new tuples read will be larger than the last written tuple and can extend the current run. This cascade effect produces runs averaging 2M tuples—double the basic algorithm.

When Replacement Selection Excels:

Partially sorted input: If input has runs of ascending values, replacement selection produces extremely long runs—potentially the entire file in one run!
Large memory: More heap capacity means longer average runs.

When It Underperforms:

Reverse sorted input: Worst case—each new tuple is smaller than all heap elements, producing runs of exactly M.
Modern SSD storage: The CPU overhead of heap operations may not be worth it when I/O is fast.

LSM Tree Connection

Replacement selection is the algorithmic ancestor of LSM (Log-Structured Merge) tree compaction. In LSM trees, memtables (in-memory sorted structures) are periodically flushed as runs, then merged. The same principles apply: longer runs mean fewer merges, and nearly-sorted input produces very long runs.

Merge Phase Optimizations

The merge phase offers numerous optimization opportunities beyond the basic algorithm.

1. Double Buffering (Prefetching):

While processing one buffer page, prefetch the next:

For each input run:
    Page A: currently being read by merge logic
    Page B: being prefetched from disk in background
    When A exhausted, swap A and B, start prefetching new B

This hides I/O latency by overlapping disk and CPU operations. Requires 2× memory for input buffers but dramatically improves throughput.

2. Read-Ahead for Sequential Runs:

Operating systems and disk controllers optimize sequential reads. Reading larger chunks (e.g., 64 pages instead of 1) reduces seek overhead and leverages disk prefetching.

3. Cascade Merge (Polyphase Merge):

For very large sorts with many initial runs, clever scheduling of merges can reduce total I/O. Instead of merging all runs at once (which limits merge order to M-1), cascade strategies merge subsets strategically.

merge_optimizations.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- Optimization: Forecasting merge with varying run lengths
-- When runs have different lengths, prioritize merging shorter runs first
 
OPTIMIZED_MERGE_SCHEDULE(runs):
    // Sort runs by length (shortest first)
    sorted_runs = sort(runs, key=lambda r: r.length)
    
    while len(sorted_runs) > 1:
        // Take the (merge_order) shortest runs
        batch = sorted_runs[:merge_order]
        sorted_runs = sorted_runs[merge_order:]
        
        merged = merge(batch)
        
        // Insert merged run in sorted position
        insert_sorted(sorted_runs, merged)
    
    return sorted_runs[0]
 
-- Optimization: Tournament tree for merge
-- Instead of heap (O(log k) per element for k-way merge),
-- use tournament tree (same complexity but better cache behavior)
 
TOURNAMENT_MERGE(runs, num_runs):
    // Build tournament tree with (num_runs) leaves
    tree = build_tournament_tree(num_runs)
    
    // Initialize: load one element from each run into leaves
    for i in range(num_runs):
        tree.leaves[i] = runs[i].next()
    
    // Build initial tournament (O(num_runs))
    tree.rebuild_tournament()
    
    while not tree.empty():
        // Winner is at root
        winner = tree.root.value
        winner_source = tree.root.source
        output(winner)
        
        // Load replacement from winner's run
        if runs[winner_source].has_more():
            tree.leaves[winner_source] = runs[winner_source].next()
        else:
            tree.leaves[winner_source] = INFINITY  // Run exhausted
        
        // Replay tournament from leaf to root (O(log k))
        tree.replay(winner_source)

4. Avoiding Final Write:

If the sorted output is consumed by a streaming operator (like merge join), we can avoid writing the final sorted run:

Pipeline: Sort → Merge Join

Optimization:
    - Perform sort, creating runs on disk
    - During final merge pass, pipe directly to merge join
    - No disk write for final sorted output
    - Saves N page writes

5. External Quick Sort:

An alternative to merge sort for external sorting:

Sample input to find good pivot
Partition into "smaller" and "larger" files
Recursively sort partitions

Can be faster when memory is very limited (fewer passes), but less predictable I/O patterns.

Cloud Storage Considerations

In cloud environments, network-attached storage has different I/O characteristics than local disks:

Latency is higher → Larger I/O blocks and aggressive prefetching help
Bandwidth may be throttled → Compression before writing runs can help
Parallel I/O is cheap → Multiple merge streams reading simultaneously is encouraged

Cloud-optimized databases tune external sort for these realities.

Parallel Sorting

Modern systems exploit parallelism at multiple levels to accelerate sorting.

Intra-Operator Parallelism:

Parallelize within a single sort operation:

1. Parallel Run Generation:

Partition input into P chunks
Each thread sorts its chunk using in-memory sort
Result: P sorted runs created in parallel

2. Parallel Merge:

Multiple merge operations can run simultaneously
Example: With 8 runs and 4 threads:
- Thread 1: Merge runs 1-2 → run A
- Thread 2: Merge runs 3-4 → run B
- Thread 3: Merge runs 5-6 → run C
- Thread 4: Merge runs 7-8 → run D
- Then merge A, B, C, D

parallel_sort.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
PARALLEL_EXTERNAL_SORT(input, num_threads, memory_per_thread):
    // Phase 1: Parallel run generation
    partitions = split_input(input, num_threads)
    runs = []
    
    parallel_for i in range(num_threads):
        local_runs = EXTERNAL_MERGE_SORT(partitions[i], memory_per_thread)
        runs.extend(local_runs)  // Thread-safe append
    
    barrier()  // Wait for all threads
    
    // Phase 2: Parallel merge tree
    while len(runs) > num_threads:
        merge_batches = partition_runs(runs, num_threads)
        runs = []
        
        parallel_for i in range(num_threads):
            merged = merge_runs(merge_batches[i])
            runs.append(merged)
        
        barrier()
    
    // Final merge (single-threaded or with range partitioning)
    return final_merge(runs)
 
-- Alternative: Range-partitioned parallel sort
RANGE_PARALLEL_SORT(input, num_threads):
    // Step 1: Sample input to find partition boundaries
    sample = random_sample(input, 1000)
    boundaries = select_percentiles(sort(sample), num_threads)
    
    // Step 2: Partition data by key ranges
    // All tuples go to appropriate partition based on key
    for each tuple t in input:
        partition = find_partition(t.key, boundaries)
        send(t, partition)
    
    // Step 3: Each thread sorts its partition
    parallel_for i in range(num_threads):
        sorted_parts[i] = sort(partition[i])
    
    // Step 4: Concatenate (already globally sorted by ranges)
    return concatenate(sorted_parts)

Distributed Sorting:

In distributed systems, sorting involves network shuffling:

Local Sort: Each node sorts its local data
Sample Exchange: Nodes exchange samples to determine global partition boundaries
Shuffle by Range: Tuples redistributed based on key ranges
Final Local Sort: Each node sorts received data
Concatenation: Range-ordered results form global sorted output

Challenges:

Data Skew: If one range has much more data, that node becomes a bottleneck
Network Bandwidth: Shuffling may transfer significant data across network
Coordination Overhead: Barrier synchronization between phases

GPU-Accelerated Sorting

Modern GPUs can sort billions of records per second. GPU radix sort and bitonic sort algorithms achieve massive parallelism:

Radix sort: Parallel counting sort per digit, highly data-parallel
Bitonic sort: Comparison network that maps well to GPU SIMD

The challenge is PCIe transfer overhead—data must move to GPU memory and back. For very large sorts, GPU is used for in-memory phases while disk I/O remains CPU-managed.

Top-K and Partial Sorting

Many queries don't need a fully sorted result—they need only the top K elements (ORDER BY with LIMIT). This enables significant optimization.

Heap-Based Top-K:

For finding the K smallest (or largest) elements:

TOP_K(input, K):
    heap = MaxHeap(capacity=K)  // For finding K smallest
    
    for each tuple t in input:
        if heap.size() < K:
            heap.insert(t)
        elif t < heap.peek():  // t smaller than largest in heap
            heap.replace_max(t)  // remove max, insert t
    
    return heap.contents_sorted()

Complexity: O(n log K) instead of O(n log n) for full sort. When K << n, this is dramatically faster.

Memory: O(K) instead of O(n). For LIMIT 10 on a billion rows, we need space for only 10 tuples!

top_k_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Query pattern: Top-K optimization target
SELECT customer_id, total_purchases
FROM customers
ORDER BY total_purchases DESC
LIMIT 100;
 
-- Without optimization: Sort all 10M customers, then take 100
-- Cost: O(10M log 10M) comparisons, potentially spill to disk
 
-- With Top-K optimization: Maintain heap of 100 largest
-- Cost: O(10M log 100) comparisons, O(100) memory
-- Speedup: ~4x fewer comparisons, no disk I/O
 
-- Index-based Top-K (even better):
-- If index exists on total_purchases DESC:
-- Plan: Index scan (backwards from max), stop after 100 rows
-- Cost: O(100) - just read the index leaf pages!
 
-- Combining with filter:
SELECT customer_id, total_purchases  
FROM customers
WHERE region = 'US'
ORDER BY total_purchases DESC
LIMIT 100;
 
-- If index on (region, total_purchases DESC):
-- Scan from (US, MAX) and stop after 100
-- Near-optimal execution

External Top-K:

For very large K that doesn't fit in memory, hybrid approaches work:

Two-pass algorithm:
- First pass: Sample to estimate the K-th largest value threshold
- Second pass: Collect all tuples above threshold, sort them, take top K
Priority queue with spillover:
- Similar to replacement selection but maintaining top-K invariant

ORDER BY with OFFSET:

SELECT * FROM table ORDER BY col LIMIT 20 OFFSET 1000;

Unfortunately, OFFSET doesn't optimize well—we must find the top 1020 elements to discard 1000. For deep pagination, consider keyset pagination (WHERE col > last_seen_value) which can use indexes.

The Pagination Problem

A common performance anti-pattern: using LIMIT/OFFSET for pagination. Page 100 with 20 items/page requires finding and discarding 1,980 rows. Page 1000 discards 19,980 rows. Performance degrades linearly with page number.

Solution: Keyset pagination using WHERE clause on the sort column: WHERE created_at < :last_seen_timestamp ORDER BY created_at DESC LIMIT 20

Each page has constant cost, regardless of depth.

Summary: Sorting Algorithms

Sorting is the foundational primitive underlying numerous database operations. Let's consolidate the key concepts:

Key Takeaways

•In-memory sorting uses quicksort or radix sort — Cache efficiency and key normalization provide massive speedups over naive implementations.
•External merge sort handles disk-scale data — Run generation + merge phases with O(2N × (1 + log passes)) I/O cost.
•Memory dramatically affects merge passes — Doubling memory can eliminate entire merge passes, halving I/O.
•Replacement selection generates longer runs — 2× average run length for random data, huge wins for nearly-sorted data.
•Parallel sorting exploits modern hardware — Parallel run generation, merge trees, and range partitioning scale across cores and nodes.
•Top-K optimization avoids full sort — Heap-based selection with O(n log K) complexity and O(K) memory.

What's Next:

Our exploration of physical operators concludes with pipelining—the technique that allows operators to work together efficiently, passing tuples without materializing intermediate results. Understanding pipelining reveals how query execution engines achieve their remarkable performance.

Page Complete

You now understand how sorting works at database scale, from cache-efficient in-memory algorithms to external merge sort that handles petabytes. This knowledge applies to ORDER BY, merge joins, sorted aggregation, index creation, and countless internal operations. Next, we'll explore pipelining—the execution model that makes all these operators work together efficiently.

4 / 5

Loading learning content...

Database Management SystemsPhysical Operators

Other Physical Operators

LevelAdvanced

Duration90 mins

TopicPhysical Operators

4 / 5

Sorting Algorithms

The Fundamental Database Primitive

The Database Sorting Challenge:

Unlike in-memory algorithm courses where we sort arrays of integers, database sorting faces distinct challenges:

Scale: Datasets often exceed available memory by orders of magnitude
I/O Cost: Disk access is millions of times slower than memory access
Variable-Length Records: Tuples aren't uniform sized integers
Key Extraction: Comparison keys must be extracted from tuples
Stability: Sometimes the original order of equal elements must be preserved

What You Will Learn

In-Memory Sorting

When data fits in memory, databases use efficient in-memory sorting algorithms. The choice of algorithm depends on data characteristics and hardware.

Common In-Memory Algorithms:

Quicksort: The default choice for many systems

Average O(n log n), worst case O(n²)
Excellent cache performance due to sequential access
In-place (minimal extra memory)
Not stable (equal elements may reorder)

Merge Sort: Used when stability is required

Guaranteed O(n log n) time
Stable (preserves order of equal elements)
Requires O(n) extra space
Good for linked structures

Radix Sort: For integer or fixed-width keys

O(n × k) where k is key length in digits/bytes
Can outperform comparison-based sorts for suitable data
Not comparison-based—exploits key structure

database_quicksort.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
DATABASE_QUICKSORT(tuples, key_extractor, comparator):
    // Optimized quicksort for database tuples
    
    if len(tuples) <= INSERTION_THRESHOLD:  // ~16-32 elements
        insertion_sort(tuples, key_extractor, comparator)
        return
    
    // Median-of-three pivot selection (avoid worst case)
    pivot = median_of_three(
        key_extractor(tuples[0]),
        key_extractor(tuples[len/2]),
        key_extractor(tuples[len-1])
    )
    
    // Three-way partition (handle duplicates efficiently)
    (lt, eq, gt) = three_way_partition(tuples, pivot, key_extractor)
    
    // Recurse on smaller partition first (limits stack depth)
    if len(lt) < len(gt):
        DATABASE_QUICKSORT(lt, key_extractor, comparator)
        DATABASE_QUICKSORT(gt, key_extractor, comparator)
    else:
        DATABASE_QUICKSORT(gt, key_extractor, comparator)
        DATABASE_QUICKSORT(lt, key_extractor, comparator)
    
    // Equal elements are already in final position
 
// Key optimization: Extract and cache keys
CACHE_AWARE_SORT(tuples, key_columns):
    // Extract keys once, avoid repeated extraction
    keys = [(extract_key(t, key_columns), i) for i, t in enumerate(tuples)]
    
    // Sort keys (smaller than full tuples, better cache use)
    sort(keys)
    
    // Reorder tuples according to sorted key positions
    return [tuples[k[1]] for k in keys]

Database-Specific Optimizations:

Key Normalization: Transform complex keys (variable-length strings, multiple columns) into fixed-length binary format that supports direct byte comparison. This enables:

SIMD comparisons (compare 16-32 bytes at once)
Radix-based sorting on normalized keys
Elimination of expensive comparison functions

Pointer Sorting: For wide tuples, sort pointers rather than tuples:

Keep tuples stationary in memory
Sort array of (key, pointer) pairs
Final pass reorders tuples according to sorted pointers
Reduces data movement during swaps

Prefix Sorting: Store key prefix with pointer:

Many comparisons resolved by prefix alone
Only dereference pointer when prefixes are equal
Trades memory for comparison speed

Cache Efficiency Matters

External Merge Sort: The Workhorse Algorithm

When data exceeds available memory, external merge sort is the dominant algorithm. It's elegant, predictable, and optimized for disk I/O patterns.

High-Level Algorithm:

External merge sort proceeds in two main phases:

Phase 1: Run Generation (Create Sorted Runs)

Read as much data as fits in memory
Sort it using an in-memory algorithm
Write the sorted data as a "run" to disk
Repeat until all input is processed
Result: Multiple sorted runs on disk

Phase 2: Merge Runs

Read portions of multiple runs simultaneously
Merge them to produce larger sorted runs
Repeat until only one run remains
That final run is the sorted output

External Merge Sort Parameters
Parameter	Symbol	Description
Input size (pages)	N	Total pages of unsorted input data
Memory (pages)	M	Available buffer pages for sorting
Initial runs	⌈N/M⌉	Number of sorted runs after Phase 1
Merge order	M-1	Runs merged simultaneously (1 output buffer)
Merge passes	⌈log_{M-1}(N/M)⌉	Number of merge phases required

external_merge_sort.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
EXTERNAL_MERGE_SORT(input_file, memory_pages):
    // Phase 1: Generate initial sorted runs
    runs = []
    while not end_of(input_file):
        // Read M pages into memory buffer
        buffer = read_pages(input_file, memory_pages)
        
        // Sort in memory using quicksort/mergesort
        in_memory_sort(buffer)
        
        // Write sorted run to temporary file
        run_file = create_temp_file()
        write_pages(run_file, buffer)
        runs.append(run_file)
    
    // Phase 2: Merge runs until one remains
    merge_order = memory_pages - 1  // Reserve 1 page for output
    
    while len(runs) > 1:
        new_runs = []
        
        // Process runs in groups of (merge_order)
        for i in range(0, len(runs), merge_order):
            runs_to_merge = runs[i : i + merge_order]
            merged = merge_runs(runs_to_merge, memory_pages)
            new_runs.append(merged)
            delete_files(runs_to_merge)
        
        runs = new_runs
    
    return runs[0]  // Single fully sorted file
 
MERGE_RUNS(run_files, memory_pages):
    // Open input buffers for each run (1 page each)
    input_buffers = [open_buffer(f, 1 page) for f in run_files]
    output_buffer = allocate_output_buffer(1 page)
    output_file = create_temp_file()
    
    // Use min-heap to track smallest element from each run
    heap = MinHeap()
    for i, buf in enumerate(input_buffers):
        if not buf.empty():
            heap.insert((buf.peek(), i))
    
    while not heap.empty():
        (min_tuple, run_idx) = heap.extract_min()
        output_buffer.append(min_tuple)
        
        if output_buffer.full():
            write_pages(output_file, output_buffer)
            output_buffer.clear()
        
        input_buffers[run_idx].advance()
        if not input_buffers[run_idx].empty():
            heap.insert((input_buffers[run_idx].peek(), run_idx))
        elif input_buffers[run_idx].has_more_pages():
            input_buffers[run_idx].load_next_page()
            heap.insert((input_buffers[run_idx].peek(), run_idx))
    
    // Write remaining output
    if not output_buffer.empty():
        write_pages(output_file, output_buffer)
    
    return output_file

Why Merge Sort for Disk?

External merge sort is ideal for disk-based sorting because:

Sequential I/O: Both run generation and merging access data sequentially—disks excel at sequential access
Predictable I/O: Every page is read and written a fixed number of times, enabling accurate cost estimation
Scalable: Works for any data size; more merge passes handle larger data
Parallelizable: Runs can be generated and merged in parallel

I/O Cost Analysis

Understanding the I/O cost of external merge sort is crucial for query optimization. The cost depends on input size, memory, and data distribution.

Phase 1 (Run Generation):

Read all N pages: N reads
Write all N pages as runs: N writes
Total: 2N I/O operations

Phase 2 (Merging):

Each merge pass reads all data and writes merged runs
Each pass: 2N I/O operations
Number of passes: ⌈log_{M-1}(N/M)⌉

Total I/O Cost:

Total = 2N × (1 + ⌈log_{M-1}(N/M)⌉)
      = 2N × (1 + number_of_merge_passes)

External Merge Sort Cost Examples
N (pages)	M (pages)	Initial Runs	Merge Passes	Total I/O
1,000	100	10	⌈log₉₉(10)⌉ = 1	2×1000×2 = 4,000
10,000	100	100	⌈log₉₉(100)⌉ = 1	2×10000×2 = 40,000
100,000	100	1,000	⌈log₉₉(1000)⌉ = 2	2×100000×3 = 600,000
1,000,000	1,000	1,000	⌈log₉₉₉(1000)⌉ = 1	2×1000000×2 = 4,000,000
1,000,000	100	10,000	⌈log₉₉(10000)⌉ = 2	2×1000000×3 = 6,000,000

Key Observations:

Memory matters dramatically: Doubling memory from M to 2M potentially eliminates an entire merge pass, halving I/O for large datasets.
The logarithm is forgiving: Even for huge datasets, merge passes remain small. Sorting 1TB with 1GB memory: log₁₀₀₀(1000000) ≈ 2 passes.
Base-case optimization helps: If data nearly fits in memory (N ≈ M), total cost approaches 4N (one run, one merge).
Read:Write ratio: External merge sort has 1:1 read:write ratio. SSDs with asymmetric read/write performance may prefer read-heavy algorithms.

When Sort Cost Matters

Sort I/O cost directly impacts query optimizer decisions:

If sort cost exceeds index scan cost, optimizer may choose index
If multiple sorts needed, optimizer may reorder operations
If downstream uses sorted output (merge join, GROUP BY), sort cost amortizes

Underestimating sort cost leads to query plans that thrash disk; overestimating leads to suboptimal index choices.

Replacement Selection: Generating Longer Runs

The Idea:

Instead of filling memory, sorting, and writing, we use a priority queue (min-heap) in memory:

Fill heap with initial tuples from input
Repeatedly:
- Extract minimum tuple from heap, write to current run
- Read next tuple from input
- If new tuple ≥ last written: add to heap (extends current run)
- If new tuple < last written: mark for next run (can't go in current run)
When heap empties of current-run tuples, start new run with marked tuples

Tuples that can extend the current run do so immediately; tuples that would break sort order are "saved" for the next run.

replacement_selection.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
REPLACEMENT_SELECTION(input_file, heap_capacity):
    runs = []
    heap = MinHeap(capacity=heap_capacity)
    current_run = []
    next_run_buffer = []
    last_written = -INFINITY
    
    // Initial fill
    while not heap.full() and not end_of(input_file):
        heap.insert(read_next_tuple(input_file))
    
    while not heap.empty() or len(next_run_buffer) > 0:
        // If heap exhausted but next_run_buffer has tuples
        if heap.empty():
            // Start new run with buffered tuples
            flush(current_run)
            runs.append(current_run)
            current_run = []
            last_written = -INFINITY
            
            for tuple in next_run_buffer:
                heap.insert(tuple)
            next_run_buffer = []
        
        // Extract minimum from heap
        min_tuple = heap.extract_min()
        
        if min_tuple >= last_written:
            // Can extend current run
            current_run.append(min_tuple)
            last_written = min_tuple
            
            // Replacement: read next input tuple
            if not end_of(input_file):
                new_tuple = read_next_tuple(input_file)
                if new_tuple >= last_written:
                    heap.insert(new_tuple)  // Can go in current run
                else:
                    next_run_buffer.append(new_tuple)  // Save for next run
        else:
            // This shouldn't happen with proper implementation
            next_run_buffer.append(min_tuple)
    
    // Flush final run
    if len(current_run) > 0:
        flush(current_run)
        runs.append(current_run)
    
    return runs
 
-- For random input, expected run length = 2 × heap_capacity!

Why 2M on Average?

When Replacement Selection Excels:

Partially sorted input: If input has runs of ascending values, replacement selection produces extremely long runs—potentially the entire file in one run!
Large memory: More heap capacity means longer average runs.

When It Underperforms:

Reverse sorted input: Worst case—each new tuple is smaller than all heap elements, producing runs of exactly M.
Modern SSD storage: The CPU overhead of heap operations may not be worth it when I/O is fast.

LSM Tree Connection

Merge Phase Optimizations

The merge phase offers numerous optimization opportunities beyond the basic algorithm.

1. Double Buffering (Prefetching):

While processing one buffer page, prefetch the next:

For each input run:
    Page A: currently being read by merge logic
    Page B: being prefetched from disk in background
    When A exhausted, swap A and B, start prefetching new B

This hides I/O latency by overlapping disk and CPU operations. Requires 2× memory for input buffers but dramatically improves throughput.

2. Read-Ahead for Sequential Runs:

Operating systems and disk controllers optimize sequential reads. Reading larger chunks (e.g., 64 pages instead of 1) reduces seek overhead and leverages disk prefetching.

3. Cascade Merge (Polyphase Merge):

merge_optimizations.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- Optimization: Forecasting merge with varying run lengths
-- When runs have different lengths, prioritize merging shorter runs first
 
OPTIMIZED_MERGE_SCHEDULE(runs):
    // Sort runs by length (shortest first)
    sorted_runs = sort(runs, key=lambda r: r.length)
    
    while len(sorted_runs) > 1:
        // Take the (merge_order) shortest runs
        batch = sorted_runs[:merge_order]
        sorted_runs = sorted_runs[merge_order:]
        
        merged = merge(batch)
        
        // Insert merged run in sorted position
        insert_sorted(sorted_runs, merged)
    
    return sorted_runs[0]
 
-- Optimization: Tournament tree for merge
-- Instead of heap (O(log k) per element for k-way merge),
-- use tournament tree (same complexity but better cache behavior)
 
TOURNAMENT_MERGE(runs, num_runs):
    // Build tournament tree with (num_runs) leaves
    tree = build_tournament_tree(num_runs)
    
    // Initialize: load one element from each run into leaves
    for i in range(num_runs):
        tree.leaves[i] = runs[i].next()
    
    // Build initial tournament (O(num_runs))
    tree.rebuild_tournament()
    
    while not tree.empty():
        // Winner is at root
        winner = tree.root.value
        winner_source = tree.root.source
        output(winner)
        
        // Load replacement from winner's run
        if runs[winner_source].has_more():
            tree.leaves[winner_source] = runs[winner_source].next()
        else:
            tree.leaves[winner_source] = INFINITY  // Run exhausted
        
        // Replay tournament from leaf to root (O(log k))
        tree.replay(winner_source)

4. Avoiding Final Write:

If the sorted output is consumed by a streaming operator (like merge join), we can avoid writing the final sorted run:

Pipeline: Sort → Merge Join

Optimization:
    - Perform sort, creating runs on disk
    - During final merge pass, pipe directly to merge join
    - No disk write for final sorted output
    - Saves N page writes

5. External Quick Sort:

An alternative to merge sort for external sorting:

Sample input to find good pivot
Partition into "smaller" and "larger" files
Recursively sort partitions

Can be faster when memory is very limited (fewer passes), but less predictable I/O patterns.

Cloud Storage Considerations

In cloud environments, network-attached storage has different I/O characteristics than local disks:

Latency is higher → Larger I/O blocks and aggressive prefetching help
Bandwidth may be throttled → Compression before writing runs can help
Parallel I/O is cheap → Multiple merge streams reading simultaneously is encouraged

Cloud-optimized databases tune external sort for these realities.

Parallel Sorting

Modern systems exploit parallelism at multiple levels to accelerate sorting.

Intra-Operator Parallelism:

Parallelize within a single sort operation:

1. Parallel Run Generation:

Partition input into P chunks
Each thread sorts its chunk using in-memory sort
Result: P sorted runs created in parallel

2. Parallel Merge:

Multiple merge operations can run simultaneously
Example: With 8 runs and 4 threads:
- Thread 1: Merge runs 1-2 → run A
- Thread 2: Merge runs 3-4 → run B
- Thread 3: Merge runs 5-6 → run C
- Thread 4: Merge runs 7-8 → run D
- Then merge A, B, C, D

parallel_sort.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
PARALLEL_EXTERNAL_SORT(input, num_threads, memory_per_thread):
    // Phase 1: Parallel run generation
    partitions = split_input(input, num_threads)
    runs = []
    
    parallel_for i in range(num_threads):
        local_runs = EXTERNAL_MERGE_SORT(partitions[i], memory_per_thread)
        runs.extend(local_runs)  // Thread-safe append
    
    barrier()  // Wait for all threads
    
    // Phase 2: Parallel merge tree
    while len(runs) > num_threads:
        merge_batches = partition_runs(runs, num_threads)
        runs = []
        
        parallel_for i in range(num_threads):
            merged = merge_runs(merge_batches[i])
            runs.append(merged)
        
        barrier()
    
    // Final merge (single-threaded or with range partitioning)
    return final_merge(runs)
 
-- Alternative: Range-partitioned parallel sort
RANGE_PARALLEL_SORT(input, num_threads):
    // Step 1: Sample input to find partition boundaries
    sample = random_sample(input, 1000)
    boundaries = select_percentiles(sort(sample), num_threads)
    
    // Step 2: Partition data by key ranges
    // All tuples go to appropriate partition based on key
    for each tuple t in input:
        partition = find_partition(t.key, boundaries)
        send(t, partition)
    
    // Step 3: Each thread sorts its partition
    parallel_for i in range(num_threads):
        sorted_parts[i] = sort(partition[i])
    
    // Step 4: Concatenate (already globally sorted by ranges)
    return concatenate(sorted_parts)

Distributed Sorting:

In distributed systems, sorting involves network shuffling:

Local Sort: Each node sorts its local data
Sample Exchange: Nodes exchange samples to determine global partition boundaries
Shuffle by Range: Tuples redistributed based on key ranges
Final Local Sort: Each node sorts received data
Concatenation: Range-ordered results form global sorted output

Challenges:

Data Skew: If one range has much more data, that node becomes a bottleneck
Network Bandwidth: Shuffling may transfer significant data across network
Coordination Overhead: Barrier synchronization between phases

GPU-Accelerated Sorting

Modern GPUs can sort billions of records per second. GPU radix sort and bitonic sort algorithms achieve massive parallelism:

Radix sort: Parallel counting sort per digit, highly data-parallel
Bitonic sort: Comparison network that maps well to GPU SIMD

The challenge is PCIe transfer overhead—data must move to GPU memory and back. For very large sorts, GPU is used for in-memory phases while disk I/O remains CPU-managed.

Top-K and Partial Sorting

Many queries don't need a fully sorted result—they need only the top K elements (ORDER BY with LIMIT). This enables significant optimization.

Heap-Based Top-K:

For finding the K smallest (or largest) elements:

TOP_K(input, K):
    heap = MaxHeap(capacity=K)  // For finding K smallest
    
    for each tuple t in input:
        if heap.size() < K:
            heap.insert(t)
        elif t < heap.peek():  // t smaller than largest in heap
            heap.replace_max(t)  // remove max, insert t
    
    return heap.contents_sorted()

Complexity: O(n log K) instead of O(n log n) for full sort. When K << n, this is dramatically faster.

Memory: O(K) instead of O(n). For LIMIT 10 on a billion rows, we need space for only 10 tuples!

top_k_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Query pattern: Top-K optimization target
SELECT customer_id, total_purchases
FROM customers
ORDER BY total_purchases DESC
LIMIT 100;
 
-- Without optimization: Sort all 10M customers, then take 100
-- Cost: O(10M log 10M) comparisons, potentially spill to disk
 
-- With Top-K optimization: Maintain heap of 100 largest
-- Cost: O(10M log 100) comparisons, O(100) memory
-- Speedup: ~4x fewer comparisons, no disk I/O
 
-- Index-based Top-K (even better):
-- If index exists on total_purchases DESC:
-- Plan: Index scan (backwards from max), stop after 100 rows
-- Cost: O(100) - just read the index leaf pages!
 
-- Combining with filter:
SELECT customer_id, total_purchases  
FROM customers
WHERE region = 'US'
ORDER BY total_purchases DESC
LIMIT 100;
 
-- If index on (region, total_purchases DESC):
-- Scan from (US, MAX) and stop after 100
-- Near-optimal execution

External Top-K:

For very large K that doesn't fit in memory, hybrid approaches work:

Two-pass algorithm:
- First pass: Sample to estimate the K-th largest value threshold
- Second pass: Collect all tuples above threshold, sort them, take top K
Priority queue with spillover:
- Similar to replacement selection but maintaining top-K invariant

ORDER BY with OFFSET:

SELECT * FROM table ORDER BY col LIMIT 20 OFFSET 1000;

Unfortunately, OFFSET doesn't optimize well—we must find the top 1020 elements to discard 1000. For deep pagination, consider keyset pagination (WHERE col > last_seen_value) which can use indexes.

The Pagination Problem

Solution: Keyset pagination using WHERE clause on the sort column: WHERE created_at < :last_seen_timestamp ORDER BY created_at DESC LIMIT 20

Each page has constant cost, regardless of depth.

Summary: Sorting Algorithms

Sorting is the foundational primitive underlying numerous database operations. Let's consolidate the key concepts:

Key Takeaways

•In-memory sorting uses quicksort or radix sort — Cache efficiency and key normalization provide massive speedups over naive implementations.
•External merge sort handles disk-scale data — Run generation + merge phases with O(2N × (1 + log passes)) I/O cost.
•Memory dramatically affects merge passes — Doubling memory can eliminate entire merge passes, halving I/O.
•Replacement selection generates longer runs — 2× average run length for random data, huge wins for nearly-sorted data.
•Parallel sorting exploits modern hardware — Parallel run generation, merge trees, and range partitioning scale across cores and nodes.
•Top-K optimization avoids full sort — Heap-based selection with O(n log K) complexity and O(K) memory.

What's Next:

Page Complete

4 / 5