Radix Sort - Learning Module

Loading content...

0/276

Time Complexity: O(d × (n + k))

Decoding Radix Sort's Complexity

Radix sort's time complexity, O(d × (n + k)), looks deceptively simple. Three variables—d, n, and k—combine to determine performance. But what do these variables mean in practice? When does this complexity beat O(n log n)? When does it disappoint?

Understanding radix sort's complexity requires moving beyond mechanical formula manipulation into genuine insight about when and why this algorithm excels. By the end of this page, you'll be able to confidently predict radix sort's performance for any given scenario and make informed decisions about algorithm selection.

What You Will Learn

By the end of this page, you will deeply understand: (1) The precise meaning of each variable in O(d × (n + k)), (2) How d, n, and k relate to input characteristics, (3) The comparison between O(d(n+k)) and O(n log n), (4) When radix sort provides genuine asymptotic advantages, (5) Hidden constants and practical performance considerations, and (6) Space complexity trade-offs.

Dissecting O(d × (n + k))

Let's carefully examine each component of radix sort's time complexity:

n — Number of elements

This is the most familiar variable: the size of the input array. Every sorting algorithm has n in its complexity because we must at least read all n elements.

d — Number of digit positions (passes)

This represents how many passes radix sort makes through the data. For integers:

With base/radix k, a maximum value M requires d = ⌈log_k(M + 1)⌉ digits
For 32-bit integers with base 256: d = 4 passes
For decimal: the number of decimal digits

k — Radix (base) / Alphabet size

The range of values each digit can take:

Base 10 (decimal): k = 10
Base 256 (byte): k = 256
Base 65536 (16-bit): k = 65536

The k appears because counting sort at each pass requires O(k) time not counting the n operations.

Understanding Each Variable in O(d × (n + k))
Variable	Meaning	Typical Values	How It's Determined
n	Array size	10¹ to 10⁹	Input size—given
d	Number of digits/passes	1 to 64	⌈log_k(max_value)⌉
k	Radix/base	2 to 65536	Algorithm design choice

Why O(d × (n + k)) and not O(d × n × k)?

Each pass of counting sort does the following:

Count phase: O(n) — scan all n elements
Accumulate phase: O(k) — process k counters
Place phase: O(n) — place all n elements

Total per pass: O(n + k), not O(n × k)

The k comes from the accumulation step, which is independent of n. We don't multiply n by k because we're not doing k operations per element—we're doing k operations total for the count array processing.

With d passes, total time is:

T(n, d, k) = d × O(n + k) = O(d × (n + k))

The Additive Nature

The (n + k) is additive because n and k contribute independently to each pass. If n >> k (many elements, small radix), the n term dominates. If k >> n (few elements, large radix), the k term dominates. In most practical cases, n >> k, so O(d(n+k)) ≈ O(dn).

The Relationship Between d, k, and Maximum Value

The number of passes d isn't arbitrary—it's determined by the maximum value M in the array and the chosen radix k:

d = ⌈log_k(M + 1)⌉

This formula captures how many base-k digits are needed to represent M. Let's see this in action:

Number of Passes Required for Different Scenarios
Maximum Value M	Base k	d = ⌈log_k(M+1)⌉	Explanation
999	10	3	3 decimal digits: 0-999
1,000,000	10	7	7 decimal digits: 0-9,999,999
255 (2⁸-1)	256	1	1 byte
65,535 (2¹⁶-1)	256	2	2 bytes
4,294,967,295 (2³²-1)	256	4	4 bytes (32-bit)
2⁶⁴-1	256	8	8 bytes (64-bit)
100	2	7	7 binary digits: 1100100

The trade-off between d and k:

Increasing k decreases d (fewer passes) but increases the per-pass overhead:

Larger k → fewer passes (d decreases as log_k(M))
Larger k → larger count array (k elements)
Larger k → worse cache behavior for count array

Mathematical relationship:

For a fixed maximum value M:

d(k) = ⌈log_k(M)⌉ = ⌈ln(M) / ln(k)⌉

This is a decreasing function of k. As k → M, d → 1, but the count array becomes size M, which might be unacceptable.

d_k_tradeoff.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import math
 
def analyze_radix_choice(max_value, n):
    """
    Analyze time complexity for different radix choices.
    
    Shows how the d × (n + k) expression varies with radix k
    for a given maximum value and array size.
    """
    print(f"Analysis for max_value = {max_value:,}, n = {n:,}")
    print("=" * 70)
    print(f"{'Radix k':<12} {'Digits d':<12} {'d × (n+k)':<20} {'Cache-friendly?'}")
    print("-" * 70)
    
    for k in [2, 4, 10, 16, 64, 256, 1024, 4096, 65536]:
        d = math.ceil(math.log(max_value + 1, k)) if max_value > 0 else 1
        complexity = d * (n + k)
        
        # Rough cache assessment: L1 cache is typically 32-64 KB
        # Count array of k integers = 4k or 8k bytes
        cache_friendly = "Yes" if k * 8 < 32768 else "No"
        
        print(f"{k:<12} {d:<12} {complexity:<20,} {cache_friendly}")
    
    print()
 
# Example: Sorting 10 million 32-bit integers
analyze_radix_choice(2**32 - 1, 10_000_000)
 
# Example: Sorting 1 million values up to 1000
analyze_radix_choice(1000, 1_000_000)
 
# Example: Sorting 100 values up to 10 million
analyze_radix_choice(10_000_000, 100)

The Sweet Spot: Base 256

For most applications, base 256 (processing one byte at a time) hits the sweet spot:

• 4 passes for 32-bit integers (d = 4) • Count array of 256 integers ≈ 1-2 KB (fits in L1 cache) • Bit manipulation (shift/mask) instead of division/modulo

This explains why base 256 is the standard choice in optimized radix sort implementations.

Comparing with O(n log n) Algorithms

The central question: When does O(d(n+k)) beat O(n log n)?

Setting up the comparison:

For radix sort to be asymptotically faster than merge sort or quicksort, we need:

d × (n + k) < c × n log n

where c is a constant factor.

Assuming k << n (small radix relative to array size), this simplifies to:

d × n < c × n log n d < c × log n

This tells us something profound: Radix sort wins when the number of digits d is less than log n (the logarithm of the array size).

Case analysis:

Case 1: Fixed-width integers (e.g., 32-bit)

With base 256, d = 4 for any 32-bit integer.

Radix sort: O(4n) = O(n) Merge sort: O(n log n)

For n > 2⁴ = 16, radix sort's 4 passes beat log₂(n) passes of comparisons. For n = 1 million: 4 passes vs 20 "passes" (log₂(10⁶) ≈ 20)

Verdict: Radix sort wins decisively for large arrays of fixed-width integers.

Case 2: Variable-precision integers

If integers can grow arbitrarily large with n (e.g., n numbers where the maximum is n^c for some constant c):

d = O(log(n^c)) = O(c log n) = O(log n)

Radix sort: O(log n × n) = O(n log n)

Verdict: No asymptotic advantage—tied with comparison sorts.

Case 3: Integers bounded by n

If maximum value M ≤ n:

d = O(log_k(n))

With k = n^ε for small ε: d = O(1/ε) = O(1)

Radix sort: O(n)

Verdict: Radix sort can achieve true O(n) with careful radix selection.

Radix Sort vs Merge Sort: When Does Each Win?
Scenario	Radix Sort Complexity	Merge Sort Complexity	Winner
32-bit integers, n = 10⁶	O(4n) ≈ 4 million ops	O(n log n) ≈ 20 million ops	Radix (5× faster)
64-bit integers, n = 10⁶	O(8n) ≈ 8 million ops	O(n log n) ≈ 20 million ops	Radix (2.5× faster)
32-bit integers, n = 100	O(4n) ≈ 400 ops	O(n log n) ≈ 700 ops	Radix (slight edge)
Arbitrary precision, digits ∝ log n	O(n log n)	O(n log n)	Tie (constants matter)
Small values (max = 1000), n = 10⁶	O(2n) ≈ 2 million ops	O(n log n) ≈ 20 million ops	Radix (10× faster)

The Hidden Assumption

The comparison assumes d is independent of n. If your maximum value grows with n (e.g., sorting n random 64-bit integers where max ≈ 2⁶⁴ regardless of n), d is fixed and radix sort wins. But if max = n² (like sorting squared values), then d = O(log n) and the advantage disappears.

Hidden Constants and Practical Performance

Asymptotic analysis hides constant factors that significantly impact real-world performance. Let's examine these constants for radix sort:

Constant factors in radix sort:

Per-pass overhead: Each counting sort pass involves:
- One full array scan (count phase): n memory reads
- One count array scan (accumulate): k operations
- One full array scan + write (place phase): n reads + n writes
- Total: 2n reads + n writes + k operations per pass
Memory access patterns:
- Count phase: Sequential reads (cache-friendly)
- Place phase: Random writes to output array (cache-unfriendly)
- The random writes are a significant performance penalty

Constant factors in comparison sorts:

Merge sort:

O(n log n) comparisons
Each comparison is cheap (one operation)
Good cache behavior during merging
Extra O(n) space for merging

Quicksort:

O(n log n) comparisons on average
In-place (O(log n) stack space)
Excellent cache behavior (sequential partitioning)
But O(n²) worst case

Practical comparison:

For 32-bit integers on modern CPUs:

Radix sort (base 256): ~4n memory traversals with random writes
Quicksort: ~1.4n log₂ n comparisons with sequential access

For n = 10⁶:

Radix: 4 × 10⁶ × 3 ≈ 12 million memory operations
Quicksort: 1.4 × 10⁶ × 20 ≈ 28 million comparisons

But each radix 'operation' involves multiple memory accesses, while each quicksort comparison is a single integer comparison. The winner depends on:

Memory bandwidth vs CPU speed
Cache sizes and patterns
Data characteristics

performance_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import random
import time
 
def radix_sort_256(arr):
    """Radix sort with base 256."""
    if len(arr) <= 1:
        return arr[:]
    
    RADIX = 256
    MASK = 255
    n = len(arr)
    output = [0] * n
    current = arr[:]
    
    for shift in range(0, 32, 8):
        count = [0] * RADIX
        
        for num in current:
            count[(num >> shift) & MASK] += 1
        
        for i in range(1, RADIX):
            count[i] += count[i-1]
        
        for i in range(n-1, -1, -1):
            digit = (current[i] >> shift) & MASK
            count[digit] -= 1
            output[count[digit]] = current[i]
        
        current, output = output, current
    
    return current
 
 
def benchmark(sizes):
    """Benchmark radix sort vs built-in sort."""
    print(f"{'Size n':<12} {'Radix Sort (s)':<18} {'Built-in Sort (s)':<18} {'Ratio'}")
    print("-" * 60)
    
    for n in sizes:
        # Generate random 32-bit integers
        data = [random.randint(0, 2**32 - 1) for _ in range(n)]
        
        # Benchmark radix sort
        data_copy = data[:]
        start = time.perf_counter()
        radix_sort_256(data_copy)
        radix_time = time.perf_counter() - start
        
        # Benchmark built-in sort (Timsort)
        data_copy = data[:]
        start = time.perf_counter()
        data_copy.sort()
        builtin_time = time.perf_counter() - start
        
        ratio = radix_time / builtin_time if builtin_time > 0 else float('inf')
        print(f"{n:<12} {radix_time:<18.6f} {builtin_time:<18.6f} {ratio:.2f}x")
 
# Run benchmark
# Note: In practice, radix sort often wins for large arrays of integers
# but loses to optimized library sorts for smaller arrays due to overhead
benchmark([1000, 10000, 100000, 1000000])

Platform-Specific Performance

Real-world performance varies significantly across platforms:

• CPU architecture: Newer CPUs with better branch prediction favor comparison sorts; high memory bandwidth favors radix sort • Language/runtime: Low-level implementations (C, Rust) see bigger radix sort wins; interpreted languages have higher overhead • Data characteristics: Partially sorted data favors adaptive algorithms (Timsort); random data is neutral

Space Complexity Analysis

Radix sort's space complexity is an important consideration, especially for memory-constrained environments.

Space breakdown for LSD radix sort:

Count array: O(k) — size equals the radix
Output array: O(n) — temporary storage for each pass
Input array: O(n) — the original data

Total auxiliary space: O(n + k)

Since k is typically much smaller than n (e.g., k = 256), this simplifies to O(n) auxiliary space—the same as merge sort.

Space Complexity Comparison
Algorithm	Auxiliary Space	In-Place?	Notes
LSD Radix Sort	O(n + k)	No	Needs output array for stability
MSD Radix Sort	O(n + k + d) or O(n + k)	No/Partial	Recursion depth d; can be in-place with care
Merge Sort	O(n)	No	Needs temporary array for merging
Quicksort	O(log n)	Yes	Only recursion stack space
Heap Sort	O(1)	Yes	Truly in-place
Counting Sort	O(n + k)	No	Same as radix—it's the subroutine!

Can radix sort be made in-place?

Technically, yes, but with significant complexity and performance costs:

American Flag Sort (in-place MSD radix sort):
- Uses counting to determine bucket boundaries
- Swaps elements into correct buckets in place
- Requires O(k) space for counts per recursion level
- Complex implementation prone to bugs
In-place counting sort is possible but loses stability or requires multiple passes.

Practical recommendation:

For most applications, accept the O(n) space cost. If memory is critical and in-place sorting is essential, use heapsort or carefully-tuned quicksort instead.

Memory Bandwidth Matters More

For large arrays, memory bandwidth often matters more than total memory usage. Radix sort's sequential reads in the count phase are efficient, but the random writes in the place phase can thrash the cache. This is why some implementations use multi-pass counting (count all digits first, then place all at once) or careful memory prefetching.

Best, Average, and Worst Case Analysis

Unlike quicksort (with its O(n²) worst case) or insertion sort (with its O(n) best case), radix sort's complexity is remarkably consistent.

Best case: O(d × (n + k))

Occurs when... well, always! The algorithm performs the same operations regardless of input order. There's no data-dependent branching that skips work.

Average case: O(d × (n + k))

Same as best case. The distribution of values doesn't affect the number of operations—every element is processed at every digit position.

Worst case: O(d × (n + k))

Same again! There's no adversarial input that causes degraded performance. This predictability is a significant advantage in real-time systems.

Complexity Consistency Across Algorithms
Algorithm	Best Case	Average Case	Worst Case	Predictable?
Radix Sort	O(d(n+k))	O(d(n+k))	O(d(n+k))	✓ Yes
Merge Sort	O(n log n)	O(n log n)	O(n log n)	✓ Yes
Heapsort	O(n log n)	O(n log n)	O(n log n)	✓ Yes
Quicksort	O(n log n)	O(n log n)	O(n²)	✗ No
Insertion Sort	O(n)	O(n²)	O(n²)	✗ No
Timsort	O(n)	O(n log n)	O(n log n)	Partially

However, practical performance can vary:

While the asymptotic complexity is constant, real-world performance can vary based on:

Value distribution: If most values share common prefixes (like phone numbers in the same area code), some passes may find most elements in one bucket. This doesn't reduce operations but may affect cache behavior.
Memory alignment: How data aligns with cache lines affects actual speed.
Branch prediction: Though minimal, there are some branches (loop conditions, etc.) that can be affected by data patterns.

The key insight: Radix sort's predictable performance makes it ideal for:

Real-time systems with strict timing requirements
Security-sensitive applications (timing attacks are harder)
Systems where worst-case performance must be guaranteed

Stability in All Cases

LSD radix sort (with stable counting sort) maintains stability in all cases. There's no input that causes it to become unstable. This deterministic behavior—both in complexity and correctness properties—makes radix sort highly reliable.

Amortized and Parallel Complexity

Amortized analysis:

Radix sort doesn't have an amortized analysis story like dynamic arrays or splay trees—each sort operation takes the same time. However, when considering radix sort in a larger system:

If you're repeatedly sorting data with bounded maximum values, the consistent O(d(n+k)) complexity amortizes nicely.
Pre-allocating the output and count arrays amortizes allocation overhead across sorts.

Parallel complexity:

Radix sort has interesting parallelization properties:

LSD Radix Sort (parallel):

Each pass can parallelize the count phase (parallel histogram)
The accumulate phase has dependencies but is only O(k) work
The place phase has race conditions—requires synchronization

With P processors, parallel time: O(d × (n/P + k + log P))

The log P factor comes from combining counts across processors.

parallel_radix.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Conceptual parallel radix sort (pseudocode-style Python)
# Real implementation would use multiprocessing or threading
 
def parallel_radix_sort(arr, num_threads=4):
    """
    Parallel LSD radix sort conceptual structure.
    
    Key idea: Each thread counts its portion, then we combine counts.
    """
    n = len(arr)
    chunk_size = n // num_threads
    RADIX = 256
    
    for shift in range(0, 32, 8):
        # PARALLEL PHASE 1: Each thread counts its chunk
        local_counts = []
        for t in range(num_threads):
            start = t * chunk_size
            end = start + chunk_size if t < num_threads - 1 else n
            
            count = [0] * RADIX
            for i in range(start, end):
                digit = (arr[i] >> shift) & 0xFF
                count[digit] += 1
            local_counts.append(count)
        
        # SEQUENTIAL PHASE: Combine counts (O(k × P) work)
        global_count = [0] * RADIX
        for count in local_counts:
            for i in range(RADIX):
                global_count[i] += count[i]
        
        # Compute prefix sums
        for i in range(1, RADIX):
            global_count[i] += global_count[i - 1]
        
        # PARALLEL PHASE 2: Place elements
        # This requires careful coordination to avoid race conditions
        # One approach: compute starting offsets for each thread's elements
        # per bucket, then each thread writes to non-overlapping regions
        
        # ... (complex coordination logic here)
        
    return arr
 
# Time complexity analysis:
# - Sequential: O(d × (n + k))
# - Parallel with P threads: O(d × (n/P + k + log P))
# 
# Speedup is limited by:
# 1. The sequential accumulate phase (Amdahl's law)
# 2. Memory bandwidth (all threads compete for bus)
# 3. Synchronization overhead for the place phase

MSD Radix Sort (parallel):

MSD has better parallelization potential after the first pass:

First pass: Partition into k buckets (sequential or coordinated parallel)
Subsequent passes: Each bucket is independent—perfect parallelism!

With P processors and average bucket size n/k:

First pass: O(n)
Remaining passes: O(d × n/P) if buckets distribute evenly

This makes MSD radix sort attractive for parallel systems, especially with good data distribution.

GPU-Accelerated Radix Sort

Radix sort is exceptionally well-suited for GPU acceleration. Modern GPU radix sort implementations can sort billions of integers per second, far exceeding CPU performance. The regular, predictable memory access patterns (with careful design) map well to GPU architectures. Libraries like CUB (CUDA) provide highly optimized GPU radix sort.

Summary: Understanding O(d × (n + k))

We've thoroughly analyzed radix sort's complexity, moving from mechanical formula manipulation to genuine understanding of when and why this algorithm excels.

Key Takeaways

•O(d(n+k)) has three variables — d (digits/passes), n (elements), k (radix), each with distinct meaning
•d = ⌈log_k(M)⌉ — Number of passes depends on maximum value M and chosen radix k
•Radix sort beats O(n log n) when d < log n — For fixed-width integers, this is typically true
•Base 256 is optimal for 32-bit integers — 4 passes with cache-friendly count arrays
•Space complexity is O(n + k) — Comparable to merge sort; not in-place
•Complexity is consistent — Same for best, average, and worst case; highly predictable
•Parallelization is possible — Especially MSD variant with independent bucket processing

What's next:

With a solid understanding of radix sort's complexity, we're ready to explore when radix sort excels—the specific scenarios where this algorithm is the clear best choice, and equally importantly, when to use something else.

Page Complete

You now have a deep understanding of radix sort's O(d × (n + k)) complexity, how it compares to O(n log n) algorithms, and the factors that influence real-world performance. Next, we'll synthesize this knowledge into practical guidelines for when to choose radix sort.