Other Operators - Learning Module

Loading content...

0/252

Aggregation Operators

The Heart of Analytical Query Processing

Every time you ask a database "How many customers do we have?" or "What's the average order value by region?", you invoke one of the most powerful yet computationally demanding categories of operators: aggregation operators. These operators transform potentially billions of individual rows into summarized insights—counts, sums, averages, minimums, maximums, and statistical measures that drive business intelligence across the modern enterprise.

Aggregation is fundamentally different from relational selection or projection. While selection filters rows and projection removes columns, aggregation collapses entire groups of tuples into single summary values. This collapse operation introduces unique challenges: How do we efficiently compute aggregates over massive datasets? How do we handle grouping when groups don't fit in memory? How do we pipeline aggregate results when the aggregate itself requires seeing all inputs before producing output?

What You Will Learn

By the end of this page, you will understand the physical implementation strategies for aggregation operators, including hash-based and sort-based aggregation. You'll learn how to analyze the I/O and CPU costs of different approaches, understand the distinction between scalar and grouped aggregates, and recognize the tradeoffs between memory consumption and disk utilization in aggregate processing.

Understanding Aggregation Operations

Before diving into physical implementations, we must establish a precise understanding of what aggregation means in relational algebra and SQL. Aggregation operators perform computations across sets of rows to produce summary results.

The SQL Aggregation Model:

In SQL, aggregation involves two conceptually distinct phases:

Grouping Phase: Rows are partitioned into groups based on the GROUP BY clause. Rows with identical values in the grouping columns belong to the same group.
Aggregation Phase: Within each group, aggregate functions compute summary values over the group members.

When no GROUP BY clause is present, the entire input relation forms a single group—this is called scalar aggregation. When GROUP BY is present, we perform grouped aggregation.

Standard SQL Aggregate Functions
Function	Description	State Requirements	Algebraic Type
*COUNT()**	Counts total rows in group	Single integer counter	Distributive
COUNT(column)	Counts non-null values	Single integer counter	Distributive
SUM(column)	Adds all values in group	Single numeric accumulator	Distributive
AVG(column)	Average of values in group	Sum and count (two values)	Algebraic
MIN(column)	Minimum value in group	Single value (current min)	Distributive
MAX(column)	Maximum value in group	Single value (current max)	Distributive
VARIANCE()	Statistical variance	Sum, sum of squares, count	Algebraic
STDDEV()	Standard deviation	Same as variance	Algebraic
MEDIAN()	Median value in group	All values (cannot stream)	Holistic
MODE()	Most frequent value	Frequency counts per value	Holistic

Algebraic Classification of Aggregates

Understanding the algebraic classification is crucial for physical implementation:

Distributive: Can be computed in parallel by partitioning data, then combining partial results (COUNT, SUM, MIN, MAX).

Algebraic: Derived from fixed-size intermediate state that can be partitioned and merged (AVG = SUM/COUNT).

Holistic: Require access to all values—cannot be computed from fixed-size partial state (MEDIAN, MODE). These are the most expensive to implement.

The Physical Challenge:

From a physical operator perspective, aggregation presents several challenges:

Memory Pressure: When there are many distinct groups, the aggregation state may not fit in memory.
Blocking Nature: Most aggregate implementations must see all input tuples before producing any output—this blocks pipelining.
Data Movement: Grouping requires bringing together tuples with matching group keys, potentially from different disk pages.
State Accumulation: Each group maintains running state (counters, sums) that must be updated efficiently.
Partitioning Complexity: For distributed or parallel aggregation, we must correctly partition work while maintaining correctness.

Hash-Based Aggregation

Hash-based aggregation is the most common strategy in modern database systems due to its excellent average-case performance and natural fit with main-memory processing. The fundamental idea is elegant: use a hash table where keys are the grouping columns and values are the aggregation state (running sums, counts, etc.).

Basic Algorithm (In-Memory):

HASH_AGGREGATE(input_relation, group_keys, aggregate_functions):
    hash_table = empty hash table
    
    for each tuple t in input_relation:
        key = extract group_keys from t
        
        if key exists in hash_table:
            entry = hash_table[key]
            update entry's aggregate state with t's values
        else:
            create new entry with initial aggregate state
            hash_table[key] = entry
    
    for each entry in hash_table:
        output (entry.key, finalize(entry.aggregate_state))

This algorithm scans the input exactly once, making it O(n) for n input tuples, assuming hash table operations are O(1) amortized.

hash_aggregate.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Example: Hash aggregation for sales analysis
-- The optimizer chooses hash aggregation when:
-- 1. Estimated number of groups is moderate
-- 2. Sufficient memory is available
-- 3. Input is not already sorted by group keys
 
SELECT 
    region_id,
    product_category,
    COUNT(*) AS total_orders,
    SUM(order_amount) AS total_revenue,
    AVG(order_amount) AS avg_order_value,
    MAX(order_amount) AS largest_order
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY region_id, product_category;
 
-- Physical execution:
-- 1. Scan orders table with filter predicate
-- 2. For each qualifying row, compute hash(region_id, product_category)
-- 3. Look up or create hash bucket for this group
-- 4. Update running aggregates: count++, sum += amount, 
--    avg_state.sum += amount, avg_state.count++, max = MAX(max, amount)
-- 5. After all rows processed, finalize and output each bucket

Memory Layout and Hash Table Design:

Efficient hash aggregation requires careful attention to memory layout. Modern implementations typically use:

Open Addressing with Linear Probing: Better cache locality than chaining, important when hash table fits in L2/L3 cache.
Pre-allocated Buckets: Each bucket contains space for all aggregate state, avoiding pointer chasing.
Key Normalization: Fixed-size keys enable faster comparison and hashing.
Aggregation State Compaction: Store only the minimum state needed (e.g., for AVG, store sum and count, not all values).

Hash Table Bucket Structure for Aggregation
Field	Size (bytes)	Description
Key Hash	8	Pre-computed hash for fast comparison
Group Key(s)	Variable	The actual grouping column values
COUNT state	8	Running count for COUNT(*)
SUM state	8-16	Running sum (integer or decimal)
AVG state	16	Sum + count for average computation
MIN/MAX state	Variable	Current minimum/maximum values
Flags	1-4	Null handling, overflow detection

Aggregation State as Mini Database

Think of each hash bucket as a "mini database" containing all the information needed to compute final aggregate values for one group. The design goal is to minimize bucket size (more groups fit in memory) while enabling O(1) updates. This is why holistic aggregates like MEDIAN are expensive—they can't be represented with fixed-size state.

External Hash Aggregation (Spilling to Disk)

When the number of distinct groups exceeds available memory, hash-based aggregation must spill to disk. This is similar to the partitioning approach used in grace hash join, adapted for aggregation.

Partition-Based External Aggregation:

The algorithm proceeds in two phases:

Phase 1: Partition

Use a hash function h₁ to partition input tuples into P partitions based on group keys
Write each partition to a separate temporary file on disk
Tuples with the same group key always go to the same partition

Phase 2: Aggregate Each Partition

For each partition:
- Read partition into memory
- Build in-memory hash table using a different hash function h₂
- Aggregate as normal
- Output results for this partition
- Clear hash table for next partition

The key insight is that if we have G distinct groups and memory M sufficient for M/B groups (where B is bytes per group), we need ceiling(G × B / M) partitions.

external_hash_aggregate.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
EXTERNAL_HASH_AGGREGATE(input, group_keys, aggregates, memory_budget):
    // Phase 1: Partition the input based on group keys
    num_partitions = estimate_partitions(input.cardinality, memory_budget)
    partition_files = create_temp_files(num_partitions)
    
    for each tuple t in input:
        partition_id = hash_partition(group_keys(t), num_partitions)
        write t to partition_files[partition_id]
    
    flush_all_partitions()
    
    // Phase 2: Aggregate each partition in memory
    output_buffer = []
    
    for i = 0 to num_partitions - 1:
        hash_table = empty hash table
        
        for each tuple t in partition_files[i]:
            key = group_keys(t)
            if key in hash_table:
                update_aggregates(hash_table[key], t)
            else:
                hash_table[key] = initialize_aggregates(t)
        
        // Output aggregated results for this partition
        for each entry in hash_table:
            output_buffer.append(finalize_result(entry))
        
        clear(hash_table)
        delete_temp_file(partition_files[i])
    
    return output_buffer

I/O Cost Analysis:

Let's analyze the I/O cost of external hash aggregation:

Phase 1 (Partitioning):
- Read input: N pages (one scan of input relation)
- Write partitions: N pages (each tuple written once)
Phase 2 (Aggregation):
- Read partitions: N pages (each partition read once)
Total I/O Cost: 3N page operations

This is significantly more expensive than in-memory aggregation (just N reads), but unavoidable when groups don't fit in memory.

Recursive Partitioning

If a single partition still doesn't fit in memory (pathological data skew or very limited memory), we must recursively partition that partition. Each additional partitioning level adds 2N I/O cost. In extreme cases with severe skew, this can degrade to O(N × log(G/M)) I/O complexity. Modern optimizers detect this risk and may choose sort-based aggregation instead.

Hash Aggregation Strengths

•O(n) average-case complexity for in-memory case
•Excellent cache locality with proper hash table design
•Natural parallelization (partition-parallel)
•No requirement for sorted input
•Well-suited for OLAP workloads with many groups

Hash Aggregation Limitations

•Memory-sensitive: performance cliffs when spilling
•Cannot exploit pre-sorted input (ignores order)
•Hash collisions degrade worst-case performance
•Not suitable for holistic aggregates (MEDIAN)
•Random I/O patterns during spill operations

Sort-Based Aggregation

Sort-based aggregation takes a fundamentally different approach: instead of hashing to group tuples, we sort the input by the grouping columns, then scan sequentially to aggregate consecutive tuples belonging to the same group.

Algorithm:

SORT_AGGREGATE(input_relation, group_keys, aggregate_functions):
    // Phase 1: Sort by group keys
    sorted_input = external_sort(input_relation, group_keys)
    
    // Phase 2: Sequential scan with aggregation
    current_key = null
    current_state = null
    
    for each tuple t in sorted_input:
        key = extract group_keys from t
        
        if key ≠ current_key:
            if current_key ≠ null:
                output (current_key, finalize(current_state))
            current_key = key
            current_state = initialize aggregate state
        
        update current_state with t's values
    
    // Output last group
    if current_key ≠ null:
        output (current_key, finalize(current_state))

The elegance of this approach is that after sorting, aggregation is trivially streaming—we only need to track state for one group at a time, regardless of how many total groups exist.

I/O Cost Analysis:

The total cost depends on the sorting phase:

External Sort Cost: Typically 2N × (1 + ⌈logₘ₋₁(N/M)⌉) where:
- N = number of pages in the input
- M = number of memory buffer pages
- The first term (2N) is for creating sorted runs
- The logarithmic term accounts for merge passes
Aggregation Scan: N pages (one sequential scan of sorted data)

Total: O(N log N) I/O for sorting plus O(N) for aggregation

This seems worse than hash aggregation's O(N) I/O for in-memory case and O(3N) for spilling case. So why would we ever use sort-based aggregation?

When Sort-Based Aggregation Wins

Sort-based aggregation becomes advantageous when:

Input is already sorted by group keys (e.g., from an index scan or prior sort)—cost drops to just O(N)
Query requires sorted output (ORDER BY matching GROUP BY)—sorting cost is amortized
Extremely high group cardinality where hash table won't fit—sort has more predictable spilling behavior
Computing holistic aggregates like MEDIAN that require all values sorted
Limited memory with many groups—sort's recursive merging is more memory-efficient than recursive hash partitioning

Sort vs Hash Aggregation: Decision Factors
Factor	Favors Hash	Favors Sort
Input order	Unordered input	Pre-sorted input by group keys
Group cardinality	Moderate (fits in memory)	Very high or unknown
Output requirements	Any order acceptable	Sorted output needed
Aggregate types	Distributive/algebraic only	Holistic aggregates required
Memory availability	Ample memory	Constrained memory
Data skew	Uniform distribution	Highly skewed (hotspot groups)
Parallelization	Partition-parallel	Merge-parallel

Partial and Distributed Aggregation

In distributed and parallel database systems, aggregation presents unique challenges: data is spread across multiple nodes, and network transfer is expensive. The solution is multi-phase aggregation that minimizes data movement.

Two-Phase Aggregation Pattern:

Local (Partial) Aggregation: Each node computes partial aggregates over its local data
Global (Final) Aggregation: Partial results are shuffled/merged to produce final output

This pattern exploits the distributive and algebraic properties of aggregates:

SUM(global) = SUM(local_SUM₁, local_SUM₂, ...)
COUNT(global) = SUM(local_COUNT₁, local_COUNT₂, ...)
AVG(global) = SUM(local_SUM) / SUM(local_COUNT)
MIN(global) = MIN(local_MIN₁, local_MIN₂, ...)

distributed_aggregation.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Conceptual execution of distributed aggregation
-- Query: SELECT region, SUM(sales), AVG(sales) 
--        FROM orders GROUP BY region
 
-- PHASE 1: Local aggregation at each node
-- Node 1 (has orders for USA, Canada):
--   {USA: (sum=100000, count=500), Canada: (sum=50000, count=200)}
-- Node 2 (has orders for USA, Mexico): 
--   {USA: (sum=80000, count=400), Mexico: (sum=30000, count=150)}
-- Node 3 (has orders for Canada, Mexico):
--   {Canada: (sum=60000, count=250), Mexico: (sum=25000, count=100)}
 
-- PHASE 2: Shuffle by group key (region)
-- Coordinator for USA receives: (100000, 500), (80000, 400)
-- Coordinator for Canada receives: (50000, 200), (60000, 250)  
-- Coordinator for Mexico receives: (30000, 150), (25000, 100)
 
-- PHASE 3: Final aggregation
-- USA: SUM=180000, AVG=180000/900 = 200.00
-- Canada: SUM=110000, AVG=110000/450 = 244.44
-- Mexico: SUM=55000, AVG=55000/250 = 220.00

Data Reduction Benefits:

The power of partial aggregation lies in data reduction. Consider aggregating 1 billion rows across 100 nodes with 1,000 distinct groups:

Without partial aggregation: Ship 1 billion rows across the network
With partial aggregation: Ship at most 100 × 1,000 = 100,000 partial aggregate records

The reduction factor can be 10,000x or more, dramatically reducing network I/O which is often the bottleneck in distributed queries.

Push-Down Optimization:

Smart query optimizers push partial aggregation as close to the data source as possible, even into storage engines or remote data sources. This is particularly important in:

Federated databases: Aggregate remotely before transfer
Column stores: Leverage vectorized partial aggregation
Data lakes: Push aggregation into Parquet/ORC readers

When Partial Aggregation Backfires

Partial aggregation can actually hurt performance when group cardinality is very high relative to tuple count. If you have 10 million rows with 9 million distinct groups, partial aggregation only reduces to 9 million partial results—nearly no benefit, plus the overhead of creating partial state. Smart optimizers detect this via statistics and skip partial aggregation when reduction is expected to be minimal.

Implementation Deep Dive

Production database systems employ numerous optimizations beyond the basic algorithms. Understanding these details illuminates how modern query engines achieve exceptional aggregation performance.

Aggregate Function State Management:

Each aggregate function defines three operations:

Initialize: Create the initial state (e.g., count=0, sum=0)
Accumulate: Update state with a new value (e.g., count++, sum+=value)
Finalize: Compute final result from accumulated state (e.g., for AVG: return sum/count)

For distributive aggregates, we also need:

Merge: Combine two partial states (e.g., sum1 + sum2, count1 + count2)

These operations are typically implemented as specialized code per aggregate type, often JIT-compiled for maximum efficiency.

Aggregate State Machine Operations
Aggregate	Initialize	Accumulate(v)	Merge(s1, s2)	Finalize
COUNT(*)	count = 0	count++	count1 + count2	return count
SUM	sum = 0	sum += v	sum1 + sum2	return sum
MIN	min = +∞	min = MIN(min, v)	MIN(min1, min2)	return min
MAX	max = -∞	max = MAX(max, v)	MAX(max1, max2)	return max
AVG	(sum=0, cnt=0)	sum+=v; cnt++	(sum1+sum2, cnt1+cnt2)	return sum/cnt
VARIANCE	(n=0, mean=0, M2=0)	Welford update	Parallel Welford merge	return M2/(n-1)

NULL Handling:

SQL semantics for NULLs in aggregates are precise but often overlooked:

COUNT(*): Counts all rows, including those with NULL values
COUNT(column): Counts only non-NULL values
SUM, AVG, MIN, MAX: Ignore NULL values entirely
All aggregates over empty set: Return NULL (except COUNT returns 0)

Physical implementations must track NULL state separately, adding a NULL bitmap or flag to the aggregate state.

DISTINCT Aggregates:

Aggregates like COUNT(DISTINCT column) or SUM(DISTINCT column) require deduplication before aggregation. This is typically implemented by:

Building a hash set of values within each group
Aggregating over the deduplicated set

This significantly increases memory consumption because we must store all distinct values, not just aggregate state.

distinct_aggregate.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- DISTINCT aggregation requires per-group deduplication
SELECT 
    department_id,
    COUNT(DISTINCT employee_id) AS unique_employees,
    SUM(DISTINCT salary_grade) AS unique_grade_sum
FROM employees
GROUP BY department_id;
 
-- Physical execution maintains per-group hash sets:
-- 
-- department_id=1: 
--   employee_id_set: {101, 102, 103}  -- for COUNT(DISTINCT)
--   salary_grade_set: {5, 6, 7}       -- for SUM(DISTINCT)
--
-- Memory: O(distinct_values_per_group × number_of_groups)
-- Much higher than regular aggregation's O(number_of_groups)

Approximate Distinct Counting

For very large datasets, exact COUNT(DISTINCT) is prohibitively expensive. Many systems offer approximate alternatives using probabilistic data structures:

HyperLogLog: O(1) memory for any input size, ~2% error
Count-Min Sketch: Frequency estimation with bounded error

These trade exactness for dramatic memory savings—fitting trillion-row distinct counts in kilobytes.

Vectorized Aggregation

Modern analytical database systems leverage SIMD (Single Instruction, Multiple Data) instructions to dramatically accelerate aggregation. Rather than processing one tuple at a time, vectorized aggregation operates on batches of tuples simultaneously.

Vectorized Hash Aggregation:

The key insight is to separate the expensive per-tuple hash table operations from the cheap arithmetic accumulation:

Batch Hashing: Compute hashes for a batch of group keys using SIMD
Batch Lookup: Probe hash table for all batch entries, recording bucket addresses
Batch Update: Use gathered bucket addresses to update aggregate states

Each step can be pipelined and vectorized, achieving 4-16x speedup over scalar code.

vectorized_aggregation.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
VECTORIZED_HASH_AGGREGATE(input_batches, group_keys, aggregates):
    hash_table = empty vectorized hash table
    BATCH_SIZE = 1024  // Process 1024 tuples at a time
    
    for each batch in input_batches:
        // Step 1: Compute hashes for entire batch using SIMD
        hashes[BATCH_SIZE] = simd_hash(batch.group_keys)
        
        // Step 2: Probe hash table, get bucket addresses
        // Uses prefetching to hide memory latency
        bucket_addrs[BATCH_SIZE] = hash_table.batch_probe(hashes)
        
        // Step 3: Handle new groups (miss in hash table)
        for i where bucket_addrs[i] == NULL:
            bucket_addrs[i] = hash_table.insert(batch.group_keys[i])
        
        // Step 4: Vectorized aggregate update
        // SIMD add for SUM, SIMD compare for MIN/MAX
        simd_accumulate(bucket_addrs, batch.aggregate_columns)
    
    return hash_table.finalize_all()

Column-Oriented Aggregation:

Column stores are particularly well-suited for aggregation because:

Sequential Access: Reading a single column is sequential I/O
Compression: Columns compress well; aggregation operates on compressed data
Cache Efficiency: Only aggregate columns loaded into cache, not entire rows
SIMD Compatibility: Fixed-width columns align to SIMD registers

Late Materialization:

A powerful optimization called late materialization defers constructing result tuples until necessary:

During aggregation, track only column positions (not full values)
After grouping complete, fetch actual values only for output columns
Avoid copying non-aggregate columns that aren't needed until final output

Hardware-Conscious Aggregation

Top-tier analytical engines (Clickhouse, DuckDB, Velox) achieve 100M+ rows/second aggregation rates through careful hardware optimization:

Cache-conscious hash tables: Bucket size tuned to cache line (64 bytes)
SIMD hash functions: xxHash, MurmurHash with AVX-512
Prefetching: Hide memory latency by fetching next batch while processing current
Branch-free code: Avoid misprediction penalties in hot loops

The difference between naive and optimized aggregation is often 10-50x.

Summary: Aggregation Operators

Aggregation operators are fundamental to analytical query processing, transforming raw data into actionable insights. Let's consolidate the key concepts:

Key Takeaways

•Aggregation collapses groups into summaries — Understanding the algebraic classification (distributive, algebraic, holistic) guides implementation strategy.
•Hash aggregation excels for moderate groups — O(n) in-memory complexity with excellent cache performance, but requires memory proportional to group count.
•External hash aggregation handles overflow — Partitioning strategy with 3N I/O cost, but watch for recursive partitioning with severe skew.
•Sort-based aggregation has predictable behavior — Higher baseline cost but exploits sorted input and handles extreme group cardinality gracefully.
•Distributed aggregation uses partial/final pattern — Push partial aggregation close to data for massive network I/O reduction.
•Vectorized implementations dominate modern engines — SIMD, batching, and cache-conscious design achieve 10-50x speedup over scalar code.

What's Next:

Aggregation operators often work alongside duplicate elimination, which shares many implementation strategies. The next page explores how DISTINCT queries and duplicate handling are physically implemented, building on the hash and sort techniques we've learned here.

Page Complete

You now understand how aggregation operators are physically implemented in database systems. From in-memory hash aggregation to distributed partial aggregation, these techniques underpin every analytical query that summarizes data. Next, we'll explore duplicate elimination—a closely related operation with its own implementation nuances.

Aggregation Operators

The Heart of Analytical Query Processing

What You Will Learn

Understanding Aggregation Operations

The SQL Aggregation Model:

In SQL, aggregation involves two conceptually distinct phases:

Grouping Phase: Rows are partitioned into groups based on the GROUP BY clause. Rows with identical values in the grouping columns belong to the same group.
Aggregation Phase: Within each group, aggregate functions compute summary values over the group members.

When no GROUP BY clause is present, the entire input relation forms a single group—this is called scalar aggregation. When GROUP BY is present, we perform grouped aggregation.

Standard SQL Aggregate Functions
Function	Description	State Requirements	Algebraic Type
*COUNT()**	Counts total rows in group	Single integer counter	Distributive
COUNT(column)	Counts non-null values	Single integer counter	Distributive
SUM(column)	Adds all values in group	Single numeric accumulator	Distributive
AVG(column)	Average of values in group	Sum and count (two values)	Algebraic
MIN(column)	Minimum value in group	Single value (current min)	Distributive
MAX(column)	Maximum value in group	Single value (current max)	Distributive
VARIANCE()	Statistical variance	Sum, sum of squares, count	Algebraic
STDDEV()	Standard deviation	Same as variance	Algebraic
MEDIAN()	Median value in group	All values (cannot stream)	Holistic
MODE()	Most frequent value	Frequency counts per value	Holistic

Algebraic Classification of Aggregates

Understanding the algebraic classification is crucial for physical implementation:

Distributive: Can be computed in parallel by partitioning data, then combining partial results (COUNT, SUM, MIN, MAX).

Algebraic: Derived from fixed-size intermediate state that can be partitioned and merged (AVG = SUM/COUNT).

Holistic: Require access to all values—cannot be computed from fixed-size partial state (MEDIAN, MODE). These are the most expensive to implement.

The Physical Challenge:

From a physical operator perspective, aggregation presents several challenges:

Memory Pressure: When there are many distinct groups, the aggregation state may not fit in memory.
Blocking Nature: Most aggregate implementations must see all input tuples before producing any output—this blocks pipelining.
Data Movement: Grouping requires bringing together tuples with matching group keys, potentially from different disk pages.
State Accumulation: Each group maintains running state (counters, sums) that must be updated efficiently.
Partitioning Complexity: For distributed or parallel aggregation, we must correctly partition work while maintaining correctness.

Hash-Based Aggregation

Basic Algorithm (In-Memory):

HASH_AGGREGATE(input_relation, group_keys, aggregate_functions):
    hash_table = empty hash table
    
    for each tuple t in input_relation:
        key = extract group_keys from t
        
        if key exists in hash_table:
            entry = hash_table[key]
            update entry's aggregate state with t's values
        else:
            create new entry with initial aggregate state
            hash_table[key] = entry
    
    for each entry in hash_table:
        output (entry.key, finalize(entry.aggregate_state))

This algorithm scans the input exactly once, making it O(n) for n input tuples, assuming hash table operations are O(1) amortized.

hash_aggregate.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Example: Hash aggregation for sales analysis
-- The optimizer chooses hash aggregation when:
-- 1. Estimated number of groups is moderate
-- 2. Sufficient memory is available
-- 3. Input is not already sorted by group keys
 
SELECT 
    region_id,
    product_category,
    COUNT(*) AS total_orders,
    SUM(order_amount) AS total_revenue,
    AVG(order_amount) AS avg_order_value,
    MAX(order_amount) AS largest_order
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY region_id, product_category;
 
-- Physical execution:
-- 1. Scan orders table with filter predicate
-- 2. For each qualifying row, compute hash(region_id, product_category)
-- 3. Look up or create hash bucket for this group
-- 4. Update running aggregates: count++, sum += amount, 
--    avg_state.sum += amount, avg_state.count++, max = MAX(max, amount)
-- 5. After all rows processed, finalize and output each bucket

Memory Layout and Hash Table Design:

Efficient hash aggregation requires careful attention to memory layout. Modern implementations typically use:

Open Addressing with Linear Probing: Better cache locality than chaining, important when hash table fits in L2/L3 cache.
Pre-allocated Buckets: Each bucket contains space for all aggregate state, avoiding pointer chasing.
Key Normalization: Fixed-size keys enable faster comparison and hashing.
Aggregation State Compaction: Store only the minimum state needed (e.g., for AVG, store sum and count, not all values).

Hash Table Bucket Structure for Aggregation
Field	Size (bytes)	Description
Key Hash	8	Pre-computed hash for fast comparison
Group Key(s)	Variable	The actual grouping column values
COUNT state	8	Running count for COUNT(*)
SUM state	8-16	Running sum (integer or decimal)
AVG state	16	Sum + count for average computation
MIN/MAX state	Variable	Current minimum/maximum values
Flags	1-4	Null handling, overflow detection

Aggregation State as Mini Database

External Hash Aggregation (Spilling to Disk)

Partition-Based External Aggregation:

The algorithm proceeds in two phases:

Phase 1: Partition

Use a hash function h₁ to partition input tuples into P partitions based on group keys
Write each partition to a separate temporary file on disk
Tuples with the same group key always go to the same partition

Phase 2: Aggregate Each Partition

For each partition:
- Read partition into memory
- Build in-memory hash table using a different hash function h₂
- Aggregate as normal
- Output results for this partition
- Clear hash table for next partition

The key insight is that if we have G distinct groups and memory M sufficient for M/B groups (where B is bytes per group), we need ceiling(G × B / M) partitions.

external_hash_aggregate.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
EXTERNAL_HASH_AGGREGATE(input, group_keys, aggregates, memory_budget):
    // Phase 1: Partition the input based on group keys
    num_partitions = estimate_partitions(input.cardinality, memory_budget)
    partition_files = create_temp_files(num_partitions)
    
    for each tuple t in input:
        partition_id = hash_partition(group_keys(t), num_partitions)
        write t to partition_files[partition_id]
    
    flush_all_partitions()
    
    // Phase 2: Aggregate each partition in memory
    output_buffer = []
    
    for i = 0 to num_partitions - 1:
        hash_table = empty hash table
        
        for each tuple t in partition_files[i]:
            key = group_keys(t)
            if key in hash_table:
                update_aggregates(hash_table[key], t)
            else:
                hash_table[key] = initialize_aggregates(t)
        
        // Output aggregated results for this partition
        for each entry in hash_table:
            output_buffer.append(finalize_result(entry))
        
        clear(hash_table)
        delete_temp_file(partition_files[i])
    
    return output_buffer

I/O Cost Analysis:

Let's analyze the I/O cost of external hash aggregation:

Phase 1 (Partitioning):
- Read input: N pages (one scan of input relation)
- Write partitions: N pages (each tuple written once)
Phase 2 (Aggregation):
- Read partitions: N pages (each partition read once)
Total I/O Cost: 3N page operations

This is significantly more expensive than in-memory aggregation (just N reads), but unavoidable when groups don't fit in memory.

Recursive Partitioning

Hash Aggregation Strengths

•O(n) average-case complexity for in-memory case
•Excellent cache locality with proper hash table design
•Natural parallelization (partition-parallel)
•No requirement for sorted input
•Well-suited for OLAP workloads with many groups

Hash Aggregation Limitations

•Memory-sensitive: performance cliffs when spilling
•Cannot exploit pre-sorted input (ignores order)
•Hash collisions degrade worst-case performance
•Not suitable for holistic aggregates (MEDIAN)
•Random I/O patterns during spill operations

Sort-Based Aggregation

Algorithm:

SORT_AGGREGATE(input_relation, group_keys, aggregate_functions):
    // Phase 1: Sort by group keys
    sorted_input = external_sort(input_relation, group_keys)
    
    // Phase 2: Sequential scan with aggregation
    current_key = null
    current_state = null
    
    for each tuple t in sorted_input:
        key = extract group_keys from t
        
        if key ≠ current_key:
            if current_key ≠ null:
                output (current_key, finalize(current_state))
            current_key = key
            current_state = initialize aggregate state
        
        update current_state with t's values
    
    // Output last group
    if current_key ≠ null:
        output (current_key, finalize(current_state))

The elegance of this approach is that after sorting, aggregation is trivially streaming—we only need to track state for one group at a time, regardless of how many total groups exist.

I/O Cost Analysis:

The total cost depends on the sorting phase:

External Sort Cost: Typically 2N × (1 + ⌈logₘ₋₁(N/M)⌉) where:
- N = number of pages in the input
- M = number of memory buffer pages
- The first term (2N) is for creating sorted runs
- The logarithmic term accounts for merge passes
Aggregation Scan: N pages (one sequential scan of sorted data)

Total: O(N log N) I/O for sorting plus O(N) for aggregation

This seems worse than hash aggregation's O(N) I/O for in-memory case and O(3N) for spilling case. So why would we ever use sort-based aggregation?

When Sort-Based Aggregation Wins

Sort-based aggregation becomes advantageous when:

Input is already sorted by group keys (e.g., from an index scan or prior sort)—cost drops to just O(N)
Query requires sorted output (ORDER BY matching GROUP BY)—sorting cost is amortized
Extremely high group cardinality where hash table won't fit—sort has more predictable spilling behavior
Computing holistic aggregates like MEDIAN that require all values sorted
Limited memory with many groups—sort's recursive merging is more memory-efficient than recursive hash partitioning

Sort vs Hash Aggregation: Decision Factors
Factor	Favors Hash	Favors Sort
Input order	Unordered input	Pre-sorted input by group keys
Group cardinality	Moderate (fits in memory)	Very high or unknown
Output requirements	Any order acceptable	Sorted output needed
Aggregate types	Distributive/algebraic only	Holistic aggregates required
Memory availability	Ample memory	Constrained memory
Data skew	Uniform distribution	Highly skewed (hotspot groups)
Parallelization	Partition-parallel	Merge-parallel

Partial and Distributed Aggregation

Two-Phase Aggregation Pattern:

Local (Partial) Aggregation: Each node computes partial aggregates over its local data
Global (Final) Aggregation: Partial results are shuffled/merged to produce final output

This pattern exploits the distributive and algebraic properties of aggregates:

SUM(global) = SUM(local_SUM₁, local_SUM₂, ...)
COUNT(global) = SUM(local_COUNT₁, local_COUNT₂, ...)
AVG(global) = SUM(local_SUM) / SUM(local_COUNT)
MIN(global) = MIN(local_MIN₁, local_MIN₂, ...)

distributed_aggregation.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Conceptual execution of distributed aggregation
-- Query: SELECT region, SUM(sales), AVG(sales) 
--        FROM orders GROUP BY region
 
-- PHASE 1: Local aggregation at each node
-- Node 1 (has orders for USA, Canada):
--   {USA: (sum=100000, count=500), Canada: (sum=50000, count=200)}
-- Node 2 (has orders for USA, Mexico): 
--   {USA: (sum=80000, count=400), Mexico: (sum=30000, count=150)}
-- Node 3 (has orders for Canada, Mexico):
--   {Canada: (sum=60000, count=250), Mexico: (sum=25000, count=100)}
 
-- PHASE 2: Shuffle by group key (region)
-- Coordinator for USA receives: (100000, 500), (80000, 400)
-- Coordinator for Canada receives: (50000, 200), (60000, 250)  
-- Coordinator for Mexico receives: (30000, 150), (25000, 100)
 
-- PHASE 3: Final aggregation
-- USA: SUM=180000, AVG=180000/900 = 200.00
-- Canada: SUM=110000, AVG=110000/450 = 244.44
-- Mexico: SUM=55000, AVG=55000/250 = 220.00

Data Reduction Benefits:

The power of partial aggregation lies in data reduction. Consider aggregating 1 billion rows across 100 nodes with 1,000 distinct groups:

Without partial aggregation: Ship 1 billion rows across the network
With partial aggregation: Ship at most 100 × 1,000 = 100,000 partial aggregate records

The reduction factor can be 10,000x or more, dramatically reducing network I/O which is often the bottleneck in distributed queries.

Push-Down Optimization:

Smart query optimizers push partial aggregation as close to the data source as possible, even into storage engines or remote data sources. This is particularly important in:

Federated databases: Aggregate remotely before transfer
Column stores: Leverage vectorized partial aggregation
Data lakes: Push aggregation into Parquet/ORC readers

When Partial Aggregation Backfires

Implementation Deep Dive

Production database systems employ numerous optimizations beyond the basic algorithms. Understanding these details illuminates how modern query engines achieve exceptional aggregation performance.

Aggregate Function State Management:

Each aggregate function defines three operations:

Initialize: Create the initial state (e.g., count=0, sum=0)
Accumulate: Update state with a new value (e.g., count++, sum+=value)
Finalize: Compute final result from accumulated state (e.g., for AVG: return sum/count)

For distributive aggregates, we also need:

Merge: Combine two partial states (e.g., sum1 + sum2, count1 + count2)

These operations are typically implemented as specialized code per aggregate type, often JIT-compiled for maximum efficiency.

Aggregate State Machine Operations
Aggregate	Initialize	Accumulate(v)	Merge(s1, s2)	Finalize
COUNT(*)	count = 0	count++	count1 + count2	return count
SUM	sum = 0	sum += v	sum1 + sum2	return sum
MIN	min = +∞	min = MIN(min, v)	MIN(min1, min2)	return min
MAX	max = -∞	max = MAX(max, v)	MAX(max1, max2)	return max
AVG	(sum=0, cnt=0)	sum+=v; cnt++	(sum1+sum2, cnt1+cnt2)	return sum/cnt
VARIANCE	(n=0, mean=0, M2=0)	Welford update	Parallel Welford merge	return M2/(n-1)

NULL Handling:

SQL semantics for NULLs in aggregates are precise but often overlooked:

COUNT(*): Counts all rows, including those with NULL values
COUNT(column): Counts only non-NULL values
SUM, AVG, MIN, MAX: Ignore NULL values entirely
All aggregates over empty set: Return NULL (except COUNT returns 0)

Physical implementations must track NULL state separately, adding a NULL bitmap or flag to the aggregate state.

DISTINCT Aggregates:

Aggregates like COUNT(DISTINCT column) or SUM(DISTINCT column) require deduplication before aggregation. This is typically implemented by:

Building a hash set of values within each group
Aggregating over the deduplicated set

This significantly increases memory consumption because we must store all distinct values, not just aggregate state.

distinct_aggregate.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- DISTINCT aggregation requires per-group deduplication
SELECT 
    department_id,
    COUNT(DISTINCT employee_id) AS unique_employees,
    SUM(DISTINCT salary_grade) AS unique_grade_sum
FROM employees
GROUP BY department_id;
 
-- Physical execution maintains per-group hash sets:
-- 
-- department_id=1: 
--   employee_id_set: {101, 102, 103}  -- for COUNT(DISTINCT)
--   salary_grade_set: {5, 6, 7}       -- for SUM(DISTINCT)
--
-- Memory: O(distinct_values_per_group × number_of_groups)
-- Much higher than regular aggregation's O(number_of_groups)

Approximate Distinct Counting

For very large datasets, exact COUNT(DISTINCT) is prohibitively expensive. Many systems offer approximate alternatives using probabilistic data structures:

HyperLogLog: O(1) memory for any input size, ~2% error
Count-Min Sketch: Frequency estimation with bounded error

These trade exactness for dramatic memory savings—fitting trillion-row distinct counts in kilobytes.

Vectorized Aggregation

Vectorized Hash Aggregation:

The key insight is to separate the expensive per-tuple hash table operations from the cheap arithmetic accumulation:

Batch Hashing: Compute hashes for a batch of group keys using SIMD
Batch Lookup: Probe hash table for all batch entries, recording bucket addresses
Batch Update: Use gathered bucket addresses to update aggregate states

Each step can be pipelined and vectorized, achieving 4-16x speedup over scalar code.

vectorized_aggregation.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
VECTORIZED_HASH_AGGREGATE(input_batches, group_keys, aggregates):
    hash_table = empty vectorized hash table
    BATCH_SIZE = 1024  // Process 1024 tuples at a time
    
    for each batch in input_batches:
        // Step 1: Compute hashes for entire batch using SIMD
        hashes[BATCH_SIZE] = simd_hash(batch.group_keys)
        
        // Step 2: Probe hash table, get bucket addresses
        // Uses prefetching to hide memory latency
        bucket_addrs[BATCH_SIZE] = hash_table.batch_probe(hashes)
        
        // Step 3: Handle new groups (miss in hash table)
        for i where bucket_addrs[i] == NULL:
            bucket_addrs[i] = hash_table.insert(batch.group_keys[i])
        
        // Step 4: Vectorized aggregate update
        // SIMD add for SUM, SIMD compare for MIN/MAX
        simd_accumulate(bucket_addrs, batch.aggregate_columns)
    
    return hash_table.finalize_all()

Column-Oriented Aggregation:

Column stores are particularly well-suited for aggregation because:

Sequential Access: Reading a single column is sequential I/O
Compression: Columns compress well; aggregation operates on compressed data
Cache Efficiency: Only aggregate columns loaded into cache, not entire rows
SIMD Compatibility: Fixed-width columns align to SIMD registers

Late Materialization:

A powerful optimization called late materialization defers constructing result tuples until necessary:

During aggregation, track only column positions (not full values)
After grouping complete, fetch actual values only for output columns
Avoid copying non-aggregate columns that aren't needed until final output

Hardware-Conscious Aggregation

Top-tier analytical engines (Clickhouse, DuckDB, Velox) achieve 100M+ rows/second aggregation rates through careful hardware optimization:

Cache-conscious hash tables: Bucket size tuned to cache line (64 bytes)
SIMD hash functions: xxHash, MurmurHash with AVX-512
Prefetching: Hide memory latency by fetching next batch while processing current
Branch-free code: Avoid misprediction penalties in hot loops

The difference between naive and optimized aggregation is often 10-50x.

Summary: Aggregation Operators

Aggregation operators are fundamental to analytical query processing, transforming raw data into actionable insights. Let's consolidate the key concepts:

Key Takeaways

•Aggregation collapses groups into summaries — Understanding the algebraic classification (distributive, algebraic, holistic) guides implementation strategy.
•Hash aggregation excels for moderate groups — O(n) in-memory complexity with excellent cache performance, but requires memory proportional to group count.
•External hash aggregation handles overflow — Partitioning strategy with 3N I/O cost, but watch for recursive partitioning with severe skew.
•Sort-based aggregation has predictable behavior — Higher baseline cost but exploits sorted input and handles extreme group cardinality gracefully.
•Distributed aggregation uses partial/final pattern — Push partial aggregation close to data for massive network I/O reduction.
•Vectorized implementations dominate modern engines — SIMD, batching, and cache-conscious design achieve 10-50x speedup over scalar code.

What's Next:

Page Complete