Other Operators - Learning Module

Loading content...

0/241

Duplicate Elimination

The Challenge of Finding Uniqueness

When users write SELECT DISTINCT ... in SQL, they're asking a deceptively simple question: "Give me each unique combination of values exactly once." Behind this simple request lies a fundamental computational challenge—how do we efficiently identify and remove duplicates from potentially billions of tuples?

Duplicate elimination is one of the most common operations in database systems, appearing not just in explicit DISTINCT queries but implicitly in UNION (without ALL), subquery optimization, and various internal operations. Understanding its physical implementation reveals deep connections to aggregation, sorting, and set theory.

What You Will Learn

By the end of this page, you will understand the physical strategies for duplicate elimination—hash-based and sort-based approaches. You'll learn when each strategy excels, how to analyze their costs, and how duplicate elimination integrates with query execution pipelines. You'll also explore optimizations like early elimination and projection pushdown.

Duplicate Elimination Fundamentals

Duplicate elimination removes tuples that have identical values across all projected columns. Conceptually, it transforms a multiset (bag) into a set—preserving one instance of each unique tuple and discarding all copies.

Formal Definition:

Given a relation R with tuples t₁, t₂, ..., tₙ, the duplicate elimination operation δ(R) produces a relation where:

Every tuple that appears in R appears exactly once in δ(R)
No tuple appears in δ(R) that doesn't appear in R
|δ(R)| ≤ |R| (the result has at most as many tuples as the input)

SQL Syntax and Semantics:

-- Explicit duplicate elimination with DISTINCT
SELECT DISTINCT column1, column2, column3
FROM table_name;

-- Implicit duplicate elimination with UNION
SELECT a, b FROM table1
UNION
SELECT a, b FROM table2;   -- DISTINCT applied automatically

-- Explicit preservation of duplicates
SELECT ALL column1, column2  -- ALL is default, usually omitted
FROM table_name;

Why Duplicates Arise:

Duplicates in query results typically come from several sources:

Projection: Projecting to a subset of columns can create duplicates even from a unique table
Joins: A join can multiply tuples, creating duplicates in the result
Union: Combining results from multiple sources may have overlap
Denormalized Data: Real-world data may contain actual duplicate records

The Computational Challenge:

For n input tuples, we must determine which are duplicates. A naive approach of comparing every tuple pair requires O(n²) comparisons—prohibitively expensive for large tables. Database systems use smarter strategies that reduce this to O(n) average case (hash-based) or O(n log n) worst case (sort-based).

Duplicate Elimination in SQL Operations
SQL Construct	Duplicate Behavior	Physical Implication
SELECT DISTINCT	Explicit elimination	Requires dedup operator
SELECT ALL	Preserves duplicates	No dedup needed
UNION	Implicit elimination	Union + dedup operator
UNION ALL	Preserves duplicates	Simple concatenation
INTERSECT	Implicit elimination	Set intersection
EXCEPT	Implicit elimination	Set difference
IN (subquery)	Result set unique	Subquery may need dedup
GROUP BY	Groups are unique	Aggregation provides dedup

The Multiset Model

SQL operates on multisets (bags), not mathematical sets. This means duplicates are allowed and meaningful—they represent repeated occurrences of data. The DISTINCT keyword explicitly converts from multiset to set semantics. Understanding this distinction is crucial for reasoning about query semantics and optimization.

Hash-Based Duplicate Elimination

Hash-based duplicate elimination is the most common strategy in modern databases. It uses a hash table to track which tuples have already been seen, outputting each tuple only on its first occurrence.

Basic Algorithm:

HASH_DISTINCT(input_relation):
    seen = empty hash set
    
    for each tuple t in input_relation:
        if t not in seen:
            seen.add(t)
            output t
    
    // No finalization needed—output is produced incrementally

This algorithm is remarkably simple and efficient:

Time Complexity: O(n) average case (hash operations are O(1) amortized)
Space Complexity: O(d) where d is the number of distinct tuples
I/O Cost: Single scan of input data

hash_distinct_execution.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Query requiring duplicate elimination
SELECT DISTINCT department, job_title
FROM employees
WHERE hire_date > '2020-01-01';
 
-- Physical execution plan:
-- 1. Table Scan: employees (with filter predicate)
-- 2. Project: (department, job_title) 
-- 3. Hash Distinct:
--    - Compute hash(department, job_title) for each row
--    - Check if hash+values exist in hash set
--    - If new: add to set and emit tuple
--    - If duplicate: discard
 
-- Hash Set Contents (conceptual):
-- { ("Engineering", "Developer"), 
--   ("Sales", "Manager"),
--   ("Engineering", "Manager"),
--   ("Marketing", "Analyst"), ... }

Implementation Considerations:

1. Hashing Strategy:

The hash function must hash the entire tuple (all DISTINCT columns). For multi-column DISTINCT, we compute a composite hash:

hash(t) = combine(hash(col1), hash(col2), ..., hash(colN))

Common combination strategies include XOR with rotation, MurmurHash3's finalization, or CRC-based combination.

2. Hash Collision Handling:

Hash collisions must be handled carefully—two different tuples may hash to the same bucket. We must store the actual tuple values (or a secondary hash) to detect true duplicates vs collisions:

// On hash match, verify actual equality
if hash(t) == existing_hash:
    if t == existing_tuple:  // actual duplicate
        discard t
    else:  // collision, not duplicate
        insert t at next slot

3. Memory Layout:

For efficiency, hash sets typically store:

A hash table of bucket entries (hash + pointer/offset)
A separate buffer holding actual tuple data
This allows compact bucket arrays with cache-efficient probing

Pipelining Advantage

Hash-based DISTINCT has a crucial pipelining advantage: it can emit output tuples as soon as they're identified as new. The first occurrence of each unique tuple is immediately passed to downstream operators. This means the operator is non-blocking—it doesn't need to see all input before producing output. This property enables much lower query latency compared to blocking alternatives.

External Hash Duplicate Elimination

When the number of distinct tuples exceeds available memory, hash-based duplicate elimination must spill to disk. The strategy mirrors external hash aggregation: partition, then process each partition independently.

Partition-Based External DISTINCT:

Phase 1: Partition Input

Use hash function h₁ to partition all input tuples into P partitions
Each tuple goes to partition h₁(tuple) mod P
Write partitions to temporary files on disk

Phase 2: Deduplicate Each Partition

For each partition i:
- Read partition i into memory
- Build in-memory hash set using different hash function h₂
- Output unique tuples from this partition
- Clear hash set for next partition

Key Insight: Since we partition by hash value, all copies of any given tuple land in the same partition. Deduplicating each partition independently yields correct overall results.

external_hash_distinct.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
EXTERNAL_HASH_DISTINCT(input, memory_budget):
    // Estimate number of partitions needed
    estimated_distinct = estimate_cardinality(input)
    bytes_per_tuple = avg_tuple_size(input)
    num_partitions = ceiling(estimated_distinct * bytes_per_tuple / memory_budget)
    
    // Phase 1: Partition the input
    partition_files = create_temp_files(num_partitions)
    
    for each tuple t in input:
        partition_id = hash_partition(t) mod num_partitions
        write t to partition_files[partition_id]
    
    flush_all_partitions()
    
    // Phase 2: Deduplicate each partition
    for i = 0 to num_partitions - 1:
        seen = empty hash set
        
        for each tuple t in partition_files[i]:
            if t not in seen:
                seen.add(t)
                output t
        
        clear(seen)
        delete_temp_file(partition_files[i])

I/O Cost Analysis:

Let N be the number of input pages:

Phase 1 (Partitioning):
- Read input: N pages
- Write partitions: N pages (assuming no compression)
Phase 2 (Deduplication):
- Read partitions: N pages
- Write output: D pages (where D ≤ N is the distinct result size)

Total I/O: 3N + D page operations

Note that unlike aggregation (which computes summaries), duplicate elimination must output the full tuple data, so output size can be significant.

Skew and Recursive Partitioning

Data skew can cause some partitions to exceed memory even after the initial partitioning. For example, if 50% of tuples share one value in a DISTINCT column, half of all data ends up in one partition. The solution is recursive partitioning: partition the oversized partition again with a different hash function. Each recursion level adds 2N I/O cost for the affected partition.

Hash DISTINCT Advantages

•O(n) average-case complexity
•Non-blocking: produces output early
•No sorting overhead
•Good cache locality with proper design
•Natural parallelization via partitioning

Hash DISTINCT Limitations

•Memory proportional to distinct count
•Cannot exploit pre-sorted input
•Spilling adds significant I/O
•Vulnerable to data skew
•Output order is arbitrary

Sort-Based Duplicate Elimination

Sort-based duplicate elimination uses a simple but elegant principle: after sorting, all duplicate tuples are adjacent, making elimination trivial.

Algorithm:

SORT_DISTINCT(input_relation):
    sorted_input = external_sort(input_relation)
    
    previous = null
    
    for each tuple t in sorted_input:
        if t ≠ previous:
            output t
            previous = t
    
    // Only need to track ONE previous tuple—constant memory for dedup phase

The beauty of this approach is that after sorting, the deduplication scan requires only O(1) memory—we compare each tuple to its predecessor. All the complexity is front-loaded into the sorting phase.

I/O Cost Analysis:

The cost is dominated by the external sort:

Sorting Cost: O(2N × (1 + ⌈log_(B-1)(N/B)⌉)) where B is buffer pages
Deduplication Scan: N pages (one sequential pass)

For typical memory sizes and data volumes:

Small data (fits in memory): ~2N I/O (sort) + N I/O (scan) = ~3N
Medium data (2-3 merge passes): ~6N (sort) + N (scan) = ~7N
Large data (multiple passes): Sorting dominates

When Sort-Based Wins:

Input is already sorted: Cost drops to just N I/O (single scan)
Output must be sorted: Sorting cost is amortized across both requirements
Very high distinct count: More predictable memory behavior than hash
Memory is extremely limited: Sort's merge can use minimal buffers

sort_distinct_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Example where sort-based DISTINCT is optimal
-- Query needs both DISTINCT and ORDER BY on same columns
 
SELECT DISTINCT customer_region, customer_segment
FROM customers
WHERE active = true
ORDER BY customer_region, customer_segment;
 
-- Optimizer recognizes that:
-- 1. DISTINCT requires grouping by (region, segment)
-- 2. ORDER BY also requires sorting by (region, segment)
-- 3. A single sort satisfies BOTH requirements
--
-- Plan: Table Scan → Sort(region, segment) → Streaming Distinct → Output
--
-- The streaming DISTINCT after sort is pure overhead—just compare adjacent rows
-- Much cheaper than: Hash Distinct → Sort for ORDER BY
 
-- Index-based optimization:
-- If index exists on (customer_region, customer_segment) with filter:
-- Plan: Index Scan → Streaming Distinct → Output  (no explicit sort!)

Early Aggregation in Sort

An optimization called run formation with duplicate elimination combines initial run creation with deduplication. During the in-memory sort phase of external sort, we can eliminate duplicates before writing runs. This reduces the data volume for merge phases, especially effective when there are many duplicates. The run becomes a set rather than a bag, reducing I/O for subsequent merges.

Optimizer Decision Making

Query optimizers must choose between hash-based and sort-based duplicate elimination, weighing multiple factors including memory, data characteristics, and downstream requirements.

Key Decision Factors:

1. Available Memory vs Distinct Cardinality

If estimated distinct tuples fit in memory → Hash (typically wins)
If distinct set >> memory → Sort may have more predictable performance

2. Input Order

Input already sorted on DISTINCT columns → Sort (just scan)
Unsorted input → Hash usually faster

3. Output Ordering Requirements

ORDER BY matches DISTINCT columns → Sort (amortize cost)
No ordering needed → Hash (no sorting overhead)

4. Parallelism

Shared-memory parallel → Both can be parallelized
Distributed → Hash with exchange by partition key

DISTINCT Strategy Selection Guidelines
Scenario	Recommended Strategy	Rationale
Small distinct set, fits in memory	Hash	O(n) scan, no I/O overhead
Large distinct set, limited memory	Sort	Predictable spilling behavior
Input sorted on DISTINCT keys	Sort-Streaming	Zero additional cost
ORDER BY same as DISTINCT	Sort	One sort serves both
High duplicate rate (>90%)	Hash with early output	Most tuples filtered early
Parallel execution needed	Hash with partitioning	Natural partition-parallel
Unknown cardinality	Hash with adaptive spill	Optimistic with fallback

Cardinality Estimation Importance:

The optimizer's choice heavily depends on estimating how many distinct values exist. This is challenging because:

Distinct count ≠ Row count: A million-row table might have anywhere from 1 to 1 million distinct combinations
Column correlations matter: DISTINCT(A, B) isn't simply DISTINCT(A) × DISTINCT(B)
After filtering: Predicates may substantially change the distinct count

Statistics Used:

Column distinct count (NDV - Number of Distinct Values)
Histograms showing value distribution
Multi-column statistics for correlated columns
Sampling for up-to-date estimates

Estimation Errors

If the optimizer underestimates distinct count, hash DISTINCT may unexpectedly spill to disk, causing severe performance degradation. If it overestimates, it might choose more expensive sort-based DISTINCT unnecessarily. Modern systems use adaptive query processing to switch strategies mid-execution when estimates prove wrong.

Advanced Optimization Techniques

Beyond basic algorithms, database systems employ sophisticated optimizations to improve duplicate elimination performance.

1. Projection Pushdown:

Before duplicate elimination, project only the columns involved in DISTINCT. This reduces tuple width, allowing more tuples in memory and less I/O:

-- Original
SELECT DISTINCT a, b FROM table WHERE ...;

-- Optimized plan:
-- Scan → Project(a, b) → DISTINCT
-- Not: Scan (all columns) → DISTINCT → Project

2. Predicate Pushdown:

Apply filters before DISTINCT to reduce input cardinality. Fewer input tuples means less work for deduplication:

-- Push filter before DISTINCT
SELECT DISTINCT customer_id FROM orders WHERE amount > 100;
-- Plan: Scan(orders WHERE amount>100) → Project(customer_id) → DISTINCT

distinct_optimizations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Optimization: DISTINCT elimination when result is already unique
 
-- Example 1: Primary key guarantees uniqueness
SELECT DISTINCT customer_id   -- DISTINCT is redundant!
FROM customers
WHERE customer_id = 123;
 
-- Optimizer can remove DISTINCT: primary key constraint
-- guarantees at most one row
 
-- Example 2: Unique index also guarantees uniqueness
SELECT DISTINCT email        -- DISTINCT redundant if email is unique
FROM users
WHERE email = 'test@example.com';
 
-- Example 3: GROUP BY already produces unique output
SELECT DISTINCT region, SUM(sales)  -- DISTINCT redundant
FROM orders
GROUP BY region;                    -- GROUP BY ensures unique regions
 
-- These are all valid DISTINCT elimination rewrites
-- that avoid the cost of the duplicate elimination operator entirely

3. Early Duplicate Elimination:

In multi-stage pipelines, eliminating duplicates early can dramatically reduce data flow:

Query: SELECT DISTINCT a FROM (SELECT a FROM t1 UNION ALL SELECT a FROM t2)

-- Naive: Full union, then single DISTINCT
-- Better: DISTINCT on each source, then DISTINCT on combined result
--   - Reduces data volume flowing through union
--   - Each partial DISTINCT may be smaller and fit in memory

4. Lazy (Late) Elimination:

Conversely, sometimes deferring DISTINCT is beneficial:

If subsequent operators will further reduce output (e.g., LIMIT)
If duplicate checking is expensive (complex expressions)
If downstream needs full multiset semantics (then discard)

5. Approximate DISTINCT with HyperLogLog:

For analytical queries where exactness isn't required:

-- PostgreSQL example
SELECT approximate_count_distinct(user_id)
FROM events
WHERE event_date > '2024-01-01';

-- Uses HyperLogLog: O(1) memory, ~2% error
-- Can process billions of rows with kilobytes of state

DISTINCT vs EXISTS Equivalence

A subtle optimization: SELECT DISTINCT a FROM t is equivalent to SELECT DISTINCT a FROM t WHERE EXISTS (SELECT 1 FROM t t2 WHERE t2.a = t.a). This algebraic equivalence sometimes enables join-based rewrites that can exploit indexes, particularly in correlated subquery contexts.

Distributed Duplicate Elimination

In distributed and parallel database systems, duplicate elimination requires coordinating across multiple nodes. The challenge is ensuring that duplicate tuples on different nodes are correctly identified and reduced to one.

Partition-Based Distributed DISTINCT:

The standard approach uses hash partitioning:

Hash-Partition by DISTINCT columns: Each tuple is sent to a node determined by hash(DISTINCT columns). All copies of any tuple land on the same node.
Local DISTINCT on each node: Each node performs in-memory or external DISTINCT on its partition.
Output union: Results from all nodes form the complete answer (no further dedup needed since partitioning guaranteed no cross-node duplicates).

distributed_distinct.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
DISTRIBUTED_DISTINCT(input_partitions, distinct_columns):
    // Step 1: Redistribute by hash of DISTINCT columns
    for each node N:
        for each local tuple t:
            target_node = hash(t[distinct_columns]) mod num_nodes
            send t to target_node
    
    barrier()  // Wait for all shuffling to complete
    
    // Step 2: Local duplicate elimination
    for each node N in parallel:
        local_result = HASH_DISTINCT(received_tuples)
        output local_result
    
    // No global coordination needed—each node has
    // complete data for its hash partitions
 
-- Example: 3-node cluster processing DISTINCT on customer_id
-- Node 1: hash(customer_id) mod 3 == 0
-- Node 2: hash(customer_id) mod 3 == 1  
-- Node 3: hash(customer_id) mod 3 == 2
-- All rows for customer_id=12345 go to same node

Optimization: Partial Distinct Before Shuffle

Network transfer is typically the bottleneck in distributed systems. We can reduce data movement by performing local duplicate elimination before reshuffling:

1. Each node: Local DISTINCT on its original partition
2. Shuffle: Send locally-distinct tuples to hash-determined nodes
3. Each node: Final DISTINCT on received tuples

This adds CPU work but can dramatically reduce network I/O when local data has high duplication.

Broadcast Optimization:

For small distinct sets, broadcasting may beat partitioning:

If estimated distinct count is very small (e.g., 100 tuples)
Broadcast all tuples to all nodes
Each node maintains local dedup state
Final aggregation merges (small) results

This avoids the overhead of hash partitioning for trivially small result sets.

Bloom Filter Optimization

A powerful distributed optimization uses Bloom filters for early duplicate detection:

Build a Bloom filter from local distinct tuples
Exchange Bloom filters with other nodes
Use received filters to identify definite non-duplicates (can pass through immediately)
Only uncertain cases (Bloom filter positives) need full duplicate checking

This reduces network traffic at the cost of some false positives requiring verification.

Summary: Duplicate Elimination

Duplicate elimination is a fundamental operation that transforms multisets into sets. Let's consolidate the key concepts:

Key Takeaways

•Hash-based DISTINCT excels in memory — O(n) complexity with early output production, but memory usage grows with distinct count.
•External hashing handles overflow — Partition-based spilling with 3N+ I/O cost; watch for skew requiring recursive partitioning.
•Sort-based DISTINCT exploits order — O(1) memory after sort; wins when input is sorted or output must be ordered.
•Optimizer weighs multiple factors — Memory, cardinality estimates, sorting requirements, and parallelism all influence strategy choice.
•Early elimination reduces data flow — Pushing DISTINCT before expensive operations can dramatically reduce intermediate data sizes.
•Distributed DISTINCT uses partitioning — Hash-based redistribution ensures all duplicates land on the same node.

What's Next:

Duplicate elimination is one of several set-oriented operations. The next page explores set operations (UNION, INTERSECT, EXCEPT) and their physical implementations, building on the hashing and sorting techniques we've covered.

Page Complete

You now understand how duplicate elimination is physically implemented in database systems, from simple in-memory hash sets to sophisticated distributed algorithms. This knowledge applies to DISTINCT queries, UNION operations, and many internal database operations. Next, we'll explore the closely related set operations.

Duplicate Elimination

The Challenge of Finding Uniqueness

What You Will Learn

Duplicate Elimination Fundamentals

Formal Definition:

Given a relation R with tuples t₁, t₂, ..., tₙ, the duplicate elimination operation δ(R) produces a relation where:

Every tuple that appears in R appears exactly once in δ(R)
No tuple appears in δ(R) that doesn't appear in R
|δ(R)| ≤ |R| (the result has at most as many tuples as the input)

SQL Syntax and Semantics:

-- Explicit duplicate elimination with DISTINCT
SELECT DISTINCT column1, column2, column3
FROM table_name;

-- Implicit duplicate elimination with UNION
SELECT a, b FROM table1
UNION
SELECT a, b FROM table2;   -- DISTINCT applied automatically

-- Explicit preservation of duplicates
SELECT ALL column1, column2  -- ALL is default, usually omitted
FROM table_name;

Why Duplicates Arise:

Duplicates in query results typically come from several sources:

Projection: Projecting to a subset of columns can create duplicates even from a unique table
Joins: A join can multiply tuples, creating duplicates in the result
Union: Combining results from multiple sources may have overlap
Denormalized Data: Real-world data may contain actual duplicate records

The Computational Challenge:

Duplicate Elimination in SQL Operations
SQL Construct	Duplicate Behavior	Physical Implication
SELECT DISTINCT	Explicit elimination	Requires dedup operator
SELECT ALL	Preserves duplicates	No dedup needed
UNION	Implicit elimination	Union + dedup operator
UNION ALL	Preserves duplicates	Simple concatenation
INTERSECT	Implicit elimination	Set intersection
EXCEPT	Implicit elimination	Set difference
IN (subquery)	Result set unique	Subquery may need dedup
GROUP BY	Groups are unique	Aggregation provides dedup

The Multiset Model

Hash-Based Duplicate Elimination

Basic Algorithm:

HASH_DISTINCT(input_relation):
    seen = empty hash set
    
    for each tuple t in input_relation:
        if t not in seen:
            seen.add(t)
            output t
    
    // No finalization needed—output is produced incrementally

This algorithm is remarkably simple and efficient:

Time Complexity: O(n) average case (hash operations are O(1) amortized)
Space Complexity: O(d) where d is the number of distinct tuples
I/O Cost: Single scan of input data

hash_distinct_execution.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Query requiring duplicate elimination
SELECT DISTINCT department, job_title
FROM employees
WHERE hire_date > '2020-01-01';
 
-- Physical execution plan:
-- 1. Table Scan: employees (with filter predicate)
-- 2. Project: (department, job_title) 
-- 3. Hash Distinct:
--    - Compute hash(department, job_title) for each row
--    - Check if hash+values exist in hash set
--    - If new: add to set and emit tuple
--    - If duplicate: discard
 
-- Hash Set Contents (conceptual):
-- { ("Engineering", "Developer"), 
--   ("Sales", "Manager"),
--   ("Engineering", "Manager"),
--   ("Marketing", "Analyst"), ... }

Implementation Considerations:

1. Hashing Strategy:

The hash function must hash the entire tuple (all DISTINCT columns). For multi-column DISTINCT, we compute a composite hash:

hash(t) = combine(hash(col1), hash(col2), ..., hash(colN))

Common combination strategies include XOR with rotation, MurmurHash3's finalization, or CRC-based combination.

2. Hash Collision Handling:

Hash collisions must be handled carefully—two different tuples may hash to the same bucket. We must store the actual tuple values (or a secondary hash) to detect true duplicates vs collisions:

// On hash match, verify actual equality
if hash(t) == existing_hash:
    if t == existing_tuple:  // actual duplicate
        discard t
    else:  // collision, not duplicate
        insert t at next slot

3. Memory Layout:

For efficiency, hash sets typically store:

A hash table of bucket entries (hash + pointer/offset)
A separate buffer holding actual tuple data
This allows compact bucket arrays with cache-efficient probing

Pipelining Advantage

External Hash Duplicate Elimination

Partition-Based External DISTINCT:

Phase 1: Partition Input

Use hash function h₁ to partition all input tuples into P partitions
Each tuple goes to partition h₁(tuple) mod P
Write partitions to temporary files on disk

Phase 2: Deduplicate Each Partition

For each partition i:
- Read partition i into memory
- Build in-memory hash set using different hash function h₂
- Output unique tuples from this partition
- Clear hash set for next partition

Key Insight: Since we partition by hash value, all copies of any given tuple land in the same partition. Deduplicating each partition independently yields correct overall results.

external_hash_distinct.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
EXTERNAL_HASH_DISTINCT(input, memory_budget):
    // Estimate number of partitions needed
    estimated_distinct = estimate_cardinality(input)
    bytes_per_tuple = avg_tuple_size(input)
    num_partitions = ceiling(estimated_distinct * bytes_per_tuple / memory_budget)
    
    // Phase 1: Partition the input
    partition_files = create_temp_files(num_partitions)
    
    for each tuple t in input:
        partition_id = hash_partition(t) mod num_partitions
        write t to partition_files[partition_id]
    
    flush_all_partitions()
    
    // Phase 2: Deduplicate each partition
    for i = 0 to num_partitions - 1:
        seen = empty hash set
        
        for each tuple t in partition_files[i]:
            if t not in seen:
                seen.add(t)
                output t
        
        clear(seen)
        delete_temp_file(partition_files[i])

I/O Cost Analysis:

Let N be the number of input pages:

Phase 1 (Partitioning):
- Read input: N pages
- Write partitions: N pages (assuming no compression)
Phase 2 (Deduplication):
- Read partitions: N pages
- Write output: D pages (where D ≤ N is the distinct result size)

Total I/O: 3N + D page operations

Note that unlike aggregation (which computes summaries), duplicate elimination must output the full tuple data, so output size can be significant.

Skew and Recursive Partitioning

Hash DISTINCT Advantages

•O(n) average-case complexity
•Non-blocking: produces output early
•No sorting overhead
•Good cache locality with proper design
•Natural parallelization via partitioning

Hash DISTINCT Limitations

•Memory proportional to distinct count
•Cannot exploit pre-sorted input
•Spilling adds significant I/O
•Vulnerable to data skew
•Output order is arbitrary

Sort-Based Duplicate Elimination

Sort-based duplicate elimination uses a simple but elegant principle: after sorting, all duplicate tuples are adjacent, making elimination trivial.

Algorithm:

SORT_DISTINCT(input_relation):
    sorted_input = external_sort(input_relation)
    
    previous = null
    
    for each tuple t in sorted_input:
        if t ≠ previous:
            output t
            previous = t
    
    // Only need to track ONE previous tuple—constant memory for dedup phase

I/O Cost Analysis:

The cost is dominated by the external sort:

Sorting Cost: O(2N × (1 + ⌈log_(B-1)(N/B)⌉)) where B is buffer pages
Deduplication Scan: N pages (one sequential pass)

For typical memory sizes and data volumes:

Small data (fits in memory): ~2N I/O (sort) + N I/O (scan) = ~3N
Medium data (2-3 merge passes): ~6N (sort) + N (scan) = ~7N
Large data (multiple passes): Sorting dominates

When Sort-Based Wins:

Input is already sorted: Cost drops to just N I/O (single scan)
Output must be sorted: Sorting cost is amortized across both requirements
Very high distinct count: More predictable memory behavior than hash
Memory is extremely limited: Sort's merge can use minimal buffers

sort_distinct_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Example where sort-based DISTINCT is optimal
-- Query needs both DISTINCT and ORDER BY on same columns
 
SELECT DISTINCT customer_region, customer_segment
FROM customers
WHERE active = true
ORDER BY customer_region, customer_segment;
 
-- Optimizer recognizes that:
-- 1. DISTINCT requires grouping by (region, segment)
-- 2. ORDER BY also requires sorting by (region, segment)
-- 3. A single sort satisfies BOTH requirements
--
-- Plan: Table Scan → Sort(region, segment) → Streaming Distinct → Output
--
-- The streaming DISTINCT after sort is pure overhead—just compare adjacent rows
-- Much cheaper than: Hash Distinct → Sort for ORDER BY
 
-- Index-based optimization:
-- If index exists on (customer_region, customer_segment) with filter:
-- Plan: Index Scan → Streaming Distinct → Output  (no explicit sort!)

Early Aggregation in Sort

Optimizer Decision Making

Query optimizers must choose between hash-based and sort-based duplicate elimination, weighing multiple factors including memory, data characteristics, and downstream requirements.

Key Decision Factors:

1. Available Memory vs Distinct Cardinality

If estimated distinct tuples fit in memory → Hash (typically wins)
If distinct set >> memory → Sort may have more predictable performance

2. Input Order

Input already sorted on DISTINCT columns → Sort (just scan)
Unsorted input → Hash usually faster

3. Output Ordering Requirements

ORDER BY matches DISTINCT columns → Sort (amortize cost)
No ordering needed → Hash (no sorting overhead)

4. Parallelism

Shared-memory parallel → Both can be parallelized
Distributed → Hash with exchange by partition key

DISTINCT Strategy Selection Guidelines
Scenario	Recommended Strategy	Rationale
Small distinct set, fits in memory	Hash	O(n) scan, no I/O overhead
Large distinct set, limited memory	Sort	Predictable spilling behavior
Input sorted on DISTINCT keys	Sort-Streaming	Zero additional cost
ORDER BY same as DISTINCT	Sort	One sort serves both
High duplicate rate (>90%)	Hash with early output	Most tuples filtered early
Parallel execution needed	Hash with partitioning	Natural partition-parallel
Unknown cardinality	Hash with adaptive spill	Optimistic with fallback

Cardinality Estimation Importance:

The optimizer's choice heavily depends on estimating how many distinct values exist. This is challenging because:

Distinct count ≠ Row count: A million-row table might have anywhere from 1 to 1 million distinct combinations
Column correlations matter: DISTINCT(A, B) isn't simply DISTINCT(A) × DISTINCT(B)
After filtering: Predicates may substantially change the distinct count

Statistics Used:

Column distinct count (NDV - Number of Distinct Values)
Histograms showing value distribution
Multi-column statistics for correlated columns
Sampling for up-to-date estimates

Estimation Errors

Advanced Optimization Techniques

Beyond basic algorithms, database systems employ sophisticated optimizations to improve duplicate elimination performance.

1. Projection Pushdown:

Before duplicate elimination, project only the columns involved in DISTINCT. This reduces tuple width, allowing more tuples in memory and less I/O:

-- Original
SELECT DISTINCT a, b FROM table WHERE ...;

-- Optimized plan:
-- Scan → Project(a, b) → DISTINCT
-- Not: Scan (all columns) → DISTINCT → Project

2. Predicate Pushdown:

Apply filters before DISTINCT to reduce input cardinality. Fewer input tuples means less work for deduplication:

-- Push filter before DISTINCT
SELECT DISTINCT customer_id FROM orders WHERE amount > 100;
-- Plan: Scan(orders WHERE amount>100) → Project(customer_id) → DISTINCT

distinct_optimizations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Optimization: DISTINCT elimination when result is already unique
 
-- Example 1: Primary key guarantees uniqueness
SELECT DISTINCT customer_id   -- DISTINCT is redundant!
FROM customers
WHERE customer_id = 123;
 
-- Optimizer can remove DISTINCT: primary key constraint
-- guarantees at most one row
 
-- Example 2: Unique index also guarantees uniqueness
SELECT DISTINCT email        -- DISTINCT redundant if email is unique
FROM users
WHERE email = 'test@example.com';
 
-- Example 3: GROUP BY already produces unique output
SELECT DISTINCT region, SUM(sales)  -- DISTINCT redundant
FROM orders
GROUP BY region;                    -- GROUP BY ensures unique regions
 
-- These are all valid DISTINCT elimination rewrites
-- that avoid the cost of the duplicate elimination operator entirely

3. Early Duplicate Elimination:

In multi-stage pipelines, eliminating duplicates early can dramatically reduce data flow:

Query: SELECT DISTINCT a FROM (SELECT a FROM t1 UNION ALL SELECT a FROM t2)

-- Naive: Full union, then single DISTINCT
-- Better: DISTINCT on each source, then DISTINCT on combined result
--   - Reduces data volume flowing through union
--   - Each partial DISTINCT may be smaller and fit in memory

4. Lazy (Late) Elimination:

Conversely, sometimes deferring DISTINCT is beneficial:

If subsequent operators will further reduce output (e.g., LIMIT)
If duplicate checking is expensive (complex expressions)
If downstream needs full multiset semantics (then discard)

5. Approximate DISTINCT with HyperLogLog:

For analytical queries where exactness isn't required:

-- PostgreSQL example
SELECT approximate_count_distinct(user_id)
FROM events
WHERE event_date > '2024-01-01';

-- Uses HyperLogLog: O(1) memory, ~2% error
-- Can process billions of rows with kilobytes of state

DISTINCT vs EXISTS Equivalence

Distributed Duplicate Elimination

Partition-Based Distributed DISTINCT:

The standard approach uses hash partitioning:

Hash-Partition by DISTINCT columns: Each tuple is sent to a node determined by hash(DISTINCT columns). All copies of any tuple land on the same node.
Local DISTINCT on each node: Each node performs in-memory or external DISTINCT on its partition.
Output union: Results from all nodes form the complete answer (no further dedup needed since partitioning guaranteed no cross-node duplicates).

distributed_distinct.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
DISTRIBUTED_DISTINCT(input_partitions, distinct_columns):
    // Step 1: Redistribute by hash of DISTINCT columns
    for each node N:
        for each local tuple t:
            target_node = hash(t[distinct_columns]) mod num_nodes
            send t to target_node
    
    barrier()  // Wait for all shuffling to complete
    
    // Step 2: Local duplicate elimination
    for each node N in parallel:
        local_result = HASH_DISTINCT(received_tuples)
        output local_result
    
    // No global coordination needed—each node has
    // complete data for its hash partitions
 
-- Example: 3-node cluster processing DISTINCT on customer_id
-- Node 1: hash(customer_id) mod 3 == 0
-- Node 2: hash(customer_id) mod 3 == 1  
-- Node 3: hash(customer_id) mod 3 == 2
-- All rows for customer_id=12345 go to same node

Optimization: Partial Distinct Before Shuffle

Network transfer is typically the bottleneck in distributed systems. We can reduce data movement by performing local duplicate elimination before reshuffling:

1. Each node: Local DISTINCT on its original partition
2. Shuffle: Send locally-distinct tuples to hash-determined nodes
3. Each node: Final DISTINCT on received tuples

This adds CPU work but can dramatically reduce network I/O when local data has high duplication.

Broadcast Optimization:

For small distinct sets, broadcasting may beat partitioning:

If estimated distinct count is very small (e.g., 100 tuples)
Broadcast all tuples to all nodes
Each node maintains local dedup state
Final aggregation merges (small) results

This avoids the overhead of hash partitioning for trivially small result sets.

Bloom Filter Optimization

A powerful distributed optimization uses Bloom filters for early duplicate detection:

Build a Bloom filter from local distinct tuples
Exchange Bloom filters with other nodes
Use received filters to identify definite non-duplicates (can pass through immediately)
Only uncertain cases (Bloom filter positives) need full duplicate checking

This reduces network traffic at the cost of some false positives requiring verification.

Summary: Duplicate Elimination

Duplicate elimination is a fundamental operation that transforms multisets into sets. Let's consolidate the key concepts:

Key Takeaways

•Hash-based DISTINCT excels in memory — O(n) complexity with early output production, but memory usage grows with distinct count.
•External hashing handles overflow — Partition-based spilling with 3N+ I/O cost; watch for skew requiring recursive partitioning.
•Sort-based DISTINCT exploits order — O(1) memory after sort; wins when input is sorted or output must be ordered.
•Optimizer weighs multiple factors — Memory, cardinality estimates, sorting requirements, and parallelism all influence strategy choice.
•Early elimination reduces data flow — Pushing DISTINCT before expensive operations can dramatically reduce intermediate data sizes.
•Distributed DISTINCT uses partitioning — Hash-based redistribution ensures all duplicates land on the same node.

What's Next:

Page Complete