Loading content...
When users write SELECT DISTINCT ... in SQL, they're asking a deceptively simple question: "Give me each unique combination of values exactly once." Behind this simple request lies a fundamental computational challenge—how do we efficiently identify and remove duplicates from potentially billions of tuples?
Duplicate elimination is one of the most common operations in database systems, appearing not just in explicit DISTINCT queries but implicitly in UNION (without ALL), subquery optimization, and various internal operations. Understanding its physical implementation reveals deep connections to aggregation, sorting, and set theory.
By the end of this page, you will understand the physical strategies for duplicate elimination—hash-based and sort-based approaches. You'll learn when each strategy excels, how to analyze their costs, and how duplicate elimination integrates with query execution pipelines. You'll also explore optimizations like early elimination and projection pushdown.
Duplicate elimination removes tuples that have identical values across all projected columns. Conceptually, it transforms a multiset (bag) into a set—preserving one instance of each unique tuple and discarding all copies.
Formal Definition:
Given a relation R with tuples t₁, t₂, ..., tₙ, the duplicate elimination operation δ(R) produces a relation where:
SQL Syntax and Semantics:
-- Explicit duplicate elimination with DISTINCT
SELECT DISTINCT column1, column2, column3
FROM table_name;
-- Implicit duplicate elimination with UNION
SELECT a, b FROM table1
UNION
SELECT a, b FROM table2; -- DISTINCT applied automatically
-- Explicit preservation of duplicates
SELECT ALL column1, column2 -- ALL is default, usually omitted
FROM table_name;
Why Duplicates Arise:
Duplicates in query results typically come from several sources:
The Computational Challenge:
For n input tuples, we must determine which are duplicates. A naive approach of comparing every tuple pair requires O(n²) comparisons—prohibitively expensive for large tables. Database systems use smarter strategies that reduce this to O(n) average case (hash-based) or O(n log n) worst case (sort-based).
| SQL Construct | Duplicate Behavior | Physical Implication |
|---|---|---|
| SELECT DISTINCT | Explicit elimination | Requires dedup operator |
| SELECT ALL | Preserves duplicates | No dedup needed |
| UNION | Implicit elimination | Union + dedup operator |
| UNION ALL | Preserves duplicates | Simple concatenation |
| INTERSECT | Implicit elimination | Set intersection |
| EXCEPT | Implicit elimination | Set difference |
| IN (subquery) | Result set unique | Subquery may need dedup |
| GROUP BY | Groups are unique | Aggregation provides dedup |
SQL operates on multisets (bags), not mathematical sets. This means duplicates are allowed and meaningful—they represent repeated occurrences of data. The DISTINCT keyword explicitly converts from multiset to set semantics. Understanding this distinction is crucial for reasoning about query semantics and optimization.
Hash-based duplicate elimination is the most common strategy in modern databases. It uses a hash table to track which tuples have already been seen, outputting each tuple only on its first occurrence.
Basic Algorithm:
HASH_DISTINCT(input_relation):
seen = empty hash set
for each tuple t in input_relation:
if t not in seen:
seen.add(t)
output t
// No finalization needed—output is produced incrementally
This algorithm is remarkably simple and efficient:
12345678910111213141516171819
-- Query requiring duplicate eliminationSELECT DISTINCT department, job_titleFROM employeesWHERE hire_date > '2020-01-01'; -- Physical execution plan:-- 1. Table Scan: employees (with filter predicate)-- 2. Project: (department, job_title) -- 3. Hash Distinct:-- - Compute hash(department, job_title) for each row-- - Check if hash+values exist in hash set-- - If new: add to set and emit tuple-- - If duplicate: discard -- Hash Set Contents (conceptual):-- { ("Engineering", "Developer"), -- ("Sales", "Manager"),-- ("Engineering", "Manager"),-- ("Marketing", "Analyst"), ... }Implementation Considerations:
1. Hashing Strategy:
The hash function must hash the entire tuple (all DISTINCT columns). For multi-column DISTINCT, we compute a composite hash:
hash(t) = combine(hash(col1), hash(col2), ..., hash(colN))
Common combination strategies include XOR with rotation, MurmurHash3's finalization, or CRC-based combination.
2. Hash Collision Handling:
Hash collisions must be handled carefully—two different tuples may hash to the same bucket. We must store the actual tuple values (or a secondary hash) to detect true duplicates vs collisions:
// On hash match, verify actual equality
if hash(t) == existing_hash:
if t == existing_tuple: // actual duplicate
discard t
else: // collision, not duplicate
insert t at next slot
3. Memory Layout:
For efficiency, hash sets typically store:
Hash-based DISTINCT has a crucial pipelining advantage: it can emit output tuples as soon as they're identified as new. The first occurrence of each unique tuple is immediately passed to downstream operators. This means the operator is non-blocking—it doesn't need to see all input before producing output. This property enables much lower query latency compared to blocking alternatives.
When the number of distinct tuples exceeds available memory, hash-based duplicate elimination must spill to disk. The strategy mirrors external hash aggregation: partition, then process each partition independently.
Partition-Based External DISTINCT:
Phase 1: Partition Input
Phase 2: Deduplicate Each Partition
Key Insight: Since we partition by hash value, all copies of any given tuple land in the same partition. Deduplicating each partition independently yields correct overall results.
1234567891011121314151617181920212223242526
EXTERNAL_HASH_DISTINCT(input, memory_budget): // Estimate number of partitions needed estimated_distinct = estimate_cardinality(input) bytes_per_tuple = avg_tuple_size(input) num_partitions = ceiling(estimated_distinct * bytes_per_tuple / memory_budget) // Phase 1: Partition the input partition_files = create_temp_files(num_partitions) for each tuple t in input: partition_id = hash_partition(t) mod num_partitions write t to partition_files[partition_id] flush_all_partitions() // Phase 2: Deduplicate each partition for i = 0 to num_partitions - 1: seen = empty hash set for each tuple t in partition_files[i]: if t not in seen: seen.add(t) output t clear(seen) delete_temp_file(partition_files[i])I/O Cost Analysis:
Let N be the number of input pages:
Phase 1 (Partitioning):
Phase 2 (Deduplication):
Total I/O: 3N + D page operations
Note that unlike aggregation (which computes summaries), duplicate elimination must output the full tuple data, so output size can be significant.
Data skew can cause some partitions to exceed memory even after the initial partitioning. For example, if 50% of tuples share one value in a DISTINCT column, half of all data ends up in one partition. The solution is recursive partitioning: partition the oversized partition again with a different hash function. Each recursion level adds 2N I/O cost for the affected partition.
Sort-based duplicate elimination uses a simple but elegant principle: after sorting, all duplicate tuples are adjacent, making elimination trivial.
Algorithm:
SORT_DISTINCT(input_relation):
sorted_input = external_sort(input_relation)
previous = null
for each tuple t in sorted_input:
if t ≠ previous:
output t
previous = t
// Only need to track ONE previous tuple—constant memory for dedup phase
The beauty of this approach is that after sorting, the deduplication scan requires only O(1) memory—we compare each tuple to its predecessor. All the complexity is front-loaded into the sorting phase.
I/O Cost Analysis:
The cost is dominated by the external sort:
For typical memory sizes and data volumes:
When Sort-Based Wins:
123456789101112131415161718192021
-- Example where sort-based DISTINCT is optimal-- Query needs both DISTINCT and ORDER BY on same columns SELECT DISTINCT customer_region, customer_segmentFROM customersWHERE active = trueORDER BY customer_region, customer_segment; -- Optimizer recognizes that:-- 1. DISTINCT requires grouping by (region, segment)-- 2. ORDER BY also requires sorting by (region, segment)-- 3. A single sort satisfies BOTH requirements---- Plan: Table Scan → Sort(region, segment) → Streaming Distinct → Output---- The streaming DISTINCT after sort is pure overhead—just compare adjacent rows-- Much cheaper than: Hash Distinct → Sort for ORDER BY -- Index-based optimization:-- If index exists on (customer_region, customer_segment) with filter:-- Plan: Index Scan → Streaming Distinct → Output (no explicit sort!)An optimization called run formation with duplicate elimination combines initial run creation with deduplication. During the in-memory sort phase of external sort, we can eliminate duplicates before writing runs. This reduces the data volume for merge phases, especially effective when there are many duplicates. The run becomes a set rather than a bag, reducing I/O for subsequent merges.
Query optimizers must choose between hash-based and sort-based duplicate elimination, weighing multiple factors including memory, data characteristics, and downstream requirements.
Key Decision Factors:
1. Available Memory vs Distinct Cardinality
2. Input Order
3. Output Ordering Requirements
4. Parallelism
| Scenario | Recommended Strategy | Rationale |
|---|---|---|
| Small distinct set, fits in memory | Hash | O(n) scan, no I/O overhead |
| Large distinct set, limited memory | Sort | Predictable spilling behavior |
| Input sorted on DISTINCT keys | Sort-Streaming | Zero additional cost |
| ORDER BY same as DISTINCT | Sort | One sort serves both |
| High duplicate rate (>90%) | Hash with early output | Most tuples filtered early |
| Parallel execution needed | Hash with partitioning | Natural partition-parallel |
| Unknown cardinality | Hash with adaptive spill | Optimistic with fallback |
Cardinality Estimation Importance:
The optimizer's choice heavily depends on estimating how many distinct values exist. This is challenging because:
Statistics Used:
If the optimizer underestimates distinct count, hash DISTINCT may unexpectedly spill to disk, causing severe performance degradation. If it overestimates, it might choose more expensive sort-based DISTINCT unnecessarily. Modern systems use adaptive query processing to switch strategies mid-execution when estimates prove wrong.
Beyond basic algorithms, database systems employ sophisticated optimizations to improve duplicate elimination performance.
1. Projection Pushdown:
Before duplicate elimination, project only the columns involved in DISTINCT. This reduces tuple width, allowing more tuples in memory and less I/O:
-- Original
SELECT DISTINCT a, b FROM table WHERE ...;
-- Optimized plan:
-- Scan → Project(a, b) → DISTINCT
-- Not: Scan (all columns) → DISTINCT → Project
2. Predicate Pushdown:
Apply filters before DISTINCT to reduce input cardinality. Fewer input tuples means less work for deduplication:
-- Push filter before DISTINCT
SELECT DISTINCT customer_id FROM orders WHERE amount > 100;
-- Plan: Scan(orders WHERE amount>100) → Project(customer_id) → DISTINCT
12345678910111213141516171819202122
-- Optimization: DISTINCT elimination when result is already unique -- Example 1: Primary key guarantees uniquenessSELECT DISTINCT customer_id -- DISTINCT is redundant!FROM customersWHERE customer_id = 123; -- Optimizer can remove DISTINCT: primary key constraint-- guarantees at most one row -- Example 2: Unique index also guarantees uniquenessSELECT DISTINCT email -- DISTINCT redundant if email is uniqueFROM usersWHERE email = 'test@example.com'; -- Example 3: GROUP BY already produces unique outputSELECT DISTINCT region, SUM(sales) -- DISTINCT redundantFROM ordersGROUP BY region; -- GROUP BY ensures unique regions -- These are all valid DISTINCT elimination rewrites-- that avoid the cost of the duplicate elimination operator entirely3. Early Duplicate Elimination:
In multi-stage pipelines, eliminating duplicates early can dramatically reduce data flow:
Query: SELECT DISTINCT a FROM (SELECT a FROM t1 UNION ALL SELECT a FROM t2)
-- Naive: Full union, then single DISTINCT
-- Better: DISTINCT on each source, then DISTINCT on combined result
-- - Reduces data volume flowing through union
-- - Each partial DISTINCT may be smaller and fit in memory
4. Lazy (Late) Elimination:
Conversely, sometimes deferring DISTINCT is beneficial:
5. Approximate DISTINCT with HyperLogLog:
For analytical queries where exactness isn't required:
-- PostgreSQL example
SELECT approximate_count_distinct(user_id)
FROM events
WHERE event_date > '2024-01-01';
-- Uses HyperLogLog: O(1) memory, ~2% error
-- Can process billions of rows with kilobytes of state
A subtle optimization: SELECT DISTINCT a FROM t is equivalent to SELECT DISTINCT a FROM t WHERE EXISTS (SELECT 1 FROM t t2 WHERE t2.a = t.a). This algebraic equivalence sometimes enables join-based rewrites that can exploit indexes, particularly in correlated subquery contexts.
In distributed and parallel database systems, duplicate elimination requires coordinating across multiple nodes. The challenge is ensuring that duplicate tuples on different nodes are correctly identified and reduced to one.
Partition-Based Distributed DISTINCT:
The standard approach uses hash partitioning:
Hash-Partition by DISTINCT columns: Each tuple is sent to a node determined by hash(DISTINCT columns). All copies of any tuple land on the same node.
Local DISTINCT on each node: Each node performs in-memory or external DISTINCT on its partition.
Output union: Results from all nodes form the complete answer (no further dedup needed since partitioning guaranteed no cross-node duplicates).
12345678910111213141516171819202122
DISTRIBUTED_DISTINCT(input_partitions, distinct_columns): // Step 1: Redistribute by hash of DISTINCT columns for each node N: for each local tuple t: target_node = hash(t[distinct_columns]) mod num_nodes send t to target_node barrier() // Wait for all shuffling to complete // Step 2: Local duplicate elimination for each node N in parallel: local_result = HASH_DISTINCT(received_tuples) output local_result // No global coordination needed—each node has // complete data for its hash partitions -- Example: 3-node cluster processing DISTINCT on customer_id-- Node 1: hash(customer_id) mod 3 == 0-- Node 2: hash(customer_id) mod 3 == 1 -- Node 3: hash(customer_id) mod 3 == 2-- All rows for customer_id=12345 go to same nodeOptimization: Partial Distinct Before Shuffle
Network transfer is typically the bottleneck in distributed systems. We can reduce data movement by performing local duplicate elimination before reshuffling:
1. Each node: Local DISTINCT on its original partition
2. Shuffle: Send locally-distinct tuples to hash-determined nodes
3. Each node: Final DISTINCT on received tuples
This adds CPU work but can dramatically reduce network I/O when local data has high duplication.
Broadcast Optimization:
For small distinct sets, broadcasting may beat partitioning:
This avoids the overhead of hash partitioning for trivially small result sets.
A powerful distributed optimization uses Bloom filters for early duplicate detection:
This reduces network traffic at the cost of some false positives requiring verification.
Duplicate elimination is a fundamental operation that transforms multisets into sets. Let's consolidate the key concepts:
What's Next:
Duplicate elimination is one of several set-oriented operations. The next page explores set operations (UNION, INTERSECT, EXCEPT) and their physical implementations, building on the hashing and sorting techniques we've covered.
You now understand how duplicate elimination is physically implemented in database systems, from simple in-memory hash sets to sophisticated distributed algorithms. This knowledge applies to DISTINCT queries, UNION operations, and many internal database operations. Next, we'll explore the closely related set operations.