Loading content...
You run a simple query to find all departments in your company and get the result: Engineering, Marketing, Engineering, HR, Engineering, Sales. Wait—why does Engineering appear three times? Is this a bug, or expected behavior?
The presence of duplicates in query results is one of the most misunderstood aspects of database systems. It sits at the intersection of theoretical foundations (where relations are sets with no duplicates) and practical implementation (where SQL tables are multisets that allow duplicates). Understanding when duplicates occur, when they matter, and how to handle them correctly is essential for database professionals.
Duplicate elimination isn't just about adding DISTINCT to make results "look right." It's about understanding data semantics, query correctness, and performance trade-offs. Mishandling duplicates can cause incorrect aggregations, inefficient queries, and subtle bugs that only manifest at scale.
By the end of this page, you will understand the theoretical foundations of set versus bag semantics, know when SQL queries produce duplicates and why, master the DISTINCT keyword and its alternatives, comprehend the performance implications of duplicate elimination, and develop strategies for handling duplicates correctly in complex queries.
The fundamental question of duplicates comes down to a choice in data model semantics: should relations behave like sets (no duplicates) or bags/multisets (duplicates allowed)?
Set Semantics (Relational Algebra):
In E.F. Codd's original relational model:
Mathematical Foundation:
A relation R over schema S is defined as: $$R ⊆ D_1 × D_2 × ... × D_n$$
As a subset of a Cartesian product, R is a set—and sets have no duplicates.
Bag Semantics (SQL):
SQL deliberately chose bag (multiset) semantics:
SQL's bag semantics was a pragmatic choice:
This choice trades theoretical purity for practical efficiency.
Understanding when duplicates appear helps you write correct queries and avoid unnecessary DISTINCT clauses.
Source 1: Projection of Non-Key Columns
The most common source of duplicates is projecting columns that don't form a key:
-- Each employee has unique emp_id, but multiple employees share departments
SELECT department FROM Employee; -- Duplicates likely
SELECT emp_id FROM Employee; -- No duplicates (primary key)
Source 2: Joins That Multiply Rows
One-to-many and many-to-many joins can produce duplicate values:
-- One customer can have many orders
SELECT c.customer_name
FROM Customer c
JOIN Orders o ON c.id = o.customer_id;
-- Customer names repeat for each order
Source 3: UNION ALL
UNION ALL preserves duplicates; UNION eliminates them:
-- UNION ALL keeps all rows from both queries
SELECT name FROM Active_Employees
UNION ALL
SELECT name FROM Former_Employees;
-- If someone appears in both tables, they appear twice
| Scenario | Duplicates? | Reason |
|---|---|---|
| SELECT * FROM table | No (if properly designed) | Primary key ensures unique rows |
| SELECT primary_key FROM table | No | Primary key values are unique |
| SELECT non_key_columns FROM table | Possible | Multiple rows may have same values |
| SELECT cols FROM t1 JOIN t2 | Likely | Join can multiply rows |
| UNION ALL | Preserved | Explicitly combines without dedup |
| UNION | Eliminated | Deduplication is built-in |
| GROUP BY results | No | Grouping creates unique key per group |
| Subquery results | Depends | Based on subquery structure |
If you're surprised by duplicates in your results, don't immediately add DISTINCT. Investigate why duplicates appeared:
• Missing join condition creating Cartesian product? • Join to wrong table or wrong column? • Fundamental data model misunderstanding?
DISTINCT can mask bugs. Understand the cause before treating the symptom.
The DISTINCT keyword is SQL's mechanism for converting bag results to set results. Understanding its precise semantics prevents common mistakes.
Basic Syntax:
SELECT DISTINCT column1, column2, ...
FROM table;
Key Semantic: DISTINCT Applies to Entire Row
DISTINCT eliminates duplicate rows, not duplicate values in one column:
-- WRONG mental model: "Give me unique departments"
-- RIGHT mental model: "Give me unique (department) tuples"
-- These are equivalent:
SELECT DISTINCT department FROM Employee;
SELECT DISTINCT department FROM Employee; -- Still one column
-- But this is different:
SELECT DISTINCT department, status FROM Employee;
-- Eliminates duplicate (department, status) pairs
-- NOT duplicate departments and duplicate statuses separately
DISTINCT and NULL:
For DISTINCT purposes, NULL values are considered equal to each other:
-- Table with NULLs:
SELECT manager_id FROM Employee; -- NULL, 1, NULL, 2, 1
SELECT DISTINCT manager_id FROM Employee; -- NULL, 1, 2
-- Multiple NULLs collapse to single NULL in DISTINCT output
This is a departure from NULL's usual behavior (where NULL ≠ NULL in comparisons).
ALL Keyword (Opposite of DISTINCT):
ALL is the default and explicit opposite of DISTINCT:
-- These are equivalent:
SELECT department FROM Employee;
SELECT ALL department FROM Employee;
ALL is rarely used explicitly but appears in some SQL standards discussions.
DISTINCT in Aggregate Functions:
DISTINCT can appear inside aggregate functions:
-- Count unique departments
SELECT COUNT(DISTINCT department) FROM Employee; -- Returns 3
-- Count all department occurrences
SELECT COUNT(department) FROM Employee; -- Returns 5
-- Sum unique salaries (unusual but valid)
SELECT SUM(DISTINCT salary) FROM Employee;
• COUNT(*) counts all rows including those with NULLs • COUNT(column) counts non-NULL values in the column (with duplicates) • COUNT(DISTINCT column) counts unique non-NULL values
For 100 rows where column has 80 non-NULL values and 20 unique values: • COUNT(*) = 100 • COUNT(column) = 80 • COUNT(DISTINCT column) = 20
Duplicate elimination is not free. Understanding its cost helps you make informed decisions about when DISTINCT is truly necessary.
Algorithmic Cost of Deduplication:
| Algorithm | Time Complexity | Space Complexity | Output Order |
|---|---|---|---|
| Sort-based | O(n log n) | O(n) | Sorted |
| Hash-based | O(n) expected | O(d) where d=distinct | Unordered |
| Index-based | O(n) or O(d) | Minimal | Index order |
When Sort-Based Is Chosen:
When Hash-Based Is Chosen:
| Query Type | Without DISTINCT | With DISTINCT | Overhead |
|---|---|---|---|
| 1M rows, 100 distinct values | ~100ms scan | ~200ms scan + hash | ~2x |
| 1M rows, 900K distinct values | ~100ms scan | ~1500ms scan + hash/sort | ~15x |
| Already unique (key column) | ~100ms scan | ~200ms (wasted work) | ~2x (unnecessary!) |
| With covering index | ~100ms index scan | ~120ms index scan + dedup | ~1.2x |
| Complex join, 100M intermediate | ~5s join | ~15s join + sort | ~3x |
Optimization Opportunities:
1. Eliminate Redundant DISTINCT:
If projecting columns that include a key, DISTINCT is unnecessary:
-- Unnecessary DISTINCT (emp_id is primary key)
SELECT DISTINCT emp_id, name, department FROM Employee;
-- Equivalent and faster:
SELECT emp_id, name, department FROM Employee;
Some optimizers detect and eliminate this, but not all.
2. Push DISTINCT Into Subquery:
When joining to a table where only distinct keys matter:
-- Suboptimal: dedup after join
SELECT DISTINCT c.name
FROM Customer c
JOIN Orders o ON c.id = o.customer_id;
-- Better: dedup before join (if applicable)
SELECT c.name
FROM Customer c
WHERE c.id IN (SELECT DISTINCT customer_id FROM Orders);
-- Or use EXISTS for potentially better optimization
3. Use EXISTS Instead of DISTINCT:
When checking for existence of related rows:
-- May require DISTINCT depending on join
SELECT DISTINCT c.name FROM Customer c
JOIN Orders o ON c.id = o.customer_id
WHERE o.amount > 100;
-- EXISTS avoids join multiplication
SELECT c.name FROM Customer c
WHERE EXISTS (
SELECT 1 FROM Orders o
WHERE o.customer_id = c.id AND o.amount > 100
);
Adding DISTINCT doesn't reduce I/O—the database must still read all source rows to check for duplicates. It adds CPU work and memory usage for deduplication. Only add DISTINCT when semantically necessary, not as a "safety net" or to fix query bugs.
Sometimes the goal of eliminating duplicates can be achieved more efficiently through query restructuring rather than brute-force DISTINCT.
Alternative 1: GROUP BY
GROUP BY inherently produces unique combinations:
-- Using DISTINCT
SELECT DISTINCT department FROM Employee;
-- Using GROUP BY (equivalent result)
SELECT department FROM Employee GROUP BY department;
When to prefer GROUP BY:
Alternative 2: EXISTS Instead of JOIN
When testing for existence without needing data from joined table:
-- JOIN produces duplicates requiring DISTINCT
SELECT DISTINCT c.* FROM Customer c
JOIN Orders o ON c.id = o.customer_id;
-- EXISTS naturally produces unique rows
SELECT c.* FROM Customer c
WHERE EXISTS (SELECT 1 FROM Orders o WHERE o.customer_id = c.id);
Alternative 3: Proper JOIN Conditions
Sometimes duplicates indicate a join problem:
-- Missing join condition creates Cartesian product
SELECT DISTINCT e.name, d.location
FROM Employee e, Department d; -- Every employee × every department!
-- Proper join prevents duplicates naturally
SELECT e.name, d.location
FROM Employee e
JOIN Department d ON e.dept_id = d.id;
Alternative 4: Window Functions with ROW_NUMBER
When you need one row per group with specific selection logic:
-- Get the most recent order per customer
-- DISTINCT can't express "which duplicate to keep"
WITH RankedOrders AS (
SELECT
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS rn
FROM Orders
)
SELECT customer_id, order_date, amount
FROM RankedOrders
WHERE rn = 1;
Alternative 5: UNION Instead of UNION ALL
When combining queries that might overlap:
-- UNION ALL preserves duplicates
SELECT name FROM Active_Employees
UNION ALL
SELECT name FROM Contractors; -- Same person in both?
-- UNION eliminates duplicates automatically
SELECT name FROM Active_Employees
UNION
SELECT name FROM Contractors; -- Each name appears once
• If checking existence: use EXISTS (stops at first match) • If need one representative from each group: use window functions • If combining sets that might overlap: use UNION • If need counts/aggregates per group: use GROUP BY • If none of the above apply and duplicates are genuine: use DISTINCT
SQL's set operations (UNION, INTERSECT, EXCEPT) have both duplicate-eliminating and duplicate-preserving variants. Understanding these is essential for correct query construction.
UNION vs UNION ALL:
| Operation | Duplicates Within Input | Duplicates Between Inputs |
|---|---|---|
| UNION | Eliminated | Eliminated |
| UNION ALL | Preserved | Preserved |
-- UNION: eliminates all duplicates
SELECT name FROM Employees
UNION
SELECT name FROM Contractors;
-- Each unique name appears exactly once
-- UNION ALL: preserves all rows
SELECT name FROM Employees
UNION ALL
SELECT name FROM Contractors;
-- If "John" appears twice in Employees and once in Contractors,
-- result contains "John" three times
Performance Implications:
| Operation | Time Complexity | Notes |
|---|---|---|
| UNION ALL | O(n + m) | Just append, no processing |
| UNION | O((n+m) log(n+m)) | Requires sort/hash for dedup |
| INTERSECT | O(n log n + m log m) | Sort both then merge |
| EXCEPT | O(n log n + m log m) | Sort both then exclude |
Using the Right Operation:
-- Use UNION ALL when:
-- 1. You know inputs are disjoint
SELECT id FROM Sales_2023 UNION ALL SELECT id FROM Sales_2024;
-- Different years, no overlap possible
-- 2. You're aggregating afterward
SELECT SUM(amount)
FROM (
SELECT amount FROM Sales_East
UNION ALL
SELECT amount FROM Sales_West
) combined;
-- Duplicates are intentional - each sale should count
-- Use UNION when:
-- 1. You need unique elements
SELECT DISTINCT customer_id FROM Sales_2023
UNION
SELECT DISTINCT customer_id FROM Sales_2024;
-- Customer in both years should appear once
-- 2. Inputs might overlap
SELECT email FROM VIP_Customers
UNION
SELECT email FROM Newsletter_Subscribers;
-- Same email shouldn't appear twice
Unlike SELECT (which preserves duplicates by default), UNION, INTERSECT, and EXCEPT eliminate duplicates by default! This is because they're defined as set operations from relational algebra. Use the ALL variants when you need multiset behavior.
Based on understanding of duplicate behavior, here are guidelines for writing correct and efficient queries.
Duplicate handling is a nuanced topic that bridges theoretical foundations and practical implementation. Mastery requires understanding both perspectives.
What's Next:
With Selection, Projection, and duplicate handling mastered individually, we now bring them together. The next page explores how to combine Selection and Projection operations effectively—understanding operation ordering, query optimization opportunities, and constructing complex queries from these fundamental building blocks.
You now understand duplicate handling deeply—from theoretical set semantics through SQL's practical bag semantics to optimization strategies. This nuanced understanding helps you write correct, efficient queries and avoid common pitfalls. Next, we'll combine Selection and Projection into powerful composite queries.