Loading learning content...
When you execute a SQL query, the database engine processes your request and returns a result set—a collection of rows that satisfy your specified conditions. However, one of the most common surprises for SQL practitioners is discovering that result sets often contain duplicate rows: multiple rows with identical values across all selected columns.
This phenomenon isn't a bug or database malfunction. It's a fundamental consequence of how relational databases work and how queries are structured. Understanding why duplicates occur—and how to systematically eliminate them—is an essential skill for anyone working with SQL.
Consider a simple scenario: you're querying an orders table to find all the cities where your company has customers. If you have 10,000 orders and only 50 unique cities, a naive query returns 10,000 rows, with massive repetition. This isn't what you want, and it can cause significant problems in reporting, application logic, and data transfer.
By the end of this page, you will understand: (1) Why duplicate rows appear in SQL result sets, (2) The difference between SQL semantics and set theory, (3) How database operations like projection and joins create duplicates, (4) The impact of duplicates on correctness and performance, and (5) Foundational strategies for duplicate elimination beyond DISTINCT.
To understand why duplicates occur, we must first recognize a fundamental distinction between mathematical set theory and SQL's actual behavior.
Set Theory vs. Multisets:
In pure mathematics, a set is a collection of distinct elements. By definition, sets cannot contain duplicates—{A, B, C} and {A, A, B, C} are identical sets because duplicates are automatically eliminated.
However, SQL does not operate on true mathematical sets. SQL operates on multisets (also called bags)—collections that can contain duplicate elements. The multiset {A, A, B, C} has four elements, and the two A's are counted separately.
This distinction is critical because the relational model, as originally defined by E.F. Codd, was based on set semantics. But practical SQL implementations adopted multiset semantics for compelling reasons.
When SQL was standardized in the 1980s, the decision to use multiset semantics was controversial. Purists argued it violated relational theory. Pragmatists countered that automatic duplicate elimination was too expensive for large datasets on the hardware of that era. The pragmatists won, and multiset semantics remain the SQL default today.
| Aspect | Set Semantics | Multiset Semantics (SQL Default) |
|---|---|---|
| Duplicate handling | Automatically eliminated | Preserved unless DISTINCT specified |
| Element count | {A, A, B} has 2 elements | {A, A, B} has 3 elements |
| Performance | Requires deduplication overhead | No overhead for duplicate removal |
| Use case | Mathematical operations | Real-world data processing |
| SQL implementation | Requires explicit DISTINCT | Default SELECT behavior |
Understanding the specific operations that create duplicates helps you predict when they'll occur and handle them appropriately. Let's examine the primary sources of duplicate rows in SQL result sets.
The most common source of duplicates is projection—selecting a subset of columns from a table. When you select columns that don't include the primary key or a unique column combination, rows that were distinct in the full table may become identical in the projection.
Example Scenario:
Consider an employees table with columns: employee_id (PK), first_name, department, salary, hire_date. Suppose multiple employees share the same department.
123456789101112131415161718192021222324
-- Full table has no duplicates (employee_id is unique)SELECT employee_id, first_name, departmentFROM employees; -- Result: 1000 rows, all unique -- Projection onto department column creates duplicatesSELECT departmentFROM employees; -- Result: 1000 rows, but only 10 unique departments-- Each department appears ~100 times on average -- Example output:-- department-- ------------ Engineering-- Engineering-- Engineering-- Sales-- Engineering-- Marketing-- Sales-- ...When you join tables, each matching pair of rows from the joined tables produces one result row. If a row in one table matches multiple rows in another, the result contains multiple copies of that row's data.
One-to-Many Join Expansion:
123456789101112131415161718192021
-- Customers table: 100 customers-- Orders table: 5,000 orders (avg 50 orders per customer) -- Join produces customer data repeated for each orderSELECT c.customer_name, c.cityFROM customers cJOIN orders o ON c.customer_id = o.customer_id; -- Result: 5,000 rows-- Each customer_name appears as many times as they have orders-- Customer "John Smith" with 75 orders appears 75 times -- Many-to-Many join (extreme multiplication)SELECT p.product_name, c.category_nameFROM products pJOIN product_categories pc ON p.product_id = pc.product_idJOIN categories c ON pc.category_id = c.category_id; -- If products average 3 categories each, and the base table-- has 1,000 products, result has ~3,000 rows with significant-- duplication in both product_name and category_name columnsThe UNION ALL operator concatenates result sets without removing duplicates. If the same data exists in multiple source tables or the same row satisfies multiple subqueries, duplicates accumulate.
12345678910111213141516
-- Combining data from regional tablesSELECT customer_name, email FROM customers_northUNION ALLSELECT customer_name, email FROM customers_southUNION ALLSELECT customer_name, email FROM customers_archived; -- If the same customer exists in multiple regions (data quality issue)-- or appears in both active and archived tables, they appear multiple times -- Result might show:-- customer_name | email-- ----------------|------------------------- Alice Johnson | alice@example.com-- Alice Johnson | alice@example.com -- duplicate from different source-- Bob Smith | bob@example.comSubqueries, especially in the FROM clause, can produce duplicates when the subquery itself returns non-unique rows, or when correlation with outer query rows causes repetition.
123456789101112131415161718192021
-- Subquery in FROM clause with duplicatesSELECT sq.categoryFROM ( SELECT category, product_name FROM products WHERE price > 100) sq; -- The subquery returns many products per category-- Selecting only category from it creates duplicates -- Correlated subquery creating row multiplicationSELECT o.order_id, (SELECT product_name FROM order_items oi WHERE oi.order_id = o.order_id LIMIT 1) as sample_productFROM orders oCROSS JOIN LATERAL ( SELECT * FROM order_items oi WHERE oi.order_id = o.order_id) items; -- Each order appears once per order itemDuplicates aren't always obvious. A query might return 50,000 rows and you assume they're all unique because you didn't check. Always examine result counts and consider whether your column selection could produce duplicates—especially when building reports or feeding data to downstream systems.
Duplicate rows aren't merely aesthetic annoyances—they can cause serious problems in data processing, analysis, and application behavior. Understanding these impacts helps you recognize when duplicate elimination is essential versus optional.
COUNT(*) counts all rows including duplicates. If you want to count unique customers but have duplicated customer rows, your count will be inflated—potentially by orders of magnitude.SUM() adds duplicate values multiple times. If a $1,000 transaction appears three times due to join multiplication, your revenue report shows $3,000 instead of $1,000.AVG() weights duplicated values more heavily. If high-value items are duplicated more often (e.g., premium customers with more orders), averages shift unexpectedly.123456789
-- WRONG: Counting with duplicatesSELECT COUNT(c.customer_id) as customer_countFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2024-01-01'; -- If 500 customers placed 5,000 orders in 2024,-- this returns 5,000 (counting each order)-- not 500 (the actual customer count)12345678910111213141516
-- CORRECT: Counting unique valuesSELECT COUNT(DISTINCT c.customer_id) as customer_countFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2024-01-01'; -- Returns 500 (the actual number of -- unique customers who ordered in 2024) -- Alternative: Subquery approachSELECT COUNT(*) FROM ( SELECT DISTINCT c.customer_id FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_date >= '2024-01-01') unique_customers;A financial services company once reported $2.4 billion in quarterly revenue instead of $800 million because a reporting query joined transactions with a lookup table that had duplicate entries. Each transaction was tripled in the result set, and SUM() faithfully added them all. The error wasn't caught until external auditors questioned the numbers—a compliance nightmare and massive reputational damage.
Before diving into the DISTINCT keyword (covered in detail on the next page), let's survey the full landscape of duplicate elimination strategies. Understanding these alternatives helps you choose the most appropriate approach for each situation.
The most direct approach is the DISTINCT keyword, which eliminates duplicate rows from the result set. We'll explore this thoroughly in the next page, but here's the basic form:
12345678910111213
-- Basic DISTINCT usageSELECT DISTINCT departmentFROM employees; -- Returns each unique department exactly once-- Regardless of how many employees are in each department -- DISTINCT on multiple columnsSELECT DISTINCT city, stateFROM customers; -- Returns each unique (city, state) combination once-- Note: (New York, NY) and (New York, CA) are differentWhen you need aggregates along with deduplication, GROUP BY naturally collapses duplicates while allowing you to compute summaries.
123456789101112131415
-- GROUP BY naturally eliminates duplicates for grouped columnsSELECT department, COUNT(*) as employee_countFROM employeesGROUP BY department; -- Each department appears exactly once-- Plus you get aggregate information -- GROUP BY without aggregates (equivalent to DISTINCT)SELECT departmentFROM employeesGROUP BY department; -- Functionally identical to SELECT DISTINCT department-- Some databases optimize them identicallySometimes the best way to eliminate duplicates is to restructure the query so duplicates never arise. Subqueries can help isolate the deduplication step.
12345678910111213141516171819
-- PROBLEMATIC: Customer data duplicated per orderSELECT c.customer_name, c.city, o.order_totalFROM customers cJOIN orders o ON c.customer_id = o.customer_id;-- Customer appears once per order -- BETTER: Aggregate first, then joinSELECT c.customer_name, c.city, order_agg.total_orders, order_agg.total_spentFROM customers cJOIN ( SELECT customer_id, COUNT(*) as total_orders, SUM(order_total) as total_spent FROM orders GROUP BY customer_id) order_agg ON c.customer_id = order_agg.customer_id; -- Each customer appears exactly once-- Aggregated order data is attachedWhen you only need to filter based on the existence of related rows (not retrieve their data), EXISTS avoids the row multiplication that joins cause.
12345678910111213141516171819
-- PROBLEMATIC: Join creates duplicatesSELECT c.customer_name, c.emailFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2024-01-01';-- Customer with 10 orders in 2024 appears 10 times -- BETTER: EXISTS for filtering without duplicationSELECT c.customer_name, c.emailFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.order_date >= '2024-01-01');-- Each customer appears exactly once-- EXISTS returns TRUE/FALSE, not rows -- Also avoids need for DISTINCT, which is more efficientThe UNION operator (without ALL) automatically eliminates duplicates between the combined result sets. Use it when deduplication is desired.
123456789101112131415
-- UNION ALL preserves all rows including duplicatesSELECT customer_id, email FROM customers_activeUNION ALLSELECT customer_id, email FROM customers_archived;-- If customer exists in both tables, appears twice -- UNION eliminates duplicates automaticallySELECT customer_id, email FROM customers_activeUNIONSELECT customer_id, email FROM customers_archived;-- Each unique (customer_id, email) appears once-- Equivalent to UNION ALL followed by DISTINCT -- Note: UNION has performance overhead for deduplication-- Use UNION ALL when duplicates are impossible or acceptableThe best duplicate handling often prevents duplicates from arising rather than eliminating them afterward. Using EXISTS instead of JOIN, restructuring queries with subqueries, or selecting appropriate columns can eliminate the need for DISTINCT entirely—often with better performance.
Duplicate elimination is not free. Understanding the computational cost helps you make informed decisions about when and how to deduplicate.
Database engines use two primary algorithms for duplicate elimination:
Sort-Based Deduplication:
Hash-Based Deduplication:
| Factor | Sort-Based | Hash-Based |
|---|---|---|
| Time Complexity | O(n log n) | O(n) average |
| Memory Usage | Proportional to total rows | Proportional to unique rows |
| Disk I/O | May spill large sorts to disk | May spill hash table to disk |
| Result Order | Produces sorted output | Arbitrary order |
| Best When | Result also needs ORDER BY | Many duplicates, few unique values |
| Database Choice | Optimizer decides based on statistics | Optimizer decides based on statistics |
DISTINCT becomes particularly expensive in these scenarios:
123456789101112131415161718192021222324252627
-- EXPENSIVE: DISTINCT on wide row setSELECT DISTINCT customer_name, email, address, city, state, zip, phone, created_at, updated_at, notesFROM customers cJOIN orders o ON c.customer_id = o.customer_id;-- Must hash/compare all 10 columns for each of potentially millions of rows -- MORE EFFICIENT: DISTINCT on minimal columns, then join for detailsSELECT c.*FROM customers cWHERE c.customer_id IN ( SELECT DISTINCT customer_id FROM orders WHERE order_date >= '2024-01-01');-- DISTINCT operates on single integer column (fast)-- Then retrieves full details only for unique customers -- MOST EFFICIENT: Use EXISTS (no DISTINCT needed)SELECT c.*FROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.order_date >= '2024-01-01');-- No deduplication at all—EXISTS short-circuits on first matchWhile DISTINCT has overhead, it's often negligible compared to other query costs (disk I/O, network transfer). Don't avoid DISTINCT prematurely—use EXPLAIN to analyze actual query plans and identify real bottlenecks. Readability and correctness often outweigh micro-optimizations.
Not all duplicates are problems. Understanding when to preserve duplicates is as important as knowing how to eliminate them.
1234567891011121314151617181920212223
-- CORRECT: Preserving duplicates for aggregationSELECT product_category, SUM(sale_amount) as total_sales, -- Each sale matters COUNT(*) as transaction_count, -- Count includes repeats AVG(sale_amount) as avg_sale -- Average weighted by frequencyFROM salesGROUP BY product_category; -- CORRECT: Preserving log frequency informationSELECT error_message, COUNT(*) as occurrence_count, MIN(logged_at) as first_occurrence, MAX(logged_at) as last_occurrenceFROM error_logsWHERE logged_at >= CURRENT_DATE - INTERVAL '1 day'GROUP BY error_messageORDER BY occurrence_count DESC; -- The COUNT(*) reveals that 'Connection timeout' occurred -- 5,000 times vs 'Invalid input' occurring 3 times-- Critical information that DISTINCT would destroyBefore adding DISTINCT, ask: 'Do these duplicates represent the same entity (should be merged) or distinct events (should be counted separately)?' The answer determines whether DISTINCT is correct or would corrupt your results.
Effective duplicate management combines proper query design, data modeling, and conscious decision-making. Here are professional practices to adopt:
SELECT column, COUNT(*) FROM table GROUP BY column HAVING COUNT(*) > 1.123456789101112131415161718192021
-- Validate potential duplicates in source dataSELECT customer_id, COUNT(*) as occurrencesFROM customersGROUP BY customer_idHAVING COUNT(*) > 1;-- Should return empty if customer_id is truly unique -- Document DISTINCT decisions with comments-- Note: DISTINCT required because join with orders creates row multiplication-- Each customer should appear once regardless of order countSELECT DISTINCT c.customer_id, c.customer_name, c.emailFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.status = 'completed'; -- Alternative: Document why DISTINCT is NOT used-- Note: Intentionally preserving duplicates for SUM accuracy-- Each order contributes to total even if amounts matchSELECT SUM(order_total) as revenueFROM ordersWHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';We've established a comprehensive foundation for understanding duplicate rows in SQL result sets. Let's consolidate the key insights:
What's Next:
With this foundational understanding of why duplicates occur and how to think about them, we're ready to dive deep into the DISTINCT clause—SQL's primary tool for explicit duplicate elimination. The next page covers DISTINCT syntax, semantics, variations, and real-world application patterns.
You now understand the fundamental nature of duplicate rows in SQL, why they occur, and the landscape of strategies for handling them. This conceptual foundation prepares you to use DISTINCT and related features effectively and appropriately.