Database Management SystemsDISTINCT and LIMIT

DISTINCT and LIMIT: Controlling Result Sets

LevelBeginner

Duration60 mins

TopicDISTINCT and LIMIT

1 / 5

Eliminating Duplicates

The Duplicate Row Problem

When you execute a SQL query, the database engine processes your request and returns a result set—a collection of rows that satisfy your specified conditions. However, one of the most common surprises for SQL practitioners is discovering that result sets often contain duplicate rows: multiple rows with identical values across all selected columns.

This phenomenon isn't a bug or database malfunction. It's a fundamental consequence of how relational databases work and how queries are structured. Understanding why duplicates occur—and how to systematically eliminate them—is an essential skill for anyone working with SQL.

Consider a simple scenario: you're querying an orders table to find all the cities where your company has customers. If you have 10,000 orders and only 50 unique cities, a naive query returns 10,000 rows, with massive repetition. This isn't what you want, and it can cause significant problems in reporting, application logic, and data transfer.

What You Will Learn

By the end of this page, you will understand: (1) Why duplicate rows appear in SQL result sets, (2) The difference between SQL semantics and set theory, (3) How database operations like projection and joins create duplicates, (4) The impact of duplicates on correctness and performance, and (5) Foundational strategies for duplicate elimination beyond DISTINCT.

SQL and Multiset Semantics

To understand why duplicates occur, we must first recognize a fundamental distinction between mathematical set theory and SQL's actual behavior.

Set Theory vs. Multisets:

In pure mathematics, a set is a collection of distinct elements. By definition, sets cannot contain duplicates—{A, B, C} and {A, A, B, C} are identical sets because duplicates are automatically eliminated.

However, SQL does not operate on true mathematical sets. SQL operates on multisets (also called bags)—collections that can contain duplicate elements. The multiset {A, A, B, C} has four elements, and the two A's are counted separately.

This distinction is critical because the relational model, as originally defined by E.F. Codd, was based on set semantics. But practical SQL implementations adopted multiset semantics for compelling reasons.

Why SQL Uses Multiset Semantics

•Performance Efficiency — Eliminating duplicates requires sorting or hashing the entire result set, which is computationally expensive. By default, SQL avoids this overhead unless explicitly requested.
•Data Fidelity — Sometimes duplicates are meaningful. If a customer ordered the same product three times, those three identical order lines represent real business transactions that shouldn't be silently merged.
•Aggregate Accuracy — When computing SUM or AVG, each occurrence matters. If you sum sales amounts, eliminating 'duplicate' $100 sales would produce incorrect totals.
•Compatibility with Sources — Real-world data often has duplicates. Tables may lack unique constraints, or queries may intentionally select non-unique column subsets.
•User Control — SQL gives you explicit control via DISTINCT when you want set semantics 、 otherwise it preserves all data.

Historical Context

When SQL was standardized in the 1980s, the decision to use multiset semantics was controversial. Purists argued it violated relational theory. Pragmatists countered that automatic duplicate elimination was too expensive for large datasets on the hardware of that era. The pragmatists won, and multiset semantics remain the SQL default today.

Set vs. Multiset Semantics Comparison
Aspect	Set Semantics	Multiset Semantics (SQL Default)
Duplicate handling	Automatically eliminated	Preserved unless DISTINCT specified
Element count	{A, A, B} has 2 elements	{A, A, B} has 3 elements
Performance	Requires deduplication overhead	No overhead for duplicate removal
Use case	Mathematical operations	Real-world data processing
SQL implementation	Requires explicit DISTINCT	Default SELECT behavior

How Duplicates Arise in SQL Queries

Understanding the specific operations that create duplicates helps you predict when they'll occur and handle them appropriately. Let's examine the primary sources of duplicate rows in SQL result sets.

2.1 Projection onto Non-Key Columns

The most common source of duplicates is projection—selecting a subset of columns from a table. When you select columns that don't include the primary key or a unique column combination, rows that were distinct in the full table may become identical in the projection.

Example Scenario:

Consider an employees table with columns: employee_id (PK), first_name, department, salary, hire_date. Suppose multiple employees share the same department.

projection_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Full table has no duplicates (employee_id is unique)
SELECT employee_id, first_name, department
FROM employees;
 
-- Result: 1000 rows, all unique
 
-- Projection onto department column creates duplicates
SELECT department
FROM employees;
 
-- Result: 1000 rows, but only 10 unique departments
-- Each department appears ~100 times on average
 
-- Example output:
-- department
-- ----------
-- Engineering
-- Engineering
-- Engineering
-- Sales
-- Engineering
-- Marketing
-- Sales
-- ...

2.2 Joins That Create Row Multiplication

When you join tables, each matching pair of rows from the joined tables produces one result row. If a row in one table matches multiple rows in another, the result contains multiple copies of that row's data.

One-to-Many Join Expansion:

join_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Customers table: 100 customers
-- Orders table: 5,000 orders (avg 50 orders per customer)
 
-- Join produces customer data repeated for each order
SELECT c.customer_name, c.city
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
 
-- Result: 5,000 rows
-- Each customer_name appears as many times as they have orders
-- Customer "John Smith" with 75 orders appears 75 times
 
-- Many-to-Many join (extreme multiplication)
SELECT p.product_name, c.category_name
FROM products p
JOIN product_categories pc ON p.product_id = pc.product_id
JOIN categories c ON pc.category_id = c.category_id;
 
-- If products average 3 categories each, and the base table
-- has 1,000 products, result has ~3,000 rows with significant
-- duplication in both product_name and category_name columns

2.3 UNION ALL Operations

The UNION ALL operator concatenates result sets without removing duplicates. If the same data exists in multiple source tables or the same row satisfies multiple subqueries, duplicates accumulate.

union_all_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Combining data from regional tables
SELECT customer_name, email FROM customers_north
UNION ALL
SELECT customer_name, email FROM customers_south
UNION ALL
SELECT customer_name, email FROM customers_archived;
 
-- If the same customer exists in multiple regions (data quality issue)
-- or appears in both active and archived tables, they appear multiple times
 
-- Result might show:
-- customer_name    | email
-- ----------------|-----------------------
-- Alice Johnson   | alice@example.com
-- Alice Johnson   | alice@example.com    -- duplicate from different source
-- Bob Smith       | bob@example.com

2.4 Subqueries and Correlated Queries

Subqueries, especially in the FROM clause, can produce duplicates when the subquery itself returns non-unique rows, or when correlation with outer query rows causes repetition.

subquery_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Subquery in FROM clause with duplicates
SELECT sq.category
FROM (
    SELECT category, product_name
    FROM products
    WHERE price > 100
) sq;
 
-- The subquery returns many products per category
-- Selecting only category from it creates duplicates
 
-- Correlated subquery creating row multiplication
SELECT o.order_id, 
       (SELECT product_name FROM order_items oi 
        WHERE oi.order_id = o.order_id LIMIT 1) as sample_product
FROM orders o
CROSS JOIN LATERAL (
    SELECT * FROM order_items oi WHERE oi.order_id = o.order_id
) items;
 
-- Each order appears once per order item

Duplicate Detection Challenge

Duplicates aren't always obvious. A query might return 50,000 rows and you assume they're all unique because you didn't check. Always examine result counts and consider whether your column selection could produce duplicates—especially when building reports or feeding data to downstream systems.

The Impact of Duplicate Rows

Duplicate rows aren't merely aesthetic annoyances—they can cause serious problems in data processing, analysis, and application behavior. Understanding these impacts helps you recognize when duplicate elimination is essential versus optional.

Negative Impacts of Unhandled Duplicates

•Incorrect Counts — COUNT(*) counts all rows including duplicates. If you want to count unique customers but have duplicated customer rows, your count will be inflated—potentially by orders of magnitude.
•Inflated Aggregates — SUM() adds duplicate values multiple times. If a $1,000 transaction appears three times due to join multiplication, your revenue report shows $3,000 instead of $1,000.
•Skewed Averages — AVG() weights duplicated values more heavily. If high-value items are duplicated more often (e.g., premium customers with more orders), averages shift unexpectedly.
•Network and Memory Waste — Transmitting 10,000 rows when only 100 are unique wastes bandwidth, memory, and processing time in your application layer.
•UI/UX Problems — Displaying duplicate rows in a user interface creates confusion. Users see the same item multiple times and lose trust in the system.
•Data Export Corruption — Exporting query results to CSV or external systems with duplicates can cause import failures or data quality issues downstream.
•Incorrect Business Decisions — Reports based on duplicated data lead to wrong conclusions. Overestimating customer counts, revenue, or inventory levels has real business consequences.

incorrect_count.sql
1
2
3
4
5
6
7
8
9
-- WRONG: Counting with duplicates
SELECT COUNT(c.customer_id) as customer_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
 
-- If 500 customers placed 5,000 orders in 2024,
-- this returns 5,000 (counting each order)
-- not 500 (the actual customer count)

correct_count.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- CORRECT: Counting unique values
SELECT COUNT(DISTINCT c.customer_id) as customer_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
 
-- Returns 500 (the actual number of 
-- unique customers who ordered in 2024)
 
-- Alternative: Subquery approach
SELECT COUNT(*) FROM (
    SELECT DISTINCT c.customer_id
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.order_date >= '2024-01-01'
) unique_customers;

Real-World Disaster Story

A financial services company once reported $2.4 billion in quarterly revenue instead of $800 million because a reporting query joined transactions with a lookup table that had duplicate entries. Each transaction was tripled in the result set, and SUM() faithfully added them all. The error wasn't caught until external auditors questioned the numbers—a compliance nightmare and massive reputational damage.

Strategies for Duplicate Elimination

Before diving into the DISTINCT keyword (covered in detail on the next page), let's survey the full landscape of duplicate elimination strategies. Understanding these alternatives helps you choose the most appropriate approach for each situation.

4.1 DISTINCT Keyword

The most direct approach is the DISTINCT keyword, which eliminates duplicate rows from the result set. We'll explore this thoroughly in the next page, but here's the basic form:

distinct_basic.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Basic DISTINCT usage
SELECT DISTINCT department
FROM employees;
 
-- Returns each unique department exactly once
-- Regardless of how many employees are in each department
 
-- DISTINCT on multiple columns
SELECT DISTINCT city, state
FROM customers;
 
-- Returns each unique (city, state) combination once
-- Note: (New York, NY) and (New York, CA) are different

4.2 GROUP BY for Aggregation

When you need aggregates along with deduplication, GROUP BY naturally collapses duplicates while allowing you to compute summaries.

group_by_dedup.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- GROUP BY naturally eliminates duplicates for grouped columns
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
 
-- Each department appears exactly once
-- Plus you get aggregate information
 
-- GROUP BY without aggregates (equivalent to DISTINCT)
SELECT department
FROM employees
GROUP BY department;
 
-- Functionally identical to SELECT DISTINCT department
-- Some databases optimize them identically

4.3 Subqueries to Restructure Joins

Sometimes the best way to eliminate duplicates is to restructure the query so duplicates never arise. Subqueries can help isolate the deduplication step.

subquery_restructure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- PROBLEMATIC: Customer data duplicated per order
SELECT c.customer_name, c.city, o.order_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
-- Customer appears once per order
 
-- BETTER: Aggregate first, then join
SELECT c.customer_name, c.city, order_agg.total_orders, order_agg.total_spent
FROM customers c
JOIN (
    SELECT customer_id, 
           COUNT(*) as total_orders,
           SUM(order_total) as total_spent
    FROM orders
    GROUP BY customer_id
) order_agg ON c.customer_id = order_agg.customer_id;
 
-- Each customer appears exactly once
-- Aggregated order data is attached

4.4 EXISTS Instead of JOIN

When you only need to filter based on the existence of related rows (not retrieve their data), EXISTS avoids the row multiplication that joins cause.

exists_vs_join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- PROBLEMATIC: Join creates duplicates
SELECT c.customer_name, c.email
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
-- Customer with 10 orders in 2024 appears 10 times
 
-- BETTER: EXISTS for filtering without duplication
SELECT c.customer_name, c.email
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_date >= '2024-01-01'
);
-- Each customer appears exactly once
-- EXISTS returns TRUE/FALSE, not rows
 
-- Also avoids need for DISTINCT, which is more efficient

4.5 UNION vs. UNION ALL

The UNION operator (without ALL) automatically eliminates duplicates between the combined result sets. Use it when deduplication is desired.

union_dedup.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- UNION ALL preserves all rows including duplicates
SELECT customer_id, email FROM customers_active
UNION ALL
SELECT customer_id, email FROM customers_archived;
-- If customer exists in both tables, appears twice
 
-- UNION eliminates duplicates automatically
SELECT customer_id, email FROM customers_active
UNION
SELECT customer_id, email FROM customers_archived;
-- Each unique (customer_id, email) appears once
-- Equivalent to UNION ALL followed by DISTINCT
 
-- Note: UNION has performance overhead for deduplication
-- Use UNION ALL when duplicates are impossible or acceptable

Prevention vs. Cure

The best duplicate handling often prevents duplicates from arising rather than eliminating them afterward. Using EXISTS instead of JOIN, restructuring queries with subqueries, or selecting appropriate columns can eliminate the need for DISTINCT entirely—often with better performance.

Performance Implications of Deduplication

Duplicate elimination is not free. Understanding the computational cost helps you make informed decisions about when and how to deduplicate.

5.1 Algorithms for Duplicate Elimination

Database engines use two primary algorithms for duplicate elimination:

Sort-Based Deduplication:

Sort the entire result set by all selected columns
Scan sequentially, outputting only when the current row differs from the previous
Complexity: O(n log n) for sorting, plus O(n) for scanning
Memory: Requires space proportional to result size (may spill to disk)

Hash-Based Deduplication:

Build a hash table of seen row values
For each row, check hash table: if not present, add and output; if present, skip
Complexity: O(n) average case (hash operations are O(1) amortized)
Memory: Requires space proportional to unique rows (can be much smaller than full result)

Deduplication Algorithm Comparison
Factor	Sort-Based	Hash-Based
Time Complexity	O(n log n)	O(n) average
Memory Usage	Proportional to total rows	Proportional to unique rows
Disk I/O	May spill large sorts to disk	May spill hash table to disk
Result Order	Produces sorted output	Arbitrary order
Best When	Result also needs ORDER BY	Many duplicates, few unique values
Database Choice	Optimizer decides based on statistics	Optimizer decides based on statistics

5.2 When DISTINCT Is Expensive

DISTINCT becomes particularly expensive in these scenarios:

•Large Result Sets — Deduplicating 10 million rows requires processing all 10 million, even if only 1,000 are unique. The database doesn't know there are few unique values until it's processed everything.
•Wide Rows — DISTINCT must compare all selected columns. If you SELECT 20 columns, each comparison involves 20 field comparisons. Select only the columns you need.
•Complex Data Types — Comparing TEXT/BLOB columns or complex types is slower than comparing integers. Hash operations on long strings are expensive.
•No Suitable Index — If an index exists on the DISTINCT columns, the database might use an index-only scan that's inherently deduplicated. Without such an index, full table scans are required.
•Memory Pressure — When the hash table or sort buffer exceeds available memory, operations spill to disk, causing massive slowdowns (10x-100x or worse).

distinct_performance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- EXPENSIVE: DISTINCT on wide row set
SELECT DISTINCT customer_name, email, address, city, state, 
                zip, phone, created_at, updated_at, notes
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
-- Must hash/compare all 10 columns for each of potentially millions of rows
 
-- MORE EFFICIENT: DISTINCT on minimal columns, then join for details
SELECT c.*
FROM customers c
WHERE c.customer_id IN (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
);
-- DISTINCT operates on single integer column (fast)
-- Then retrieves full details only for unique customers
 
-- MOST EFFICIENT: Use EXISTS (no DISTINCT needed)
SELECT c.*
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_date >= '2024-01-01'
);
-- No deduplication at all—EXISTS short-circuits on first match

Measure Before Optimizing

While DISTINCT has overhead, it's often negligible compared to other query costs (disk I/O, network transfer). Don't avoid DISTINCT prematurely—use EXPLAIN to analyze actual query plans and identify real bottlenecks. Readability and correctness often outweigh micro-optimizations.

When Duplicates Are Acceptable or Desired

Not all duplicates are problems. Understanding when to preserve duplicates is as important as knowing how to eliminate them.

Scenarios Where Duplicates Should Be Preserved

•Aggregation Input — When computing SUM, AVG, or COUNT, each row represents a distinct event that should be counted. Eliminating 'duplicate' $100 transactions would corrupt financial totals.
•Transaction Logs — Log entries may have identical content (same error occurring repeatedly) but each occurrence is significant. Deduplicating would hide the frequency of events.
•Time Series Data — Sensor readings at the same timestamp with identical values are still separate data points. The fact that temperature was 72°F for 10 consecutive readings is meaningful.
•Audit Trails — Identical audit entries (same user performing same action) represent distinct events that compliance may require preserving.
•Machine Learning Training Data — Some ML algorithms benefit from seeing repeated examples. Deduplicating training data changes model behavior.
•Data Pipeline Intermediate Steps — In ETL processes, intermediate queries may intentionally preserve duplicates for later aggregation steps.

preserve_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- CORRECT: Preserving duplicates for aggregation
SELECT 
    product_category,
    SUM(sale_amount) as total_sales,    -- Each sale matters
    COUNT(*) as transaction_count,       -- Count includes repeats
    AVG(sale_amount) as avg_sale         -- Average weighted by frequency
FROM sales
GROUP BY product_category;
 
-- CORRECT: Preserving log frequency information
SELECT 
    error_message,
    COUNT(*) as occurrence_count,
    MIN(logged_at) as first_occurrence,
    MAX(logged_at) as last_occurrence
FROM error_logs
WHERE logged_at >= CURRENT_DATE - INTERVAL '1 day'
GROUP BY error_message
ORDER BY occurrence_count DESC;
 
-- The COUNT(*) reveals that 'Connection timeout' occurred 
-- 5,000 times vs 'Invalid input' occurring 3 times
-- Critical information that DISTINCT would destroy

Ask the Right Question

Before adding DISTINCT, ask: 'Do these duplicates represent the same entity (should be merged) or distinct events (should be counted separately)?' The answer determines whether DISTINCT is correct or would corrupt your results.

Best Practices for Managing Duplicates

Effective duplicate management combines proper query design, data modeling, and conscious decision-making. Here are professional practices to adopt:

Professional Guidelines

•Understand Your Data Model — Know which columns are unique. If you're unsure whether a table has duplicates, query it: SELECT column, COUNT(*) FROM table GROUP BY column HAVING COUNT(*) > 1.
•Design Queries to Prevent Duplicates — Use EXISTS instead of JOIN for existence checks. Aggregate before joining. Select primary keys when you need unique rows.
•Use DISTINCT Intentionally — Never add DISTINCT 'just in case.' If your query produces unexpected duplicates, understand why rather than masking the issue with DISTINCT.
•Validate Row Counts — After writing complex queries, verify that the result count makes sense. If you expect 1,000 customers but get 50,000 rows, investigate before proceeding.
•Document Deduplication Decisions — When a query intentionally uses or omits DISTINCT, comment why. Future maintainers need to understand the reasoning.
•Consider Indexes for Frequent DISTINCT Queries — If you frequently run DISTINCT on specific columns, indexes on those columns can enable efficient index-only scans.
•Test with Realistic Data Volumes — A DISTINCT query that runs instantly on 1,000 development rows might timeout on 10 million production rows. Test at scale.

duplicate_prevention.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Validate potential duplicates in source data
SELECT customer_id, COUNT(*) as occurrences
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;
-- Should return empty if customer_id is truly unique
 
-- Document DISTINCT decisions with comments
-- Note: DISTINCT required because join with orders creates row multiplication
-- Each customer should appear once regardless of order count
SELECT DISTINCT c.customer_id, c.customer_name, c.email
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status = 'completed';
 
-- Alternative: Document why DISTINCT is NOT used
-- Note: Intentionally preserving duplicates for SUM accuracy
-- Each order contributes to total even if amounts match
SELECT SUM(order_total) as revenue
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';

Summary: Understanding Duplicate Rows

We've established a comprehensive foundation for understanding duplicate rows in SQL result sets. Let's consolidate the key insights:

Key Takeaways

•SQL uses multiset semantics — Unlike mathematical sets, SQL result sets can contain duplicate rows by default. This is intentional for performance and data fidelity.
•Duplicates arise from specific operations — Projection onto non-key columns, joins that multiply rows, UNION ALL, and certain subqueries all create duplicates.
•Unhandled duplicates cause real problems — Incorrect counts, inflated aggregates, wasted resources, and corrupted business decisions result from ignoring duplicates.
•Multiple elimination strategies exist — DISTINCT, GROUP BY, subquery restructuring, EXISTS, and UNION each have appropriate use cases.
•Deduplication has performance costs — Sort or hash operations process all rows; optimize by preventing duplicates when possible.
•Sometimes duplicates are correct — For aggregation, logging, and time series data, preserving duplicates is essential for accuracy.
•Design queries consciously — Understand when duplicates will occur, decide whether they should be eliminated, and document your decisions.

What's Next:

With this foundational understanding of why duplicates occur and how to think about them, we're ready to dive deep into the DISTINCT clause—SQL's primary tool for explicit duplicate elimination. The next page covers DISTINCT syntax, semantics, variations, and real-world application patterns.

Page Complete

You now understand the fundamental nature of duplicate rows in SQL, why they occur, and the landscape of strategies for handling them. This conceptual foundation prepares you to use DISTINCT and related features effectively and appropriately.

1 / 5

Loading learning content...

Database Management SystemsDISTINCT and LIMIT

DISTINCT and LIMIT: Controlling Result Sets

LevelBeginner

Duration60 mins

TopicDISTINCT and LIMIT

1 / 5

Eliminating Duplicates

The Duplicate Row Problem

What You Will Learn

SQL and Multiset Semantics

To understand why duplicates occur, we must first recognize a fundamental distinction between mathematical set theory and SQL's actual behavior.

Set Theory vs. Multisets:

Why SQL Uses Multiset Semantics

•Performance Efficiency — Eliminating duplicates requires sorting or hashing the entire result set, which is computationally expensive. By default, SQL avoids this overhead unless explicitly requested.
•Data Fidelity — Sometimes duplicates are meaningful. If a customer ordered the same product three times, those three identical order lines represent real business transactions that shouldn't be silently merged.
•Aggregate Accuracy — When computing SUM or AVG, each occurrence matters. If you sum sales amounts, eliminating 'duplicate' $100 sales would produce incorrect totals.
•Compatibility with Sources — Real-world data often has duplicates. Tables may lack unique constraints, or queries may intentionally select non-unique column subsets.
•User Control — SQL gives you explicit control via DISTINCT when you want set semantics 、 otherwise it preserves all data.

Historical Context

Set vs. Multiset Semantics Comparison
Aspect	Set Semantics	Multiset Semantics (SQL Default)
Duplicate handling	Automatically eliminated	Preserved unless DISTINCT specified
Element count	{A, A, B} has 2 elements	{A, A, B} has 3 elements
Performance	Requires deduplication overhead	No overhead for duplicate removal
Use case	Mathematical operations	Real-world data processing
SQL implementation	Requires explicit DISTINCT	Default SELECT behavior

How Duplicates Arise in SQL Queries

2.1 Projection onto Non-Key Columns

Example Scenario:

Consider an employees table with columns: employee_id (PK), first_name, department, salary, hire_date. Suppose multiple employees share the same department.

projection_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Full table has no duplicates (employee_id is unique)
SELECT employee_id, first_name, department
FROM employees;
 
-- Result: 1000 rows, all unique
 
-- Projection onto department column creates duplicates
SELECT department
FROM employees;
 
-- Result: 1000 rows, but only 10 unique departments
-- Each department appears ~100 times on average
 
-- Example output:
-- department
-- ----------
-- Engineering
-- Engineering
-- Engineering
-- Sales
-- Engineering
-- Marketing
-- Sales
-- ...

2.2 Joins That Create Row Multiplication

One-to-Many Join Expansion:

join_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Customers table: 100 customers
-- Orders table: 5,000 orders (avg 50 orders per customer)
 
-- Join produces customer data repeated for each order
SELECT c.customer_name, c.city
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
 
-- Result: 5,000 rows
-- Each customer_name appears as many times as they have orders
-- Customer "John Smith" with 75 orders appears 75 times
 
-- Many-to-Many join (extreme multiplication)
SELECT p.product_name, c.category_name
FROM products p
JOIN product_categories pc ON p.product_id = pc.product_id
JOIN categories c ON pc.category_id = c.category_id;
 
-- If products average 3 categories each, and the base table
-- has 1,000 products, result has ~3,000 rows with significant
-- duplication in both product_name and category_name columns

2.3 UNION ALL Operations

The UNION ALL operator concatenates result sets without removing duplicates. If the same data exists in multiple source tables or the same row satisfies multiple subqueries, duplicates accumulate.

union_all_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Combining data from regional tables
SELECT customer_name, email FROM customers_north
UNION ALL
SELECT customer_name, email FROM customers_south
UNION ALL
SELECT customer_name, email FROM customers_archived;
 
-- If the same customer exists in multiple regions (data quality issue)
-- or appears in both active and archived tables, they appear multiple times
 
-- Result might show:
-- customer_name    | email
-- ----------------|-----------------------
-- Alice Johnson   | alice@example.com
-- Alice Johnson   | alice@example.com    -- duplicate from different source
-- Bob Smith       | bob@example.com

2.4 Subqueries and Correlated Queries

Subqueries, especially in the FROM clause, can produce duplicates when the subquery itself returns non-unique rows, or when correlation with outer query rows causes repetition.

subquery_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Subquery in FROM clause with duplicates
SELECT sq.category
FROM (
    SELECT category, product_name
    FROM products
    WHERE price > 100
) sq;
 
-- The subquery returns many products per category
-- Selecting only category from it creates duplicates
 
-- Correlated subquery creating row multiplication
SELECT o.order_id, 
       (SELECT product_name FROM order_items oi 
        WHERE oi.order_id = o.order_id LIMIT 1) as sample_product
FROM orders o
CROSS JOIN LATERAL (
    SELECT * FROM order_items oi WHERE oi.order_id = o.order_id
) items;
 
-- Each order appears once per order item

Duplicate Detection Challenge

The Impact of Duplicate Rows

Negative Impacts of Unhandled Duplicates

•Incorrect Counts — COUNT(*) counts all rows including duplicates. If you want to count unique customers but have duplicated customer rows, your count will be inflated—potentially by orders of magnitude.
•Inflated Aggregates — SUM() adds duplicate values multiple times. If a $1,000 transaction appears three times due to join multiplication, your revenue report shows $3,000 instead of $1,000.
•Skewed Averages — AVG() weights duplicated values more heavily. If high-value items are duplicated more often (e.g., premium customers with more orders), averages shift unexpectedly.
•Network and Memory Waste — Transmitting 10,000 rows when only 100 are unique wastes bandwidth, memory, and processing time in your application layer.
•UI/UX Problems — Displaying duplicate rows in a user interface creates confusion. Users see the same item multiple times and lose trust in the system.
•Data Export Corruption — Exporting query results to CSV or external systems with duplicates can cause import failures or data quality issues downstream.
•Incorrect Business Decisions — Reports based on duplicated data lead to wrong conclusions. Overestimating customer counts, revenue, or inventory levels has real business consequences.

incorrect_count.sql
1
2
3
4
5
6
7
8
9
-- WRONG: Counting with duplicates
SELECT COUNT(c.customer_id) as customer_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
 
-- If 500 customers placed 5,000 orders in 2024,
-- this returns 5,000 (counting each order)
-- not 500 (the actual customer count)

correct_count.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- CORRECT: Counting unique values
SELECT COUNT(DISTINCT c.customer_id) as customer_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
 
-- Returns 500 (the actual number of 
-- unique customers who ordered in 2024)
 
-- Alternative: Subquery approach
SELECT COUNT(*) FROM (
    SELECT DISTINCT c.customer_id
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.order_date >= '2024-01-01'
) unique_customers;

Real-World Disaster Story

Strategies for Duplicate Elimination

4.1 DISTINCT Keyword

The most direct approach is the DISTINCT keyword, which eliminates duplicate rows from the result set. We'll explore this thoroughly in the next page, but here's the basic form:

distinct_basic.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Basic DISTINCT usage
SELECT DISTINCT department
FROM employees;
 
-- Returns each unique department exactly once
-- Regardless of how many employees are in each department
 
-- DISTINCT on multiple columns
SELECT DISTINCT city, state
FROM customers;
 
-- Returns each unique (city, state) combination once
-- Note: (New York, NY) and (New York, CA) are different

4.2 GROUP BY for Aggregation

When you need aggregates along with deduplication, GROUP BY naturally collapses duplicates while allowing you to compute summaries.

group_by_dedup.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- GROUP BY naturally eliminates duplicates for grouped columns
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
 
-- Each department appears exactly once
-- Plus you get aggregate information
 
-- GROUP BY without aggregates (equivalent to DISTINCT)
SELECT department
FROM employees
GROUP BY department;
 
-- Functionally identical to SELECT DISTINCT department
-- Some databases optimize them identically

4.3 Subqueries to Restructure Joins

Sometimes the best way to eliminate duplicates is to restructure the query so duplicates never arise. Subqueries can help isolate the deduplication step.

subquery_restructure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- PROBLEMATIC: Customer data duplicated per order
SELECT c.customer_name, c.city, o.order_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
-- Customer appears once per order
 
-- BETTER: Aggregate first, then join
SELECT c.customer_name, c.city, order_agg.total_orders, order_agg.total_spent
FROM customers c
JOIN (
    SELECT customer_id, 
           COUNT(*) as total_orders,
           SUM(order_total) as total_spent
    FROM orders
    GROUP BY customer_id
) order_agg ON c.customer_id = order_agg.customer_id;
 
-- Each customer appears exactly once
-- Aggregated order data is attached

4.4 EXISTS Instead of JOIN

When you only need to filter based on the existence of related rows (not retrieve their data), EXISTS avoids the row multiplication that joins cause.

exists_vs_join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- PROBLEMATIC: Join creates duplicates
SELECT c.customer_name, c.email
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
-- Customer with 10 orders in 2024 appears 10 times
 
-- BETTER: EXISTS for filtering without duplication
SELECT c.customer_name, c.email
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_date >= '2024-01-01'
);
-- Each customer appears exactly once
-- EXISTS returns TRUE/FALSE, not rows
 
-- Also avoids need for DISTINCT, which is more efficient

4.5 UNION vs. UNION ALL

The UNION operator (without ALL) automatically eliminates duplicates between the combined result sets. Use it when deduplication is desired.

union_dedup.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- UNION ALL preserves all rows including duplicates
SELECT customer_id, email FROM customers_active
UNION ALL
SELECT customer_id, email FROM customers_archived;
-- If customer exists in both tables, appears twice
 
-- UNION eliminates duplicates automatically
SELECT customer_id, email FROM customers_active
UNION
SELECT customer_id, email FROM customers_archived;
-- Each unique (customer_id, email) appears once
-- Equivalent to UNION ALL followed by DISTINCT
 
-- Note: UNION has performance overhead for deduplication
-- Use UNION ALL when duplicates are impossible or acceptable

Prevention vs. Cure

Performance Implications of Deduplication

Duplicate elimination is not free. Understanding the computational cost helps you make informed decisions about when and how to deduplicate.

5.1 Algorithms for Duplicate Elimination

Database engines use two primary algorithms for duplicate elimination:

Sort-Based Deduplication:

Sort the entire result set by all selected columns
Scan sequentially, outputting only when the current row differs from the previous
Complexity: O(n log n) for sorting, plus O(n) for scanning
Memory: Requires space proportional to result size (may spill to disk)

Hash-Based Deduplication:

Build a hash table of seen row values
For each row, check hash table: if not present, add and output; if present, skip
Complexity: O(n) average case (hash operations are O(1) amortized)
Memory: Requires space proportional to unique rows (can be much smaller than full result)

Deduplication Algorithm Comparison
Factor	Sort-Based	Hash-Based
Time Complexity	O(n log n)	O(n) average
Memory Usage	Proportional to total rows	Proportional to unique rows
Disk I/O	May spill large sorts to disk	May spill hash table to disk
Result Order	Produces sorted output	Arbitrary order
Best When	Result also needs ORDER BY	Many duplicates, few unique values
Database Choice	Optimizer decides based on statistics	Optimizer decides based on statistics

5.2 When DISTINCT Is Expensive

DISTINCT becomes particularly expensive in these scenarios:

•Large Result Sets — Deduplicating 10 million rows requires processing all 10 million, even if only 1,000 are unique. The database doesn't know there are few unique values until it's processed everything.
•Wide Rows — DISTINCT must compare all selected columns. If you SELECT 20 columns, each comparison involves 20 field comparisons. Select only the columns you need.
•Complex Data Types — Comparing TEXT/BLOB columns or complex types is slower than comparing integers. Hash operations on long strings are expensive.
•No Suitable Index — If an index exists on the DISTINCT columns, the database might use an index-only scan that's inherently deduplicated. Without such an index, full table scans are required.
•Memory Pressure — When the hash table or sort buffer exceeds available memory, operations spill to disk, causing massive slowdowns (10x-100x or worse).

distinct_performance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- EXPENSIVE: DISTINCT on wide row set
SELECT DISTINCT customer_name, email, address, city, state, 
                zip, phone, created_at, updated_at, notes
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
-- Must hash/compare all 10 columns for each of potentially millions of rows
 
-- MORE EFFICIENT: DISTINCT on minimal columns, then join for details
SELECT c.*
FROM customers c
WHERE c.customer_id IN (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
);
-- DISTINCT operates on single integer column (fast)
-- Then retrieves full details only for unique customers
 
-- MOST EFFICIENT: Use EXISTS (no DISTINCT needed)
SELECT c.*
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_date >= '2024-01-01'
);
-- No deduplication at all—EXISTS short-circuits on first match

Measure Before Optimizing

When Duplicates Are Acceptable or Desired

Not all duplicates are problems. Understanding when to preserve duplicates is as important as knowing how to eliminate them.

Scenarios Where Duplicates Should Be Preserved

•Aggregation Input — When computing SUM, AVG, or COUNT, each row represents a distinct event that should be counted. Eliminating 'duplicate' $100 transactions would corrupt financial totals.
•Transaction Logs — Log entries may have identical content (same error occurring repeatedly) but each occurrence is significant. Deduplicating would hide the frequency of events.
•Time Series Data — Sensor readings at the same timestamp with identical values are still separate data points. The fact that temperature was 72°F for 10 consecutive readings is meaningful.
•Audit Trails — Identical audit entries (same user performing same action) represent distinct events that compliance may require preserving.
•Machine Learning Training Data — Some ML algorithms benefit from seeing repeated examples. Deduplicating training data changes model behavior.
•Data Pipeline Intermediate Steps — In ETL processes, intermediate queries may intentionally preserve duplicates for later aggregation steps.

preserve_duplicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- CORRECT: Preserving duplicates for aggregation
SELECT 
    product_category,
    SUM(sale_amount) as total_sales,    -- Each sale matters
    COUNT(*) as transaction_count,       -- Count includes repeats
    AVG(sale_amount) as avg_sale         -- Average weighted by frequency
FROM sales
GROUP BY product_category;
 
-- CORRECT: Preserving log frequency information
SELECT 
    error_message,
    COUNT(*) as occurrence_count,
    MIN(logged_at) as first_occurrence,
    MAX(logged_at) as last_occurrence
FROM error_logs
WHERE logged_at >= CURRENT_DATE - INTERVAL '1 day'
GROUP BY error_message
ORDER BY occurrence_count DESC;
 
-- The COUNT(*) reveals that 'Connection timeout' occurred 
-- 5,000 times vs 'Invalid input' occurring 3 times
-- Critical information that DISTINCT would destroy

Ask the Right Question

Best Practices for Managing Duplicates

Effective duplicate management combines proper query design, data modeling, and conscious decision-making. Here are professional practices to adopt:

Professional Guidelines

•Understand Your Data Model — Know which columns are unique. If you're unsure whether a table has duplicates, query it: SELECT column, COUNT(*) FROM table GROUP BY column HAVING COUNT(*) > 1.
•Design Queries to Prevent Duplicates — Use EXISTS instead of JOIN for existence checks. Aggregate before joining. Select primary keys when you need unique rows.
•Use DISTINCT Intentionally — Never add DISTINCT 'just in case.' If your query produces unexpected duplicates, understand why rather than masking the issue with DISTINCT.
•Validate Row Counts — After writing complex queries, verify that the result count makes sense. If you expect 1,000 customers but get 50,000 rows, investigate before proceeding.
•Document Deduplication Decisions — When a query intentionally uses or omits DISTINCT, comment why. Future maintainers need to understand the reasoning.
•Consider Indexes for Frequent DISTINCT Queries — If you frequently run DISTINCT on specific columns, indexes on those columns can enable efficient index-only scans.
•Test with Realistic Data Volumes — A DISTINCT query that runs instantly on 1,000 development rows might timeout on 10 million production rows. Test at scale.

duplicate_prevention.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Validate potential duplicates in source data
SELECT customer_id, COUNT(*) as occurrences
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;
-- Should return empty if customer_id is truly unique
 
-- Document DISTINCT decisions with comments
-- Note: DISTINCT required because join with orders creates row multiplication
-- Each customer should appear once regardless of order count
SELECT DISTINCT c.customer_id, c.customer_name, c.email
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status = 'completed';
 
-- Alternative: Document why DISTINCT is NOT used
-- Note: Intentionally preserving duplicates for SUM accuracy
-- Each order contributes to total even if amounts match
SELECT SUM(order_total) as revenue
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';

Summary: Understanding Duplicate Rows

We've established a comprehensive foundation for understanding duplicate rows in SQL result sets. Let's consolidate the key insights:

Key Takeaways

•SQL uses multiset semantics — Unlike mathematical sets, SQL result sets can contain duplicate rows by default. This is intentional for performance and data fidelity.
•Duplicates arise from specific operations — Projection onto non-key columns, joins that multiply rows, UNION ALL, and certain subqueries all create duplicates.
•Unhandled duplicates cause real problems — Incorrect counts, inflated aggregates, wasted resources, and corrupted business decisions result from ignoring duplicates.
•Multiple elimination strategies exist — DISTINCT, GROUP BY, subquery restructuring, EXISTS, and UNION each have appropriate use cases.
•Deduplication has performance costs — Sort or hash operations process all rows; optimize by preventing duplicates when possible.
•Sometimes duplicates are correct — For aggregation, logging, and time series data, preserving duplicates is essential for accuracy.
•Design queries consciously — Understand when duplicates will occur, decide whether they should be eliminated, and document your decisions.

What's Next:

Page Complete

1 / 5