Selection And Projection - Learning Module

Loading content...

0/252

Duplicate Elimination

The Duplicate Dilemma

You run a simple query to find all departments in your company and get the result: Engineering, Marketing, Engineering, HR, Engineering, Sales. Wait—why does Engineering appear three times? Is this a bug, or expected behavior?

The presence of duplicates in query results is one of the most misunderstood aspects of database systems. It sits at the intersection of theoretical foundations (where relations are sets with no duplicates) and practical implementation (where SQL tables are multisets that allow duplicates). Understanding when duplicates occur, when they matter, and how to handle them correctly is essential for database professionals.

Duplicate elimination isn't just about adding DISTINCT to make results "look right." It's about understanding data semantics, query correctness, and performance trade-offs. Mishandling duplicates can cause incorrect aggregations, inefficient queries, and subtle bugs that only manifest at scale.

What You Will Learn

By the end of this page, you will understand the theoretical foundations of set versus bag semantics, know when SQL queries produce duplicates and why, master the DISTINCT keyword and its alternatives, comprehend the performance implications of duplicate elimination, and develop strategies for handling duplicates correctly in complex queries.

Set Semantics vs Bag Semantics

The fundamental question of duplicates comes down to a choice in data model semantics: should relations behave like sets (no duplicates) or bags/multisets (duplicates allowed)?

Set Semantics (Relational Algebra):

In E.F. Codd's original relational model:

A relation is a set of tuples
By definition, sets contain no duplicate elements
Every tuple in a relation is unique
Projection automatically eliminates duplicates

Mathematical Foundation:

A relation R over schema S is defined as: $$R ⊆ D_1 × D_2 × ... × D_n$$

As a subset of a Cartesian product, R is a set—and sets have no duplicates.

Bag Semantics (SQL):

SQL deliberately chose bag (multiset) semantics:

A table can contain duplicate rows
SELECT produces a bag, not a set
Duplicates are preserved unless explicitly eliminated
DISTINCT keyword converts bag to set

Set Semantics

•Foundation: Mathematical set theory
•Duplicates: Never allowed
•Union result: Distinct elements only
•Projection result: Always deduplicated
•Expressiveness: Cleaner theoretical properties
•Performance: Must always deduplicate

Bag Semantics

•Foundation: Multiset theory
•Duplicates: Allowed and preserved
•Union result: All occurrences combined
•Projection result: Preserves duplicates
•Expressiveness: Supports counting, sums
•Performance: Deduplication optional

Set vs Bag ComparisonSame query under different semantics

Input

Output

Why SQL Chose Bag Semantics

SQL's bag semantics was a pragmatic choice:

Performance: Deduplication is O(n log n) or O(n); avoiding it saves time
Aggregation correctness: SUM(salary) needs each salary counted, including duplicates
Flexibility: Developers can add DISTINCT when needed
Real-world data: Many scenarios legitimately have duplicate values in projected columns

This choice trades theoretical purity for practical efficiency.

When and Why Duplicates Occur

Understanding when duplicates appear helps you write correct queries and avoid unnecessary DISTINCT clauses.

Source 1: Projection of Non-Key Columns

The most common source of duplicates is projecting columns that don't form a key:

-- Each employee has unique emp_id, but multiple employees share departments
SELECT department FROM Employee;  -- Duplicates likely
SELECT emp_id FROM Employee;      -- No duplicates (primary key)

Source 2: Joins That Multiply Rows

One-to-many and many-to-many joins can produce duplicate values:

-- One customer can have many orders
SELECT c.customer_name 
FROM Customer c 
JOIN Orders o ON c.id = o.customer_id;
-- Customer names repeat for each order

Source 3: UNION ALL

UNION ALL preserves duplicates; UNION eliminates them:

-- UNION ALL keeps all rows from both queries
SELECT name FROM Active_Employees
UNION ALL
SELECT name FROM Former_Employees;
-- If someone appears in both tables, they appear twice

Duplicate Occurrence Scenarios
Scenario	Duplicates?	Reason
SELECT * FROM table	No (if properly designed)	Primary key ensures unique rows
SELECT primary_key FROM table	No	Primary key values are unique
SELECT non_key_columns FROM table	Possible	Multiple rows may have same values
SELECT cols FROM t1 JOIN t2	Likely	Join can multiply rows
UNION ALL	Preserved	Explicitly combines without dedup
UNION	Eliminated	Deduplication is built-in
GROUP BY results	No	Grouping creates unique key per group
Subquery results	Depends	Based on subquery structure

Join-Induced DuplicatesUnderstanding how joins create duplicate values in results

Input

Output

Unexpected Duplicates = Query Problem

If you're surprised by duplicates in your results, don't immediately add DISTINCT. Investigate why duplicates appeared:

• Missing join condition creating Cartesian product? • Join to wrong table or wrong column? • Fundamental data model misunderstanding?

DISTINCT can mask bugs. Understand the cause before treating the symptom.

DISTINCT: Syntax and Semantics

The DISTINCT keyword is SQL's mechanism for converting bag results to set results. Understanding its precise semantics prevents common mistakes.

Basic Syntax:

SELECT DISTINCT column1, column2, ...
FROM table;

Key Semantic: DISTINCT Applies to Entire Row

DISTINCT eliminates duplicate rows, not duplicate values in one column:

-- WRONG mental model: "Give me unique departments"
-- RIGHT mental model: "Give me unique (department) tuples"

-- These are equivalent:
SELECT DISTINCT department FROM Employee;
SELECT DISTINCT department FROM Employee;  -- Still one column

-- But this is different:
SELECT DISTINCT department, status FROM Employee;
-- Eliminates duplicate (department, status) pairs
-- NOT duplicate departments and duplicate statuses separately

DISTINCT on Multiple ColumnsHow DISTINCT considers the complete row

Input

Output

DISTINCT and NULL:

For DISTINCT purposes, NULL values are considered equal to each other:

-- Table with NULLs:
SELECT manager_id FROM Employee;  -- NULL, 1, NULL, 2, 1

SELECT DISTINCT manager_id FROM Employee;  -- NULL, 1, 2
-- Multiple NULLs collapse to single NULL in DISTINCT output

This is a departure from NULL's usual behavior (where NULL ≠ NULL in comparisons).

ALL Keyword (Opposite of DISTINCT):

ALL is the default and explicit opposite of DISTINCT:

-- These are equivalent:
SELECT department FROM Employee;
SELECT ALL department FROM Employee;

ALL is rarely used explicitly but appears in some SQL standards discussions.

DISTINCT in Aggregate Functions:

DISTINCT can appear inside aggregate functions:

-- Count unique departments
SELECT COUNT(DISTINCT department) FROM Employee;  -- Returns 3

-- Count all department occurrences  
SELECT COUNT(department) FROM Employee;  -- Returns 5

-- Sum unique salaries (unusual but valid)
SELECT SUM(DISTINCT salary) FROM Employee;

COUNT(*) vs COUNT(DISTINCT col)

• COUNT(*) counts all rows including those with NULLs • COUNT(column) counts non-NULL values in the column (with duplicates) • COUNT(DISTINCT column) counts unique non-NULL values

For 100 rows where column has 80 non-NULL values and 20 unique values: • COUNT(*) = 100 • COUNT(column) = 80 • COUNT(DISTINCT column) = 20

Performance Implications

Duplicate elimination is not free. Understanding its cost helps you make informed decisions about when DISTINCT is truly necessary.

Algorithmic Cost of Deduplication:

Algorithm	Time Complexity	Space Complexity	Output Order
Sort-based	O(n log n)	O(n)	Sorted
Hash-based	O(n) expected	O(d) where d=distinct	Unordered
Index-based	O(n) or O(d)	Minimal	Index order

When Sort-Based Is Chosen:

ORDER BY is also present (sort once, use for both)
Result needs to be sorted anyway
Memory is constrained and external sort is viable

When Hash-Based Is Chosen:

No ordering requirement
Expected distinct values fit in memory
Fastest for moderate-sized results

DISTINCT Performance Impact Examples
Query Type	Without DISTINCT	With DISTINCT	Overhead
1M rows, 100 distinct values	~100ms scan	~200ms scan + hash	~2x
1M rows, 900K distinct values	~100ms scan	~1500ms scan + hash/sort	~15x
Already unique (key column)	~100ms scan	~200ms (wasted work)	~2x (unnecessary!)
With covering index	~100ms index scan	~120ms index scan + dedup	~1.2x
Complex join, 100M intermediate	~5s join	~15s join + sort	~3x

Execution Plan ComparisonObserving DISTINCT impact in query plans

Input

Output

Optimization Opportunities:

1. Eliminate Redundant DISTINCT:

If projecting columns that include a key, DISTINCT is unnecessary:

-- Unnecessary DISTINCT (emp_id is primary key)
SELECT DISTINCT emp_id, name, department FROM Employee;

-- Equivalent and faster:
SELECT emp_id, name, department FROM Employee;

Some optimizers detect and eliminate this, but not all.

2. Push DISTINCT Into Subquery:

When joining to a table where only distinct keys matter:

-- Suboptimal: dedup after join
SELECT DISTINCT c.name
FROM Customer c
JOIN Orders o ON c.id = o.customer_id;

-- Better: dedup before join (if applicable)
SELECT c.name
FROM Customer c
WHERE c.id IN (SELECT DISTINCT customer_id FROM Orders);
-- Or use EXISTS for potentially better optimization

3. Use EXISTS Instead of DISTINCT:

When checking for existence of related rows:

-- May require DISTINCT depending on join
SELECT DISTINCT c.name FROM Customer c 
JOIN Orders o ON c.id = o.customer_id
WHERE o.amount > 100;

-- EXISTS avoids join multiplication
SELECT c.name FROM Customer c
WHERE EXISTS (
  SELECT 1 FROM Orders o 
  WHERE o.customer_id = c.id AND o.amount > 100
);

DISTINCT is Not Magic

Adding DISTINCT doesn't reduce I/O—the database must still read all source rows to check for duplicates. It adds CPU work and memory usage for deduplication. Only add DISTINCT when semantically necessary, not as a "safety net" or to fix query bugs.

Alternatives to DISTINCT

Sometimes the goal of eliminating duplicates can be achieved more efficiently through query restructuring rather than brute-force DISTINCT.

Alternative 1: GROUP BY

GROUP BY inherently produces unique combinations:

-- Using DISTINCT
SELECT DISTINCT department FROM Employee;

-- Using GROUP BY (equivalent result)
SELECT department FROM Employee GROUP BY department;

When to prefer GROUP BY:

When you also need aggregates (COUNT, SUM, etc.)
When optimizer handles GROUP BY better (varies by database)
When explicit grouping clarifies intent

Alternative 2: EXISTS Instead of JOIN

When testing for existence without needing data from joined table:

-- JOIN produces duplicates requiring DISTINCT
SELECT DISTINCT c.* FROM Customer c
JOIN Orders o ON c.id = o.customer_id;

-- EXISTS naturally produces unique rows
SELECT c.* FROM Customer c
WHERE EXISTS (SELECT 1 FROM Orders o WHERE o.customer_id = c.id);

DISTINCT Alternatives ComparedSame logical result achieved different ways

Input

Output

Alternative 3: Proper JOIN Conditions

Sometimes duplicates indicate a join problem:

-- Missing join condition creates Cartesian product
SELECT DISTINCT e.name, d.location
FROM Employee e, Department d;  -- Every employee × every department!

-- Proper join prevents duplicates naturally
SELECT e.name, d.location
FROM Employee e
JOIN Department d ON e.dept_id = d.id;

Alternative 4: Window Functions with ROW_NUMBER

When you need one row per group with specific selection logic:

-- Get the most recent order per customer
-- DISTINCT can't express "which duplicate to keep"

WITH RankedOrders AS (
    SELECT 
        customer_id,
        order_date,
        amount,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id 
            ORDER BY order_date DESC
        ) AS rn
    FROM Orders
)
SELECT customer_id, order_date, amount
FROM RankedOrders
WHERE rn = 1;

Alternative 5: UNION Instead of UNION ALL

When combining queries that might overlap:

-- UNION ALL preserves duplicates
SELECT name FROM Active_Employees
UNION ALL
SELECT name FROM Contractors;  -- Same person in both?

-- UNION eliminates duplicates automatically
SELECT name FROM Active_Employees
UNION
SELECT name FROM Contractors;  -- Each name appears once

Choosing the Right Approach

• If checking existence: use EXISTS (stops at first match) • If need one representative from each group: use window functions • If combining sets that might overlap: use UNION • If need counts/aggregates per group: use GROUP BY • If none of the above apply and duplicates are genuine: use DISTINCT

Duplicates in Set Operations

SQL's set operations (UNION, INTERSECT, EXCEPT) have both duplicate-eliminating and duplicate-preserving variants. Understanding these is essential for correct query construction.

UNION vs UNION ALL:

Operation	Duplicates Within Input	Duplicates Between Inputs
UNION	Eliminated	Eliminated
UNION ALL	Preserved	Preserved

-- UNION: eliminates all duplicates
SELECT name FROM Employees
UNION
SELECT name FROM Contractors;
-- Each unique name appears exactly once

-- UNION ALL: preserves all rows
SELECT name FROM Employees
UNION ALL
SELECT name FROM Contractors;
-- If "John" appears twice in Employees and once in Contractors,
-- result contains "John" three times

Set Operations with DuplicatesDemonstrating duplicate handling in set operations

Input

Output

Performance Implications:

Operation	Time Complexity	Notes
UNION ALL	O(n + m)	Just append, no processing
UNION	O((n+m) log(n+m))	Requires sort/hash for dedup
INTERSECT	O(n log n + m log m)	Sort both then merge
EXCEPT	O(n log n + m log m)	Sort both then exclude

Using the Right Operation:

-- Use UNION ALL when:
-- 1. You know inputs are disjoint
SELECT id FROM Sales_2023 UNION ALL SELECT id FROM Sales_2024;
-- Different years, no overlap possible

-- 2. You're aggregating afterward
SELECT SUM(amount) 
FROM (
    SELECT amount FROM Sales_East
    UNION ALL
    SELECT amount FROM Sales_West
) combined;
-- Duplicates are intentional - each sale should count

-- Use UNION when:
-- 1. You need unique elements
SELECT DISTINCT customer_id FROM Sales_2023
UNION
SELECT DISTINCT customer_id FROM Sales_2024;
-- Customer in both years should appear once

-- 2. Inputs might overlap
SELECT email FROM VIP_Customers
UNION
SELECT email FROM Newsletter_Subscribers;
-- Same email shouldn't appear twice

Default Behavior Warning

Unlike SELECT (which preserves duplicates by default), UNION, INTERSECT, and EXCEPT eliminate duplicates by default! This is because they're defined as set operations from relational algebra. Use the ALL variants when you need multiset behavior.

Practical Guidelines for Handling Duplicates

Based on understanding of duplicate behavior, here are guidelines for writing correct and efficient queries.

DO: Correct Duplicate Handling

•Understand your data model — Know which columns are keys and which can have duplicates
•Add DISTINCT only when semantically necessary — Required when projecting non-key columns and duplicates aren't meaningful
•Use EXISTS for existence checks — More efficient than JOIN + DISTINCT for "does related row exist?" queries
•Use GROUP BY when aggregating — Provides dedup naturally along with aggregate values
•Choose UNION vs UNION ALL consciously — Know whether duplicates should be preserved or eliminated
•Document why DISTINCT is used — Future maintainers need to know if it's required or defensive

DON'T: Common Mistakes

•Don't add DISTINCT reflexively — It adds overhead and may mask query logic bugs
•Don't use DISTINCT on key columns — Wasteful; keys are unique by definition
•Don't assume duplicates are bugs — Sometimes they represent real multiplicity (e.g., counts)
•Don't ignore unexpected duplicates — Investigate join conditions and data relationships first
•Don't use UNION when UNION ALL suffices — Unnecessary deduplication cost
•Don't rely on DISTINCT order — Without ORDER BY, row order is undefined

Decision Tree for Duplicate HandlingA systematic approach to deciding on duplicate handling

Input

Output

Summary and Key Takeaways

Duplicate handling is a nuanced topic that bridges theoretical foundations and practical implementation. Mastery requires understanding both perspectives.

Key Takeaways

•Set vs Bag semantics — Relational algebra uses sets (no duplicates); SQL uses bags by default (duplicates allowed). This is a pragmatic performance choice.
•Duplicates arise from non-key projection and joins — Projecting columns that don't form a key, or joining tables with one-to-many relationships, naturally produces duplicates.
•DISTINCT eliminates duplicate rows, not values — The entire tuple is compared; (A,B) and (A,C) are distinct even though A value is the same.
•Deduplication has cost — O(n log n) for sort-based, O(n) expected for hash-based. Don't add DISTINCT unless necessary.
•Alternatives exist — EXISTS, GROUP BY, proper joins, and window functions often achieve the goal more efficiently than DISTINCT.
•Set operations default to set semantics — UNION, INTERSECT, EXCEPT eliminate duplicates; use ALL variants to preserve them.
•Unexpected duplicates indicate query problems — Investigate before adding DISTINCT; the root cause might be a join bug or data model misunderstanding.

What's Next:

With Selection, Projection, and duplicate handling mastered individually, we now bring them together. The next page explores how to combine Selection and Projection operations effectively—understanding operation ordering, query optimization opportunities, and constructing complex queries from these fundamental building blocks.

Page Complete

You now understand duplicate handling deeply—from theoretical set semantics through SQL's practical bag semantics to optimization strategies. This nuanced understanding helps you write correct, efficient queries and avoid common pitfalls. Next, we'll combine Selection and Projection into powerful composite queries.

Duplicate Elimination

The Duplicate Dilemma

What You Will Learn

Set Semantics vs Bag Semantics

The fundamental question of duplicates comes down to a choice in data model semantics: should relations behave like sets (no duplicates) or bags/multisets (duplicates allowed)?

Set Semantics (Relational Algebra):

In E.F. Codd's original relational model:

A relation is a set of tuples
By definition, sets contain no duplicate elements
Every tuple in a relation is unique
Projection automatically eliminates duplicates

Mathematical Foundation:

A relation R over schema S is defined as: $$R ⊆ D_1 × D_2 × ... × D_n$$

As a subset of a Cartesian product, R is a set—and sets have no duplicates.

Bag Semantics (SQL):

SQL deliberately chose bag (multiset) semantics:

A table can contain duplicate rows
SELECT produces a bag, not a set
Duplicates are preserved unless explicitly eliminated
DISTINCT keyword converts bag to set

Set Semantics

•Foundation: Mathematical set theory
•Duplicates: Never allowed
•Union result: Distinct elements only
•Projection result: Always deduplicated
•Expressiveness: Cleaner theoretical properties
•Performance: Must always deduplicate

Bag Semantics

•Foundation: Multiset theory
•Duplicates: Allowed and preserved
•Union result: All occurrences combined
•Projection result: Preserves duplicates
•Expressiveness: Supports counting, sums
•Performance: Deduplication optional

Set vs Bag ComparisonSame query under different semantics

Input

Output

Why SQL Chose Bag Semantics

SQL's bag semantics was a pragmatic choice:

Performance: Deduplication is O(n log n) or O(n); avoiding it saves time
Aggregation correctness: SUM(salary) needs each salary counted, including duplicates
Flexibility: Developers can add DISTINCT when needed
Real-world data: Many scenarios legitimately have duplicate values in projected columns

This choice trades theoretical purity for practical efficiency.

When and Why Duplicates Occur

Understanding when duplicates appear helps you write correct queries and avoid unnecessary DISTINCT clauses.

Source 1: Projection of Non-Key Columns

The most common source of duplicates is projecting columns that don't form a key:

-- Each employee has unique emp_id, but multiple employees share departments
SELECT department FROM Employee;  -- Duplicates likely
SELECT emp_id FROM Employee;      -- No duplicates (primary key)

Source 2: Joins That Multiply Rows

One-to-many and many-to-many joins can produce duplicate values:

-- One customer can have many orders
SELECT c.customer_name 
FROM Customer c 
JOIN Orders o ON c.id = o.customer_id;
-- Customer names repeat for each order

Source 3: UNION ALL

UNION ALL preserves duplicates; UNION eliminates them:

-- UNION ALL keeps all rows from both queries
SELECT name FROM Active_Employees
UNION ALL
SELECT name FROM Former_Employees;
-- If someone appears in both tables, they appear twice

Duplicate Occurrence Scenarios
Scenario	Duplicates?	Reason
SELECT * FROM table	No (if properly designed)	Primary key ensures unique rows
SELECT primary_key FROM table	No	Primary key values are unique
SELECT non_key_columns FROM table	Possible	Multiple rows may have same values
SELECT cols FROM t1 JOIN t2	Likely	Join can multiply rows
UNION ALL	Preserved	Explicitly combines without dedup
UNION	Eliminated	Deduplication is built-in
GROUP BY results	No	Grouping creates unique key per group
Subquery results	Depends	Based on subquery structure

Join-Induced DuplicatesUnderstanding how joins create duplicate values in results

Input

Output

Unexpected Duplicates = Query Problem

If you're surprised by duplicates in your results, don't immediately add DISTINCT. Investigate why duplicates appeared:

• Missing join condition creating Cartesian product? • Join to wrong table or wrong column? • Fundamental data model misunderstanding?

DISTINCT can mask bugs. Understand the cause before treating the symptom.

DISTINCT: Syntax and Semantics

The DISTINCT keyword is SQL's mechanism for converting bag results to set results. Understanding its precise semantics prevents common mistakes.

Basic Syntax:

SELECT DISTINCT column1, column2, ...
FROM table;

Key Semantic: DISTINCT Applies to Entire Row

DISTINCT eliminates duplicate rows, not duplicate values in one column:

-- WRONG mental model: "Give me unique departments"
-- RIGHT mental model: "Give me unique (department) tuples"

-- These are equivalent:
SELECT DISTINCT department FROM Employee;
SELECT DISTINCT department FROM Employee;  -- Still one column

-- But this is different:
SELECT DISTINCT department, status FROM Employee;
-- Eliminates duplicate (department, status) pairs
-- NOT duplicate departments and duplicate statuses separately

DISTINCT on Multiple ColumnsHow DISTINCT considers the complete row

Input

Output

DISTINCT and NULL:

For DISTINCT purposes, NULL values are considered equal to each other:

-- Table with NULLs:
SELECT manager_id FROM Employee;  -- NULL, 1, NULL, 2, 1

SELECT DISTINCT manager_id FROM Employee;  -- NULL, 1, 2
-- Multiple NULLs collapse to single NULL in DISTINCT output

This is a departure from NULL's usual behavior (where NULL ≠ NULL in comparisons).

ALL Keyword (Opposite of DISTINCT):

ALL is the default and explicit opposite of DISTINCT:

-- These are equivalent:
SELECT department FROM Employee;
SELECT ALL department FROM Employee;

ALL is rarely used explicitly but appears in some SQL standards discussions.

DISTINCT in Aggregate Functions:

DISTINCT can appear inside aggregate functions:

-- Count unique departments
SELECT COUNT(DISTINCT department) FROM Employee;  -- Returns 3

-- Count all department occurrences  
SELECT COUNT(department) FROM Employee;  -- Returns 5

-- Sum unique salaries (unusual but valid)
SELECT SUM(DISTINCT salary) FROM Employee;

COUNT(*) vs COUNT(DISTINCT col)

• COUNT(*) counts all rows including those with NULLs • COUNT(column) counts non-NULL values in the column (with duplicates) • COUNT(DISTINCT column) counts unique non-NULL values

For 100 rows where column has 80 non-NULL values and 20 unique values: • COUNT(*) = 100 • COUNT(column) = 80 • COUNT(DISTINCT column) = 20

Performance Implications

Duplicate elimination is not free. Understanding its cost helps you make informed decisions about when DISTINCT is truly necessary.

Algorithmic Cost of Deduplication:

Algorithm	Time Complexity	Space Complexity	Output Order
Sort-based	O(n log n)	O(n)	Sorted
Hash-based	O(n) expected	O(d) where d=distinct	Unordered
Index-based	O(n) or O(d)	Minimal	Index order

When Sort-Based Is Chosen:

ORDER BY is also present (sort once, use for both)
Result needs to be sorted anyway
Memory is constrained and external sort is viable

When Hash-Based Is Chosen:

No ordering requirement
Expected distinct values fit in memory
Fastest for moderate-sized results

DISTINCT Performance Impact Examples
Query Type	Without DISTINCT	With DISTINCT	Overhead
1M rows, 100 distinct values	~100ms scan	~200ms scan + hash	~2x
1M rows, 900K distinct values	~100ms scan	~1500ms scan + hash/sort	~15x
Already unique (key column)	~100ms scan	~200ms (wasted work)	~2x (unnecessary!)
With covering index	~100ms index scan	~120ms index scan + dedup	~1.2x
Complex join, 100M intermediate	~5s join	~15s join + sort	~3x

Execution Plan ComparisonObserving DISTINCT impact in query plans

Input

Output

Optimization Opportunities:

1. Eliminate Redundant DISTINCT:

If projecting columns that include a key, DISTINCT is unnecessary:

-- Unnecessary DISTINCT (emp_id is primary key)
SELECT DISTINCT emp_id, name, department FROM Employee;

-- Equivalent and faster:
SELECT emp_id, name, department FROM Employee;

Some optimizers detect and eliminate this, but not all.

2. Push DISTINCT Into Subquery:

When joining to a table where only distinct keys matter:

-- Suboptimal: dedup after join
SELECT DISTINCT c.name
FROM Customer c
JOIN Orders o ON c.id = o.customer_id;

-- Better: dedup before join (if applicable)
SELECT c.name
FROM Customer c
WHERE c.id IN (SELECT DISTINCT customer_id FROM Orders);
-- Or use EXISTS for potentially better optimization

3. Use EXISTS Instead of DISTINCT:

When checking for existence of related rows:

-- May require DISTINCT depending on join
SELECT DISTINCT c.name FROM Customer c 
JOIN Orders o ON c.id = o.customer_id
WHERE o.amount > 100;

-- EXISTS avoids join multiplication
SELECT c.name FROM Customer c
WHERE EXISTS (
  SELECT 1 FROM Orders o 
  WHERE o.customer_id = c.id AND o.amount > 100
);

DISTINCT is Not Magic

Alternatives to DISTINCT

Sometimes the goal of eliminating duplicates can be achieved more efficiently through query restructuring rather than brute-force DISTINCT.

Alternative 1: GROUP BY

GROUP BY inherently produces unique combinations:

-- Using DISTINCT
SELECT DISTINCT department FROM Employee;

-- Using GROUP BY (equivalent result)
SELECT department FROM Employee GROUP BY department;

When to prefer GROUP BY:

When you also need aggregates (COUNT, SUM, etc.)
When optimizer handles GROUP BY better (varies by database)
When explicit grouping clarifies intent

Alternative 2: EXISTS Instead of JOIN

When testing for existence without needing data from joined table:

-- JOIN produces duplicates requiring DISTINCT
SELECT DISTINCT c.* FROM Customer c
JOIN Orders o ON c.id = o.customer_id;

-- EXISTS naturally produces unique rows
SELECT c.* FROM Customer c
WHERE EXISTS (SELECT 1 FROM Orders o WHERE o.customer_id = c.id);

DISTINCT Alternatives ComparedSame logical result achieved different ways

Input

Output

Alternative 3: Proper JOIN Conditions

Sometimes duplicates indicate a join problem:

-- Missing join condition creates Cartesian product
SELECT DISTINCT e.name, d.location
FROM Employee e, Department d;  -- Every employee × every department!

-- Proper join prevents duplicates naturally
SELECT e.name, d.location
FROM Employee e
JOIN Department d ON e.dept_id = d.id;

Alternative 4: Window Functions with ROW_NUMBER

When you need one row per group with specific selection logic:

-- Get the most recent order per customer
-- DISTINCT can't express "which duplicate to keep"

WITH RankedOrders AS (
    SELECT 
        customer_id,
        order_date,
        amount,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id 
            ORDER BY order_date DESC
        ) AS rn
    FROM Orders
)
SELECT customer_id, order_date, amount
FROM RankedOrders
WHERE rn = 1;

Alternative 5: UNION Instead of UNION ALL

When combining queries that might overlap:

-- UNION ALL preserves duplicates
SELECT name FROM Active_Employees
UNION ALL
SELECT name FROM Contractors;  -- Same person in both?

-- UNION eliminates duplicates automatically
SELECT name FROM Active_Employees
UNION
SELECT name FROM Contractors;  -- Each name appears once

Choosing the Right Approach

Duplicates in Set Operations

SQL's set operations (UNION, INTERSECT, EXCEPT) have both duplicate-eliminating and duplicate-preserving variants. Understanding these is essential for correct query construction.

UNION vs UNION ALL:

Operation	Duplicates Within Input	Duplicates Between Inputs
UNION	Eliminated	Eliminated
UNION ALL	Preserved	Preserved

-- UNION: eliminates all duplicates
SELECT name FROM Employees
UNION
SELECT name FROM Contractors;
-- Each unique name appears exactly once

-- UNION ALL: preserves all rows
SELECT name FROM Employees
UNION ALL
SELECT name FROM Contractors;
-- If "John" appears twice in Employees and once in Contractors,
-- result contains "John" three times

Set Operations with DuplicatesDemonstrating duplicate handling in set operations

Input

Output

Performance Implications:

Operation	Time Complexity	Notes
UNION ALL	O(n + m)	Just append, no processing
UNION	O((n+m) log(n+m))	Requires sort/hash for dedup
INTERSECT	O(n log n + m log m)	Sort both then merge
EXCEPT	O(n log n + m log m)	Sort both then exclude

Using the Right Operation:

-- Use UNION ALL when:
-- 1. You know inputs are disjoint
SELECT id FROM Sales_2023 UNION ALL SELECT id FROM Sales_2024;
-- Different years, no overlap possible

-- 2. You're aggregating afterward
SELECT SUM(amount) 
FROM (
    SELECT amount FROM Sales_East
    UNION ALL
    SELECT amount FROM Sales_West
) combined;
-- Duplicates are intentional - each sale should count

-- Use UNION when:
-- 1. You need unique elements
SELECT DISTINCT customer_id FROM Sales_2023
UNION
SELECT DISTINCT customer_id FROM Sales_2024;
-- Customer in both years should appear once

-- 2. Inputs might overlap
SELECT email FROM VIP_Customers
UNION
SELECT email FROM Newsletter_Subscribers;
-- Same email shouldn't appear twice

Default Behavior Warning

Practical Guidelines for Handling Duplicates

Based on understanding of duplicate behavior, here are guidelines for writing correct and efficient queries.

DO: Correct Duplicate Handling

•Understand your data model — Know which columns are keys and which can have duplicates
•Add DISTINCT only when semantically necessary — Required when projecting non-key columns and duplicates aren't meaningful
•Use EXISTS for existence checks — More efficient than JOIN + DISTINCT for "does related row exist?" queries
•Use GROUP BY when aggregating — Provides dedup naturally along with aggregate values
•Choose UNION vs UNION ALL consciously — Know whether duplicates should be preserved or eliminated
•Document why DISTINCT is used — Future maintainers need to know if it's required or defensive

DON'T: Common Mistakes

•Don't add DISTINCT reflexively — It adds overhead and may mask query logic bugs
•Don't use DISTINCT on key columns — Wasteful; keys are unique by definition
•Don't assume duplicates are bugs — Sometimes they represent real multiplicity (e.g., counts)
•Don't ignore unexpected duplicates — Investigate join conditions and data relationships first
•Don't use UNION when UNION ALL suffices — Unnecessary deduplication cost
•Don't rely on DISTINCT order — Without ORDER BY, row order is undefined

Decision Tree for Duplicate HandlingA systematic approach to deciding on duplicate handling

Input

Output

Summary and Key Takeaways

Duplicate handling is a nuanced topic that bridges theoretical foundations and practical implementation. Mastery requires understanding both perspectives.

Key Takeaways

•Set vs Bag semantics — Relational algebra uses sets (no duplicates); SQL uses bags by default (duplicates allowed). This is a pragmatic performance choice.
•Duplicates arise from non-key projection and joins — Projecting columns that don't form a key, or joining tables with one-to-many relationships, naturally produces duplicates.
•DISTINCT eliminates duplicate rows, not values — The entire tuple is compared; (A,B) and (A,C) are distinct even though A value is the same.
•Deduplication has cost — O(n log n) for sort-based, O(n) expected for hash-based. Don't add DISTINCT unless necessary.
•Alternatives exist — EXISTS, GROUP BY, proper joins, and window functions often achieve the goal more efficiently than DISTINCT.
•Set operations default to set semantics — UNION, INTERSECT, EXCEPT eliminate duplicates; use ALL variants to preserve them.
•Unexpected duplicates indicate query problems — Investigate before adding DISTINCT; the root cause might be a join bug or data model misunderstanding.

What's Next:

Page Complete