Database Management SystemsDISTINCT and LIMIT

DISTINCT and LIMIT: Controlling Result Sets

LevelBeginner

Duration60 mins

TopicDISTINCT and LIMIT

2 / 5

DISTINCT Clause

The DISTINCT Keyword: SQL's Deduplication Tool

The DISTINCT keyword is SQL's primary mechanism for transforming multiset (bag) results into true set results—eliminating duplicate rows so that each unique combination of values appears exactly once. While conceptually simple, DISTINCT has nuanced behavior that experienced SQL developers must master.

This page provides exhaustive coverage of DISTINCT: its placement in queries, how it determines row uniqueness, its interaction with other clauses, performance characteristics, and common patterns and pitfalls. By the end, you'll wield DISTINCT with precision and confidence.

What You Will Learn

By the end of this page, you will understand: (1) DISTINCT syntax and placement rules, (2) How DISTINCT determines row equality including NULL handling, (3) DISTINCT with single vs. multiple columns, (4) DISTINCT ON (PostgreSQL extension), (5) Interaction with ORDER BY and other clauses, (6) Performance optimization techniques, and (7) Common mistakes and how to avoid them.

Basic DISTINCT Syntax

DISTINCT is placed immediately after the SELECT keyword and before the column list. It affects the entire row—not individual columns. This distinction is crucial and commonly misunderstood.

distinct_syntax.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Basic syntax
SELECT DISTINCT column1, column2, ...
FROM table_name;
 
-- DISTINCT eliminates rows where ALL listed columns match
-- It does NOT operate on individual columns independently
 
-- Example: Unique departments
SELECT DISTINCT department
FROM employees;
 
-- Example: Unique (city, state) combinations
SELECT DISTINCT city, state
FROM customers;
 
-- The above returns unique pairs, not unique cities AND unique states
-- (New York, NY) and (New York, CA) are DIFFERENT rows
-- (New York, NY) appearing twice becomes one row

Common Misconception

DISTINCT does NOT mean 'make each column unique independently.' The query SELECT DISTINCT city, state does not give you unique cities AND unique states—it gives you unique (city, state) pairs. This is one of the most common misunderstandings about DISTINCT.

1.1 SELECT ALL: The Implicit Default

SQL actually has an explicit keyword for the default behavior: SELECT ALL. This explicitly requests multiset semantics (preserving duplicates). In practice, no one writes it because it's the default, but it exists for completeness.

select_all.sql

-- These are equivalent:
SELECT department FROM employees;
SELECT ALL department FROM employees;
 
-- Both return all rows including duplicates
-- SELECT ALL is valid SQL but never used in practice
 
-- DISTINCT explicitly changes the default behavior
SELECT DISTINCT department FROM employees;
-- Returns each unique department exactly once

1.2 Placement Rules

DISTINCT must immediately follow SELECT. It cannot appear elsewhere in the select list or after column names.

distinct_placement.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- CORRECT placement
SELECT DISTINCT department, city FROM employees;
 
-- INVALID placements (syntax errors)
SELECT department DISTINCT, city FROM employees;  -- Error
SELECT department, DISTINCT city FROM employees;  -- Error
DISTINCT SELECT department FROM employees;        -- Error
 
-- DISTINCT with expressions
SELECT DISTINCT UPPER(department) FROM employees;
-- Returns unique uppercase department names
 
-- DISTINCT with aliases
SELECT DISTINCT department AS dept, city AS location
FROM employees;
-- Aliases don't affect DISTINCT comparison; they rename output

How DISTINCT Determines Row Equality

Understanding exactly how DISTINCT determines whether two rows are 'equal' (and thus one should be eliminated) is essential for predictable results. The rules involve data type comparison semantics and special handling for NULL values.

2.1 Column-by-Column Comparison

Two rows are considered duplicates if every selected column has equal values between them. If any column differs, the rows are distinct.

DISTINCT Row Comparison Example
Row	city	state	Kept?
1	New York	NY	Yes (first occurrence)
2	New York	NY	No (duplicate of row 1)
3	New York	CA	Yes (state differs)
4	Los Angeles	CA	Yes (city differs from row 3)
5	Los Angeles	CA	No (duplicate of row 4)

2.2 NULL Handling in DISTINCT

DISTINCT treats NULL values as equal to each other for comparison purposes. This is a critical exception to the normal SQL rule that NULL is not equal to anything, including itself.

Normal SQL comparison: NULL = NULL evaluates to NULL (unknown), which is treated as false.

DISTINCT comparison: Two NULLs in the same column position are treated as matching.

This behavior aligns with the intuition that 'unknown' values should be grouped together, but it's technically inconsistent with standard NULL semantics.

distinct_null.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Sample data
-- employee_id | department | manager_id
-- 1           | Sales      | 100
-- 2           | Sales      | 100
-- 3           | NULL       | NULL
-- 4           | NULL       | NULL
-- 5           | Sales      | NULL
 
SELECT DISTINCT department, manager_id FROM employees;
 
-- Result:
-- department | manager_id
-- Sales      | 100        (rows 1,2 collapsed)
-- NULL       | NULL       (rows 3,4 collapsed - NULLs match!)
-- Sales      | NULL       (row 5 - different from above)
 
-- Compare with WHERE clause NULL behavior:
SELECT * FROM employees WHERE department = NULL;
-- Returns NOTHING (NULL = NULL is unknown, treated as false)
 
-- But DISTINCT groups NULLs together:
SELECT DISTINCT department FROM employees;
-- Returns: Sales, NULL (one NULL row for all NULL departments)

SQL Standard Specification

The SQL standard explicitly specifies that for DISTINCT, 'two values that are both null are considered to be not distinct.' This special-case handling ensures that DISTINCT produces intuitive results when NULL values are present, even though it contradicts the general NULL comparison rules.

2.3 Data Type Considerations

DISTINCT comparison follows the rules of each column's data type, which can produce surprising results with certain types.

distinct_datatypes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- String comparison (case sensitivity depends on collation)
SELECT DISTINCT name FROM users;
-- With case-insensitive collation: 'John' and 'john' are duplicates
-- With case-sensitive collation: 'John' and 'john' are distinct
 
-- Trailing spaces (varies by database)
-- In some databases: 'John' and 'John  ' are equal (trailing spaces ignored)
-- In others: they're different values
 
-- Numeric precision
SELECT DISTINCT price FROM products;
-- 10.00 and 10.000 are typically equal (same numeric value)
-- But 10.00 and 10.001 are different
 
-- Timestamp precision
SELECT DISTINCT created_at FROM orders;
-- Timestamps compared to full precision
-- '2024-01-15 10:30:00.000' vs '2024-01-15 10:30:00.001' are different
 
-- DATE vs DATETIME comparison
-- Comparing different temporal types may produce unexpected results
SELECT DISTINCT DATE(order_timestamp), order_timestamp FROM orders;
-- DATE(order_timestamp) groups by day, but order_timestamp is unique per millisecond

DISTINCT with Multiple Columns

When DISTINCT is applied to multiple columns, it eliminates rows where all specified columns match. This creates unique tuples or combinations, not unique individual values.

multi_column_distinct.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Sample orders table
-- order_id | customer_id | product_id | quantity
-- 1        | 100         | A          | 2
-- 2        | 100         | B          | 1
-- 3        | 100         | A          | 3
-- 4        | 101         | A          | 2
-- 5        | 101         | A          | 2
 
-- DISTINCT on single column
SELECT DISTINCT customer_id FROM orders;
-- Result: 100, 101 (2 rows)
 
SELECT DISTINCT product_id FROM orders;
-- Result: A, B (2 rows)
 
-- DISTINCT on two columns: unique combinations
SELECT DISTINCT customer_id, product_id FROM orders;
-- Result:
-- 100, A  (rows 1,3 collapsed)
-- 100, B  (row 2)
-- 101, A  (rows 4,5 collapsed)
-- Total: 3 rows
 
-- Note: Row 3 (customer 100, product A, qty 3) and 
-- row 1 (customer 100, product A, qty 2) are considered
-- duplicates because only customer_id and product_id are compared

3.1 Column Order Doesn't Affect Results

The order of columns in the SELECT DISTINCT list doesn't change which rows are considered duplicates—only the presentation order of columns in the output.

column_order.sql
1
2
3
4
5
6
7
8
-- These queries return the same rows (different column order)
SELECT DISTINCT customer_id, product_id FROM orders;
SELECT DISTINCT product_id, customer_id FROM orders;
 
-- Both return the same 3 unique combinations
-- Only the column presentation order differs
 
-- The pairs (100, A) and (A, 100) represent the same logical tuple

3.2 All Selected Columns Participate

Every column in the SELECT list participates in DISTINCT comparison. You cannot make some columns distinct while preserving duplicates in others.

distinct_all_columns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- WRONG expectation:
-- "I want unique customers with 
-- their first order date"
 
SELECT DISTINCT 
    customer_id, 
    order_date
FROM orders;
 
-- This gives unique (customer, date) 
-- PAIRS, not unique customers!
 
-- If customer 100 ordered on Jan 1 
-- and Jan 2, both rows appear

distinct_correct.sql

-- CORRECT approach:
-- Use aggregation to get one 
-- row per customer
 
SELECT 
    customer_id, 
    MIN(order_date) as first_order
FROM orders
GROUP BY customer_id;
 
-- This gives exactly one row per 
-- customer with their first order
 
-- Or use DISTINCT ON in PostgreSQL
-- (covered in section 4)

The DISTINCT Trap

A common mistake is adding columns to a DISTINCT query and expecting the uniqueness to remain on the original column. Each column you add potentially increases the number of 'unique' rows because there are more ways for rows to differ. If you need one row per X with additional columns, use GROUP BY with aggregates or window functions.

DISTINCT ON: PostgreSQL Extension

PostgreSQL extends standard SQL with DISTINCT ON (columns), which allows you to specify which columns determine uniqueness while still selecting additional columns. This is extraordinarily useful but not portable to other database systems.

distinct_on.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- PostgreSQL DISTINCT ON syntax
SELECT DISTINCT ON (expression1, expression2, ...)
    column1, column2, column3, ...
FROM table_name
ORDER BY expression1, expression2, ..., other_columns;
 
-- Example: Get first order per customer
SELECT DISTINCT ON (customer_id)
    customer_id,
    order_id,
    order_date,
    total_amount
FROM orders
ORDER BY customer_id, order_date ASC;
 
-- This returns exactly ONE row per customer_id
-- The row kept is the FIRST one according to ORDER BY
-- So we get each customer's earliest order
 
-- Example: Get most recent login per user
SELECT DISTINCT ON (user_id)
    user_id,
    login_at,
    ip_address,
    device_type
FROM login_history
ORDER BY user_id, login_at DESC;
 
-- Returns most recent login for each user

ORDER BY Requirement

DISTINCT ON requires ORDER BY, and the ORDER BY must start with the same expressions as DISTINCT ON (in the same order). The database uses this ordering to determine which row to keep from each group. Additional ORDER BY columns can follow to control which specific row is selected.

4.1 DISTINCT ON vs. Alternatives

DISTINCT ON solves the 'select additional columns' problem elegantly, but equivalent solutions exist in other databases:

distinct_on_alternatives.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- PostgreSQL DISTINCT ON
SELECT DISTINCT ON (customer_id)
    customer_id, order_id, order_date
FROM orders
ORDER BY customer_id, order_date DESC;
 
-- Standard SQL equivalent using window function
SELECT customer_id, order_id, order_date
FROM (
    SELECT 
        customer_id, 
        order_id, 
        order_date,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id 
            ORDER BY order_date DESC
        ) as rn
    FROM orders
) ranked
WHERE rn = 1;
 
-- Alternative using correlated subquery
SELECT o.customer_id, o.order_id, o.order_date
FROM orders o
WHERE o.order_date = (
    SELECT MAX(o2.order_date)
    FROM orders o2
    WHERE o2.customer_id = o.customer_id
);
 
-- Note: Correlated subquery approach may return multiple rows
-- if there are ties; ROW_NUMBER guarantees exactly one row

DISTINCT ON Alternatives Comparison
Approach	Database Support	Handles Ties	Performance
DISTINCT ON	PostgreSQL only	Deterministic with ORDER BY	Excellent (optimized)
ROW_NUMBER() window	All modern databases	Deterministic	Good (requires sort)
Correlated subquery	All databases	May return multiple rows	Often poor (N+1 queries)
GROUP BY + JOIN	All databases	Depends on join condition	Varies (can be optimized)

DISTINCT Interaction with Other SQL Clauses

DISTINCT interacts with other SQL clauses in specific ways. Understanding these interactions is essential for writing correct queries.

5.1 DISTINCT and ORDER BY

When using both DISTINCT and ORDER BY, the ORDER BY columns must be present in the SELECT list (in most databases). This is because DISTINCT is logically applied before ORDER BY, and you can only sort by columns that exist in the result.

distinct_order_by.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- VALID: ORDER BY column is in SELECT list
SELECT DISTINCT city, state
FROM customers
ORDER BY city;
 
-- VALID: Multiple ORDER BY columns, all in SELECT
SELECT DISTINCT city, state
FROM customers
ORDER BY state, city;
 
-- INVALID in most databases: ORDER BY column not in SELECT
SELECT DISTINCT city
FROM customers
ORDER BY state;
-- Error: ORDER BY items must appear in the select list if DISTINCT is specified
 
-- Why? After DISTINCT eliminates rows, the database might have:
-- city
-- ----
-- Boston
-- New York
-- For 'Boston', there might have been original rows from MA and CT.
-- Which 'state' value should be used for sorting?
 
-- Solution: Include the ORDER BY column in SELECT
SELECT DISTINCT city, state
FROM customers
ORDER BY state;

5.2 DISTINCT and WHERE

WHERE filtering occurs before DISTINCT deduplication. This means you filter rows first, then eliminate duplicates from the filtered set.

distinct_where.sql
1
2
3
4
5
6
7
8
9
10
11
12
-- Execution order: FROM -> WHERE -> SELECT (with DISTINCT)
SELECT DISTINCT department
FROM employees
WHERE salary > 50000;
 
-- Step 1: Read all employees
-- Step 2: Filter to those with salary > 50000
-- Step 3: Project to department column
-- Step 4: Eliminate duplicate departments
 
-- This gives unique departments that have at least one
-- employee earning over $50,000

5.3 DISTINCT and Aggregates

DISTINCT can be used inside aggregate functions to operate only on unique values. This is different from SELECT DISTINCT.

distinct_aggregate.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- COUNT(column) counts non-NULL values (including duplicates)
SELECT COUNT(department) FROM employees;
-- Result: 1000 (if 1000 employees have a department)
 
-- COUNT(DISTINCT column) counts unique non-NULL values
SELECT COUNT(DISTINCT department) FROM employees;
-- Result: 10 (if there are 10 unique departments)
 
-- This works with other aggregates too
SELECT AVG(salary) FROM employees;           -- Average of all salaries
SELECT AVG(DISTINCT salary) FROM employees;  -- Average of unique salary values
-- If 5 people earn $50K and 3 earn $60K:
-- AVG(salary) = (5*50000 + 3*60000) / 8 = $53,750
-- AVG(DISTINCT salary) = (50000 + 60000) / 2 = $55,000
 
-- SUM(DISTINCT column) sums unique values
SELECT SUM(order_total) FROM orders;           -- All order totals
SELECT SUM(DISTINCT order_total) FROM orders;  -- Each unique total once
 
-- String aggregation with DISTINCT (PostgreSQL example)
SELECT STRING_AGG(DISTINCT department, ', ' ORDER BY department)
FROM employees;
-- Returns: "Engineering, Marketing, Sales" (not "Sales, Sales, Sales, Marketing...")

DISTINCT in Aggregates

DISTINCT inside aggregates is different from SELECT DISTINCT. SELECT DISTINCT COUNT(*) is meaningless (there's only one count value to deduplicate). SELECT COUNT(DISTINCT column) counts unique values—a very common and useful operation.

5.4 DISTINCT and GROUP BY

Using both DISTINCT and GROUP BY is usually redundant. GROUP BY already collapses rows by the grouped columns, and DISTINCT on those same columns adds no value (with computational overhead).

distinct_group_by.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- REDUNDANT: DISTINCT with same columns as GROUP BY
SELECT DISTINCT department, COUNT(*) as emp_count
FROM employees
GROUP BY department;
 
-- EQUIVALENT and more efficient (no DISTINCT):
SELECT department, COUNT(*) as emp_count
FROM employees
GROUP BY department;
 
-- GROUP BY already produces one row per department
-- DISTINCT has nothing to deduplicate
 
-- However, DISTINCT can be meaningful with certain GROUP BY patterns:
SELECT DISTINCT dept_category
FROM (
    SELECT 
        department,
        CASE WHEN COUNT(*) > 100 THEN 'Large' ELSE 'Small' END as dept_category
    FROM employees
    GROUP BY department
) dept_sizes;
-- Multiple departments might be 'Large', DISTINCT collapses them

DISTINCT Performance Optimization

DISTINCT operations require significant computational resources. Understanding how to optimize them—or avoid them entirely—is essential for high-performance SQL.

6.1 Understanding the Cost

DISTINCT requires the database to:

Process all matching rows from the base query
Compare each row against all others (or use hashing) to identify duplicates
Output only unique rows

The cost scales with both the number of input rows and the size of each row (number and width of columns).

DISTINCT Performance Factors
Factor	Impact	Mitigation
Total input rows	More rows = more comparisons	Add WHERE filters before DISTINCT
Number of columns	More columns = more comparison work	Select only needed columns
Column data types	Large strings/BLOBs are slow to hash	Avoid DISTINCT on TEXT/BLOB
Uniqueness ratio	Few duplicates = DISTINCT is wasted	Verify duplicates actually exist
Available memory	Hash tables spill to disk if too large	Ensure work_mem is adequate
Indexes	Index-only scans can provide sorted input	Create indexes on DISTINCT columns

6.2 Query Plan Analysis

Use EXPLAIN to understand how the database implements DISTINCT:

distinct_explain.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- PostgreSQL EXPLAIN example
EXPLAIN (ANALYZE, BUFFERS) 
SELECT DISTINCT department FROM employees;
 
-- Possible plans:
 
-- 1. HashAggregate: Uses hash table for deduplication
--    HashAggregate  (rows=10) (actual rows=10)
--      -> Seq Scan on employees (rows=10000)
--    Good when: many duplicates, moderate unique values
 
-- 2. Sort + Unique: Sorts then removes adjacent duplicates
--    Unique  (rows=10)
--      -> Sort (rows=10000)
--        -> Seq Scan on employees
--    Good when: ORDER BY also needed, index available
 
-- 3. Index Only Scan (if covering index exists)
--    Unique
--      -> Index Only Scan using idx_department
--    Best performance: no table access needed
 
-- Force PostgreSQL to use specific strategy (for testing)
SET enable_hashagg = off;  -- Force sort-based DISTINCT
SET enable_sort = off;     -- Force hash-based DISTINCT

6.3 Index-Based Optimization

Indexes can dramatically speed up DISTINCT by providing pre-sorted or grouped access paths.

distinct_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Without index: Full table scan + sort/hash
SELECT DISTINCT department FROM employees;
-- Reads all 10,000 rows, then deduplicates
 
-- With index on department: Index-only scan possible
CREATE INDEX idx_employees_department ON employees(department);
SELECT DISTINCT department FROM employees;
-- Reads only index entries (much smaller), already sorted/grouped
 
-- For multi-column DISTINCT, index must cover all columns
CREATE INDEX idx_employees_dept_city 
ON employees(department, city);
 
SELECT DISTINCT department, city FROM employees;
-- Can use index-only scan for fast deduplication
 
-- Index column order matters for some query patterns
-- (department, city) index helps: DISTINCT department, city
-- (department, city) index helps: DISTINCT department
-- (department, city) index does NOT help: DISTINCT city

6.4 Alternative Approaches for Better Performance

Sometimes restructuring the query eliminates the need for DISTINCT:

distinct_alternatives.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- SLOW: DISTINCT after join multiplication
SELECT DISTINCT c.customer_id, c.customer_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
-- Joins potentially millions of rows, then deduplicates
 
-- FASTER: EXISTS avoids row multiplication
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_date >= '2024-01-01'
);
-- No join multiplication, no DISTINCT needed
 
-- FASTER: Semi-join with IN (often equivalent to EXISTS)
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE c.customer_id IN (
    SELECT DISTINCT customer_id 
    FROM orders 
    WHERE order_date >= '2024-01-01'
);
-- DISTINCT on single integer column is faster
 
-- ALTERNATIVE: Use GROUP BY for aggregated information
SELECT c.customer_id, c.customer_name, COUNT(o.order_id) as order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_id, c.customer_name;
-- Gets unique customers PLUS order counts (more useful)

Rethink the Query

If DISTINCT is slow, ask: 'Why are there duplicates?' Often the answer reveals a better query structure. EXISTS for filtering, GROUP BY for aggregation, and subqueries for isolation often outperform SELECT DISTINCT on joined multi-table queries.

Common DISTINCT Mistakes and Pitfalls

Even experienced developers make DISTINCT-related errors. Understanding these common pitfalls helps you avoid them.

Common DISTINCT Antipatterns

•Using DISTINCT as a band-aid — Adding DISTINCT to 'fix' duplicate results without understanding why duplicates occur. This masks data model or query logic issues and may hide real bugs.
•DISTINCT on unique columns — Using DISTINCT when selecting primary key columns is wasteful. The rows are already unique by definition.
•Expecting column-level deduplication — Misunderstanding that DISTINCT operates on rows, not individual columns. SELECT DISTINCT a, b gives unique (a,b) pairs, not unique a's AND unique b's.
•DISTINCT with ORDER BY conflict — Trying to ORDER BY columns not in the SELECT list when using DISTINCT. Most databases reject this because it's ambiguous.
•Premature optimization avoidance — Avoiding DISTINCT due to performance concerns without measuring. DISTINCT on indexed columns with few unique values is often very fast.
•DISTINCT in subqueries unnecessarily — Adding DISTINCT in subqueries used for EXISTS checks. EXISTS already returns TRUE on first match; DISTINCT doesn't help.

distinct_antipatterns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- ANTIPATTERN: DISTINCT as band-aid
-- Developer notices duplicate customers, adds DISTINCT "to fix it"
SELECT DISTINCT customer_name, email FROM customers;
-- But wait—why are there duplicates in the customer table?
-- This might indicate: data quality issues, missing unique constraint,
-- or incorrect query joining somewhere else
 
-- BETTER: Investigate the actual cause
SELECT customer_name, email, COUNT(*)
FROM customers
GROUP BY customer_name, email
HAVING COUNT(*) > 1;
-- See the actual duplicates and decide how to handle them
 
-- ANTIPATTERN: DISTINCT on primary key (wasteful)
SELECT DISTINCT customer_id, customer_name, email
FROM customers;
-- customer_id is PK, so rows are already unique
-- DISTINCT adds overhead for no benefit
 
-- CORRECT: Just select without DISTINCT
SELECT customer_id, customer_name, email FROM customers;
 
-- ANTIPATTERN: DISTINCT in EXISTS subquery
SELECT c.customer_name
FROM customers c
WHERE EXISTS (
    SELECT DISTINCT o.order_id  -- DISTINCT is pointless here!
    FROM orders o
    WHERE o.customer_id = c.customer_id
);
-- EXISTS returns TRUE on first match; it doesn't matter if
-- there would be duplicates because only one row is checked
 
-- CORRECT: Simple EXISTS
SELECT c.customer_name
FROM customers c
WHERE EXISTS (
    SELECT 1  -- Convention: SELECT 1 in EXISTS
    FROM orders o
    WHERE o.customer_id = c.customer_id
);

DISTINCT as a Code Smell

If you find yourself adding DISTINCT 'because the query returns duplicates,' treat it as a code smell. Investigate why duplicates are occurring. The cause might be: incorrect joins, missing GROUP BY, data quality issues, or schema design problems. DISTINCT should be intentional, not a fix for surprising behavior.

Real-World DISTINCT Usage Patterns

Let's examine practical scenarios where DISTINCT is the right tool, applied correctly.

distinct_patterns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- Pattern 1: Dropdown/Filter value lists
-- Get unique values for a UI filter component
SELECT DISTINCT category_name
FROM products
WHERE is_active = true
ORDER BY category_name;
-- Returns list of categories for a dropdown menu
 
-- Pattern 2: Distinct customers who purchased
SELECT DISTINCT c.customer_id, c.customer_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
-- Each customer who ordered in 2024, listed once
-- (Alternative: use EXISTS for potentially better performance)
 
-- Pattern 3: Tag cloud / unique labels
SELECT DISTINCT tag
FROM article_tags
WHERE article_id IN (SELECT id FROM articles WHERE published = true);
-- All unique tags used on published articles
 
-- Pattern 4: Data quality check - find unique value counts
SELECT 
    COUNT(*) as total_rows,
    COUNT(DISTINCT email) as unique_emails,
    COUNT(DISTINCT phone) as unique_phones,
    COUNT(*) - COUNT(DISTINCT email) as duplicate_email_count
FROM customers;
-- Reveals potential data quality issues
 
-- Pattern 5: Geographic coverage analysis
SELECT DISTINCT country, state, city
FROM customer_addresses
ORDER BY country, state, city;
-- All locations where we have customers
 
-- Pattern 6: Product availability by region
SELECT DISTINCT 
    p.product_name,
    w.region
FROM products p
JOIN inventory i ON p.product_id = i.product_id
JOIN warehouses w ON i.warehouse_id = w.warehouse_id
WHERE i.quantity > 0;
-- Products available in each region (once per region, not per warehouse)

Summary: Mastering DISTINCT

The DISTINCT clause is a fundamental SQL tool for eliminating duplicate rows. Let's consolidate the key insights from this comprehensive exploration:

Key Takeaways

•DISTINCT eliminates duplicate rows — It works on entire rows, not individual columns. Two rows are duplicates only if ALL selected columns match.
•NULLs are treated as equal — For DISTINCT purposes, two NULL values in the same column are considered matching, unlike regular SQL comparison rules.
•DISTINCT operates on selected columns only — Adding columns to your SELECT DISTINCT can produce more 'unique' rows because there are more ways for rows to differ.
•DISTINCT ON (PostgreSQL-specific) — Allows specifying which columns determine uniqueness while selecting additional columns. Use with ORDER BY to control which row is kept.
•ORDER BY columns must be in SELECT — When using DISTINCT with ORDER BY, the ORDER BY columns must appear in the SELECT list (in most databases).
•DISTINCT inside aggregates works differently — COUNT(DISTINCT column) counts unique values; this is different from SELECT DISTINCT which eliminates duplicate rows.
•Performance requires attention — DISTINCT has computational cost; use indexes, minimize column count, and consider alternatives like EXISTS or GROUP BY when appropriate.
•DISTINCT as a symptom — If you're adding DISTINCT to 'fix' unexpected duplicates, investigate the root cause. It may indicate data or query design issues.

What's Next:

With DISTINCT mastered, we'll explore the complementary challenge: limiting the number of rows returned. The next page covers LIMIT, TOP, and FETCH—syntax variations across database systems for restricting result set size.

Page Complete

You now have comprehensive knowledge of the DISTINCT clause—its syntax, semantics, performance characteristics, and best practices. You can apply DISTINCT intentionally and effectively, avoiding common pitfalls that trap less experienced developers.

2 / 5

Loading learning content...

Database Management SystemsDISTINCT and LIMIT

DISTINCT and LIMIT: Controlling Result Sets

LevelBeginner

Duration60 mins

TopicDISTINCT and LIMIT

2 / 5

DISTINCT Clause

The DISTINCT Keyword: SQL's Deduplication Tool

What You Will Learn

Basic DISTINCT Syntax

DISTINCT is placed immediately after the SELECT keyword and before the column list. It affects the entire row—not individual columns. This distinction is crucial and commonly misunderstood.

distinct_syntax.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Basic syntax
SELECT DISTINCT column1, column2, ...
FROM table_name;
 
-- DISTINCT eliminates rows where ALL listed columns match
-- It does NOT operate on individual columns independently
 
-- Example: Unique departments
SELECT DISTINCT department
FROM employees;
 
-- Example: Unique (city, state) combinations
SELECT DISTINCT city, state
FROM customers;
 
-- The above returns unique pairs, not unique cities AND unique states
-- (New York, NY) and (New York, CA) are DIFFERENT rows
-- (New York, NY) appearing twice becomes one row

Common Misconception

1.1 SELECT ALL: The Implicit Default

select_all.sql

-- These are equivalent:
SELECT department FROM employees;
SELECT ALL department FROM employees;
 
-- Both return all rows including duplicates
-- SELECT ALL is valid SQL but never used in practice
 
-- DISTINCT explicitly changes the default behavior
SELECT DISTINCT department FROM employees;
-- Returns each unique department exactly once

1.2 Placement Rules

DISTINCT must immediately follow SELECT. It cannot appear elsewhere in the select list or after column names.

distinct_placement.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- CORRECT placement
SELECT DISTINCT department, city FROM employees;
 
-- INVALID placements (syntax errors)
SELECT department DISTINCT, city FROM employees;  -- Error
SELECT department, DISTINCT city FROM employees;  -- Error
DISTINCT SELECT department FROM employees;        -- Error
 
-- DISTINCT with expressions
SELECT DISTINCT UPPER(department) FROM employees;
-- Returns unique uppercase department names
 
-- DISTINCT with aliases
SELECT DISTINCT department AS dept, city AS location
FROM employees;
-- Aliases don't affect DISTINCT comparison; they rename output

How DISTINCT Determines Row Equality

2.1 Column-by-Column Comparison

Two rows are considered duplicates if every selected column has equal values between them. If any column differs, the rows are distinct.

DISTINCT Row Comparison Example
Row	city	state	Kept?
1	New York	NY	Yes (first occurrence)
2	New York	NY	No (duplicate of row 1)
3	New York	CA	Yes (state differs)
4	Los Angeles	CA	Yes (city differs from row 3)
5	Los Angeles	CA	No (duplicate of row 4)

2.2 NULL Handling in DISTINCT

DISTINCT treats NULL values as equal to each other for comparison purposes. This is a critical exception to the normal SQL rule that NULL is not equal to anything, including itself.

Normal SQL comparison: NULL = NULL evaluates to NULL (unknown), which is treated as false.

DISTINCT comparison: Two NULLs in the same column position are treated as matching.

This behavior aligns with the intuition that 'unknown' values should be grouped together, but it's technically inconsistent with standard NULL semantics.

distinct_null.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Sample data
-- employee_id | department | manager_id
-- 1           | Sales      | 100
-- 2           | Sales      | 100
-- 3           | NULL       | NULL
-- 4           | NULL       | NULL
-- 5           | Sales      | NULL
 
SELECT DISTINCT department, manager_id FROM employees;
 
-- Result:
-- department | manager_id
-- Sales      | 100        (rows 1,2 collapsed)
-- NULL       | NULL       (rows 3,4 collapsed - NULLs match!)
-- Sales      | NULL       (row 5 - different from above)
 
-- Compare with WHERE clause NULL behavior:
SELECT * FROM employees WHERE department = NULL;
-- Returns NOTHING (NULL = NULL is unknown, treated as false)
 
-- But DISTINCT groups NULLs together:
SELECT DISTINCT department FROM employees;
-- Returns: Sales, NULL (one NULL row for all NULL departments)

SQL Standard Specification

2.3 Data Type Considerations

DISTINCT comparison follows the rules of each column's data type, which can produce surprising results with certain types.

distinct_datatypes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- String comparison (case sensitivity depends on collation)
SELECT DISTINCT name FROM users;
-- With case-insensitive collation: 'John' and 'john' are duplicates
-- With case-sensitive collation: 'John' and 'john' are distinct
 
-- Trailing spaces (varies by database)
-- In some databases: 'John' and 'John  ' are equal (trailing spaces ignored)
-- In others: they're different values
 
-- Numeric precision
SELECT DISTINCT price FROM products;
-- 10.00 and 10.000 are typically equal (same numeric value)
-- But 10.00 and 10.001 are different
 
-- Timestamp precision
SELECT DISTINCT created_at FROM orders;
-- Timestamps compared to full precision
-- '2024-01-15 10:30:00.000' vs '2024-01-15 10:30:00.001' are different
 
-- DATE vs DATETIME comparison
-- Comparing different temporal types may produce unexpected results
SELECT DISTINCT DATE(order_timestamp), order_timestamp FROM orders;
-- DATE(order_timestamp) groups by day, but order_timestamp is unique per millisecond

DISTINCT with Multiple Columns

When DISTINCT is applied to multiple columns, it eliminates rows where all specified columns match. This creates unique tuples or combinations, not unique individual values.

multi_column_distinct.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Sample orders table
-- order_id | customer_id | product_id | quantity
-- 1        | 100         | A          | 2
-- 2        | 100         | B          | 1
-- 3        | 100         | A          | 3
-- 4        | 101         | A          | 2
-- 5        | 101         | A          | 2
 
-- DISTINCT on single column
SELECT DISTINCT customer_id FROM orders;
-- Result: 100, 101 (2 rows)
 
SELECT DISTINCT product_id FROM orders;
-- Result: A, B (2 rows)
 
-- DISTINCT on two columns: unique combinations
SELECT DISTINCT customer_id, product_id FROM orders;
-- Result:
-- 100, A  (rows 1,3 collapsed)
-- 100, B  (row 2)
-- 101, A  (rows 4,5 collapsed)
-- Total: 3 rows
 
-- Note: Row 3 (customer 100, product A, qty 3) and 
-- row 1 (customer 100, product A, qty 2) are considered
-- duplicates because only customer_id and product_id are compared

3.1 Column Order Doesn't Affect Results

The order of columns in the SELECT DISTINCT list doesn't change which rows are considered duplicates—only the presentation order of columns in the output.

column_order.sql
1
2
3
4
5
6
7
8
-- These queries return the same rows (different column order)
SELECT DISTINCT customer_id, product_id FROM orders;
SELECT DISTINCT product_id, customer_id FROM orders;
 
-- Both return the same 3 unique combinations
-- Only the column presentation order differs
 
-- The pairs (100, A) and (A, 100) represent the same logical tuple

3.2 All Selected Columns Participate

Every column in the SELECT list participates in DISTINCT comparison. You cannot make some columns distinct while preserving duplicates in others.

distinct_all_columns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- WRONG expectation:
-- "I want unique customers with 
-- their first order date"
 
SELECT DISTINCT 
    customer_id, 
    order_date
FROM orders;
 
-- This gives unique (customer, date) 
-- PAIRS, not unique customers!
 
-- If customer 100 ordered on Jan 1 
-- and Jan 2, both rows appear

distinct_correct.sql

-- CORRECT approach:
-- Use aggregation to get one 
-- row per customer
 
SELECT 
    customer_id, 
    MIN(order_date) as first_order
FROM orders
GROUP BY customer_id;
 
-- This gives exactly one row per 
-- customer with their first order
 
-- Or use DISTINCT ON in PostgreSQL
-- (covered in section 4)

The DISTINCT Trap

DISTINCT ON: PostgreSQL Extension

distinct_on.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- PostgreSQL DISTINCT ON syntax
SELECT DISTINCT ON (expression1, expression2, ...)
    column1, column2, column3, ...
FROM table_name
ORDER BY expression1, expression2, ..., other_columns;
 
-- Example: Get first order per customer
SELECT DISTINCT ON (customer_id)
    customer_id,
    order_id,
    order_date,
    total_amount
FROM orders
ORDER BY customer_id, order_date ASC;
 
-- This returns exactly ONE row per customer_id
-- The row kept is the FIRST one according to ORDER BY
-- So we get each customer's earliest order
 
-- Example: Get most recent login per user
SELECT DISTINCT ON (user_id)
    user_id,
    login_at,
    ip_address,
    device_type
FROM login_history
ORDER BY user_id, login_at DESC;
 
-- Returns most recent login for each user

ORDER BY Requirement

4.1 DISTINCT ON vs. Alternatives

DISTINCT ON solves the 'select additional columns' problem elegantly, but equivalent solutions exist in other databases:

distinct_on_alternatives.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- PostgreSQL DISTINCT ON
SELECT DISTINCT ON (customer_id)
    customer_id, order_id, order_date
FROM orders
ORDER BY customer_id, order_date DESC;
 
-- Standard SQL equivalent using window function
SELECT customer_id, order_id, order_date
FROM (
    SELECT 
        customer_id, 
        order_id, 
        order_date,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id 
            ORDER BY order_date DESC
        ) as rn
    FROM orders
) ranked
WHERE rn = 1;
 
-- Alternative using correlated subquery
SELECT o.customer_id, o.order_id, o.order_date
FROM orders o
WHERE o.order_date = (
    SELECT MAX(o2.order_date)
    FROM orders o2
    WHERE o2.customer_id = o.customer_id
);
 
-- Note: Correlated subquery approach may return multiple rows
-- if there are ties; ROW_NUMBER guarantees exactly one row

DISTINCT ON Alternatives Comparison
Approach	Database Support	Handles Ties	Performance
DISTINCT ON	PostgreSQL only	Deterministic with ORDER BY	Excellent (optimized)
ROW_NUMBER() window	All modern databases	Deterministic	Good (requires sort)
Correlated subquery	All databases	May return multiple rows	Often poor (N+1 queries)
GROUP BY + JOIN	All databases	Depends on join condition	Varies (can be optimized)

DISTINCT Interaction with Other SQL Clauses

DISTINCT interacts with other SQL clauses in specific ways. Understanding these interactions is essential for writing correct queries.

5.1 DISTINCT and ORDER BY

distinct_order_by.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- VALID: ORDER BY column is in SELECT list
SELECT DISTINCT city, state
FROM customers
ORDER BY city;
 
-- VALID: Multiple ORDER BY columns, all in SELECT
SELECT DISTINCT city, state
FROM customers
ORDER BY state, city;
 
-- INVALID in most databases: ORDER BY column not in SELECT
SELECT DISTINCT city
FROM customers
ORDER BY state;
-- Error: ORDER BY items must appear in the select list if DISTINCT is specified
 
-- Why? After DISTINCT eliminates rows, the database might have:
-- city
-- ----
-- Boston
-- New York
-- For 'Boston', there might have been original rows from MA and CT.
-- Which 'state' value should be used for sorting?
 
-- Solution: Include the ORDER BY column in SELECT
SELECT DISTINCT city, state
FROM customers
ORDER BY state;

5.2 DISTINCT and WHERE

WHERE filtering occurs before DISTINCT deduplication. This means you filter rows first, then eliminate duplicates from the filtered set.

distinct_where.sql
1
2
3
4
5
6
7
8
9
10
11
12
-- Execution order: FROM -> WHERE -> SELECT (with DISTINCT)
SELECT DISTINCT department
FROM employees
WHERE salary > 50000;
 
-- Step 1: Read all employees
-- Step 2: Filter to those with salary > 50000
-- Step 3: Project to department column
-- Step 4: Eliminate duplicate departments
 
-- This gives unique departments that have at least one
-- employee earning over $50,000

5.3 DISTINCT and Aggregates

DISTINCT can be used inside aggregate functions to operate only on unique values. This is different from SELECT DISTINCT.

distinct_aggregate.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- COUNT(column) counts non-NULL values (including duplicates)
SELECT COUNT(department) FROM employees;
-- Result: 1000 (if 1000 employees have a department)
 
-- COUNT(DISTINCT column) counts unique non-NULL values
SELECT COUNT(DISTINCT department) FROM employees;
-- Result: 10 (if there are 10 unique departments)
 
-- This works with other aggregates too
SELECT AVG(salary) FROM employees;           -- Average of all salaries
SELECT AVG(DISTINCT salary) FROM employees;  -- Average of unique salary values
-- If 5 people earn $50K and 3 earn $60K:
-- AVG(salary) = (5*50000 + 3*60000) / 8 = $53,750
-- AVG(DISTINCT salary) = (50000 + 60000) / 2 = $55,000
 
-- SUM(DISTINCT column) sums unique values
SELECT SUM(order_total) FROM orders;           -- All order totals
SELECT SUM(DISTINCT order_total) FROM orders;  -- Each unique total once
 
-- String aggregation with DISTINCT (PostgreSQL example)
SELECT STRING_AGG(DISTINCT department, ', ' ORDER BY department)
FROM employees;
-- Returns: "Engineering, Marketing, Sales" (not "Sales, Sales, Sales, Marketing...")

DISTINCT in Aggregates

5.4 DISTINCT and GROUP BY

Using both DISTINCT and GROUP BY is usually redundant. GROUP BY already collapses rows by the grouped columns, and DISTINCT on those same columns adds no value (with computational overhead).

distinct_group_by.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- REDUNDANT: DISTINCT with same columns as GROUP BY
SELECT DISTINCT department, COUNT(*) as emp_count
FROM employees
GROUP BY department;
 
-- EQUIVALENT and more efficient (no DISTINCT):
SELECT department, COUNT(*) as emp_count
FROM employees
GROUP BY department;
 
-- GROUP BY already produces one row per department
-- DISTINCT has nothing to deduplicate
 
-- However, DISTINCT can be meaningful with certain GROUP BY patterns:
SELECT DISTINCT dept_category
FROM (
    SELECT 
        department,
        CASE WHEN COUNT(*) > 100 THEN 'Large' ELSE 'Small' END as dept_category
    FROM employees
    GROUP BY department
) dept_sizes;
-- Multiple departments might be 'Large', DISTINCT collapses them

DISTINCT Performance Optimization

DISTINCT operations require significant computational resources. Understanding how to optimize them—or avoid them entirely—is essential for high-performance SQL.

6.1 Understanding the Cost

DISTINCT requires the database to:

Process all matching rows from the base query
Compare each row against all others (or use hashing) to identify duplicates
Output only unique rows

The cost scales with both the number of input rows and the size of each row (number and width of columns).

DISTINCT Performance Factors
Factor	Impact	Mitigation
Total input rows	More rows = more comparisons	Add WHERE filters before DISTINCT
Number of columns	More columns = more comparison work	Select only needed columns
Column data types	Large strings/BLOBs are slow to hash	Avoid DISTINCT on TEXT/BLOB
Uniqueness ratio	Few duplicates = DISTINCT is wasted	Verify duplicates actually exist
Available memory	Hash tables spill to disk if too large	Ensure work_mem is adequate
Indexes	Index-only scans can provide sorted input	Create indexes on DISTINCT columns

6.2 Query Plan Analysis

Use EXPLAIN to understand how the database implements DISTINCT:

distinct_explain.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- PostgreSQL EXPLAIN example
EXPLAIN (ANALYZE, BUFFERS) 
SELECT DISTINCT department FROM employees;
 
-- Possible plans:
 
-- 1. HashAggregate: Uses hash table for deduplication
--    HashAggregate  (rows=10) (actual rows=10)
--      -> Seq Scan on employees (rows=10000)
--    Good when: many duplicates, moderate unique values
 
-- 2. Sort + Unique: Sorts then removes adjacent duplicates
--    Unique  (rows=10)
--      -> Sort (rows=10000)
--        -> Seq Scan on employees
--    Good when: ORDER BY also needed, index available
 
-- 3. Index Only Scan (if covering index exists)
--    Unique
--      -> Index Only Scan using idx_department
--    Best performance: no table access needed
 
-- Force PostgreSQL to use specific strategy (for testing)
SET enable_hashagg = off;  -- Force sort-based DISTINCT
SET enable_sort = off;     -- Force hash-based DISTINCT

6.3 Index-Based Optimization

Indexes can dramatically speed up DISTINCT by providing pre-sorted or grouped access paths.

distinct_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Without index: Full table scan + sort/hash
SELECT DISTINCT department FROM employees;
-- Reads all 10,000 rows, then deduplicates
 
-- With index on department: Index-only scan possible
CREATE INDEX idx_employees_department ON employees(department);
SELECT DISTINCT department FROM employees;
-- Reads only index entries (much smaller), already sorted/grouped
 
-- For multi-column DISTINCT, index must cover all columns
CREATE INDEX idx_employees_dept_city 
ON employees(department, city);
 
SELECT DISTINCT department, city FROM employees;
-- Can use index-only scan for fast deduplication
 
-- Index column order matters for some query patterns
-- (department, city) index helps: DISTINCT department, city
-- (department, city) index helps: DISTINCT department
-- (department, city) index does NOT help: DISTINCT city

6.4 Alternative Approaches for Better Performance

Sometimes restructuring the query eliminates the need for DISTINCT:

distinct_alternatives.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- SLOW: DISTINCT after join multiplication
SELECT DISTINCT c.customer_id, c.customer_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
-- Joins potentially millions of rows, then deduplicates
 
-- FASTER: EXISTS avoids row multiplication
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_date >= '2024-01-01'
);
-- No join multiplication, no DISTINCT needed
 
-- FASTER: Semi-join with IN (often equivalent to EXISTS)
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE c.customer_id IN (
    SELECT DISTINCT customer_id 
    FROM orders 
    WHERE order_date >= '2024-01-01'
);
-- DISTINCT on single integer column is faster
 
-- ALTERNATIVE: Use GROUP BY for aggregated information
SELECT c.customer_id, c.customer_name, COUNT(o.order_id) as order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_id, c.customer_name;
-- Gets unique customers PLUS order counts (more useful)

Rethink the Query

Common DISTINCT Mistakes and Pitfalls

Even experienced developers make DISTINCT-related errors. Understanding these common pitfalls helps you avoid them.

Common DISTINCT Antipatterns

•Using DISTINCT as a band-aid — Adding DISTINCT to 'fix' duplicate results without understanding why duplicates occur. This masks data model or query logic issues and may hide real bugs.
•DISTINCT on unique columns — Using DISTINCT when selecting primary key columns is wasteful. The rows are already unique by definition.
•Expecting column-level deduplication — Misunderstanding that DISTINCT operates on rows, not individual columns. SELECT DISTINCT a, b gives unique (a,b) pairs, not unique a's AND unique b's.
•DISTINCT with ORDER BY conflict — Trying to ORDER BY columns not in the SELECT list when using DISTINCT. Most databases reject this because it's ambiguous.
•Premature optimization avoidance — Avoiding DISTINCT due to performance concerns without measuring. DISTINCT on indexed columns with few unique values is often very fast.
•DISTINCT in subqueries unnecessarily — Adding DISTINCT in subqueries used for EXISTS checks. EXISTS already returns TRUE on first match; DISTINCT doesn't help.

distinct_antipatterns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- ANTIPATTERN: DISTINCT as band-aid
-- Developer notices duplicate customers, adds DISTINCT "to fix it"
SELECT DISTINCT customer_name, email FROM customers;
-- But wait—why are there duplicates in the customer table?
-- This might indicate: data quality issues, missing unique constraint,
-- or incorrect query joining somewhere else
 
-- BETTER: Investigate the actual cause
SELECT customer_name, email, COUNT(*)
FROM customers
GROUP BY customer_name, email
HAVING COUNT(*) > 1;
-- See the actual duplicates and decide how to handle them
 
-- ANTIPATTERN: DISTINCT on primary key (wasteful)
SELECT DISTINCT customer_id, customer_name, email
FROM customers;
-- customer_id is PK, so rows are already unique
-- DISTINCT adds overhead for no benefit
 
-- CORRECT: Just select without DISTINCT
SELECT customer_id, customer_name, email FROM customers;
 
-- ANTIPATTERN: DISTINCT in EXISTS subquery
SELECT c.customer_name
FROM customers c
WHERE EXISTS (
    SELECT DISTINCT o.order_id  -- DISTINCT is pointless here!
    FROM orders o
    WHERE o.customer_id = c.customer_id
);
-- EXISTS returns TRUE on first match; it doesn't matter if
-- there would be duplicates because only one row is checked
 
-- CORRECT: Simple EXISTS
SELECT c.customer_name
FROM customers c
WHERE EXISTS (
    SELECT 1  -- Convention: SELECT 1 in EXISTS
    FROM orders o
    WHERE o.customer_id = c.customer_id
);

DISTINCT as a Code Smell

Real-World DISTINCT Usage Patterns

Let's examine practical scenarios where DISTINCT is the right tool, applied correctly.

distinct_patterns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- Pattern 1: Dropdown/Filter value lists
-- Get unique values for a UI filter component
SELECT DISTINCT category_name
FROM products
WHERE is_active = true
ORDER BY category_name;
-- Returns list of categories for a dropdown menu
 
-- Pattern 2: Distinct customers who purchased
SELECT DISTINCT c.customer_id, c.customer_name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01';
-- Each customer who ordered in 2024, listed once
-- (Alternative: use EXISTS for potentially better performance)
 
-- Pattern 3: Tag cloud / unique labels
SELECT DISTINCT tag
FROM article_tags
WHERE article_id IN (SELECT id FROM articles WHERE published = true);
-- All unique tags used on published articles
 
-- Pattern 4: Data quality check - find unique value counts
SELECT 
    COUNT(*) as total_rows,
    COUNT(DISTINCT email) as unique_emails,
    COUNT(DISTINCT phone) as unique_phones,
    COUNT(*) - COUNT(DISTINCT email) as duplicate_email_count
FROM customers;
-- Reveals potential data quality issues
 
-- Pattern 5: Geographic coverage analysis
SELECT DISTINCT country, state, city
FROM customer_addresses
ORDER BY country, state, city;
-- All locations where we have customers
 
-- Pattern 6: Product availability by region
SELECT DISTINCT 
    p.product_name,
    w.region
FROM products p
JOIN inventory i ON p.product_id = i.product_id
JOIN warehouses w ON i.warehouse_id = w.warehouse_id
WHERE i.quantity > 0;
-- Products available in each region (once per region, not per warehouse)

Summary: Mastering DISTINCT

The DISTINCT clause is a fundamental SQL tool for eliminating duplicate rows. Let's consolidate the key insights from this comprehensive exploration:

Key Takeaways

•DISTINCT eliminates duplicate rows — It works on entire rows, not individual columns. Two rows are duplicates only if ALL selected columns match.
•NULLs are treated as equal — For DISTINCT purposes, two NULL values in the same column are considered matching, unlike regular SQL comparison rules.
•DISTINCT operates on selected columns only — Adding columns to your SELECT DISTINCT can produce more 'unique' rows because there are more ways for rows to differ.
•DISTINCT ON (PostgreSQL-specific) — Allows specifying which columns determine uniqueness while selecting additional columns. Use with ORDER BY to control which row is kept.
•ORDER BY columns must be in SELECT — When using DISTINCT with ORDER BY, the ORDER BY columns must appear in the SELECT list (in most databases).
•DISTINCT inside aggregates works differently — COUNT(DISTINCT column) counts unique values; this is different from SELECT DISTINCT which eliminates duplicate rows.
•Performance requires attention — DISTINCT has computational cost; use indexes, minimize column count, and consider alternatives like EXISTS or GROUP BY when appropriate.
•DISTINCT as a symptom — If you're adding DISTINCT to 'fix' unexpected duplicates, investigate the root cause. It may indicate data or query design issues.

What's Next:

Page Complete

2 / 5