Loading learning content...
The DISTINCT keyword is SQL's primary mechanism for transforming multiset (bag) results into true set results—eliminating duplicate rows so that each unique combination of values appears exactly once. While conceptually simple, DISTINCT has nuanced behavior that experienced SQL developers must master.
This page provides exhaustive coverage of DISTINCT: its placement in queries, how it determines row uniqueness, its interaction with other clauses, performance characteristics, and common patterns and pitfalls. By the end, you'll wield DISTINCT with precision and confidence.
By the end of this page, you will understand: (1) DISTINCT syntax and placement rules, (2) How DISTINCT determines row equality including NULL handling, (3) DISTINCT with single vs. multiple columns, (4) DISTINCT ON (PostgreSQL extension), (5) Interaction with ORDER BY and other clauses, (6) Performance optimization techniques, and (7) Common mistakes and how to avoid them.
DISTINCT is placed immediately after the SELECT keyword and before the column list. It affects the entire row—not individual columns. This distinction is crucial and commonly misunderstood.
123456789101112131415161718
-- Basic syntaxSELECT DISTINCT column1, column2, ...FROM table_name; -- DISTINCT eliminates rows where ALL listed columns match-- It does NOT operate on individual columns independently -- Example: Unique departmentsSELECT DISTINCT departmentFROM employees; -- Example: Unique (city, state) combinationsSELECT DISTINCT city, stateFROM customers; -- The above returns unique pairs, not unique cities AND unique states-- (New York, NY) and (New York, CA) are DIFFERENT rows-- (New York, NY) appearing twice becomes one rowDISTINCT does NOT mean 'make each column unique independently.' The query SELECT DISTINCT city, state does not give you unique cities AND unique states—it gives you unique (city, state) pairs. This is one of the most common misunderstandings about DISTINCT.
SQL actually has an explicit keyword for the default behavior: SELECT ALL. This explicitly requests multiset semantics (preserving duplicates). In practice, no one writes it because it's the default, but it exists for completeness.
12345678910
-- These are equivalent:SELECT department FROM employees;SELECT ALL department FROM employees; -- Both return all rows including duplicates-- SELECT ALL is valid SQL but never used in practice -- DISTINCT explicitly changes the default behaviorSELECT DISTINCT department FROM employees;-- Returns each unique department exactly onceDISTINCT must immediately follow SELECT. It cannot appear elsewhere in the select list or after column names.
12345678910111213141516
-- CORRECT placementSELECT DISTINCT department, city FROM employees; -- INVALID placements (syntax errors)SELECT department DISTINCT, city FROM employees; -- ErrorSELECT department, DISTINCT city FROM employees; -- ErrorDISTINCT SELECT department FROM employees; -- Error -- DISTINCT with expressionsSELECT DISTINCT UPPER(department) FROM employees;-- Returns unique uppercase department names -- DISTINCT with aliasesSELECT DISTINCT department AS dept, city AS locationFROM employees;-- Aliases don't affect DISTINCT comparison; they rename outputUnderstanding exactly how DISTINCT determines whether two rows are 'equal' (and thus one should be eliminated) is essential for predictable results. The rules involve data type comparison semantics and special handling for NULL values.
Two rows are considered duplicates if every selected column has equal values between them. If any column differs, the rows are distinct.
| Row | city | state | Kept? |
|---|---|---|---|
| 1 | New York | NY | Yes (first occurrence) |
| 2 | New York | NY | No (duplicate of row 1) |
| 3 | New York | CA | Yes (state differs) |
| 4 | Los Angeles | CA | Yes (city differs from row 3) |
| 5 | Los Angeles | CA | No (duplicate of row 4) |
DISTINCT treats NULL values as equal to each other for comparison purposes. This is a critical exception to the normal SQL rule that NULL is not equal to anything, including itself.
Normal SQL comparison: NULL = NULL evaluates to NULL (unknown), which is treated as false.
DISTINCT comparison: Two NULLs in the same column position are treated as matching.
This behavior aligns with the intuition that 'unknown' values should be grouped together, but it's technically inconsistent with standard NULL semantics.
1234567891011121314151617181920212223
-- Sample data-- employee_id | department | manager_id-- 1 | Sales | 100-- 2 | Sales | 100-- 3 | NULL | NULL-- 4 | NULL | NULL-- 5 | Sales | NULL SELECT DISTINCT department, manager_id FROM employees; -- Result:-- department | manager_id-- Sales | 100 (rows 1,2 collapsed)-- NULL | NULL (rows 3,4 collapsed - NULLs match!)-- Sales | NULL (row 5 - different from above) -- Compare with WHERE clause NULL behavior:SELECT * FROM employees WHERE department = NULL;-- Returns NOTHING (NULL = NULL is unknown, treated as false) -- But DISTINCT groups NULLs together:SELECT DISTINCT department FROM employees;-- Returns: Sales, NULL (one NULL row for all NULL departments)The SQL standard explicitly specifies that for DISTINCT, 'two values that are both null are considered to be not distinct.' This special-case handling ensures that DISTINCT produces intuitive results when NULL values are present, even though it contradicts the general NULL comparison rules.
DISTINCT comparison follows the rules of each column's data type, which can produce surprising results with certain types.
1234567891011121314151617181920212223
-- String comparison (case sensitivity depends on collation)SELECT DISTINCT name FROM users;-- With case-insensitive collation: 'John' and 'john' are duplicates-- With case-sensitive collation: 'John' and 'john' are distinct -- Trailing spaces (varies by database)-- In some databases: 'John' and 'John ' are equal (trailing spaces ignored)-- In others: they're different values -- Numeric precisionSELECT DISTINCT price FROM products;-- 10.00 and 10.000 are typically equal (same numeric value)-- But 10.00 and 10.001 are different -- Timestamp precisionSELECT DISTINCT created_at FROM orders;-- Timestamps compared to full precision-- '2024-01-15 10:30:00.000' vs '2024-01-15 10:30:00.001' are different -- DATE vs DATETIME comparison-- Comparing different temporal types may produce unexpected resultsSELECT DISTINCT DATE(order_timestamp), order_timestamp FROM orders;-- DATE(order_timestamp) groups by day, but order_timestamp is unique per millisecondWhen DISTINCT is applied to multiple columns, it eliminates rows where all specified columns match. This creates unique tuples or combinations, not unique individual values.
1234567891011121314151617181920212223242526
-- Sample orders table-- order_id | customer_id | product_id | quantity-- 1 | 100 | A | 2-- 2 | 100 | B | 1-- 3 | 100 | A | 3-- 4 | 101 | A | 2-- 5 | 101 | A | 2 -- DISTINCT on single columnSELECT DISTINCT customer_id FROM orders;-- Result: 100, 101 (2 rows) SELECT DISTINCT product_id FROM orders;-- Result: A, B (2 rows) -- DISTINCT on two columns: unique combinationsSELECT DISTINCT customer_id, product_id FROM orders;-- Result:-- 100, A (rows 1,3 collapsed)-- 100, B (row 2)-- 101, A (rows 4,5 collapsed)-- Total: 3 rows -- Note: Row 3 (customer 100, product A, qty 3) and -- row 1 (customer 100, product A, qty 2) are considered-- duplicates because only customer_id and product_id are comparedThe order of columns in the SELECT DISTINCT list doesn't change which rows are considered duplicates—only the presentation order of columns in the output.
12345678
-- These queries return the same rows (different column order)SELECT DISTINCT customer_id, product_id FROM orders;SELECT DISTINCT product_id, customer_id FROM orders; -- Both return the same 3 unique combinations-- Only the column presentation order differs -- The pairs (100, A) and (A, 100) represent the same logical tupleEvery column in the SELECT list participates in DISTINCT comparison. You cannot make some columns distinct while preserving duplicates in others.
1234567891011121314
-- WRONG expectation:-- "I want unique customers with -- their first order date" SELECT DISTINCT customer_id, order_dateFROM orders; -- This gives unique (customer, date) -- PAIRS, not unique customers! -- If customer 100 ordered on Jan 1 -- and Jan 2, both rows appear123456789101112131415
-- CORRECT approach:-- Use aggregation to get one -- row per customer SELECT customer_id, MIN(order_date) as first_orderFROM ordersGROUP BY customer_id; -- This gives exactly one row per -- customer with their first order -- Or use DISTINCT ON in PostgreSQL-- (covered in section 4)A common mistake is adding columns to a DISTINCT query and expecting the uniqueness to remain on the original column. Each column you add potentially increases the number of 'unique' rows because there are more ways for rows to differ. If you need one row per X with additional columns, use GROUP BY with aggregates or window functions.
PostgreSQL extends standard SQL with DISTINCT ON (columns), which allows you to specify which columns determine uniqueness while still selecting additional columns. This is extraordinarily useful but not portable to other database systems.
1234567891011121314151617181920212223242526272829
-- PostgreSQL DISTINCT ON syntaxSELECT DISTINCT ON (expression1, expression2, ...) column1, column2, column3, ...FROM table_nameORDER BY expression1, expression2, ..., other_columns; -- Example: Get first order per customerSELECT DISTINCT ON (customer_id) customer_id, order_id, order_date, total_amountFROM ordersORDER BY customer_id, order_date ASC; -- This returns exactly ONE row per customer_id-- The row kept is the FIRST one according to ORDER BY-- So we get each customer's earliest order -- Example: Get most recent login per userSELECT DISTINCT ON (user_id) user_id, login_at, ip_address, device_typeFROM login_historyORDER BY user_id, login_at DESC; -- Returns most recent login for each userDISTINCT ON requires ORDER BY, and the ORDER BY must start with the same expressions as DISTINCT ON (in the same order). The database uses this ordering to determine which row to keep from each group. Additional ORDER BY columns can follow to control which specific row is selected.
DISTINCT ON solves the 'select additional columns' problem elegantly, but equivalent solutions exist in other databases:
1234567891011121314151617181920212223242526272829303132
-- PostgreSQL DISTINCT ONSELECT DISTINCT ON (customer_id) customer_id, order_id, order_dateFROM ordersORDER BY customer_id, order_date DESC; -- Standard SQL equivalent using window functionSELECT customer_id, order_id, order_dateFROM ( SELECT customer_id, order_id, order_date, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) as rn FROM orders) rankedWHERE rn = 1; -- Alternative using correlated subquerySELECT o.customer_id, o.order_id, o.order_dateFROM orders oWHERE o.order_date = ( SELECT MAX(o2.order_date) FROM orders o2 WHERE o2.customer_id = o.customer_id); -- Note: Correlated subquery approach may return multiple rows-- if there are ties; ROW_NUMBER guarantees exactly one row| Approach | Database Support | Handles Ties | Performance |
|---|---|---|---|
| DISTINCT ON | PostgreSQL only | Deterministic with ORDER BY | Excellent (optimized) |
| ROW_NUMBER() window | All modern databases | Deterministic | Good (requires sort) |
| Correlated subquery | All databases | May return multiple rows | Often poor (N+1 queries) |
| GROUP BY + JOIN | All databases | Depends on join condition | Varies (can be optimized) |
DISTINCT interacts with other SQL clauses in specific ways. Understanding these interactions is essential for writing correct queries.
When using both DISTINCT and ORDER BY, the ORDER BY columns must be present in the SELECT list (in most databases). This is because DISTINCT is logically applied before ORDER BY, and you can only sort by columns that exist in the result.
12345678910111213141516171819202122232425262728
-- VALID: ORDER BY column is in SELECT listSELECT DISTINCT city, stateFROM customersORDER BY city; -- VALID: Multiple ORDER BY columns, all in SELECTSELECT DISTINCT city, stateFROM customersORDER BY state, city; -- INVALID in most databases: ORDER BY column not in SELECTSELECT DISTINCT cityFROM customersORDER BY state;-- Error: ORDER BY items must appear in the select list if DISTINCT is specified -- Why? After DISTINCT eliminates rows, the database might have:-- city-- ------ Boston-- New York-- For 'Boston', there might have been original rows from MA and CT.-- Which 'state' value should be used for sorting? -- Solution: Include the ORDER BY column in SELECTSELECT DISTINCT city, stateFROM customersORDER BY state;WHERE filtering occurs before DISTINCT deduplication. This means you filter rows first, then eliminate duplicates from the filtered set.
123456789101112
-- Execution order: FROM -> WHERE -> SELECT (with DISTINCT)SELECT DISTINCT departmentFROM employeesWHERE salary > 50000; -- Step 1: Read all employees-- Step 2: Filter to those with salary > 50000-- Step 3: Project to department column-- Step 4: Eliminate duplicate departments -- This gives unique departments that have at least one-- employee earning over $50,000DISTINCT can be used inside aggregate functions to operate only on unique values. This is different from SELECT DISTINCT.
1234567891011121314151617181920212223
-- COUNT(column) counts non-NULL values (including duplicates)SELECT COUNT(department) FROM employees;-- Result: 1000 (if 1000 employees have a department) -- COUNT(DISTINCT column) counts unique non-NULL valuesSELECT COUNT(DISTINCT department) FROM employees;-- Result: 10 (if there are 10 unique departments) -- This works with other aggregates tooSELECT AVG(salary) FROM employees; -- Average of all salariesSELECT AVG(DISTINCT salary) FROM employees; -- Average of unique salary values-- If 5 people earn $50K and 3 earn $60K:-- AVG(salary) = (5*50000 + 3*60000) / 8 = $53,750-- AVG(DISTINCT salary) = (50000 + 60000) / 2 = $55,000 -- SUM(DISTINCT column) sums unique valuesSELECT SUM(order_total) FROM orders; -- All order totalsSELECT SUM(DISTINCT order_total) FROM orders; -- Each unique total once -- String aggregation with DISTINCT (PostgreSQL example)SELECT STRING_AGG(DISTINCT department, ', ' ORDER BY department)FROM employees;-- Returns: "Engineering, Marketing, Sales" (not "Sales, Sales, Sales, Marketing...")DISTINCT inside aggregates is different from SELECT DISTINCT. SELECT DISTINCT COUNT(*) is meaningless (there's only one count value to deduplicate). SELECT COUNT(DISTINCT column) counts unique values—a very common and useful operation.
Using both DISTINCT and GROUP BY is usually redundant. GROUP BY already collapses rows by the grouped columns, and DISTINCT on those same columns adds no value (with computational overhead).
1234567891011121314151617181920212223
-- REDUNDANT: DISTINCT with same columns as GROUP BYSELECT DISTINCT department, COUNT(*) as emp_countFROM employeesGROUP BY department; -- EQUIVALENT and more efficient (no DISTINCT):SELECT department, COUNT(*) as emp_countFROM employeesGROUP BY department; -- GROUP BY already produces one row per department-- DISTINCT has nothing to deduplicate -- However, DISTINCT can be meaningful with certain GROUP BY patterns:SELECT DISTINCT dept_categoryFROM ( SELECT department, CASE WHEN COUNT(*) > 100 THEN 'Large' ELSE 'Small' END as dept_category FROM employees GROUP BY department) dept_sizes;-- Multiple departments might be 'Large', DISTINCT collapses themDISTINCT operations require significant computational resources. Understanding how to optimize them—or avoid them entirely—is essential for high-performance SQL.
DISTINCT requires the database to:
The cost scales with both the number of input rows and the size of each row (number and width of columns).
| Factor | Impact | Mitigation |
|---|---|---|
| Total input rows | More rows = more comparisons | Add WHERE filters before DISTINCT |
| Number of columns | More columns = more comparison work | Select only needed columns |
| Column data types | Large strings/BLOBs are slow to hash | Avoid DISTINCT on TEXT/BLOB |
| Uniqueness ratio | Few duplicates = DISTINCT is wasted | Verify duplicates actually exist |
| Available memory | Hash tables spill to disk if too large | Ensure work_mem is adequate |
| Indexes | Index-only scans can provide sorted input | Create indexes on DISTINCT columns |
Use EXPLAIN to understand how the database implements DISTINCT:
12345678910111213141516171819202122232425
-- PostgreSQL EXPLAIN exampleEXPLAIN (ANALYZE, BUFFERS) SELECT DISTINCT department FROM employees; -- Possible plans: -- 1. HashAggregate: Uses hash table for deduplication-- HashAggregate (rows=10) (actual rows=10)-- -> Seq Scan on employees (rows=10000)-- Good when: many duplicates, moderate unique values -- 2. Sort + Unique: Sorts then removes adjacent duplicates-- Unique (rows=10)-- -> Sort (rows=10000)-- -> Seq Scan on employees-- Good when: ORDER BY also needed, index available -- 3. Index Only Scan (if covering index exists)-- Unique-- -> Index Only Scan using idx_department-- Best performance: no table access needed -- Force PostgreSQL to use specific strategy (for testing)SET enable_hashagg = off; -- Force sort-based DISTINCTSET enable_sort = off; -- Force hash-based DISTINCTIndexes can dramatically speed up DISTINCT by providing pre-sorted or grouped access paths.
1234567891011121314151617181920
-- Without index: Full table scan + sort/hashSELECT DISTINCT department FROM employees;-- Reads all 10,000 rows, then deduplicates -- With index on department: Index-only scan possibleCREATE INDEX idx_employees_department ON employees(department);SELECT DISTINCT department FROM employees;-- Reads only index entries (much smaller), already sorted/grouped -- For multi-column DISTINCT, index must cover all columnsCREATE INDEX idx_employees_dept_city ON employees(department, city); SELECT DISTINCT department, city FROM employees;-- Can use index-only scan for fast deduplication -- Index column order matters for some query patterns-- (department, city) index helps: DISTINCT department, city-- (department, city) index helps: DISTINCT department-- (department, city) index does NOT help: DISTINCT citySometimes restructuring the query eliminates the need for DISTINCT:
12345678910111213141516171819202122232425262728293031323334
-- SLOW: DISTINCT after join multiplicationSELECT DISTINCT c.customer_id, c.customer_nameFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2024-01-01';-- Joins potentially millions of rows, then deduplicates -- FASTER: EXISTS avoids row multiplicationSELECT c.customer_id, c.customer_nameFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.order_date >= '2024-01-01');-- No join multiplication, no DISTINCT needed -- FASTER: Semi-join with IN (often equivalent to EXISTS)SELECT c.customer_id, c.customer_nameFROM customers cWHERE c.customer_id IN ( SELECT DISTINCT customer_id FROM orders WHERE order_date >= '2024-01-01');-- DISTINCT on single integer column is faster -- ALTERNATIVE: Use GROUP BY for aggregated informationSELECT c.customer_id, c.customer_name, COUNT(o.order_id) as order_countFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2024-01-01'GROUP BY c.customer_id, c.customer_name;-- Gets unique customers PLUS order counts (more useful)If DISTINCT is slow, ask: 'Why are there duplicates?' Often the answer reveals a better query structure. EXISTS for filtering, GROUP BY for aggregation, and subqueries for isolation often outperform SELECT DISTINCT on joined multi-table queries.
Even experienced developers make DISTINCT-related errors. Understanding these common pitfalls helps you avoid them.
SELECT DISTINCT a, b gives unique (a,b) pairs, not unique a's AND unique b's.123456789101112131415161718192021222324252627282930313233343536373839404142
-- ANTIPATTERN: DISTINCT as band-aid-- Developer notices duplicate customers, adds DISTINCT "to fix it"SELECT DISTINCT customer_name, email FROM customers;-- But wait—why are there duplicates in the customer table?-- This might indicate: data quality issues, missing unique constraint,-- or incorrect query joining somewhere else -- BETTER: Investigate the actual causeSELECT customer_name, email, COUNT(*)FROM customersGROUP BY customer_name, emailHAVING COUNT(*) > 1;-- See the actual duplicates and decide how to handle them -- ANTIPATTERN: DISTINCT on primary key (wasteful)SELECT DISTINCT customer_id, customer_name, emailFROM customers;-- customer_id is PK, so rows are already unique-- DISTINCT adds overhead for no benefit -- CORRECT: Just select without DISTINCTSELECT customer_id, customer_name, email FROM customers; -- ANTIPATTERN: DISTINCT in EXISTS subquerySELECT c.customer_nameFROM customers cWHERE EXISTS ( SELECT DISTINCT o.order_id -- DISTINCT is pointless here! FROM orders o WHERE o.customer_id = c.customer_id);-- EXISTS returns TRUE on first match; it doesn't matter if-- there would be duplicates because only one row is checked -- CORRECT: Simple EXISTSSELECT c.customer_nameFROM customers cWHERE EXISTS ( SELECT 1 -- Convention: SELECT 1 in EXISTS FROM orders o WHERE o.customer_id = c.customer_id);If you find yourself adding DISTINCT 'because the query returns duplicates,' treat it as a code smell. Investigate why duplicates are occurring. The cause might be: incorrect joins, missing GROUP BY, data quality issues, or schema design problems. DISTINCT should be intentional, not a fix for surprising behavior.
Let's examine practical scenarios where DISTINCT is the right tool, applied correctly.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
-- Pattern 1: Dropdown/Filter value lists-- Get unique values for a UI filter componentSELECT DISTINCT category_nameFROM productsWHERE is_active = trueORDER BY category_name;-- Returns list of categories for a dropdown menu -- Pattern 2: Distinct customers who purchasedSELECT DISTINCT c.customer_id, c.customer_nameFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2024-01-01';-- Each customer who ordered in 2024, listed once-- (Alternative: use EXISTS for potentially better performance) -- Pattern 3: Tag cloud / unique labelsSELECT DISTINCT tagFROM article_tagsWHERE article_id IN (SELECT id FROM articles WHERE published = true);-- All unique tags used on published articles -- Pattern 4: Data quality check - find unique value countsSELECT COUNT(*) as total_rows, COUNT(DISTINCT email) as unique_emails, COUNT(DISTINCT phone) as unique_phones, COUNT(*) - COUNT(DISTINCT email) as duplicate_email_countFROM customers;-- Reveals potential data quality issues -- Pattern 5: Geographic coverage analysisSELECT DISTINCT country, state, cityFROM customer_addressesORDER BY country, state, city;-- All locations where we have customers -- Pattern 6: Product availability by regionSELECT DISTINCT p.product_name, w.regionFROM products pJOIN inventory i ON p.product_id = i.product_idJOIN warehouses w ON i.warehouse_id = w.warehouse_idWHERE i.quantity > 0;-- Products available in each region (once per region, not per warehouse)The DISTINCT clause is a fundamental SQL tool for eliminating duplicate rows. Let's consolidate the key insights from this comprehensive exploration:
What's Next:
With DISTINCT mastered, we'll explore the complementary challenge: limiting the number of rows returned. The next page covers LIMIT, TOP, and FETCH—syntax variations across database systems for restricting result set size.
You now have comprehensive knowledge of the DISTINCT clause—its syntax, semantics, performance characteristics, and best practices. You can apply DISTINCT intentionally and effectively, avoiding common pitfalls that trap less experienced developers.