Database Management SystemsSQL Aggregation & Grouping

GROUP BY: Categorizing and Summarizing Data

LevelIntermediate

Duration60 mins

TopicSQL Aggregation & Grouping

5 / 5

Common Errors

Learning from Mistakes: GROUP BY Pitfalls

GROUP BY queries are a frequent source of bugs in production systems. The syntax is simple enough that developers feel confident writing complex grouping logic—but subtle errors can produce wrong results, degraded performance, or unexpected behavior.

The most insidious bugs don't cause errors—they produce plausible-looking wrong answers.

This page catalogs the most common GROUP BY mistakes, explains their root causes, and teaches you to recognize and fix them. Whether reviewing your own code or auditing others', these patterns will sharpen your SQL debugging skills.

What You Will Learn

By the end of this page, you will recognize the most common GROUP BY errors, understand why each error occurs and its consequences, know how to fix each type of error, and develop habits that prevent these mistakes in future code.

Error 1: Non-Aggregated Columns Missing from GROUP BY

The most common GROUP BY error: Including a column in SELECT that's neither in GROUP BY nor wrapped in an aggregate function.

-- ERROR: column 'product_category' must appear in GROUP BY clause
SELECT salesperson, product_category, SUM(amount)
FROM sales
GROUP BY salesperson;

Why it happens: Developers forget that grouping changes the query's semantics. They're thinking row-by-row but the query now operates group-by-group.

Consequence: In strict databases, this causes an error (good). In permissive databases, it returns arbitrary values (dangerous).

Solutions for Missing Column Error
Solution	When to Use	Example
Add to GROUP BY	When you want that column as a grouping dimension	`GROUP BY salesperson, product_category`
Wrap in aggregate	When you want a summary of that column	`MAX(product_category)`, `COUNT(DISTINCT product_category)`
Use ANY_VALUE()	When any value is acceptable (MySQL 8.0+)	`ANY_VALUE(product_category)`
Remove from SELECT	When you don't actually need that column	Simply delete it from SELECT

fix-missing-column.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Original error
SELECT salesperson, product_category, SUM(amount)
FROM sales
GROUP BY salesperson;  -- product_category missing!
 
-- Fix Option 1: Add to GROUP BY (more granular results)
SELECT salesperson, product_category, SUM(amount) AS total
FROM sales
GROUP BY salesperson, product_category;
-- Now: one row per (salesperson, category) combination
 
-- Fix Option 2: Aggregate the column
SELECT 
    salesperson, 
    COUNT(DISTINCT product_category) AS categories_sold,
    SUM(amount) AS total
FROM sales
GROUP BY salesperson;
-- Now: one row per salesperson, showing how many categories
 
-- Fix Option 3: Remove if not needed
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson;
-- Simplest if you don't need product_category at all

Silent Failures Are Dangerous

In MySQL without ONLY_FULL_GROUP_BY, or in SQLite, this query runs without error but returns arbitrary category values. You might not notice the bug until data analysis reveals nonsensical patterns. Always enable strict mode in MySQL.

Error 2: WHERE vs. HAVING Confusion

Using WHERE with aggregates—or HAVING without reason.

-- ERROR: aggregate functions are not allowed in WHERE
SELECT salesperson, SUM(amount)
FROM sales
WHERE SUM(amount) > 1000    -- Can't use aggregate in WHERE!
GROUP BY salesperson;

Why it happens: Developers conflate 'filter rows' (WHERE) with 'filter groups' (HAVING). Both filter, but at different stages of query execution.

Clause	Filters	Uses Aggregates?	Executes
WHERE	Individual rows	❌ No	Before GROUP BY
HAVING	Entire groups	✓ Yes	After GROUP BY

where-vs-having.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- WRONG: Aggregate in WHERE
SELECT salesperson, SUM(amount) AS total
FROM sales
WHERE SUM(amount) > 1000
GROUP BY salesperson;
-- ERROR: aggregate functions not allowed in WHERE
 
-- CORRECT: Use HAVING for aggregate conditions
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
HAVING SUM(amount) > 1000;  -- Filters AFTER aggregation
-- Returns only salespeople with total sales > 1000
 
-- Both WHERE and HAVING in same query
SELECT salesperson, SUM(amount) AS total
FROM sales
WHERE product_category = 'Electronics'  -- Filter rows FIRST
GROUP BY salesperson
HAVING SUM(amount) > 500;               -- Filter groups AFTER
-- Process: filter to Electronics → group by person → filter groups > $500
 
-- INEFFICIENT: Using HAVING when WHERE would work
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
HAVING salesperson != 'Bob';  -- Works but inefficient!
 
-- BETTER: Use WHERE for non-aggregate conditions
SELECT salesperson, SUM(amount) AS total
FROM sales
WHERE salesperson != 'Bob'    -- Filter rows before grouping
GROUP BY salesperson;
-- Fewer rows to group = better performance

The Performance Rule

Use WHERE to filter rows BEFORE grouping (whenever possible). Use HAVING only when you must filter based on aggregate values. Filtering early means fewer rows to process during grouping, improving performance.

Error 3: JOIN Row Multiplication

The silent killer: JOINs that multiply rows before aggregation, causing inflated sums and counts.

-- Orders table: each order has one total
-- Order_items table: each order has multiple items

SELECT customer_id, SUM(order_total) AS total_spent
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY customer_id;
-- BUG: order_total is summed once PER ITEM, not once per order!

Why it happens: The JOIN creates multiple rows per order (one per item), so order_total appears multiple times before grouping. SUM then adds the same total repeatedly.

Example of Row Multiplication
order_id	order_total	item_id	item_name
101	$100	1	Widget A
101	$100	2	Widget B
101	$100	3	Widget C
102	$50	4	Gadget X

After the JOIN, order 101's total of $100 appears 3 times (once per item). SUM(order_total) wrongly computes $100+$100+$100+$50 = $350 instead of the correct $100+$50 = $150.

fix-join-multiplication.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- BUGGY: Order total counted once per item
SELECT customer_id, SUM(order_total) AS total_spent
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY customer_id;
-- Result: Inflated totals!
 
-- FIX Option 1: Aggregate at correct level first
SELECT 
    o.customer_id,
    SUM(o.order_total) AS total_spent,
    SUM(item_count) AS total_items
FROM orders o
JOIN (
    SELECT order_id, COUNT(*) AS item_count
    FROM order_items
    GROUP BY order_id
) oi ON o.order_id = oi.order_id
GROUP BY o.customer_id;
-- Pre-aggregate items, then join to orders
 
-- FIX Option 2: Use DISTINCT inside aggregate
SELECT 
    customer_id, 
    SUM(DISTINCT order_total) AS total_spent,  -- Only works if totals are unique!
    COUNT(DISTINCT o.order_id) AS order_count
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY customer_id;
-- WARNING: SUM(DISTINCT) fails if two orders have the same total
 
-- FIX Option 3: Don't join if you don't need joined data
SELECT customer_id, SUM(order_total) AS total_spent
FROM orders
GROUP BY customer_id;
-- If you don't need item details, don't join to items!

Detection Technique

Compare aggregates before and after the JOIN. If SUM increases after joining, you have row multiplication. Run: SELECT SUM(order_total) FROM orders; vs SELECT SUM(o.order_total) FROM orders o JOIN order_items oi ON ... . Different values = bug.

Error 4: Wrong Grouping Granularity

Grouping at too high or too low a level for your analytical question.

Too low (too granular): Groups contain 1 row each; aggregates are trivial
Too high (too broad): Multiple categories collapse together; you lose detail

Why it happens: Developers add/remove grouping columns without fully considering the impact on result granularity.

granularity-errors.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Question: What is the average order value per customer?
 
-- TOO GRANULAR: Grouping by order_id gives one row per order
SELECT customer_id, order_id, AVG(order_total) AS avg
FROM orders
GROUP BY customer_id, order_id;
-- AVG of a single value = that value. Not useful!
-- Each customer appears multiple times.
 
-- CORRECT GRANULARITY: Group by customer only
SELECT customer_id, AVG(order_total) AS avg_order_value
FROM orders
GROUP BY customer_id;
-- Now: one row per customer, AVG across their orders.
 
-- TOO BROAD: Accidentally omitting needed dimensions
-- Question: Revenue by salesperson per month
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson;  -- Missing month!
-- This gives total across ALL time, not per month
 
-- CORRECT: Include time dimension
SELECT 
    salesperson, 
    EXTRACT(MONTH FROM sale_date) AS month,
    SUM(amount) AS monthly_total
FROM sales
GROUP BY salesperson, EXTRACT(MONTH FROM sale_date);

Granularity Checklist

•What's the unit of analysis? (per customer, per day, per product...)
•How many result rows should you get? Estimate before running.
•Are aggregates meaningful? If AVG/SUM operates on 1 row, granularity is too fine.
•Are dimensions missing? If 'per month' is needed but month isn't in GROUP BY, granularity is too coarse.

Start From the Question

Before writing GROUP BY, state the question clearly: 'I want X per Y per Z.' X is your aggregate (SUM, COUNT), Y and Z are your GROUP BY columns. Mismatch between question and GROUP BY = wrong granularity.

Error 5: Ignoring NULL Behavior

NULL values in grouping columns or aggregated columns can produce unexpected results.

NULLs in grouping columns: All NULLs form one group (which may or may not be meaningful)
NULLs in aggregated columns: Ignored by most aggregates (changing counts and averages)

null-errors.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Unexpected NULL group
SELECT region, SUM(amount) AS total
FROM sales
GROUP BY region;
-- If some rows have region = NULL:
-- Result includes a 'NULL' group, which might confuse users
 
-- Fix: Handle NULL explicitly
SELECT 
    COALESCE(region, 'Unknown') AS region, 
    SUM(amount) AS total
FROM sales
GROUP BY COALESCE(region, 'Unknown');
-- NULL becomes 'Unknown' - more readable
 
-- Or filter out NULLs if not wanted
SELECT region, SUM(amount) AS total
FROM sales
WHERE region IS NOT NULL
GROUP BY region;
 
-- NULL in aggregated columns: skewed averages
SELECT department, AVG(bonus) AS avg_bonus
FROM employees
GROUP BY department;
-- If 3 employees: bonus = 1000, 2000, NULL
-- AVG = (1000 + 2000) / 2 = 1500, NOT 1000 (NULL ignored, not counted)
 
-- If NULL should be 0:
SELECT department, AVG(COALESCE(bonus, 0)) AS avg_bonus
FROM employees
GROUP BY department;
-- Now NULL treated as 0: (1000 + 2000 + 0) / 3 = 1000

NULL Handling by Aggregate Function
Function	NULL Behavior	Example with [10, 20, NULL]
`COUNT(*)`	Counts all rows including nulls	3
`COUNT(col)`	Counts non-NULL values only	2
`SUM(col)`	Ignores NULLs	30
`AVG(col)`	Ignores NULLs in both sum and count	15 (30/2, not 30/3)
`MIN/MAX(col)`	Ignores NULLs	10 / 20

AVG NULLs Can Distort Statistics

AVG ignores NULLs completely—they don't count toward the divisor. If NULL means 'zero bonus,' your average will be wrong unless you use COALESCE(bonus, 0). Always consider whether NULL should be treated as zero or excluded from analysis.

Error 6: ORDER BY Misuse in Grouped Queries

Common ORDER BY mistakes in GROUP BY queries:

Without ORDER BY, assuming results will be sorted by GROUP BY columns (they won't be guaranteed)
ORDER BY a column not in SELECT or GROUP BY (may fail in some databases)
Confusing ORDER BY with GROUP BY (they do completely different things)

order-by-mistakes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Mistake 1: Assuming order without ORDER BY
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson;
-- Result order is UNDEFINED. Might be alphabetical, might not be.
-- NEVER rely on implicit order!
 
-- Fix: Always specify ORDER BY if order matters
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
ORDER BY salesperson;  -- Explicit: alphabetical by name
-- OR
ORDER BY total DESC;   -- Explicit: highest total first
 
-- Mistake 2: ORDER BY non-selected, non-grouped column
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
ORDER BY product_category;  -- This column isn't available!
-- May error or use arbitrary value depending on database
 
-- Fix: Order by what's in your result
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
ORDER BY total DESC;  -- Aggregate is valid in ORDER BY
 
-- Mistake 3: Thinking ORDER BY groups data
SELECT * FROM sales ORDER BY salesperson;
-- This SORTS rows, it doesn't GROUP them.
-- You still have individual rows, not aggregated summaries.

What ORDER BY can reference in grouped queries:

Expression	Can ORDER BY?	Notes
Grouping column	✓ Yes	Same as SELECT
Aggregate result	✓ Yes	ORDER BY SUM(amount)
Alias from SELECT	✓ Usually	ORDER BY total (database-dependent)
Non-grouped column	❌ No	Not meaningful after grouping
Ordinal position	✓ Usually	ORDER BY 1, 2 (first, second column)

Best Practice: Explicit ORDER BY

Never assume GROUP BY implies any ordering. Always add an explicit ORDER BY clause if you need results in a specific order. This makes queries self-documenting and ensures consistent behavior across database versions.

Error 7: Performance Pitfalls

GROUP BY queries can become performance bottlenecks. Common issues include:

Grouping by high-cardinality columns (millions of groups)
Not filtering before grouping
Using HAVING for non-aggregate conditions
Complex expressions in GROUP BY that prevent index use

performance-issues.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- SLOW: Grouping by high-cardinality column without filtering
SELECT user_id, COUNT(*) AS activity_count
FROM user_activity  -- 100 million rows
GROUP BY user_id;   -- 10 million unique users
-- Creates 10 million groups - very expensive!
 
-- FASTER: Filter first to reduce row count
SELECT user_id, COUNT(*) AS activity_count
FROM user_activity
WHERE activity_date >= CURRENT_DATE - INTERVAL '7 days'  -- Filter first!
GROUP BY user_id;
-- Now only grouping last week's data - much smaller
 
-- SLOW: Using HAVING for non-aggregate filter
SELECT department, SUM(salary) AS total
FROM employees
GROUP BY department
HAVING department != 'Inactive';  -- Filter AFTER grouping
 
-- FASTER: Use WHERE instead
SELECT department, SUM(salary) AS total
FROM employees
WHERE department != 'Inactive'    -- Filter BEFORE grouping
GROUP BY department;
-- Fewer rows to group = faster
 
-- SLOW: Expression in GROUP BY prevents index use
SELECT YEAR(order_date) AS year, SUM(amount)
FROM orders
GROUP BY YEAR(order_date);  -- Can't use index on order_date
 
-- FASTER (if you can restructure):
-- Option 1: Add a computed/generated column for year
-- Option 2: Use date range filtering instead
SELECT SUM(amount)
FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Can use index on order_date

Performance Optimization Tips

•Filter early with WHERE — Reduce row count before grouping whenever possible.
•Use HAVING only for aggregate conditions — Non-aggregate filters belong in WHERE.
•Consider indexes — Indexes on GROUP BY columns can speed up grouping operations.
•Watch cardinality — Grouping by unique IDs creates as many groups as rows. Is that what you need?
•Profile queries — Use EXPLAIN to see how your database handles GROUP BY. Look for sorts, temp tables.
•Pre-aggregate for reporting — For repeated analytical queries, consider materialized views or summary tables.

The Temp Table Warning

When GROUP BY creates too many groups to fit in memory, databases spill to disk using temp tables. This dramatically slows performance. Watch for 'using temporary' or similar in EXPLAIN output. Consider adding WHERE filters or rethinking granularity.

Summary: Avoiding GROUP BY Mistakes

Let's consolidate the common errors and their solutions:

Quick Reference: GROUP BY Error Fixes
Error	Symptom	Fix
Missing columns in GROUP BY	Error or arbitrary values	Add to GROUP BY or wrap in aggregate
WHERE vs HAVING confusion	Error using aggregate in WHERE	Use HAVING for aggregate conditions
JOIN multiplication	Inflated SUM/COUNT values	Aggregate before JOIN or use DISTINCT
Wrong granularity	Too many/few result rows	Adjust GROUP BY columns to match question
NULL mishandling	Unexpected NULL group or skewed AVG	Use COALESCE or filter NULLs
ORDER BY misuse	Unexpected order or errors	Explicit ORDER BY; use valid expressions
Performance issues	Slow queries	Filter early; use indexes; reduce cardinality

Prevention Habits

•Enable strict mode — Always use ONLY_FULL_GROUP_BY in MySQL. Errors are better than silent bugs.
•State the question before writing SQL — 'I want X per Y' prevents granularity errors.
•Estimate row count — Predict how many groups you expect. Validate after running.
•Check for NULLs — Look at grouping column nullability. Decide how to handle.
•Compare before/after JOINs — Verify aggregates don't change unexpectedly after adding JOINs.
•Filter WHERE, filter HAVING — Know the difference. One is for rows, one is for groups.

Module Complete:

You've now mastered the GROUP BY clause—from conceptual foundations through syntax, multiple columns, formal rules, and common errors. This knowledge enables you to:

Write correct grouped queries for any analytical question
Debug GROUP BY errors quickly and confidently
Review others' code for grouping mistakes
Optimize performance of aggregation queries

Next modules will build on this foundation with HAVING clause deep-dives, window functions, and advanced analytical SQL patterns.

Module Complete

Congratulations! You've completed Module 2: GROUP BY. You now have comprehensive knowledge of SQL data grouping—from foundational concepts to practical error prevention. Apply these skills to build robust analytical queries in any database environment.

5 / 5

Loading learning content...

Database Management SystemsSQL Aggregation & Grouping

GROUP BY: Categorizing and Summarizing Data

LevelIntermediate

Duration60 mins

TopicSQL Aggregation & Grouping

5 / 5

Common Errors

Learning from Mistakes: GROUP BY Pitfalls

The most insidious bugs don't cause errors—they produce plausible-looking wrong answers.

What You Will Learn

Error 1: Non-Aggregated Columns Missing from GROUP BY

The most common GROUP BY error: Including a column in SELECT that's neither in GROUP BY nor wrapped in an aggregate function.

-- ERROR: column 'product_category' must appear in GROUP BY clause
SELECT salesperson, product_category, SUM(amount)
FROM sales
GROUP BY salesperson;

Why it happens: Developers forget that grouping changes the query's semantics. They're thinking row-by-row but the query now operates group-by-group.

Consequence: In strict databases, this causes an error (good). In permissive databases, it returns arbitrary values (dangerous).

Solutions for Missing Column Error
Solution	When to Use	Example
Add to GROUP BY	When you want that column as a grouping dimension	`GROUP BY salesperson, product_category`
Wrap in aggregate	When you want a summary of that column	`MAX(product_category)`, `COUNT(DISTINCT product_category)`
Use ANY_VALUE()	When any value is acceptable (MySQL 8.0+)	`ANY_VALUE(product_category)`
Remove from SELECT	When you don't actually need that column	Simply delete it from SELECT

fix-missing-column.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Original error
SELECT salesperson, product_category, SUM(amount)
FROM sales
GROUP BY salesperson;  -- product_category missing!
 
-- Fix Option 1: Add to GROUP BY (more granular results)
SELECT salesperson, product_category, SUM(amount) AS total
FROM sales
GROUP BY salesperson, product_category;
-- Now: one row per (salesperson, category) combination
 
-- Fix Option 2: Aggregate the column
SELECT 
    salesperson, 
    COUNT(DISTINCT product_category) AS categories_sold,
    SUM(amount) AS total
FROM sales
GROUP BY salesperson;
-- Now: one row per salesperson, showing how many categories
 
-- Fix Option 3: Remove if not needed
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson;
-- Simplest if you don't need product_category at all

Silent Failures Are Dangerous

Error 2: WHERE vs. HAVING Confusion

Using WHERE with aggregates—or HAVING without reason.

-- ERROR: aggregate functions are not allowed in WHERE
SELECT salesperson, SUM(amount)
FROM sales
WHERE SUM(amount) > 1000    -- Can't use aggregate in WHERE!
GROUP BY salesperson;

Why it happens: Developers conflate 'filter rows' (WHERE) with 'filter groups' (HAVING). Both filter, but at different stages of query execution.

Clause	Filters	Uses Aggregates?	Executes
WHERE	Individual rows	❌ No	Before GROUP BY
HAVING	Entire groups	✓ Yes	After GROUP BY

where-vs-having.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- WRONG: Aggregate in WHERE
SELECT salesperson, SUM(amount) AS total
FROM sales
WHERE SUM(amount) > 1000
GROUP BY salesperson;
-- ERROR: aggregate functions not allowed in WHERE
 
-- CORRECT: Use HAVING for aggregate conditions
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
HAVING SUM(amount) > 1000;  -- Filters AFTER aggregation
-- Returns only salespeople with total sales > 1000
 
-- Both WHERE and HAVING in same query
SELECT salesperson, SUM(amount) AS total
FROM sales
WHERE product_category = 'Electronics'  -- Filter rows FIRST
GROUP BY salesperson
HAVING SUM(amount) > 500;               -- Filter groups AFTER
-- Process: filter to Electronics → group by person → filter groups > $500
 
-- INEFFICIENT: Using HAVING when WHERE would work
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
HAVING salesperson != 'Bob';  -- Works but inefficient!
 
-- BETTER: Use WHERE for non-aggregate conditions
SELECT salesperson, SUM(amount) AS total
FROM sales
WHERE salesperson != 'Bob'    -- Filter rows before grouping
GROUP BY salesperson;
-- Fewer rows to group = better performance

The Performance Rule

Error 3: JOIN Row Multiplication

The silent killer: JOINs that multiply rows before aggregation, causing inflated sums and counts.

-- Orders table: each order has one total
-- Order_items table: each order has multiple items

SELECT customer_id, SUM(order_total) AS total_spent
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY customer_id;
-- BUG: order_total is summed once PER ITEM, not once per order!

Why it happens: The JOIN creates multiple rows per order (one per item), so order_total appears multiple times before grouping. SUM then adds the same total repeatedly.

Example of Row Multiplication
order_id	order_total	item_id	item_name
101	$100	1	Widget A
101	$100	2	Widget B
101	$100	3	Widget C
102	$50	4	Gadget X

After the JOIN, order 101's total of $100 appears 3 times (once per item). SUM(order_total) wrongly computes $100+$100+$100+$50 = $350 instead of the correct $100+$50 = $150.

fix-join-multiplication.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- BUGGY: Order total counted once per item
SELECT customer_id, SUM(order_total) AS total_spent
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY customer_id;
-- Result: Inflated totals!
 
-- FIX Option 1: Aggregate at correct level first
SELECT 
    o.customer_id,
    SUM(o.order_total) AS total_spent,
    SUM(item_count) AS total_items
FROM orders o
JOIN (
    SELECT order_id, COUNT(*) AS item_count
    FROM order_items
    GROUP BY order_id
) oi ON o.order_id = oi.order_id
GROUP BY o.customer_id;
-- Pre-aggregate items, then join to orders
 
-- FIX Option 2: Use DISTINCT inside aggregate
SELECT 
    customer_id, 
    SUM(DISTINCT order_total) AS total_spent,  -- Only works if totals are unique!
    COUNT(DISTINCT o.order_id) AS order_count
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY customer_id;
-- WARNING: SUM(DISTINCT) fails if two orders have the same total
 
-- FIX Option 3: Don't join if you don't need joined data
SELECT customer_id, SUM(order_total) AS total_spent
FROM orders
GROUP BY customer_id;
-- If you don't need item details, don't join to items!

Detection Technique

Error 4: Wrong Grouping Granularity

Grouping at too high or too low a level for your analytical question.

Too low (too granular): Groups contain 1 row each; aggregates are trivial
Too high (too broad): Multiple categories collapse together; you lose detail

Why it happens: Developers add/remove grouping columns without fully considering the impact on result granularity.

granularity-errors.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Question: What is the average order value per customer?
 
-- TOO GRANULAR: Grouping by order_id gives one row per order
SELECT customer_id, order_id, AVG(order_total) AS avg
FROM orders
GROUP BY customer_id, order_id;
-- AVG of a single value = that value. Not useful!
-- Each customer appears multiple times.
 
-- CORRECT GRANULARITY: Group by customer only
SELECT customer_id, AVG(order_total) AS avg_order_value
FROM orders
GROUP BY customer_id;
-- Now: one row per customer, AVG across their orders.
 
-- TOO BROAD: Accidentally omitting needed dimensions
-- Question: Revenue by salesperson per month
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson;  -- Missing month!
-- This gives total across ALL time, not per month
 
-- CORRECT: Include time dimension
SELECT 
    salesperson, 
    EXTRACT(MONTH FROM sale_date) AS month,
    SUM(amount) AS monthly_total
FROM sales
GROUP BY salesperson, EXTRACT(MONTH FROM sale_date);

Granularity Checklist

•What's the unit of analysis? (per customer, per day, per product...)
•How many result rows should you get? Estimate before running.
•Are aggregates meaningful? If AVG/SUM operates on 1 row, granularity is too fine.
•Are dimensions missing? If 'per month' is needed but month isn't in GROUP BY, granularity is too coarse.

Start From the Question

Error 5: Ignoring NULL Behavior

NULL values in grouping columns or aggregated columns can produce unexpected results.

NULLs in grouping columns: All NULLs form one group (which may or may not be meaningful)
NULLs in aggregated columns: Ignored by most aggregates (changing counts and averages)

null-errors.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Unexpected NULL group
SELECT region, SUM(amount) AS total
FROM sales
GROUP BY region;
-- If some rows have region = NULL:
-- Result includes a 'NULL' group, which might confuse users
 
-- Fix: Handle NULL explicitly
SELECT 
    COALESCE(region, 'Unknown') AS region, 
    SUM(amount) AS total
FROM sales
GROUP BY COALESCE(region, 'Unknown');
-- NULL becomes 'Unknown' - more readable
 
-- Or filter out NULLs if not wanted
SELECT region, SUM(amount) AS total
FROM sales
WHERE region IS NOT NULL
GROUP BY region;
 
-- NULL in aggregated columns: skewed averages
SELECT department, AVG(bonus) AS avg_bonus
FROM employees
GROUP BY department;
-- If 3 employees: bonus = 1000, 2000, NULL
-- AVG = (1000 + 2000) / 2 = 1500, NOT 1000 (NULL ignored, not counted)
 
-- If NULL should be 0:
SELECT department, AVG(COALESCE(bonus, 0)) AS avg_bonus
FROM employees
GROUP BY department;
-- Now NULL treated as 0: (1000 + 2000 + 0) / 3 = 1000

NULL Handling by Aggregate Function
Function	NULL Behavior	Example with [10, 20, NULL]
`COUNT(*)`	Counts all rows including nulls	3
`COUNT(col)`	Counts non-NULL values only	2
`SUM(col)`	Ignores NULLs	30
`AVG(col)`	Ignores NULLs in both sum and count	15 (30/2, not 30/3)
`MIN/MAX(col)`	Ignores NULLs	10 / 20

AVG NULLs Can Distort Statistics

Error 6: ORDER BY Misuse in Grouped Queries

Common ORDER BY mistakes in GROUP BY queries:

Without ORDER BY, assuming results will be sorted by GROUP BY columns (they won't be guaranteed)
ORDER BY a column not in SELECT or GROUP BY (may fail in some databases)
Confusing ORDER BY with GROUP BY (they do completely different things)

order-by-mistakes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Mistake 1: Assuming order without ORDER BY
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson;
-- Result order is UNDEFINED. Might be alphabetical, might not be.
-- NEVER rely on implicit order!
 
-- Fix: Always specify ORDER BY if order matters
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
ORDER BY salesperson;  -- Explicit: alphabetical by name
-- OR
ORDER BY total DESC;   -- Explicit: highest total first
 
-- Mistake 2: ORDER BY non-selected, non-grouped column
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
ORDER BY product_category;  -- This column isn't available!
-- May error or use arbitrary value depending on database
 
-- Fix: Order by what's in your result
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson
ORDER BY total DESC;  -- Aggregate is valid in ORDER BY
 
-- Mistake 3: Thinking ORDER BY groups data
SELECT * FROM sales ORDER BY salesperson;
-- This SORTS rows, it doesn't GROUP them.
-- You still have individual rows, not aggregated summaries.

What ORDER BY can reference in grouped queries:

Expression	Can ORDER BY?	Notes
Grouping column	✓ Yes	Same as SELECT
Aggregate result	✓ Yes	ORDER BY SUM(amount)
Alias from SELECT	✓ Usually	ORDER BY total (database-dependent)
Non-grouped column	❌ No	Not meaningful after grouping
Ordinal position	✓ Usually	ORDER BY 1, 2 (first, second column)

Best Practice: Explicit ORDER BY

Error 7: Performance Pitfalls

GROUP BY queries can become performance bottlenecks. Common issues include:

Grouping by high-cardinality columns (millions of groups)
Not filtering before grouping
Using HAVING for non-aggregate conditions
Complex expressions in GROUP BY that prevent index use

performance-issues.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- SLOW: Grouping by high-cardinality column without filtering
SELECT user_id, COUNT(*) AS activity_count
FROM user_activity  -- 100 million rows
GROUP BY user_id;   -- 10 million unique users
-- Creates 10 million groups - very expensive!
 
-- FASTER: Filter first to reduce row count
SELECT user_id, COUNT(*) AS activity_count
FROM user_activity
WHERE activity_date >= CURRENT_DATE - INTERVAL '7 days'  -- Filter first!
GROUP BY user_id;
-- Now only grouping last week's data - much smaller
 
-- SLOW: Using HAVING for non-aggregate filter
SELECT department, SUM(salary) AS total
FROM employees
GROUP BY department
HAVING department != 'Inactive';  -- Filter AFTER grouping
 
-- FASTER: Use WHERE instead
SELECT department, SUM(salary) AS total
FROM employees
WHERE department != 'Inactive'    -- Filter BEFORE grouping
GROUP BY department;
-- Fewer rows to group = faster
 
-- SLOW: Expression in GROUP BY prevents index use
SELECT YEAR(order_date) AS year, SUM(amount)
FROM orders
GROUP BY YEAR(order_date);  -- Can't use index on order_date
 
-- FASTER (if you can restructure):
-- Option 1: Add a computed/generated column for year
-- Option 2: Use date range filtering instead
SELECT SUM(amount)
FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Can use index on order_date

Performance Optimization Tips

•Filter early with WHERE — Reduce row count before grouping whenever possible.
•Use HAVING only for aggregate conditions — Non-aggregate filters belong in WHERE.
•Consider indexes — Indexes on GROUP BY columns can speed up grouping operations.
•Watch cardinality — Grouping by unique IDs creates as many groups as rows. Is that what you need?
•Profile queries — Use EXPLAIN to see how your database handles GROUP BY. Look for sorts, temp tables.
•Pre-aggregate for reporting — For repeated analytical queries, consider materialized views or summary tables.

The Temp Table Warning

Summary: Avoiding GROUP BY Mistakes

Let's consolidate the common errors and their solutions:

Quick Reference: GROUP BY Error Fixes
Error	Symptom	Fix
Missing columns in GROUP BY	Error or arbitrary values	Add to GROUP BY or wrap in aggregate
WHERE vs HAVING confusion	Error using aggregate in WHERE	Use HAVING for aggregate conditions
JOIN multiplication	Inflated SUM/COUNT values	Aggregate before JOIN or use DISTINCT
Wrong granularity	Too many/few result rows	Adjust GROUP BY columns to match question
NULL mishandling	Unexpected NULL group or skewed AVG	Use COALESCE or filter NULLs
ORDER BY misuse	Unexpected order or errors	Explicit ORDER BY; use valid expressions
Performance issues	Slow queries	Filter early; use indexes; reduce cardinality

Prevention Habits

•Enable strict mode — Always use ONLY_FULL_GROUP_BY in MySQL. Errors are better than silent bugs.
•State the question before writing SQL — 'I want X per Y' prevents granularity errors.
•Estimate row count — Predict how many groups you expect. Validate after running.
•Check for NULLs — Look at grouping column nullability. Decide how to handle.
•Compare before/after JOINs — Verify aggregates don't change unexpectedly after adding JOINs.
•Filter WHERE, filter HAVING — Know the difference. One is for rows, one is for groups.

Module Complete:

You've now mastered the GROUP BY clause—from conceptual foundations through syntax, multiple columns, formal rules, and common errors. This knowledge enables you to:

Write correct grouped queries for any analytical question
Debug GROUP BY errors quickly and confidently
Review others' code for grouping mistakes
Optimize performance of aggregation queries

Next modules will build on this foundation with HAVING clause deep-dives, window functions, and advanced analytical SQL patterns.

Module Complete

5 / 5