Group By - Learning Module

Loading content...

0/252

GROUP BY Clause

From Concept to Syntax: Writing GROUP BY Queries

In the previous page, we established the conceptual foundation: grouping partitions data by category and computes aggregates per partition. Now we translate that understanding into executable SQL using the GROUP BY clause.

The GROUP BY clause is deceptively simple in syntax but profoundly powerful in capability. A single clause added to a SELECT statement transforms how SQL processes the entire query—changing the result from individual rows to categorical summaries.

This page covers exactly how to write GROUP BY queries: the precise syntax, where it appears in the query structure, how it interacts with other clauses, and the execution mechanics that determine your results.

What You Will Learn

By the end of this page, you will be able to write syntactically correct GROUP BY queries, understand exactly where GROUP BY fits in query execution order, combine GROUP BY with aggregate functions effectively, and predict exactly how many rows your grouped query will return.

Basic GROUP BY Syntax

The GROUP BY clause appears after the FROM and WHERE clauses (if present) and specifies which column(s) define the grouping:

SELECT column1, aggregate_function(column2)
FROM table_name
WHERE condition           -- Optional: filter rows first
GROUP BY column1;         -- Group by this column

The fundamental rule: Every column in the SELECT list must either:

Appear in the GROUP BY clause, OR
Be wrapped in an aggregate function (COUNT, SUM, AVG, MIN, MAX, etc.)

Violating this rule causes an error in standards-compliant databases (though MySQL has historically been more permissive—a behavior that causes subtle bugs).

basic-group-by.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Sample data context:
-- Table: sales (transaction_id, salesperson, product_category, amount, sale_date)
 
-- Example 1: Count transactions per salesperson
SELECT salesperson, COUNT(*) AS transaction_count
FROM sales
GROUP BY salesperson;
 
-- Result:
-- +-------------+-------------------+
-- | salesperson | transaction_count |
-- +-------------+-------------------+
-- | Alice       | 3                 |
-- | Bob         | 3                 |
-- | Carol       | 2                 |
-- +-------------+-------------------+
 
-- Example 2: Total revenue per product category
SELECT product_category, SUM(amount) AS total_revenue
FROM sales
GROUP BY product_category;
 
-- Result:
-- +------------------+---------------+
-- | product_category | total_revenue |
-- +------------------+---------------+
-- | Electronics      | 2225          |
-- | Furniture        | 4050          |
-- | Office           | 180           |
-- +------------------+---------------+
 
-- Example 3: Average sale amount per salesperson
SELECT salesperson, 
       AVG(amount) AS avg_sale,
       MIN(amount) AS smallest_sale,
       MAX(amount) AS largest_sale
FROM sales
GROUP BY salesperson;

Anatomy of a GROUP BY query:

Component	Role	Example
SELECT (grouping column)	Identifies each group	`salesperson`
SELECT (aggregate)	Summarizes the group	`COUNT(*)`, `SUM(amount)`
FROM	Source table(s)	`sales`
GROUP BY	Specifies grouping	`GROUP BY salesperson`

The grouping column's value is the same for every row in the group (by definition), so it can appear directly in SELECT. The aggregate computes a single value from all rows in the group.

Reading a GROUP BY Query

Read GROUP BY queries as a sentence: 'Show me [aggregates] for each unique value of [grouping column(s)] from [table].' For example: 'Show me the count and sum of sales for each unique salesperson from the sales table.'

Query Execution Order: Where GROUP BY Fits

Understanding execution order is essential for writing correct GROUP BY queries. SQL does not process clauses top-to-bottom as written; instead, it follows a specific logical sequence:

Logical Execution Order:

1. FROM       → Determine source table(s)
2. WHERE     → Filter individual rows
3. GROUP BY  → Partition remaining rows into groups
4. HAVING    → Filter groups (after aggregation)
5. SELECT    → Compute output expressions and aggregates
6. DISTINCT  → Remove duplicate rows (if specified)
7. ORDER BY  → Sort final results
8. LIMIT     → Restrict output row count

The critical insight: GROUP BY happens AFTER WHERE but BEFORE SELECT. This has important implications:

Execution Order Implications

•WHERE filters rows BEFORE grouping — Rows excluded by WHERE are never seen by GROUP BY. They cannot contribute to any group's aggregate.
•GROUP BY sees WHERE's results — If WHERE removes half the rows, GROUP BY partitions only the remaining half. Groups may have fewer rows (or not exist at all).
•SELECT aliases don't exist during GROUP BY — You cannot GROUP BY an alias defined in SELECT (in standard SQL). The alias is created after grouping.
•Aggregates are computed during/after SELECT — This means WHERE cannot reference aggregates. Use HAVING for filtering based on aggregate values.
•ORDER BY sees the grouped results — You can ORDER BY grouping columns or aggregates because the results exist at that point.

execution-order-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Complete example showing execution order
 
SELECT salesperson, 
       SUM(amount) AS total_sales      -- Step 5: Compute aggregates
FROM sales                              -- Step 1: Source table
WHERE sale_date >= '2024-01-16'        -- Step 2: Filter rows first
GROUP BY salesperson                    -- Step 3: Group remaining rows
HAVING SUM(amount) > 500                -- Step 4: Filter groups by aggregate
ORDER BY total_sales DESC;              -- Step 7: Sort results
 
-- Execution trace with sample data:
-- Step 1: Start with 8 rows
-- Step 2: WHERE removes 2 rows (sale_date < '2024-01-16') → 6 rows remain
-- Step 3: GROUP BY partitions 6 rows:
--         Alice: 2 rows ($325 + $750 = $1075)
--         Bob: 2 rows ($2100 + $560 = $2660)
--         Carol: 2 rows ($890 + $180 = $1070)
-- Step 4: HAVING filters: all 3 groups have SUM > 500 → 3 groups remain
-- Step 5: SELECT computes output columns → 3 result rows
-- Step 7: ORDER BY sorts by total_sales descending
 
-- Final Result:
-- +-------------+-------------+
-- | salesperson | total_sales |
-- +-------------+-------------+
-- | Bob         | 2660        |
-- | Alice       | 1075        |
-- | Carol       | 1070        |
-- +-------------+-------------+

Common Mistake: Using Aliases in GROUP BY

This often fails: SELECT YEAR(sale_date) AS year, ... GROUP BY year. In strict SQL, aliases don't exist during GROUP BY. Solution: Repeat the expression: GROUP BY YEAR(sale_date). Some databases (MySQL, PostgreSQL) allow alias references as an extension, but this isn't portable.

Combining GROUP BY with WHERE

WHERE and GROUP BY work together in a specific sequence: WHERE filters individual rows, then GROUP BY partitions what remains. This allows powerful analytical queries that answer questions like:

"What is the total revenue per salesperson for Electronics sales only?"
"How many orders per customer were placed in Q4?"
"What is the average rating per product for verified purchases?"

The key is that WHERE acts as a pre-filter—removing rows before any grouping or aggregation occurs.

where-then-group.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Analyze Electronics sales only
SELECT salesperson, 
       COUNT(*) AS electronics_count,
       SUM(amount) AS electronics_revenue
FROM sales
WHERE product_category = 'Electronics'   -- Pre-filter: only Electronics
GROUP BY salesperson;
 
-- Processing:
-- 1. FROM: All 8 rows
-- 2. WHERE: Keep only 'Electronics' → 4 rows remain
--    (Alice: $450, $325; Bob: $560; Carol: $890)
-- 3. GROUP BY: Partition by salesperson
--    Alice: 2 rows, Bob: 1 row, Carol: 1 row
-- 4. SELECT: Compute COUNT and SUM per group
 
-- Result:
-- +-------------+-------------------+---------------------+
-- | salesperson | electronics_count | electronics_revenue |
-- +-------------+-------------------+---------------------+
-- | Alice       | 2                 | 775                 |
-- | Bob         | 1                 | 560                 |
-- | Carol       | 1                 | 890                 |
-- +-------------+-------------------+---------------------+
 
-- Multiple WHERE conditions
SELECT product_category,
       COUNT(*) AS sales_count,
       AVG(amount) AS avg_amount
FROM sales
WHERE salesperson IN ('Alice', 'Bob')    -- Only these salespeople
  AND sale_date >= '2024-01-16'          -- Only recent sales
GROUP BY product_category;
 
-- Date-based filtering
SELECT salesperson,
       SUM(amount) AS jan_17_total
FROM sales
WHERE sale_date = '2024-01-17'
GROUP BY salesperson;

Impact of WHERE on group existence:

WHERE can completely remove a group if all its rows are filtered out. Consider:

SELECT salesperson, COUNT(*) 
FROM sales 
WHERE product_category = 'Office'
GROUP BY salesperson;

If only Carol sold Office products, Alice and Bob won't appear in results at all—they have no rows left after WHERE filtering. This is correct behavior, but it's important to understand: groups can disappear entirely based on WHERE conditions.

Design Pattern: Filter Then Aggregate

When writing analytical queries, think in two phases: (1) What subset of data am I analyzing? (WHERE) (2) How do I want to summarize that subset? (GROUP BY + aggregates). This mental separation prevents confusion about what gets filtered when.

Aggregate Functions in Grouped Queries

When GROUP BY is present, aggregate functions operate on each group independently rather than the entire table. This changes their semantics significantly—each aggregate produces one value per group, not one value for the whole dataset.

The standard aggregate functions available in most SQL databases:

Aggregate Functions in Grouped Context
Function	Description	NULL Handling	Grouped Behavior
`COUNT(*)`	Count all rows	Counts NULLs	Rows in each group
`COUNT(column)`	Count non-NULL values	Excludes NULLs	Non-NULL values per group
`COUNT(DISTINCT col)`	Count unique non-NULL values	Excludes NULLs	Unique values per group
`SUM(column)`	Sum of values	Ignores NULLs	Sum within each group
`AVG(column)`	Average of values	Ignores NULLs	Average within each group
`MIN(column)`	Minimum value	Ignores NULLs	Min within each group
`MAX(column)`	Maximum value	Ignores NULLs	Max within each group

aggregates-with-grouping.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Multiple aggregates per group
SELECT salesperson,
       COUNT(*) AS total_sales,
       COUNT(DISTINCT product_category) AS categories_sold,
       SUM(amount) AS total_revenue,
       AVG(amount) AS avg_sale,
       MIN(amount) AS smallest,
       MAX(amount) AS largest,
       MAX(amount) - MIN(amount) AS sale_range
FROM sales
GROUP BY salesperson;
 
-- Result:
-- +-------------+-------------+-----------------+---------------+----------+----------+---------+------------+
-- | salesperson | total_sales | categories_sold | total_revenue | avg_sale | smallest | largest | sale_range |
-- +-------------+-------------+-----------------+---------------+----------+----------+---------+------------+
-- | Alice       | 3           | 2               | 1525          | 508.33   | 325      | 750     | 425        |
-- | Bob         | 3           | 2               | 3860          | 1286.67  | 560      | 2100    | 1540       |
-- | Carol       | 2           | 2               | 1070          | 535.00   | 180      | 890     | 710        |
-- +-------------+-------------+-----------------+---------------+----------+----------+---------+------------+
 
-- Using aggregates with expressions
SELECT product_category,
       SUM(amount) AS revenue,
       SUM(amount) / COUNT(*) AS computed_avg,  -- Same as AVG
       SUM(amount) * 1.0 / SUM(COUNT(*)) OVER() AS pct_of_total  -- Window function
FROM sales
GROUP BY product_category;

Aggregates can appear in expressions:

Aggregate results are scalar values within each group, so they can be used in arithmetic, comparisons, and function calls:

SELECT salesperson,
       SUM(amount) AS revenue,
       ROUND(AVG(amount), 2) AS avg_rounded,           -- Aggregate in function
       SUM(amount) / COUNT(*) AS manual_avg,           -- Aggregates in expression
       CASE WHEN SUM(amount) > 2000 
            THEN 'High' ELSE 'Low' END AS performance  -- Aggregate in CASE
FROM sales
GROUP BY salesperson;

COUNT(*) vs COUNT(column)

COUNT(*) counts all rows in the group, including those with NULL values in any column. COUNT(column) counts only rows where that specific column is not NULL. For most grouping scenarios, COUNT(*) is correct unless you specifically need to count non-NULL values of a particular column.

The Grouping Column Constraint: Why It Exists

SQL enforces a strict rule for GROUP BY queries:

Every column in SELECT must either be in GROUP BY or inside an aggregate function.

This rule is not arbitrary—it ensures query results are deterministic and semantically meaningful. Let's understand why through an example of what would happen without this rule.

grouping-constraint.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- INCORRECT: What value should 'amount' have for Alice's group?
SELECT salesperson, amount    -- ❌ ERROR in standard SQL
FROM sales
GROUP BY salesperson;
 
-- Alice's group contains amounts: $450, $325, $750
-- The database cannot arbitrarily pick one - no defined behavior
 
-- CORRECT: Aggregate the amount
SELECT salesperson, SUM(amount) AS total  -- ✓ Unambiguous
FROM sales
GROUP BY salesperson;
 
-- CORRECT: Include in GROUP BY
SELECT salesperson, amount               -- ✓ Each combination is a group
FROM sales
GROUP BY salesperson, amount;
-- Now each (salesperson, amount) pair forms its own group
 
-- CORRECT: Use ANY_VALUE() if you don't care which value
-- (Available in MySQL 8.0+, some other databases)
SELECT salesperson, ANY_VALUE(amount) AS sample_amount
FROM sales
GROUP BY salesperson;
-- Explicitly states: give me any arbitrary amount from the group

Understanding the three valid patterns:

Pattern	When to Use	Result
Column in GROUP BY	Column defines the grouping	Same value throughout group
Column in aggregate	Column values should be summarized	Single computed value per group
Column in both	Usually redundant	Works but unnecessary

The mathematical foundation:

A grouped query conceptually performs a many-to-one mapping: many input rows become one output row per group. For this mapping to be well-defined, every output column must produce exactly one value. Grouping columns guarantee this (same value for all grouped rows). Aggregates guarantee this (compute single summary). Non-grouped, non-aggregated columns cannot guarantee this.

MySQL's Historical Permissiveness

Older MySQL versions allowed non-aggregated columns not in GROUP BY, silently picking an arbitrary value. This caused subtle bugs and non-deterministic results. MySQL 5.7.5+ defaults to ONLY_FULL_GROUP_BY mode, enforcing the standard behavior. If you encounter old MySQL code without this mode, be very careful—results may be unpredictable.

How GROUP BY Determines Groups

A group is defined by the unique combination of values in the GROUP BY columns. Understanding exactly how groups form helps you predict query results:

The database examines the GROUP BY column value(s) for each row
Rows with identical values in all GROUP BY columns belong to the same group
The number of output rows equals the number of distinct combinations

This applies regardless of whether you're grouping by one column or multiple columns.

group-formation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Single-column grouping: groups = unique values in that column
SELECT salesperson, COUNT(*)
FROM sales
GROUP BY salesperson;
-- Groups: 'Alice', 'Bob', 'Carol' → 3 output rows
 
-- GROUP BY with expressions
SELECT UPPER(salesperson) AS name, COUNT(*)
FROM sales
GROUP BY UPPER(salesperson);
-- Groups based on uppercase names: 'ALICE', 'BOB', 'CAROL'
-- Useful for case-insensitive grouping
 
-- GROUP BY with date functions
SELECT EXTRACT(MONTH FROM sale_date) AS month, SUM(amount)
FROM sales
GROUP BY EXTRACT(MONTH FROM sale_date);
-- Groups by month number extracted from date
 
-- GROUP BY with CASE expressions
SELECT 
    CASE 
        WHEN amount >= 1000 THEN 'Large'
        WHEN amount >= 500 THEN 'Medium'
        ELSE 'Small'
    END AS sale_size,
    COUNT(*) AS count,
    SUM(amount) AS total
FROM sales
GROUP BY 
    CASE 
        WHEN amount >= 1000 THEN 'Large'
        WHEN amount >= 500 THEN 'Medium'
        ELSE 'Small'
    END;
-- Groups: 'Large', 'Medium', 'Small'

Grouping by expressions:

You can GROUP BY any expression, not just column names. This enables powerful patterns:

Date truncation: GROUP BY DATE_TRUNC('month', created_at) — monthly summaries
Conditional grouping: GROUP BY CASE WHEN ... END — custom categories
Derived values: GROUP BY SUBSTRING(product_code, 1, 3) — first 3 characters
Mathematical bins: GROUP BY FLOOR(price / 100) * 100 — price ranges

Important: When grouping by an expression, you must repeat the same expression in SELECT (or use an aggregate). Aliases defined in SELECT cannot be referenced in GROUP BY in standard SQL.

Predicting Row Counts

Before running a grouped query, ask: 'How many unique values (or combinations) exist in my grouping column(s)?' That's your maximum possible row count. WHERE may reduce it by eliminating groups entirely, and HAVING may reduce it further by filtering post-aggregation.

Practical Examples: Real-World Queries

Let's apply GROUP BY to realistic scenarios across different domains:

real-world-group-by.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- E-COMMERCE: Revenue by category with metrics
SELECT 
    category,
    COUNT(*) AS order_count,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(total_amount) AS revenue,
    AVG(total_amount) AS avg_order_value,
    MAX(total_amount) AS largest_order
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY category
ORDER BY revenue DESC;
 
-- HR: Salary statistics by department
SELECT 
    department,
    COUNT(*) AS employee_count,
    ROUND(AVG(salary), 2) AS avg_salary,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary,
    SUM(salary) AS total_payroll
FROM employees
WHERE status = 'active'
GROUP BY department
ORDER BY avg_salary DESC;
 
-- SAAS: User engagement by subscription tier
SELECT 
    subscription_tier,
    COUNT(*) AS user_count,
    AVG(logins_last_30_days) AS avg_logins,
    AVG(features_used) AS avg_features_used,
    SUM(CASE WHEN churned = true THEN 1 ELSE 0 END) AS churns,
    ROUND(100.0 * SUM(CASE WHEN churned = true THEN 1 ELSE 0 END) / COUNT(*), 2) AS churn_rate_pct
FROM users
GROUP BY subscription_tier;
 
-- ANALYTICS: Page views by day with comparison
SELECT 
    DATE(timestamp) AS date,
    COUNT(*) AS page_views,
    COUNT(DISTINCT user_id) AS unique_visitors,
    COUNT(*) * 1.0 / COUNT(DISTINCT user_id) AS pages_per_visitor
FROM page_views
WHERE timestamp >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(timestamp)
ORDER BY date;

Query Patterns Worth Noting

•Multiple aggregates per query — You can compute many metrics in a single GROUP BY query. Efficient use of database resources.
•COUNT(DISTINCT ...) — Extremely useful for 'unique' counts (unique customers, unique days, etc.)
•Conditional counting with CASE — SUM(CASE WHEN condition THEN 1 ELSE 0 END) counts rows matching a condition within each group.
•Computed metrics from aggregates — Calculate ratios, percentages, and derived values from aggregate results.
•ORDER BY aggregate — Sort grouped results by any aggregate to surface most/least significant groups.

One Query, Many Insights

A well-designed GROUP BY query can answer multiple questions simultaneously: count, sum, average, min, max, unique counts, and derived ratios—all computed in a single database pass. This is far more efficient than running separate queries for each metric.

Summary: Mastering GROUP BY Syntax

We've covered the essential syntax and mechanics of the GROUP BY clause. Let's consolidate the key points:

Key Takeaways

•GROUP BY appears after FROM and WHERE — It partitions the rows remaining after WHERE filtering.
•Every SELECT column must be in GROUP BY or an aggregate — This ensures deterministic, meaningful results. No exceptions in standard SQL.
•Execution order matters — FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. WHERE cannot use aggregates; HAVING can.
•Groups are defined by unique value combinations — One output row per unique combination of GROUP BY column values.
•You can GROUP BY expressions — Date functions, CASE expressions, string functions—not just column names.
•Multiple aggregates per query — Compute count, sum, avg, min, max, and derived metrics in one pass.
•NULL values group together — All NULLs in a grouping column form one group, despite NULL ≠ NULL in comparisons.

What's Next:

While single-column grouping is powerful, many analytical questions require grouping by multiple columns simultaneously. The next page explores multi-column GROUP BY: how combinations work, what happens to row counts, and how to design queries that capture the right level of detail.

Page Complete

You can now write syntactically correct GROUP BY queries, understand execution order, and combine grouping with aggregate functions effectively. Next, we'll extend this to multi-column grouping for more nuanced categorical analysis.