Loading content...
In the previous page, we established the conceptual foundation: grouping partitions data by category and computes aggregates per partition. Now we translate that understanding into executable SQL using the GROUP BY clause.
The GROUP BY clause is deceptively simple in syntax but profoundly powerful in capability. A single clause added to a SELECT statement transforms how SQL processes the entire query—changing the result from individual rows to categorical summaries.
This page covers exactly how to write GROUP BY queries: the precise syntax, where it appears in the query structure, how it interacts with other clauses, and the execution mechanics that determine your results.
By the end of this page, you will be able to write syntactically correct GROUP BY queries, understand exactly where GROUP BY fits in query execution order, combine GROUP BY with aggregate functions effectively, and predict exactly how many rows your grouped query will return.
The GROUP BY clause appears after the FROM and WHERE clauses (if present) and specifies which column(s) define the grouping:
SELECT column1, aggregate_function(column2)
FROM table_name
WHERE condition -- Optional: filter rows first
GROUP BY column1; -- Group by this column
The fundamental rule: Every column in the SELECT list must either:
Violating this rule causes an error in standards-compliant databases (though MySQL has historically been more permissive—a behavior that causes subtle bugs).
1234567891011121314151617181920212223242526272829303132333435363738
-- Sample data context:-- Table: sales (transaction_id, salesperson, product_category, amount, sale_date) -- Example 1: Count transactions per salespersonSELECT salesperson, COUNT(*) AS transaction_countFROM salesGROUP BY salesperson; -- Result:-- +-------------+-------------------+-- | salesperson | transaction_count |-- +-------------+-------------------+-- | Alice | 3 |-- | Bob | 3 |-- | Carol | 2 |-- +-------------+-------------------+ -- Example 2: Total revenue per product categorySELECT product_category, SUM(amount) AS total_revenueFROM salesGROUP BY product_category; -- Result:-- +------------------+---------------+-- | product_category | total_revenue |-- +------------------+---------------+-- | Electronics | 2225 |-- | Furniture | 4050 |-- | Office | 180 |-- +------------------+---------------+ -- Example 3: Average sale amount per salespersonSELECT salesperson, AVG(amount) AS avg_sale, MIN(amount) AS smallest_sale, MAX(amount) AS largest_saleFROM salesGROUP BY salesperson;Anatomy of a GROUP BY query:
| Component | Role | Example |
|---|---|---|
| SELECT (grouping column) | Identifies each group | salesperson |
| SELECT (aggregate) | Summarizes the group | COUNT(*), SUM(amount) |
| FROM | Source table(s) | sales |
| GROUP BY | Specifies grouping | GROUP BY salesperson |
The grouping column's value is the same for every row in the group (by definition), so it can appear directly in SELECT. The aggregate computes a single value from all rows in the group.
Read GROUP BY queries as a sentence: 'Show me [aggregates] for each unique value of [grouping column(s)] from [table].' For example: 'Show me the count and sum of sales for each unique salesperson from the sales table.'
Understanding execution order is essential for writing correct GROUP BY queries. SQL does not process clauses top-to-bottom as written; instead, it follows a specific logical sequence:
Logical Execution Order:
1. FROM → Determine source table(s)
2. WHERE → Filter individual rows
3. GROUP BY → Partition remaining rows into groups
4. HAVING → Filter groups (after aggregation)
5. SELECT → Compute output expressions and aggregates
6. DISTINCT → Remove duplicate rows (if specified)
7. ORDER BY → Sort final results
8. LIMIT → Restrict output row count
The critical insight: GROUP BY happens AFTER WHERE but BEFORE SELECT. This has important implications:
1234567891011121314151617181920212223242526272829
-- Complete example showing execution order SELECT salesperson, SUM(amount) AS total_sales -- Step 5: Compute aggregatesFROM sales -- Step 1: Source tableWHERE sale_date >= '2024-01-16' -- Step 2: Filter rows firstGROUP BY salesperson -- Step 3: Group remaining rowsHAVING SUM(amount) > 500 -- Step 4: Filter groups by aggregateORDER BY total_sales DESC; -- Step 7: Sort results -- Execution trace with sample data:-- Step 1: Start with 8 rows-- Step 2: WHERE removes 2 rows (sale_date < '2024-01-16') → 6 rows remain-- Step 3: GROUP BY partitions 6 rows:-- Alice: 2 rows ($325 + $750 = $1075)-- Bob: 2 rows ($2100 + $560 = $2660)-- Carol: 2 rows ($890 + $180 = $1070)-- Step 4: HAVING filters: all 3 groups have SUM > 500 → 3 groups remain-- Step 5: SELECT computes output columns → 3 result rows-- Step 7: ORDER BY sorts by total_sales descending -- Final Result:-- +-------------+-------------+-- | salesperson | total_sales |-- +-------------+-------------+-- | Bob | 2660 |-- | Alice | 1075 |-- | Carol | 1070 |-- +-------------+-------------+This often fails: SELECT YEAR(sale_date) AS year, ... GROUP BY year. In strict SQL, aliases don't exist during GROUP BY. Solution: Repeat the expression: GROUP BY YEAR(sale_date). Some databases (MySQL, PostgreSQL) allow alias references as an extension, but this isn't portable.
WHERE and GROUP BY work together in a specific sequence: WHERE filters individual rows, then GROUP BY partitions what remains. This allows powerful analytical queries that answer questions like:
The key is that WHERE acts as a pre-filter—removing rows before any grouping or aggregation occurs.
12345678910111213141516171819202122232425262728293031323334353637383940
-- Analyze Electronics sales onlySELECT salesperson, COUNT(*) AS electronics_count, SUM(amount) AS electronics_revenueFROM salesWHERE product_category = 'Electronics' -- Pre-filter: only ElectronicsGROUP BY salesperson; -- Processing:-- 1. FROM: All 8 rows-- 2. WHERE: Keep only 'Electronics' → 4 rows remain-- (Alice: $450, $325; Bob: $560; Carol: $890)-- 3. GROUP BY: Partition by salesperson-- Alice: 2 rows, Bob: 1 row, Carol: 1 row-- 4. SELECT: Compute COUNT and SUM per group -- Result:-- +-------------+-------------------+---------------------+-- | salesperson | electronics_count | electronics_revenue |-- +-------------+-------------------+---------------------+-- | Alice | 2 | 775 |-- | Bob | 1 | 560 |-- | Carol | 1 | 890 |-- +-------------+-------------------+---------------------+ -- Multiple WHERE conditionsSELECT product_category, COUNT(*) AS sales_count, AVG(amount) AS avg_amountFROM salesWHERE salesperson IN ('Alice', 'Bob') -- Only these salespeople AND sale_date >= '2024-01-16' -- Only recent salesGROUP BY product_category; -- Date-based filteringSELECT salesperson, SUM(amount) AS jan_17_totalFROM salesWHERE sale_date = '2024-01-17'GROUP BY salesperson;Impact of WHERE on group existence:
WHERE can completely remove a group if all its rows are filtered out. Consider:
SELECT salesperson, COUNT(*)
FROM sales
WHERE product_category = 'Office'
GROUP BY salesperson;
If only Carol sold Office products, Alice and Bob won't appear in results at all—they have no rows left after WHERE filtering. This is correct behavior, but it's important to understand: groups can disappear entirely based on WHERE conditions.
When writing analytical queries, think in two phases: (1) What subset of data am I analyzing? (WHERE) (2) How do I want to summarize that subset? (GROUP BY + aggregates). This mental separation prevents confusion about what gets filtered when.
When GROUP BY is present, aggregate functions operate on each group independently rather than the entire table. This changes their semantics significantly—each aggregate produces one value per group, not one value for the whole dataset.
The standard aggregate functions available in most SQL databases:
| Function | Description | NULL Handling | Grouped Behavior |
|---|---|---|---|
COUNT(*) | Count all rows | Counts NULLs | Rows in each group |
COUNT(column) | Count non-NULL values | Excludes NULLs | Non-NULL values per group |
COUNT(DISTINCT col) | Count unique non-NULL values | Excludes NULLs | Unique values per group |
SUM(column) | Sum of values | Ignores NULLs | Sum within each group |
AVG(column) | Average of values | Ignores NULLs | Average within each group |
MIN(column) | Minimum value | Ignores NULLs | Min within each group |
MAX(column) | Maximum value | Ignores NULLs | Max within each group |
12345678910111213141516171819202122232425262728
-- Multiple aggregates per groupSELECT salesperson, COUNT(*) AS total_sales, COUNT(DISTINCT product_category) AS categories_sold, SUM(amount) AS total_revenue, AVG(amount) AS avg_sale, MIN(amount) AS smallest, MAX(amount) AS largest, MAX(amount) - MIN(amount) AS sale_rangeFROM salesGROUP BY salesperson; -- Result:-- +-------------+-------------+-----------------+---------------+----------+----------+---------+------------+-- | salesperson | total_sales | categories_sold | total_revenue | avg_sale | smallest | largest | sale_range |-- +-------------+-------------+-----------------+---------------+----------+----------+---------+------------+-- | Alice | 3 | 2 | 1525 | 508.33 | 325 | 750 | 425 |-- | Bob | 3 | 2 | 3860 | 1286.67 | 560 | 2100 | 1540 |-- | Carol | 2 | 2 | 1070 | 535.00 | 180 | 890 | 710 |-- +-------------+-------------+-----------------+---------------+----------+----------+---------+------------+ -- Using aggregates with expressionsSELECT product_category, SUM(amount) AS revenue, SUM(amount) / COUNT(*) AS computed_avg, -- Same as AVG SUM(amount) * 1.0 / SUM(COUNT(*)) OVER() AS pct_of_total -- Window functionFROM salesGROUP BY product_category;Aggregates can appear in expressions:
Aggregate results are scalar values within each group, so they can be used in arithmetic, comparisons, and function calls:
SELECT salesperson,
SUM(amount) AS revenue,
ROUND(AVG(amount), 2) AS avg_rounded, -- Aggregate in function
SUM(amount) / COUNT(*) AS manual_avg, -- Aggregates in expression
CASE WHEN SUM(amount) > 2000
THEN 'High' ELSE 'Low' END AS performance -- Aggregate in CASE
FROM sales
GROUP BY salesperson;
COUNT(*) counts all rows in the group, including those with NULL values in any column. COUNT(column) counts only rows where that specific column is not NULL. For most grouping scenarios, COUNT(*) is correct unless you specifically need to count non-NULL values of a particular column.
SQL enforces a strict rule for GROUP BY queries:
Every column in SELECT must either be in GROUP BY or inside an aggregate function.
This rule is not arbitrary—it ensures query results are deterministic and semantically meaningful. Let's understand why through an example of what would happen without this rule.
12345678910111213141516171819202122232425
-- INCORRECT: What value should 'amount' have for Alice's group?SELECT salesperson, amount -- ❌ ERROR in standard SQLFROM salesGROUP BY salesperson; -- Alice's group contains amounts: $450, $325, $750-- The database cannot arbitrarily pick one - no defined behavior -- CORRECT: Aggregate the amountSELECT salesperson, SUM(amount) AS total -- ✓ UnambiguousFROM salesGROUP BY salesperson; -- CORRECT: Include in GROUP BYSELECT salesperson, amount -- ✓ Each combination is a groupFROM salesGROUP BY salesperson, amount;-- Now each (salesperson, amount) pair forms its own group -- CORRECT: Use ANY_VALUE() if you don't care which value-- (Available in MySQL 8.0+, some other databases)SELECT salesperson, ANY_VALUE(amount) AS sample_amountFROM salesGROUP BY salesperson;-- Explicitly states: give me any arbitrary amount from the groupUnderstanding the three valid patterns:
| Pattern | When to Use | Result |
|---|---|---|
| Column in GROUP BY | Column defines the grouping | Same value throughout group |
| Column in aggregate | Column values should be summarized | Single computed value per group |
| Column in both | Usually redundant | Works but unnecessary |
The mathematical foundation:
A grouped query conceptually performs a many-to-one mapping: many input rows become one output row per group. For this mapping to be well-defined, every output column must produce exactly one value. Grouping columns guarantee this (same value for all grouped rows). Aggregates guarantee this (compute single summary). Non-grouped, non-aggregated columns cannot guarantee this.
Older MySQL versions allowed non-aggregated columns not in GROUP BY, silently picking an arbitrary value. This caused subtle bugs and non-deterministic results. MySQL 5.7.5+ defaults to ONLY_FULL_GROUP_BY mode, enforcing the standard behavior. If you encounter old MySQL code without this mode, be very careful—results may be unpredictable.
A group is defined by the unique combination of values in the GROUP BY columns. Understanding exactly how groups form helps you predict query results:
This applies regardless of whether you're grouping by one column or multiple columns.
123456789101112131415161718192021222324252627282930313233343536
-- Single-column grouping: groups = unique values in that columnSELECT salesperson, COUNT(*)FROM salesGROUP BY salesperson;-- Groups: 'Alice', 'Bob', 'Carol' → 3 output rows -- GROUP BY with expressionsSELECT UPPER(salesperson) AS name, COUNT(*)FROM salesGROUP BY UPPER(salesperson);-- Groups based on uppercase names: 'ALICE', 'BOB', 'CAROL'-- Useful for case-insensitive grouping -- GROUP BY with date functionsSELECT EXTRACT(MONTH FROM sale_date) AS month, SUM(amount)FROM salesGROUP BY EXTRACT(MONTH FROM sale_date);-- Groups by month number extracted from date -- GROUP BY with CASE expressionsSELECT CASE WHEN amount >= 1000 THEN 'Large' WHEN amount >= 500 THEN 'Medium' ELSE 'Small' END AS sale_size, COUNT(*) AS count, SUM(amount) AS totalFROM salesGROUP BY CASE WHEN amount >= 1000 THEN 'Large' WHEN amount >= 500 THEN 'Medium' ELSE 'Small' END;-- Groups: 'Large', 'Medium', 'Small'Grouping by expressions:
You can GROUP BY any expression, not just column names. This enables powerful patterns:
GROUP BY DATE_TRUNC('month', created_at) — monthly summariesGROUP BY CASE WHEN ... END — custom categoriesGROUP BY SUBSTRING(product_code, 1, 3) — first 3 charactersGROUP BY FLOOR(price / 100) * 100 — price rangesImportant: When grouping by an expression, you must repeat the same expression in SELECT (or use an aggregate). Aliases defined in SELECT cannot be referenced in GROUP BY in standard SQL.
Before running a grouped query, ask: 'How many unique values (or combinations) exist in my grouping column(s)?' That's your maximum possible row count. WHERE may reduce it by eliminating groups entirely, and HAVING may reduce it further by filtering post-aggregation.
Let's apply GROUP BY to realistic scenarios across different domains:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
-- E-COMMERCE: Revenue by category with metricsSELECT category, COUNT(*) AS order_count, COUNT(DISTINCT customer_id) AS unique_customers, SUM(total_amount) AS revenue, AVG(total_amount) AS avg_order_value, MAX(total_amount) AS largest_orderFROM ordersWHERE order_date >= '2024-01-01'GROUP BY categoryORDER BY revenue DESC; -- HR: Salary statistics by departmentSELECT department, COUNT(*) AS employee_count, ROUND(AVG(salary), 2) AS avg_salary, MIN(salary) AS min_salary, MAX(salary) AS max_salary, SUM(salary) AS total_payrollFROM employeesWHERE status = 'active'GROUP BY departmentORDER BY avg_salary DESC; -- SAAS: User engagement by subscription tierSELECT subscription_tier, COUNT(*) AS user_count, AVG(logins_last_30_days) AS avg_logins, AVG(features_used) AS avg_features_used, SUM(CASE WHEN churned = true THEN 1 ELSE 0 END) AS churns, ROUND(100.0 * SUM(CASE WHEN churned = true THEN 1 ELSE 0 END) / COUNT(*), 2) AS churn_rate_pctFROM usersGROUP BY subscription_tier; -- ANALYTICS: Page views by day with comparisonSELECT DATE(timestamp) AS date, COUNT(*) AS page_views, COUNT(DISTINCT user_id) AS unique_visitors, COUNT(*) * 1.0 / COUNT(DISTINCT user_id) AS pages_per_visitorFROM page_viewsWHERE timestamp >= CURRENT_DATE - INTERVAL '7 days'GROUP BY DATE(timestamp)ORDER BY date;SUM(CASE WHEN condition THEN 1 ELSE 0 END) counts rows matching a condition within each group.A well-designed GROUP BY query can answer multiple questions simultaneously: count, sum, average, min, max, unique counts, and derived ratios—all computed in a single database pass. This is far more efficient than running separate queries for each metric.
We've covered the essential syntax and mechanics of the GROUP BY clause. Let's consolidate the key points:
What's Next:
While single-column grouping is powerful, many analytical questions require grouping by multiple columns simultaneously. The next page explores multi-column GROUP BY: how combinations work, what happens to row counts, and how to design queries that capture the right level of detail.
You can now write syntactically correct GROUP BY queries, understand execution order, and combine grouping with aggregate functions effectively. Next, we'll extend this to multi-column grouping for more nuanced categorical analysis.