Group By - Learning Module

Loading content...

0/241

Multiple Columns

Beyond Single Dimensions: Multi-Column Grouping

Single-column grouping answers questions like 'revenue by salesperson' or 'count by category.' But real analytical needs are often more nuanced:

Revenue by salesperson and product category
Average order value by region and quarter
User engagement by subscription tier and account age

These questions require multi-column GROUP BY—grouping by combinations of values across multiple dimensions simultaneously. This capability transforms SQL from simple summarization into powerful multi-dimensional analysis.

Understanding multi-column grouping is essential for building the kind of analytical queries that drive business intelligence dashboards, executive reports, and data-driven decision making.

What You Will Learn

By the end of this page, you will understand how multi-column GROUP BY creates combination-based groups, predict result row counts from distinct combinations, design queries at the appropriate level of granularity, and apply hierarchical grouping patterns for drill-down analysis.

How Multi-Column Grouping Works

When you GROUP BY multiple columns, each unique combination of values across all grouping columns forms one group. The columns are treated as a composite key—rows must match on ALL grouping columns to belong to the same group.

SELECT salesperson, product_category, SUM(amount) AS total
FROM sales
GROUP BY salesperson, product_category;

This query produces one row for each unique (salesperson, product_category) pair that exists in the data.

Visualization of combination-based grouping:

Original Sales Data:
┌─────────────┬──────────────────┬─────────┐
│ salesperson │ product_category │ amount  │
├─────────────┼──────────────────┼─────────┤
│ Alice       │ Electronics      │ $450    │
│ Alice       │ Electronics      │ $325    │
│ Alice       │ Furniture        │ $750    │
│ Bob         │ Furniture        │ $1,200  │
│ Bob         │ Furniture        │ $2,100  │
│ Bob         │ Electronics      │ $560    │
│ Carol       │ Electronics      │ $890    │
│ Carol       │ Office           │ $180    │
└─────────────┴──────────────────┴─────────┘

GROUP BY salesperson, product_category:

┌──────────────────────────────────┐
│ Groups formed (6 combinations):  │
├──────────────────────────────────┤
│ (Alice, Electronics) → $450+$325 │
│ (Alice, Furniture)   → $750      │
│ (Bob, Furniture)     → $1,200+$2,100│
│ (Bob, Electronics)   → $560      │
│ (Carol, Electronics) → $890      │
│ (Carol, Office)      → $180      │
└──────────────────────────────────┘

Result (6 rows):
┌─────────────┬──────────────────┬─────────┐
│ salesperson │ product_category │ total   │
├─────────────┼──────────────────┼─────────┤
│ Alice       │ Electronics      │ $775    │
│ Alice       │ Furniture        │ $750    │
│ Bob         │ Furniture        │ $3,300  │
│ Bob         │ Electronics      │ $560    │
│ Carol       │ Electronics      │ $890    │
│ Carol       │ Office           │ $180    │
└─────────────┴──────────────────┴─────────┘

Key observation: The result has 6 rows because there are 6 distinct (salesperson, product_category) combinations in the data—not 3 salespeople × 3 categories = 9. Some combinations don't exist (Alice never sold Office, Carol never sold Furniture, etc.).

Result Rows = Existing Combinations

Multi-column GROUP BY returns only combinations that actually exist in your data. The maximum possible rows is the product of distinct values in each column, but the actual count depends on which combinations have data. Empty combinations produce no output row.

Granularity and Detail Level

Adding columns to GROUP BY increases granularity—you get more groups with fewer rows each. Conversely, removing columns decreases granularity—fewer groups, more rows aggregated into each.

Think of it as a zoom level on your data:

GROUP BY	Granularity	Group Count	Rows per Group
(none)	Lowest	1 (entire table)	All rows
`salesperson`	Low	3	~2-3 each
`salesperson, product_category`	Medium	6	1-2 each
`salesperson, product_category, sale_date`	High	8	1 each

Choosing the right granularity depends on your analytical question.

Low Granularity

•Fewer grouping columns
•Broader categories
•Fewer result rows
•More data aggregated per group
•High-level summaries
•Executive dashboards
•Example: Revenue by quarter

High Granularity

•More grouping columns
•Narrow combinations
•More result rows
•Less data aggregated per group
•Detailed breakdowns
•Operational analysis
•Example: Revenue by quarter, region, product, salesperson

granularity-comparison.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Low granularity: Total by salesperson only
SELECT salesperson, SUM(amount) AS total
FROM sales
GROUP BY salesperson;
-- Result: 3 rows (one per salesperson)
 
-- Medium granularity: By salesperson and category
SELECT salesperson, product_category, SUM(amount) AS total
FROM sales
GROUP BY salesperson, product_category;
-- Result: 6 rows (one per salesperson-category combination)
 
-- High granularity: By salesperson, category, and date
SELECT salesperson, product_category, sale_date, SUM(amount) AS total
FROM sales
GROUP BY salesperson, product_category, sale_date;
-- Result: 8 rows (each transaction becomes its own group)
 
-- At maximum granularity, GROUP BY essentially does nothing
-- (each row is its own group, aggregates just return the value)

Over-Granular Grouping

If every combination results in a single-row group, your GROUP BY may be too granular. Aggregates become trivial (SUM of one value = that value, AVG of one value = that value). This suggests either your grouping is too detailed or the data itself is sparse at that level.

The Order of Columns in GROUP BY

The order of columns in GROUP BY does not affect the grouping result—the same combinations form regardless of order. However, order can matter for:

Readability — List columns in logical order (broad to specific, primary to secondary)
Performance — Some databases may optimize differently based on column order, especially if indexes exist
Consistency — Match GROUP BY order to SELECT order for clarity

column-order.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- These produce identical groupings (same combinations)
SELECT salesperson, product_category, SUM(amount)
FROM sales
GROUP BY salesperson, product_category;
 
SELECT salesperson, product_category, SUM(amount)
FROM sales
GROUP BY product_category, salesperson;  -- Different order, same result
 
-- But the output row ORDER may differ
-- (use ORDER BY to guarantee specific ordering)
 
-- Good practice: Match SELECT and GROUP BY order
SELECT 
    region,
    department,
    product_category,
    SUM(amount) AS total
FROM sales
GROUP BY 
    region,           -- Primary dimension
    department,       -- Secondary dimension
    product_category  -- Tertiary dimension
ORDER BY 
    region,
    department,
    product_category;

Conceptual hierarchy pattern:

When grouping by hierarchical data (region → department → team), list columns from broadest to most specific. This creates a natural reading order:

-- Hierarchical grouping (broad to specific)
SELECT 
    country,        -- Broadest
    region,
    city,
    store_id,       -- Most specific
    SUM(sales) AS total
FROM retail_sales
GROUP BY country, region, city, store_id
ORDER BY country, region, city, store_id;

This pattern supports drill-down analysis where you can easily remove the rightmost column to zoom out to a higher level.

GROUP BY vs ORDER BY

GROUP BY determines WHAT groups exist. ORDER BY determines HOW results are sorted. They are independent operations. Without ORDER BY, grouped results may appear in any order. Always add ORDER BY if you need specific output ordering.

Calculating Expected Row Counts

Being able to predict the number of result rows helps validate query correctness and anticipate result set sizes:

Theoretical maximum: Product of distinct values in each grouping column

Actual count: Number of combinations that exist in your data

Distinct salespeople: 3 (Alice, Bob, Carol)
Distinct categories: 3 (Electronics, Furniture, Office)
Theoretical max: 3 × 3 = 9
Actual combinations: 6 (not everyone sold every category)

row-count-prediction.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Query to find distinct values in each potential grouping column
SELECT 
    COUNT(DISTINCT salesperson) AS distinct_salespeople,
    COUNT(DISTINCT product_category) AS distinct_categories,
    COUNT(DISTINCT sale_date) AS distinct_dates
FROM sales;
-- Example result: 3 salespeople, 3 categories, 4 dates
-- Theoretical max: 3 × 3 × 4 = 36 combinations
 
-- Count actual combinations at different granularities
SELECT COUNT(*) AS group_count
FROM (
    SELECT DISTINCT salesperson, product_category
    FROM sales
) AS combinations;
-- Result: 6 actual combinations
 
-- Or using GROUP BY with counting
SELECT COUNT(*) AS result_rows
FROM (
    SELECT salesperson, product_category
    FROM sales
    GROUP BY salesperson, product_category
) AS grouped;
-- Same result: 6 rows

Predicting Row Counts for Different Groupings
GROUP BY Columns	Distinct Values	Theoretical Max	Actual Groups*
`salesperson`	3	3	3
`product_category`	3	3	3
`sale_date`	4	4	4
`salesperson, product_category`	3 × 3	9	6
`salesperson, sale_date`	3 × 4	12	7
`product_category, sale_date`	3 × 4	12	7
`salesperson, product_category, sale_date`	3 × 3 × 4	36	8

Sparse vs. Dense Data

When actual combinations are much lower than theoretical maximum, your data is 'sparse' for those dimensions. This is common (not every product sells in every region). Very dense data (actual ≈ theoretical) suggests high coverage across all combinations.

Practical Multi-Column Grouping Patterns

Certain multi-column grouping patterns appear frequently across analytical workloads. Recognizing these patterns accelerates query design:

common-patterns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- PATTERN 1: Time-series breakdown
-- Metrics over time, segmented by category
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    product_category,
    SUM(amount) AS revenue,
    COUNT(*) AS orders
FROM orders
GROUP BY DATE_TRUNC('month', order_date), product_category
ORDER BY month, product_category;
 
-- PATTERN 2: Geographic hierarchy
-- Drill-down from country to city
SELECT 
    country,
    region,
    city,
    SUM(sales) AS total_sales,
    COUNT(DISTINCT customer_id) AS customers
FROM transactions
GROUP BY country, region, city
ORDER BY country, region, city;
 
-- PATTERN 3: Cohort analysis
-- Metrics by user cohort and activity period
SELECT 
    DATE_TRUNC('month', signup_date) AS cohort_month,
    DATE_TRUNC('month', activity_date) AS activity_month,
    COUNT(DISTINCT user_id) AS active_users
FROM user_activity
GROUP BY 
    DATE_TRUNC('month', signup_date),
    DATE_TRUNC('month', activity_date)
ORDER BY cohort_month, activity_month;
 
-- PATTERN 4: Cross-dimensional comparison
-- Compare metrics across two independent dimensions
SELECT 
    channel,
    device_type,
    SUM(revenue) AS total_revenue,
    AVG(session_duration) AS avg_duration,
    SUM(conversions) * 100.0 / COUNT(*) AS conversion_rate
FROM sessions
GROUP BY channel, device_type
ORDER BY total_revenue DESC;

Pattern Recognition Guide

•Time + Category — Most common pattern. Track metrics over time, segmented by business dimension (product, region, team, etc.)
•Hierarchy Drill-Down — Group by hierarchical levels (continent → country → region → city) for overview-to-detail analysis.
•Cohort Analysis — Two time dimensions: when users joined AND when they acted. Essential for retention analysis.
•Cross-Tab / Pivot — Two categorical dimensions creating a matrix. Each cell is a (dimension1, dimension2) combination.
•Funnel Stages — Group by step AND other dimension to see where different segments drop off.

Pattern Library

Building a mental library of these patterns lets you recognize them in business requirements. When someone asks for 'revenue trends by product line,' you immediately know: GROUP BY month, product_category with SUM(revenue).

Handling Missing Combinations

A subtle but important issue: GROUP BY only returns combinations that exist in your data. If Alice never sold Office products, (Alice, Office) won't appear in results—not even with zeros.

This can be problematic for:

Time-series charts (gaps on days with no activity)
Comparison tables (missing cells imply zero but aren't shown)
Pivot tables (incomplete matrix)

missing-combinations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Problem: Missing date gaps in time series
SELECT sale_date, COUNT(*) AS sales_count
FROM sales
GROUP BY sale_date;
-- If no sales on 2024-01-19, that date doesn't appear at all
 
-- Solution 1: CROSS JOIN with dimension table
-- First, have or create a dates table
SELECT 
    d.date_value,
    COALESCE(SUM(s.amount), 0) AS revenue,
    COUNT(s.transaction_id) AS sales_count
FROM dates d
LEFT JOIN sales s ON d.date_value = s.sale_date
WHERE d.date_value BETWEEN '2024-01-15' AND '2024-01-20'
GROUP BY d.date_value
ORDER BY d.date_value;
-- Now every date appears, with 0 for dates with no sales
 
-- Solution 2: Generate all combinations with CROSS JOIN
SELECT 
    sp.salesperson,
    pc.product_category,
    COALESCE(SUM(s.amount), 0) AS total_sales
FROM (SELECT DISTINCT salesperson FROM sales) sp
CROSS JOIN (SELECT DISTINCT product_category FROM sales) pc
LEFT JOIN sales s 
    ON s.salesperson = sp.salesperson 
    AND s.product_category = pc.product_category
GROUP BY sp.salesperson, pc.product_category
ORDER BY sp.salesperson, pc.product_category;
-- Every (salesperson, category) combination appears

The CROSS JOIN + LEFT JOIN pattern:

Create or select all possible dimension values (e.g., all dates, all categories)
CROSS JOIN them to get all combinations
LEFT JOIN to your fact table
Use COALESCE to convert NULLs (no matching facts) to zeros
GROUP BY the dimension columns

This ensures complete coverage for visualizations and reports that need every cell filled.

Cardinality Explosion

Be careful with CROSS JOIN! If you cross join 1000 products × 365 days × 50 regions = 18.25 million rows. Only generate complete combinations when necessary and limit the dimensions appropriately.

Multi-Column GROUP BY with Complex Aggregates

When grouping by multiple columns, you can still use any aggregate functions. The aggregates compute over the rows within each combination-group:

complex-aggregates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Comprehensive metrics per salesperson-category combination
SELECT 
    salesperson,
    product_category,
    -- Count metrics
    COUNT(*) AS transaction_count,
    COUNT(DISTINCT sale_date) AS active_days,
    -- Sum metrics
    SUM(amount) AS total_revenue,
    -- Average metrics
    ROUND(AVG(amount), 2) AS avg_transaction,
    -- Range metrics
    MIN(amount) AS smallest_sale,
    MAX(amount) AS largest_sale,
    MAX(amount) - MIN(amount) AS sale_range,
    -- Conditional aggregates
    SUM(CASE WHEN amount >= 500 THEN 1 ELSE 0 END) AS large_sales_count,
    SUM(CASE WHEN amount >= 500 THEN amount ELSE 0 END) AS large_sales_revenue,
    -- Percentage calculation
    ROUND(100.0 * SUM(CASE WHEN amount >= 500 THEN 1 ELSE 0 END) / COUNT(*), 1) AS large_sale_pct
FROM sales
GROUP BY salesperson, product_category
ORDER BY total_revenue DESC;
 
-- Result shows complete profile for each salesperson-category pair:
-- Alice-Electronics: 2 transactions, 2 active days, $775 total, ...
-- Bob-Furniture: 2 transactions, 2 active days, $3300 total, ...
-- etc.
 
-- Nested logic within aggregates
SELECT 
    department,
    job_level,
    COUNT(*) AS employees,
    AVG(salary) AS avg_salary,
    AVG(CASE WHEN tenure_years >= 5 THEN salary END) AS avg_senior_salary,
    AVG(CASE WHEN tenure_years < 5 THEN salary END) AS avg_junior_salary
FROM employees
GROUP BY department, job_level;

Key techniques for complex multi-column aggregations:

Technique	Purpose	Example
Conditional COUNT	Count subset	`SUM(CASE WHEN x THEN 1 ELSE 0 END)`
Conditional SUM	Sum subset	`SUM(CASE WHEN x THEN amount ELSE 0 END)`
Conditional AVG	Average subset	`AVG(CASE WHEN x THEN value END)`
Ratio from aggregates	Percentage	`SUM(a) * 100.0 / SUM(b)`
Range calculation	Spread	`MAX(x) - MIN(x)`

Aggregates Within Aggregates

You cannot directly nest aggregates (AVG(SUM(amount)) is invalid). If you need to aggregate over aggregated results, use a subquery or CTE: first GROUP BY to get sums, then AVG over those sums in an outer query.

Summary: Multi-Column GROUP BY Mastery

Multi-column GROUP BY is the key to dimensional analysis in SQL. Let's consolidate the key concepts:

Key Takeaways

•Groups = unique combinations — Each distinct combination of values across all GROUP BY columns forms one group. Result rows = existing combinations in your data.
•More columns = higher granularity — Adding grouping columns creates more, smaller groups. Removing columns creates fewer, larger groups with more aggregated data.
•Column order doesn't affect grouping — The same combinations form regardless of order. Order affects readability and potentially optimization.
•Actual < theoretical max — Real data is usually sparse. Not all possible combinations exist, so actual group count is often much lower than the product of distinct values.
•Pattern recognition accelerates design — Time + category, geographic hierarchy, cohort analysis, cross-tabs—recognize these patterns in requirements.
•Missing combinations don't appear — GROUP BY only returns combinations with data. Use CROSS JOIN + LEFT JOIN to fill in zeros for complete matrices.
•All aggregates work per combination — COUNT, SUM, AVG, etc. all operate within each combination-group independently.

What's Next:

With grouping syntax mastered, we need to understand the rules and constraints that govern valid GROUP BY queries. The next page covers grouping rules in depth: what can appear in SELECT, the relationship between grouping columns and aggregates, and how different databases enforce (or don't enforce) these rules.

Page Complete

You now understand how multi-column GROUP BY creates combination-based groups, can predict row counts, and recognize common analytical patterns. Next, we'll formalize the rules that ensure grouped queries are valid and produce meaningful results.