Loading content...
Single-column grouping answers questions like 'revenue by salesperson' or 'count by category.' But real analytical needs are often more nuanced:
These questions require multi-column GROUP BY—grouping by combinations of values across multiple dimensions simultaneously. This capability transforms SQL from simple summarization into powerful multi-dimensional analysis.
Understanding multi-column grouping is essential for building the kind of analytical queries that drive business intelligence dashboards, executive reports, and data-driven decision making.
By the end of this page, you will understand how multi-column GROUP BY creates combination-based groups, predict result row counts from distinct combinations, design queries at the appropriate level of granularity, and apply hierarchical grouping patterns for drill-down analysis.
When you GROUP BY multiple columns, each unique combination of values across all grouping columns forms one group. The columns are treated as a composite key—rows must match on ALL grouping columns to belong to the same group.
SELECT salesperson, product_category, SUM(amount) AS total
FROM sales
GROUP BY salesperson, product_category;
This query produces one row for each unique (salesperson, product_category) pair that exists in the data.
Visualization of combination-based grouping:
Original Sales Data:
┌─────────────┬──────────────────┬─────────┐
│ salesperson │ product_category │ amount │
├─────────────┼──────────────────┼─────────┤
│ Alice │ Electronics │ $450 │
│ Alice │ Electronics │ $325 │
│ Alice │ Furniture │ $750 │
│ Bob │ Furniture │ $1,200 │
│ Bob │ Furniture │ $2,100 │
│ Bob │ Electronics │ $560 │
│ Carol │ Electronics │ $890 │
│ Carol │ Office │ $180 │
└─────────────┴──────────────────┴─────────┘
GROUP BY salesperson, product_category:
┌──────────────────────────────────┐
│ Groups formed (6 combinations): │
├──────────────────────────────────┤
│ (Alice, Electronics) → $450+$325 │
│ (Alice, Furniture) → $750 │
│ (Bob, Furniture) → $1,200+$2,100│
│ (Bob, Electronics) → $560 │
│ (Carol, Electronics) → $890 │
│ (Carol, Office) → $180 │
└──────────────────────────────────┘
Result (6 rows):
┌─────────────┬──────────────────┬─────────┐
│ salesperson │ product_category │ total │
├─────────────┼──────────────────┼─────────┤
│ Alice │ Electronics │ $775 │
│ Alice │ Furniture │ $750 │
│ Bob │ Furniture │ $3,300 │
│ Bob │ Electronics │ $560 │
│ Carol │ Electronics │ $890 │
│ Carol │ Office │ $180 │
└─────────────┴──────────────────┴─────────┘
Key observation: The result has 6 rows because there are 6 distinct (salesperson, product_category) combinations in the data—not 3 salespeople × 3 categories = 9. Some combinations don't exist (Alice never sold Office, Carol never sold Furniture, etc.).
Multi-column GROUP BY returns only combinations that actually exist in your data. The maximum possible rows is the product of distinct values in each column, but the actual count depends on which combinations have data. Empty combinations produce no output row.
Adding columns to GROUP BY increases granularity—you get more groups with fewer rows each. Conversely, removing columns decreases granularity—fewer groups, more rows aggregated into each.
Think of it as a zoom level on your data:
| GROUP BY | Granularity | Group Count | Rows per Group |
|---|---|---|---|
| (none) | Lowest | 1 (entire table) | All rows |
salesperson | Low | 3 | ~2-3 each |
salesperson, product_category | Medium | 6 | 1-2 each |
salesperson, product_category, sale_date | High | 8 | 1 each |
Choosing the right granularity depends on your analytical question.
1234567891011121314151617181920
-- Low granularity: Total by salesperson onlySELECT salesperson, SUM(amount) AS totalFROM salesGROUP BY salesperson;-- Result: 3 rows (one per salesperson) -- Medium granularity: By salesperson and categorySELECT salesperson, product_category, SUM(amount) AS totalFROM salesGROUP BY salesperson, product_category;-- Result: 6 rows (one per salesperson-category combination) -- High granularity: By salesperson, category, and dateSELECT salesperson, product_category, sale_date, SUM(amount) AS totalFROM salesGROUP BY salesperson, product_category, sale_date;-- Result: 8 rows (each transaction becomes its own group) -- At maximum granularity, GROUP BY essentially does nothing-- (each row is its own group, aggregates just return the value)If every combination results in a single-row group, your GROUP BY may be too granular. Aggregates become trivial (SUM of one value = that value, AVG of one value = that value). This suggests either your grouping is too detailed or the data itself is sparse at that level.
The order of columns in GROUP BY does not affect the grouping result—the same combinations form regardless of order. However, order can matter for:
123456789101112131415161718192021222324252627
-- These produce identical groupings (same combinations)SELECT salesperson, product_category, SUM(amount)FROM salesGROUP BY salesperson, product_category; SELECT salesperson, product_category, SUM(amount)FROM salesGROUP BY product_category, salesperson; -- Different order, same result -- But the output row ORDER may differ-- (use ORDER BY to guarantee specific ordering) -- Good practice: Match SELECT and GROUP BY orderSELECT region, department, product_category, SUM(amount) AS totalFROM salesGROUP BY region, -- Primary dimension department, -- Secondary dimension product_category -- Tertiary dimensionORDER BY region, department, product_category;Conceptual hierarchy pattern:
When grouping by hierarchical data (region → department → team), list columns from broadest to most specific. This creates a natural reading order:
-- Hierarchical grouping (broad to specific)
SELECT
country, -- Broadest
region,
city,
store_id, -- Most specific
SUM(sales) AS total
FROM retail_sales
GROUP BY country, region, city, store_id
ORDER BY country, region, city, store_id;
This pattern supports drill-down analysis where you can easily remove the rightmost column to zoom out to a higher level.
GROUP BY determines WHAT groups exist. ORDER BY determines HOW results are sorted. They are independent operations. Without ORDER BY, grouped results may appear in any order. Always add ORDER BY if you need specific output ordering.
Being able to predict the number of result rows helps validate query correctness and anticipate result set sizes:
Theoretical maximum: Product of distinct values in each grouping column
Actual count: Number of combinations that exist in your data
Distinct salespeople: 3 (Alice, Bob, Carol)
Distinct categories: 3 (Electronics, Furniture, Office)
Theoretical max: 3 × 3 = 9
Actual combinations: 6 (not everyone sold every category)
12345678910111213141516171819202122232425
-- Query to find distinct values in each potential grouping columnSELECT COUNT(DISTINCT salesperson) AS distinct_salespeople, COUNT(DISTINCT product_category) AS distinct_categories, COUNT(DISTINCT sale_date) AS distinct_datesFROM sales;-- Example result: 3 salespeople, 3 categories, 4 dates-- Theoretical max: 3 × 3 × 4 = 36 combinations -- Count actual combinations at different granularitiesSELECT COUNT(*) AS group_countFROM ( SELECT DISTINCT salesperson, product_category FROM sales) AS combinations;-- Result: 6 actual combinations -- Or using GROUP BY with countingSELECT COUNT(*) AS result_rowsFROM ( SELECT salesperson, product_category FROM sales GROUP BY salesperson, product_category) AS grouped;-- Same result: 6 rows| GROUP BY Columns | Distinct Values | Theoretical Max | Actual Groups* |
|---|---|---|---|
salesperson | 3 | 3 | 3 |
product_category | 3 | 3 | 3 |
sale_date | 4 | 4 | 4 |
salesperson, product_category | 3 × 3 | 9 | 6 |
salesperson, sale_date | 3 × 4 | 12 | 7 |
product_category, sale_date | 3 × 4 | 12 | 7 |
salesperson, product_category, sale_date | 3 × 3 × 4 | 36 | 8 |
When actual combinations are much lower than theoretical maximum, your data is 'sparse' for those dimensions. This is common (not every product sells in every region). Very dense data (actual ≈ theoretical) suggests high coverage across all combinations.
Certain multi-column grouping patterns appear frequently across analytical workloads. Recognizing these patterns accelerates query design:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
-- PATTERN 1: Time-series breakdown-- Metrics over time, segmented by categorySELECT DATE_TRUNC('month', order_date) AS month, product_category, SUM(amount) AS revenue, COUNT(*) AS ordersFROM ordersGROUP BY DATE_TRUNC('month', order_date), product_categoryORDER BY month, product_category; -- PATTERN 2: Geographic hierarchy-- Drill-down from country to citySELECT country, region, city, SUM(sales) AS total_sales, COUNT(DISTINCT customer_id) AS customersFROM transactionsGROUP BY country, region, cityORDER BY country, region, city; -- PATTERN 3: Cohort analysis-- Metrics by user cohort and activity periodSELECT DATE_TRUNC('month', signup_date) AS cohort_month, DATE_TRUNC('month', activity_date) AS activity_month, COUNT(DISTINCT user_id) AS active_usersFROM user_activityGROUP BY DATE_TRUNC('month', signup_date), DATE_TRUNC('month', activity_date)ORDER BY cohort_month, activity_month; -- PATTERN 4: Cross-dimensional comparison-- Compare metrics across two independent dimensionsSELECT channel, device_type, SUM(revenue) AS total_revenue, AVG(session_duration) AS avg_duration, SUM(conversions) * 100.0 / COUNT(*) AS conversion_rateFROM sessionsGROUP BY channel, device_typeORDER BY total_revenue DESC;Building a mental library of these patterns lets you recognize them in business requirements. When someone asks for 'revenue trends by product line,' you immediately know: GROUP BY month, product_category with SUM(revenue).
A subtle but important issue: GROUP BY only returns combinations that exist in your data. If Alice never sold Office products, (Alice, Office) won't appear in results—not even with zeros.
This can be problematic for:
1234567891011121314151617181920212223242526272829303132
-- Problem: Missing date gaps in time seriesSELECT sale_date, COUNT(*) AS sales_countFROM salesGROUP BY sale_date;-- If no sales on 2024-01-19, that date doesn't appear at all -- Solution 1: CROSS JOIN with dimension table-- First, have or create a dates tableSELECT d.date_value, COALESCE(SUM(s.amount), 0) AS revenue, COUNT(s.transaction_id) AS sales_countFROM dates dLEFT JOIN sales s ON d.date_value = s.sale_dateWHERE d.date_value BETWEEN '2024-01-15' AND '2024-01-20'GROUP BY d.date_valueORDER BY d.date_value;-- Now every date appears, with 0 for dates with no sales -- Solution 2: Generate all combinations with CROSS JOINSELECT sp.salesperson, pc.product_category, COALESCE(SUM(s.amount), 0) AS total_salesFROM (SELECT DISTINCT salesperson FROM sales) spCROSS JOIN (SELECT DISTINCT product_category FROM sales) pcLEFT JOIN sales s ON s.salesperson = sp.salesperson AND s.product_category = pc.product_categoryGROUP BY sp.salesperson, pc.product_categoryORDER BY sp.salesperson, pc.product_category;-- Every (salesperson, category) combination appearsThe CROSS JOIN + LEFT JOIN pattern:
This ensures complete coverage for visualizations and reports that need every cell filled.
Be careful with CROSS JOIN! If you cross join 1000 products × 365 days × 50 regions = 18.25 million rows. Only generate complete combinations when necessary and limit the dimensions appropriately.
When grouping by multiple columns, you can still use any aggregate functions. The aggregates compute over the rows within each combination-group:
123456789101112131415161718192021222324252627282930313233343536373839
-- Comprehensive metrics per salesperson-category combinationSELECT salesperson, product_category, -- Count metrics COUNT(*) AS transaction_count, COUNT(DISTINCT sale_date) AS active_days, -- Sum metrics SUM(amount) AS total_revenue, -- Average metrics ROUND(AVG(amount), 2) AS avg_transaction, -- Range metrics MIN(amount) AS smallest_sale, MAX(amount) AS largest_sale, MAX(amount) - MIN(amount) AS sale_range, -- Conditional aggregates SUM(CASE WHEN amount >= 500 THEN 1 ELSE 0 END) AS large_sales_count, SUM(CASE WHEN amount >= 500 THEN amount ELSE 0 END) AS large_sales_revenue, -- Percentage calculation ROUND(100.0 * SUM(CASE WHEN amount >= 500 THEN 1 ELSE 0 END) / COUNT(*), 1) AS large_sale_pctFROM salesGROUP BY salesperson, product_categoryORDER BY total_revenue DESC; -- Result shows complete profile for each salesperson-category pair:-- Alice-Electronics: 2 transactions, 2 active days, $775 total, ...-- Bob-Furniture: 2 transactions, 2 active days, $3300 total, ...-- etc. -- Nested logic within aggregatesSELECT department, job_level, COUNT(*) AS employees, AVG(salary) AS avg_salary, AVG(CASE WHEN tenure_years >= 5 THEN salary END) AS avg_senior_salary, AVG(CASE WHEN tenure_years < 5 THEN salary END) AS avg_junior_salaryFROM employeesGROUP BY department, job_level;Key techniques for complex multi-column aggregations:
| Technique | Purpose | Example |
|---|---|---|
| Conditional COUNT | Count subset | SUM(CASE WHEN x THEN 1 ELSE 0 END) |
| Conditional SUM | Sum subset | SUM(CASE WHEN x THEN amount ELSE 0 END) |
| Conditional AVG | Average subset | AVG(CASE WHEN x THEN value END) |
| Ratio from aggregates | Percentage | SUM(a) * 100.0 / SUM(b) |
| Range calculation | Spread | MAX(x) - MIN(x) |
You cannot directly nest aggregates (AVG(SUM(amount)) is invalid). If you need to aggregate over aggregated results, use a subquery or CTE: first GROUP BY to get sums, then AVG over those sums in an outer query.
Multi-column GROUP BY is the key to dimensional analysis in SQL. Let's consolidate the key concepts:
What's Next:
With grouping syntax mastered, we need to understand the rules and constraints that govern valid GROUP BY queries. The next page covers grouping rules in depth: what can appear in SELECT, the relationship between grouping columns and aggregates, and how different databases enforce (or don't enforce) these rules.
You now understand how multi-column GROUP BY creates combination-based groups, can predict row counts, and recognize common analytical patterns. Next, we'll formalize the rules that ensure grouped queries are valid and produce meaningful results.