Database Management SystemsAggregate Functions

SQL Aggregate Functions: Summarizing Data with Precision

LevelIntermediate

Duration75 mins

TopicAggregate Functions

1 / 5

COUNT: Counting Rows and Values with Precision

The Foundation of Data Summarization

Every meaningful interaction with data begins with a fundamental question: How many? How many customers placed orders last month? How many products are in stock? How many transactions failed? The COUNT function is SQL's answer to this universal need—a deceptively simple construct that, when understood deeply, becomes an indispensable tool in the data professional's arsenal.

Yet beneath this apparent simplicity lies nuance that trips up even experienced developers. The difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column) is not merely syntactic—it reflects fundamentally different semantic operations with distinct performance characteristics and results. Understanding these distinctions is the difference between getting correct answers and getting confident-but-wrong answers.

What You Will Learn

By the end of this page, you will master the COUNT function in all its variations. You'll understand its precise semantics, NULL handling behavior, performance implications, and when to choose each variant. You'll see how COUNT operates as a foundational aggregate that sets the pattern for understanding all other aggregation functions.

Understanding Aggregate Functions

Before diving into COUNT specifically, we must understand what aggregate functions are and how they differ fundamentally from scalar functions.

Scalar functions operate on individual values and return individual values. Functions like UPPER(), ROUND(), or LENGTH() take one input and produce one output per row. If you have 1,000 rows, a scalar function produces 1,000 results.

Aggregate functions operate on sets of values and collapse them into single summary values. They transform many rows into fewer rows—often just one. This collapse is the essence of aggregation: converting detail-level data into summary-level insight.

Scalar vs. Aggregate Functions
Characteristic	Scalar Functions	Aggregate Functions
Input	Single value	Set of values (potentially many rows)
Output	One value per input row	One value per group (or entire result set)
Examples	UPPER(), LOWER(), ROUND(), CONCAT()	COUNT(), SUM(), AVG(), MIN(), MAX()
Row preservation	Maintains row count	Reduces row count
NULL handling	Usually returns NULL if input is NULL	Varies by function; often ignores NULLs

The aggregation contract:

When you invoke an aggregate function, you're making an implicit contract with the database engine:

Group identification: Which rows belong together for aggregation? (Determined by GROUP BY or, if absent, the entire result set forms one group)
Value extraction: What values should be aggregated? (The expression inside the aggregate function)
Reduction operation: How should the values be combined? (The specific aggregate function's logic)

This three-part contract applies to every aggregate function, and understanding it is essential for writing correct aggregation queries.

The Mental Model

Think of aggregate functions as "reducers" in functional programming. They take a collection and reduce it to a single value. Each group of rows becomes one row in the output. No GROUP BY means one group (all rows). With GROUP BY, you get as many output rows as there are distinct group combinations.

COUNT(*): Counting All Rows

The COUNT(*) form is the most straightforward variant: count every row in the group, unconditionally. The asterisk doesn't mean "all columns"—it means "the row itself." This distinction is crucial.

COUNT(*) answers: How many rows exist in this set?

It doesn't examine column values. It doesn't care about NULLs. It simply counts the presence of rows.

count_star_basics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Total number of rows in the employees table
SELECT COUNT(*) AS total_employees
FROM employees;
-- Result: Returns the total row count, regardless of NULL values
 
-- Count rows with a WHERE filter
SELECT COUNT(*) AS active_employees
FROM employees
WHERE status = 'active';
-- Result: Counts only rows where the condition is true
 
-- Count within groups
SELECT department_id, COUNT(*) AS dept_size
FROM employees
GROUP BY department_id;
-- Result: One row per department, with employee count
 
-- Count with multiple conditions
SELECT COUNT(*) AS senior_engineers
FROM employees
WHERE job_title = 'Engineer'
  AND hire_date < '2020-01-01';

Why COUNT(*) and not COUNT(column)?

Beginners often write COUNT(id) or COUNT(some_column) when they mean COUNT(*). While these may produce identical results in many cases, they have different semantics:

COUNT(*) counts rows, including rows where every column is NULL
COUNT(column) counts non-NULL values in that column

If your intent is to count rows, always use COUNT(*). It's more explicit about intent and often marginally faster because the engine doesn't need to evaluate a specific column's nullability.

Performance Note

In some databases, COUNT() can be optimized using index metadata or specialized counters, especially for InnoDB tables in MySQL or when counting entire tables without WHERE clauses. However, with complex predicates, the engine must still scan or seek through the data. Don't assume COUNT() is always instantaneous for large tables.

COUNT(*) Behavior Examples
employees table	department_id	name	salary
Row 1	10	Alice	75000
Row 2	10	Bob	NULL
Row 3	20	NULL	60000
Row 4	20	Diana	NULL
Row 5	NULL	Eve	55000

count_star_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Given the table above:
 
SELECT COUNT(*) FROM employees;              -- Result: 5 (all rows counted)
 
SELECT COUNT(*) FROM employees 
WHERE department_id = 10;                    -- Result: 2 (Alice and Bob)
 
SELECT COUNT(*) FROM employees 
WHERE salary IS NOT NULL;                    -- Result: 3 (Alice, row 3, Eve)
 
SELECT department_id, COUNT(*) AS cnt
FROM employees
GROUP BY department_id;
-- Result:
-- department_id | cnt
-- 10            | 2
-- 20            | 2
-- NULL          | 1

COUNT(column): Counting Non-NULL Values

When you specify a column inside COUNT(), the semantic changes fundamentally: only non-NULL values are counted. This is one of the most important behaviors to understand in SQL aggregation.

COUNT(column) answers: How many non-NULL values exist for this column in the set?

This makes COUNT(column) a null-filtering operation, not just a counting operation.

count_column_basics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Count employees with non-NULL salaries
SELECT COUNT(salary) AS employees_with_salary
FROM employees;
-- If salary is NULL for some employees, they're excluded
 
-- Compare COUNT(*) vs COUNT(column)
SELECT 
    COUNT(*) AS total_rows,
    COUNT(salary) AS with_salary,
    COUNT(bonus) AS with_bonus
FROM employees;
-- These numbers may all differ!
 
-- Using COUNT(column) to find data completeness
SELECT 
    COUNT(*) AS total_records,
    COUNT(email) AS has_email,
    COUNT(phone) AS has_phone,
    COUNT(address) AS has_address
FROM customers;

Practical use cases for COUNT(column):

Data quality assessment: Quickly identify how many records have values for optional fields
Completeness metrics: Calculate percentage of non-NULL values: COUNT(email) * 100.0 / COUNT(*)
Sparse data analysis: Understand fill rates for wide, sparse tables
Conditional counting: When combined with expressions, selectively count based on criteria

Expression Evaluation

COUNT() evaluates its argument for each row. If the argument expression evaluates to NULL, that row doesn't increment the count. This means COUNT(NULL) always returns 0, and COUNT(column_a + column_b) counts only rows where the addition doesn't produce NULL.

count_expressions.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Counting with expressions
SELECT 
    COUNT(salary + bonus) AS with_total_comp,  -- NULL if either is NULL
    COUNT(COALESCE(bonus, 0)) AS all_rows      -- COALESCE prevents NULL
FROM employees;
 
-- Using CASE for conditional counting (alternative to FILTER)
SELECT 
    COUNT(*) AS total,
    COUNT(CASE WHEN status = 'active' THEN 1 END) AS active_count,
    COUNT(CASE WHEN salary > 100000 THEN 1 END) AS high_earners
FROM employees;
-- CASE returns NULL when condition is false, so COUNT excludes those
 
-- This pattern is so common, some databases have shorthand:
-- PostgreSQL: COUNT(*) FILTER (WHERE status = 'active')
-- SQL Server: COUNT(CASE WHEN status = 'active' THEN 1 END)

COUNT(*) vs COUNT(column) Comparison
Query	Result (using sample table)	Explanation
COUNT(*)	5	Counts all rows regardless of NULL values
COUNT(name)	4	Excludes row 3 where name is NULL
COUNT(salary)	3	Excludes rows 2 and 4 where salary is NULL
COUNT(department_id)	4	Excludes row 5 where department_id is NULL

COUNT(DISTINCT): Counting Unique Values

COUNT(DISTINCT expression) combines deduplication with counting: count each unique non-NULL value only once. This is remarkably powerful for understanding cardinality—the number of distinct values in a dataset.

COUNT(DISTINCT column) answers: How many different non-NULL values exist for this column?

count_distinct_basics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- How many distinct departments have employees?
SELECT COUNT(DISTINCT department_id) AS num_departments
FROM employees;
 
-- How many unique customers placed orders this month?
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE);
 
-- Compare total orders vs unique customers
SELECT 
    COUNT(*) AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers,
    COUNT(*) * 1.0 / COUNT(DISTINCT customer_id) AS orders_per_customer
FROM orders;

The deduplication process:

Internally, COUNT(DISTINCT) must track which values it has already seen. This typically involves:

Extracting the column value from each row
Checking if it's NULL (excluded if so)
Checking if it's already in a tracking structure (hash set, sorted list, etc.)
Adding to the structure if new
Returning the final size of the structure

This explains why COUNT(DISTINCT) is often slower than COUNT(*) or COUNT(column)—it requires additional memory and computational overhead for deduplication.

Performance Implications

COUNT(DISTINCT) on high-cardinality columns (many unique values) can be expensive. For millions of rows with mostly unique values, the engine must maintain millions of entries in its deduplication structure. For approximate counts at scale, consider HyperLogLog-based alternatives like PostgreSQL's pg_hll extension or BigQuery's APPROX_COUNT_DISTINCT.

count_distinct_advanced.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Multiple DISTINCT counts in one query
SELECT 
    COUNT(DISTINCT department_id) AS num_departments,
    COUNT(DISTINCT job_title) AS num_job_titles,
    COUNT(DISTINCT manager_id) AS num_managers
FROM employees;
 
-- DISTINCT on expressions
SELECT COUNT(DISTINCT YEAR(hire_date)) AS years_with_hires
FROM employees;
 
-- Counting distinct combinations (standard SQL)
-- Note: COUNT(DISTINCT col1, col2) is NOT standard SQL
-- Use this pattern instead:
SELECT COUNT(DISTINCT CONCAT(department_id, '-', job_title)) AS unique_combinations
FROM employees;
 
-- PostgreSQL alternative: row-based distinct
SELECT COUNT(DISTINCT (department_id, job_title)) AS unique_combinations
FROM employees;

Common use cases for COUNT(DISTINCT):

User metrics: Counting unique visitors, unique sessions, unique customers
Cardinality analysis: Understanding dataset characteristics for optimization
Data validation: Ensuring expected number of categories, codes, or identifiers
Funnel analysis: Counting users who performed specific actions
Cohort analysis: Counting unique users per time period or segment

COUNT Variants Summary
Variant	NULLs Included?	Deduplication?	Performance
COUNT(*)	Yes (counts all rows)	No	Generally fastest
COUNT(column)	No (excludes NULLs)	No	Fast (one column check)
COUNT(DISTINCT column)	No (excludes NULLs)	Yes	Slower (tracking overhead)

COUNT with GROUP BY: Segmented Counting

The true power of aggregate functions emerges when combined with GROUP BY. Instead of producing one count for the entire dataset, you produce one count per group—enabling segmented analysis.

When GROUP BY is present, the aggregation contract changes:

Without GROUP BY: All rows form one implicit group → one output row
With GROUP BY: Rows are partitioned by grouping columns → one output row per unique combination

count_group_by.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Count employees per department
SELECT 
    department_id,
    COUNT(*) AS employee_count
FROM employees
GROUP BY department_id
ORDER BY employee_count DESC;
 
-- Count with multiple grouping columns
SELECT 
    department_id,
    job_title,
    COUNT(*) AS headcount
FROM employees
GROUP BY department_id, job_title
ORDER BY department_id, headcount DESC;
 
-- Combining COUNT variants with GROUP BY
SELECT 
    department_id,
    COUNT(*) AS total_employees,
    COUNT(salary) AS with_salary,
    COUNT(DISTINCT job_title) AS unique_roles
FROM employees
GROUP BY department_id;

Understanding the grouping process:

Partition: The engine conceptually partitions rows by the grouping columns
Aggregate: For each partition, aggregate functions are computed
Output: One row per partition is produced with group values and aggregated results

This is why you cannot select non-aggregated columns that aren't in the GROUP BY clause—the database wouldn't know which value to show when multiple rows collapse into one.

NULL in GROUP BY

NULL values in grouping columns form their own group. If three employees have NULL department_id, they'll appear as one row with department_id = NULL and count = 3. This is often desired behavior but can surprise developers expecting NULLs to be excluded.

count_group_by_advanced.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Time-based grouping
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;
 
-- Hierarchical grouping with ROLLUP (extended totals)
SELECT 
    COALESCE(region, 'ALL REGIONS') AS region,
    COALESCE(department, 'ALL DEPTS') AS department,
    COUNT(*) AS employee_count
FROM employees
GROUP BY ROLLUP(region, department);
 
-- Using GROUP BY with HAVING to filter groups
SELECT 
    department_id,
    COUNT(*) AS employee_count
FROM employees
GROUP BY department_id
HAVING COUNT(*) >= 5  -- Only departments with 5+ employees
ORDER BY employee_count DESC;

Practical Patterns and Common Pitfalls

Understanding COUNT conceptually is one thing; using it correctly in production is another. Let's examine common patterns and pitfalls that distinguish novice from expert usage.

Best Practice Patterns

•Use COUNT(*) for row counting: When you want to count rows, use COUNT(*) not COUNT(id). It's clearer in intent and may be optimized differently.
•Calculate percentages correctly: COUNT(email) * 100.0 / NULLIF(COUNT(*), 0) prevents division by zero and gives percentage with non-NULL emails.
•Combine aggregates strategically: One query with multiple COUNT variants is more efficient than multiple queries counting different things.
•Consider approximate counts for scale: For dashboards showing "~2.3M users", exact counts are unnecessary and approximate algorithms are orders of magnitude faster.
•Index columns used in COUNT(DISTINCT): If you frequently count distinct values of a column, ensure it's indexed for better performance.

Common Pitfalls to Avoid

•Confusing COUNT(*) and COUNT(column): They're not interchangeable when NULLs exist. A table with 1000 rows where 100 have NULL email will give COUNT(*) = 1000 but COUNT(email) = 900.
•Ignoring NULL in grouping: When department_id is NULL, those employees form their own group, which may not be intended.
•Expensive COUNT(DISTINCT) on high-cardinality: Counting millions of distinct values consumes significant memory and CPU.
•COUNT in WHERE instead of HAVING: WHERE COUNT(*) > 5 is invalid; use HAVING COUNT(*) > 5 to filter groups.
•Forgetting that COUNT excludes NULLs: COUNT(column) on an all-NULL column returns 0, not NULL.

count_patterns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Pattern: Data completeness report
SELECT 
    'employees' AS table_name,
    COUNT(*) AS total_rows,
    COUNT(email) AS has_email,
    COUNT(phone) AS has_phone,
    COUNT(address) AS has_address,
    ROUND(COUNT(email) * 100.0 / NULLIF(COUNT(*), 0), 2) AS email_pct
FROM employees;
 
-- Pattern: Conditional counting with CASE
SELECT 
    department_id,
    COUNT(*) AS total,
    COUNT(CASE WHEN status = 'active' THEN 1 END) AS active,
    COUNT(CASE WHEN status = 'inactive' THEN 1 END) AS inactive,
    COUNT(CASE WHEN status IS NULL THEN 1 END) AS unknown
FROM employees
GROUP BY department_id;
 
-- Pattern: Count with existence check (correlation pattern)
SELECT 
    d.department_name,
    (SELECT COUNT(*) FROM employees e WHERE e.department_id = d.id) AS emp_count
FROM departments d;
 
-- More efficient join-based alternative:
SELECT 
    d.department_name,
    COUNT(e.id) AS emp_count
FROM departments d
LEFT JOIN employees e ON e.department_id = d.id
GROUP BY d.department_name;

Database-Specific Variations

While COUNT is defined by the SQL standard, implementations vary across database systems. Understanding these variations is essential for writing portable code or leveraging database-specific optimizations.

count_postgresql.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- PostgreSQL supports COUNT with FILTER clause (SQL:2003)
SELECT 
    COUNT(*) AS total,
    COUNT(*) FILTER (WHERE status = 'active') AS active,
    COUNT(*) FILTER (WHERE status = 'inactive') AS inactive
FROM employees;
 
-- COUNT(DISTINCT) on multiple columns using row comparison
SELECT COUNT(DISTINCT (department_id, job_title)) 
FROM employees;
 
-- Approximate count with pg_class (instant but approximate)
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE relname = 'employees';
 
-- For large tables, consider pg_stat_user_tables
SELECT n_live_tup FROM pg_stat_user_tables 
WHERE relname = 'employees';

Summary: Mastering COUNT

The COUNT function is deceptively simple on the surface but rich in nuance. Let's consolidate what we've learned:

Key Takeaways

•COUNT(*) counts all rows — It doesn't examine column values and includes rows even if all columns are NULL. Use this when you want row counts.
•COUNT(column) counts non-NULL values — NULL values are excluded. Use this for data completeness metrics or when NULLs should be ignored.
•COUNT(DISTINCT column) counts unique non-NULL values — Provides cardinality metrics. More expensive due to deduplication overhead.
•GROUP BY segments counts — Without GROUP BY, you get one count for everything. With GROUP BY, you get counts per group.
•NULL handling is consistent — COUNT(column) and COUNT(DISTINCT column) both exclude NULLs. GROUP BY creates a separate group for NULL values.
•Performance varies by variant — COUNT(*) is typically fastest; COUNT(DISTINCT) on high-cardinality columns can be expensive.

What's next:

Now that we've mastered counting, we'll explore SUM—the aggregate function for total value calculations. You'll see how SUM shares NULL-handling behavior with COUNT but introduces new considerations around numeric types and overflow.

Page Complete

You now understand COUNT in all its forms. You can count rows unconditionally, count non-NULL values, count unique values, and combine these with GROUP BY for segmented analysis. This foundational knowledge prepares you for all other aggregate functions.

1 / 5

Loading learning content...

Database Management SystemsAggregate Functions

SQL Aggregate Functions: Summarizing Data with Precision

LevelIntermediate

Duration75 mins

TopicAggregate Functions

1 / 5

COUNT: Counting Rows and Values with Precision

The Foundation of Data Summarization

What You Will Learn

Understanding Aggregate Functions

Before diving into COUNT specifically, we must understand what aggregate functions are and how they differ fundamentally from scalar functions.

Scalar vs. Aggregate Functions
Characteristic	Scalar Functions	Aggregate Functions
Input	Single value	Set of values (potentially many rows)
Output	One value per input row	One value per group (or entire result set)
Examples	UPPER(), LOWER(), ROUND(), CONCAT()	COUNT(), SUM(), AVG(), MIN(), MAX()
Row preservation	Maintains row count	Reduces row count
NULL handling	Usually returns NULL if input is NULL	Varies by function; often ignores NULLs

The aggregation contract:

When you invoke an aggregate function, you're making an implicit contract with the database engine:

Group identification: Which rows belong together for aggregation? (Determined by GROUP BY or, if absent, the entire result set forms one group)
Value extraction: What values should be aggregated? (The expression inside the aggregate function)
Reduction operation: How should the values be combined? (The specific aggregate function's logic)

This three-part contract applies to every aggregate function, and understanding it is essential for writing correct aggregation queries.

The Mental Model

COUNT(*): Counting All Rows

COUNT(*) answers: How many rows exist in this set?

It doesn't examine column values. It doesn't care about NULLs. It simply counts the presence of rows.

count_star_basics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Total number of rows in the employees table
SELECT COUNT(*) AS total_employees
FROM employees;
-- Result: Returns the total row count, regardless of NULL values
 
-- Count rows with a WHERE filter
SELECT COUNT(*) AS active_employees
FROM employees
WHERE status = 'active';
-- Result: Counts only rows where the condition is true
 
-- Count within groups
SELECT department_id, COUNT(*) AS dept_size
FROM employees
GROUP BY department_id;
-- Result: One row per department, with employee count
 
-- Count with multiple conditions
SELECT COUNT(*) AS senior_engineers
FROM employees
WHERE job_title = 'Engineer'
  AND hire_date < '2020-01-01';

Why COUNT(*) and not COUNT(column)?

Beginners often write COUNT(id) or COUNT(some_column) when they mean COUNT(*). While these may produce identical results in many cases, they have different semantics:

COUNT(*) counts rows, including rows where every column is NULL
COUNT(column) counts non-NULL values in that column

If your intent is to count rows, always use COUNT(*). It's more explicit about intent and often marginally faster because the engine doesn't need to evaluate a specific column's nullability.

Performance Note

COUNT(*) Behavior Examples
employees table	department_id	name	salary
Row 1	10	Alice	75000
Row 2	10	Bob	NULL
Row 3	20	NULL	60000
Row 4	20	Diana	NULL
Row 5	NULL	Eve	55000

count_star_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Given the table above:
 
SELECT COUNT(*) FROM employees;              -- Result: 5 (all rows counted)
 
SELECT COUNT(*) FROM employees 
WHERE department_id = 10;                    -- Result: 2 (Alice and Bob)
 
SELECT COUNT(*) FROM employees 
WHERE salary IS NOT NULL;                    -- Result: 3 (Alice, row 3, Eve)
 
SELECT department_id, COUNT(*) AS cnt
FROM employees
GROUP BY department_id;
-- Result:
-- department_id | cnt
-- 10            | 2
-- 20            | 2
-- NULL          | 1

COUNT(column): Counting Non-NULL Values

When you specify a column inside COUNT(), the semantic changes fundamentally: only non-NULL values are counted. This is one of the most important behaviors to understand in SQL aggregation.

COUNT(column) answers: How many non-NULL values exist for this column in the set?

This makes COUNT(column) a null-filtering operation, not just a counting operation.

count_column_basics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Count employees with non-NULL salaries
SELECT COUNT(salary) AS employees_with_salary
FROM employees;
-- If salary is NULL for some employees, they're excluded
 
-- Compare COUNT(*) vs COUNT(column)
SELECT 
    COUNT(*) AS total_rows,
    COUNT(salary) AS with_salary,
    COUNT(bonus) AS with_bonus
FROM employees;
-- These numbers may all differ!
 
-- Using COUNT(column) to find data completeness
SELECT 
    COUNT(*) AS total_records,
    COUNT(email) AS has_email,
    COUNT(phone) AS has_phone,
    COUNT(address) AS has_address
FROM customers;

Practical use cases for COUNT(column):

Data quality assessment: Quickly identify how many records have values for optional fields
Completeness metrics: Calculate percentage of non-NULL values: COUNT(email) * 100.0 / COUNT(*)
Sparse data analysis: Understand fill rates for wide, sparse tables
Conditional counting: When combined with expressions, selectively count based on criteria

Expression Evaluation

count_expressions.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Counting with expressions
SELECT 
    COUNT(salary + bonus) AS with_total_comp,  -- NULL if either is NULL
    COUNT(COALESCE(bonus, 0)) AS all_rows      -- COALESCE prevents NULL
FROM employees;
 
-- Using CASE for conditional counting (alternative to FILTER)
SELECT 
    COUNT(*) AS total,
    COUNT(CASE WHEN status = 'active' THEN 1 END) AS active_count,
    COUNT(CASE WHEN salary > 100000 THEN 1 END) AS high_earners
FROM employees;
-- CASE returns NULL when condition is false, so COUNT excludes those
 
-- This pattern is so common, some databases have shorthand:
-- PostgreSQL: COUNT(*) FILTER (WHERE status = 'active')
-- SQL Server: COUNT(CASE WHEN status = 'active' THEN 1 END)

COUNT(*) vs COUNT(column) Comparison
Query	Result (using sample table)	Explanation
COUNT(*)	5	Counts all rows regardless of NULL values
COUNT(name)	4	Excludes row 3 where name is NULL
COUNT(salary)	3	Excludes rows 2 and 4 where salary is NULL
COUNT(department_id)	4	Excludes row 5 where department_id is NULL

COUNT(DISTINCT): Counting Unique Values

COUNT(DISTINCT column) answers: How many different non-NULL values exist for this column?

count_distinct_basics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- How many distinct departments have employees?
SELECT COUNT(DISTINCT department_id) AS num_departments
FROM employees;
 
-- How many unique customers placed orders this month?
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE);
 
-- Compare total orders vs unique customers
SELECT 
    COUNT(*) AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers,
    COUNT(*) * 1.0 / COUNT(DISTINCT customer_id) AS orders_per_customer
FROM orders;

The deduplication process:

Internally, COUNT(DISTINCT) must track which values it has already seen. This typically involves:

Extracting the column value from each row
Checking if it's NULL (excluded if so)
Checking if it's already in a tracking structure (hash set, sorted list, etc.)
Adding to the structure if new
Returning the final size of the structure

This explains why COUNT(DISTINCT) is often slower than COUNT(*) or COUNT(column)—it requires additional memory and computational overhead for deduplication.

Performance Implications

count_distinct_advanced.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Multiple DISTINCT counts in one query
SELECT 
    COUNT(DISTINCT department_id) AS num_departments,
    COUNT(DISTINCT job_title) AS num_job_titles,
    COUNT(DISTINCT manager_id) AS num_managers
FROM employees;
 
-- DISTINCT on expressions
SELECT COUNT(DISTINCT YEAR(hire_date)) AS years_with_hires
FROM employees;
 
-- Counting distinct combinations (standard SQL)
-- Note: COUNT(DISTINCT col1, col2) is NOT standard SQL
-- Use this pattern instead:
SELECT COUNT(DISTINCT CONCAT(department_id, '-', job_title)) AS unique_combinations
FROM employees;
 
-- PostgreSQL alternative: row-based distinct
SELECT COUNT(DISTINCT (department_id, job_title)) AS unique_combinations
FROM employees;

Common use cases for COUNT(DISTINCT):

User metrics: Counting unique visitors, unique sessions, unique customers
Cardinality analysis: Understanding dataset characteristics for optimization
Data validation: Ensuring expected number of categories, codes, or identifiers
Funnel analysis: Counting users who performed specific actions
Cohort analysis: Counting unique users per time period or segment

COUNT Variants Summary
Variant	NULLs Included?	Deduplication?	Performance
COUNT(*)	Yes (counts all rows)	No	Generally fastest
COUNT(column)	No (excludes NULLs)	No	Fast (one column check)
COUNT(DISTINCT column)	No (excludes NULLs)	Yes	Slower (tracking overhead)

COUNT with GROUP BY: Segmented Counting

The true power of aggregate functions emerges when combined with GROUP BY. Instead of producing one count for the entire dataset, you produce one count per group—enabling segmented analysis.

When GROUP BY is present, the aggregation contract changes:

Without GROUP BY: All rows form one implicit group → one output row
With GROUP BY: Rows are partitioned by grouping columns → one output row per unique combination

count_group_by.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Count employees per department
SELECT 
    department_id,
    COUNT(*) AS employee_count
FROM employees
GROUP BY department_id
ORDER BY employee_count DESC;
 
-- Count with multiple grouping columns
SELECT 
    department_id,
    job_title,
    COUNT(*) AS headcount
FROM employees
GROUP BY department_id, job_title
ORDER BY department_id, headcount DESC;
 
-- Combining COUNT variants with GROUP BY
SELECT 
    department_id,
    COUNT(*) AS total_employees,
    COUNT(salary) AS with_salary,
    COUNT(DISTINCT job_title) AS unique_roles
FROM employees
GROUP BY department_id;

Understanding the grouping process:

Partition: The engine conceptually partitions rows by the grouping columns
Aggregate: For each partition, aggregate functions are computed
Output: One row per partition is produced with group values and aggregated results

This is why you cannot select non-aggregated columns that aren't in the GROUP BY clause—the database wouldn't know which value to show when multiple rows collapse into one.

NULL in GROUP BY

count_group_by_advanced.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Time-based grouping
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;
 
-- Hierarchical grouping with ROLLUP (extended totals)
SELECT 
    COALESCE(region, 'ALL REGIONS') AS region,
    COALESCE(department, 'ALL DEPTS') AS department,
    COUNT(*) AS employee_count
FROM employees
GROUP BY ROLLUP(region, department);
 
-- Using GROUP BY with HAVING to filter groups
SELECT 
    department_id,
    COUNT(*) AS employee_count
FROM employees
GROUP BY department_id
HAVING COUNT(*) >= 5  -- Only departments with 5+ employees
ORDER BY employee_count DESC;

Practical Patterns and Common Pitfalls

Understanding COUNT conceptually is one thing; using it correctly in production is another. Let's examine common patterns and pitfalls that distinguish novice from expert usage.

Best Practice Patterns

•Use COUNT(*) for row counting: When you want to count rows, use COUNT(*) not COUNT(id). It's clearer in intent and may be optimized differently.
•Calculate percentages correctly: COUNT(email) * 100.0 / NULLIF(COUNT(*), 0) prevents division by zero and gives percentage with non-NULL emails.
•Combine aggregates strategically: One query with multiple COUNT variants is more efficient than multiple queries counting different things.
•Consider approximate counts for scale: For dashboards showing "~2.3M users", exact counts are unnecessary and approximate algorithms are orders of magnitude faster.
•Index columns used in COUNT(DISTINCT): If you frequently count distinct values of a column, ensure it's indexed for better performance.

Common Pitfalls to Avoid

•Confusing COUNT(*) and COUNT(column): They're not interchangeable when NULLs exist. A table with 1000 rows where 100 have NULL email will give COUNT(*) = 1000 but COUNT(email) = 900.
•Ignoring NULL in grouping: When department_id is NULL, those employees form their own group, which may not be intended.
•Expensive COUNT(DISTINCT) on high-cardinality: Counting millions of distinct values consumes significant memory and CPU.
•COUNT in WHERE instead of HAVING: WHERE COUNT(*) > 5 is invalid; use HAVING COUNT(*) > 5 to filter groups.
•Forgetting that COUNT excludes NULLs: COUNT(column) on an all-NULL column returns 0, not NULL.

count_patterns.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Pattern: Data completeness report
SELECT 
    'employees' AS table_name,
    COUNT(*) AS total_rows,
    COUNT(email) AS has_email,
    COUNT(phone) AS has_phone,
    COUNT(address) AS has_address,
    ROUND(COUNT(email) * 100.0 / NULLIF(COUNT(*), 0), 2) AS email_pct
FROM employees;
 
-- Pattern: Conditional counting with CASE
SELECT 
    department_id,
    COUNT(*) AS total,
    COUNT(CASE WHEN status = 'active' THEN 1 END) AS active,
    COUNT(CASE WHEN status = 'inactive' THEN 1 END) AS inactive,
    COUNT(CASE WHEN status IS NULL THEN 1 END) AS unknown
FROM employees
GROUP BY department_id;
 
-- Pattern: Count with existence check (correlation pattern)
SELECT 
    d.department_name,
    (SELECT COUNT(*) FROM employees e WHERE e.department_id = d.id) AS emp_count
FROM departments d;
 
-- More efficient join-based alternative:
SELECT 
    d.department_name,
    COUNT(e.id) AS emp_count
FROM departments d
LEFT JOIN employees e ON e.department_id = d.id
GROUP BY d.department_name;

Database-Specific Variations

count_postgresql.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- PostgreSQL supports COUNT with FILTER clause (SQL:2003)
SELECT 
    COUNT(*) AS total,
    COUNT(*) FILTER (WHERE status = 'active') AS active,
    COUNT(*) FILTER (WHERE status = 'inactive') AS inactive
FROM employees;
 
-- COUNT(DISTINCT) on multiple columns using row comparison
SELECT COUNT(DISTINCT (department_id, job_title)) 
FROM employees;
 
-- Approximate count with pg_class (instant but approximate)
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE relname = 'employees';
 
-- For large tables, consider pg_stat_user_tables
SELECT n_live_tup FROM pg_stat_user_tables 
WHERE relname = 'employees';

Summary: Mastering COUNT

The COUNT function is deceptively simple on the surface but rich in nuance. Let's consolidate what we've learned:

Key Takeaways

•COUNT(*) counts all rows — It doesn't examine column values and includes rows even if all columns are NULL. Use this when you want row counts.
•COUNT(column) counts non-NULL values — NULL values are excluded. Use this for data completeness metrics or when NULLs should be ignored.
•COUNT(DISTINCT column) counts unique non-NULL values — Provides cardinality metrics. More expensive due to deduplication overhead.
•GROUP BY segments counts — Without GROUP BY, you get one count for everything. With GROUP BY, you get counts per group.
•NULL handling is consistent — COUNT(column) and COUNT(DISTINCT column) both exclude NULLs. GROUP BY creates a separate group for NULL values.
•Performance varies by variant — COUNT(*) is typically fastest; COUNT(DISTINCT) on high-cardinality columns can be expensive.

What's next:

Page Complete

1 / 5