Analytic Functions - Learning Module

Loading content...

0/252

Cumulative Functions

Beyond Simple Accumulation

Running totals and moving averages accumulate raw values—but many analytical questions require understanding relative position within a distribution: What percentile is this value? What fraction of values fall below this point? How does this item rank as a percentage of all items?

Cumulative functions answer these distributional questions. They compute how values relate to the entire dataset or partition, expressing position as ranks, percentiles, or probabilities.

These functions are essential for:

Statistical analysis: Understanding data distributions without external tools
Business intelligence: Identifying top performers, segmenting customers, detecting outliers
Data quality: Validating that distributions match expectations
Normalization: Converting absolute values to relative measures for comparison across different scales

SQL provides several cumulative distribution functions: CUME_DIST(), PERCENT_RANK(), NTILE(), and patterns for cumulative percentages. Mastering these completes your analytical toolkit.

What You Will Learn

By the end of this page, you will understand CUME_DIST and PERCENT_RANK for relative positioning, master NTILE for bucket distribution, compute cumulative percentages and contribution analysis, apply these functions to real-world segmentation and analysis, and combine cumulative functions with other window operations for advanced analytics.

Cumulative Distribution Concepts

In statistics, a Cumulative Distribution Function (CDF) gives the probability that a random variable is less than or equal to a given value. In SQL, cumulative distribution functions serve a similar purpose: they tell you what fraction of values fall at or below each point in your ordered data.

Key Concepts:

Percentile: A value below which a given percentage of data falls. The 90th percentile means 90% of values are below this point.

Rank as percentage: Expressing position as a fraction of total items rather than absolute position.

Distribution shape: How values are spread—concentrated at extremes, uniform, normal, skewed, etc.

Converting Mermaid diagram...

CUME_DIST vs PERCENT_RANK:

These two functions seem similar but differ in how they handle the current row:

CUME_DIST(): The fraction of rows with values less than or equal to the current row's value. CUME_DIST = (rows with value ≤ current value) / (total rows)

PERCENT_RANK(): The relative rank expressed as a percentage; the fraction of rows that rank below the current row. PERCENT_RANK = (rank - 1) / (total rows - 1)

Key difference: CUME_DIST ranges from 1/n to 1 (never 0). PERCENT_RANK ranges from 0 to 1 (first row is always 0).

Which to Use?

Use CUME_DIST when you want 'what percentage of data is at or below this point' (inclusive). Use PERCENT_RANK when you want 'what percentage of data ranks lower than this' (exclusive of current value). PERCENT_RANK gives the first-ranked item 0%, while CUME_DIST always shows some positive percentage.

CUME_DIST: Cumulative Distribution

CUME_DIST() calculates the cumulative distribution of a value within a partition. It returns the fraction of rows that have a value less than or equal to the current row's value.

cume_dist_syntax.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- CUME_DIST syntax
CUME_DIST() OVER (
    [PARTITION BY partition_expression, ...]
    ORDER BY sort_expression [ASC | DESC]
)
 
-- Formula: (number of rows with value <= current row) / (total rows in partition)
 
-- Basic example: Sales performance distribution
SELECT 
    salesperson,
    total_sales,
    ROUND(CUME_DIST() OVER (ORDER BY total_sales), 4) AS cume_dist
FROM sales_summary
ORDER BY total_sales;
 
-- Result:
-- | salesperson | total_sales | cume_dist |
-- |-------------|-------------|-----------|
-- | Alice       | 50000       | 0.2000    | (1/5 = 20% at or below)
-- | Bob         | 75000       | 0.4000    | (2/5 = 40%)
-- | Charlie     | 90000       | 0.6000    | (3/5 = 60%)
-- | Diana       | 120000      | 0.8000    | (4/5 = 80%)
-- | Eve         | 150000      | 1.0000    | (5/5 = 100%)

Handling Ties:

When multiple rows have the same value, they all receive the same CUME_DIST—the cumulative distribution accounts for all tied values together.

cume_dist_ties.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Example with tied values
-- | score | cume_dist calculation |
-- |-------|----------------------|
-- | 70    | 2/6 = 0.333 (2 rows have score <= 70)
-- | 70    | 2/6 = 0.333 (same - ties get same value)
-- | 80    | 3/6 = 0.500
-- | 90    | 5/6 = 0.833 (2 rows at 90, so 5 total <= 90)
-- | 90    | 5/6 = 0.833 (same tie behavior)
-- | 100   | 6/6 = 1.000
 
SELECT 
    student_id,
    score,
    CUME_DIST() OVER (ORDER BY score) AS cume_dist,
    RANK() OVER (ORDER BY score) AS rank,
    COUNT(*) OVER () AS total_students
FROM exam_scores;

Practical Applications:

cume_dist_applications.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Find items in the top 10% of sales
SELECT *
FROM (
    SELECT 
        product_name,
        sales_amount,
        CUME_DIST() OVER (ORDER BY sales_amount DESC) AS cume_dist_desc
    FROM product_sales
) ranked
WHERE cume_dist_desc <= 0.10;  -- Top 10%
 
-- Identify the median (50th percentile) threshold
WITH ranked AS (
    SELECT 
        value,
        CUME_DIST() OVER (ORDER BY value) AS cume_dist
    FROM data_points
)
SELECT MIN(value) AS median_threshold
FROM ranked
WHERE cume_dist >= 0.50;
 
-- Segment customers by spending distribution
SELECT 
    customer_id,
    lifetime_value,
    CASE 
        WHEN CUME_DIST() OVER (ORDER BY lifetime_value DESC) <= 0.10 THEN 'Top 10%'
        WHEN CUME_DIST() OVER (ORDER BY lifetime_value DESC) <= 0.25 THEN 'Top 25%'
        WHEN CUME_DIST() OVER (ORDER BY lifetime_value DESC) <= 0.50 THEN 'Top 50%'
        ELSE 'Bottom 50%'
    END AS customer_tier
FROM customers;

CUME_DIST Never Returns Zero

The minimum CUME_DIST is 1/n (where n is the partition size). Even the lowest-ranked row has at least itself in the 'at or below' count, so it's never 0. The maximum is always 1.0 for the highest value(s).

PERCENT_RANK: Relative Ranking

PERCENT_RANK() calculates the relative rank of a row as a percentage. It answers: What percentage of rows rank lower than this one?

percent_rank_syntax.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- PERCENT_RANK syntax
PERCENT_RANK() OVER (
    [PARTITION BY partition_expression, ...]
    ORDER BY sort_expression [ASC | DESC]
)
 
-- Formula: (rank - 1) / (total_rows - 1)
-- First row always gets 0; last row always gets 1 (if no ties)
 
-- Basic example
SELECT 
    salesperson,
    total_sales,
    RANK() OVER (ORDER BY total_sales) AS rank,
    ROUND(PERCENT_RANK() OVER (ORDER BY total_sales), 4) AS pct_rank
FROM sales_summary
ORDER BY total_sales;
 
-- Result:
-- | salesperson | total_sales | rank | pct_rank |
-- |-------------|-------------|------|----------|
-- | Alice       | 50000       | 1    | 0.0000   | (1-1)/(5-1) = 0/4
-- | Bob         | 75000       | 2    | 0.2500   | (2-1)/(5-1) = 1/4
-- | Charlie     | 90000       | 3    | 0.5000   | (3-1)/(5-1) = 2/4
-- | Diana       | 120000      | 4    | 0.7500   | (4-1)/(5-1) = 3/4
-- | Eve         | 150000      | 5    | 1.0000   | (5-1)/(5-1) = 4/4

CUME_DIST vs PERCENT_RANK Comparison
Aspect	CUME_DIST	PERCENT_RANK
Question answered	What % is at or below?	What % ranks below?
First row value	1/n (never 0)	0 (always)
Last row value	1.0	1.0 (if no ties at max)
Formula	count(≤ current) / n	(rank - 1) / (n - 1)
Tie handling	All ties get same value	All ties get same value
Single-row partition	1.0	0.0 (division by zero → 0)

percent_rank_applications.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Compare CUME_DIST and PERCENT_RANK side by side
SELECT 
    employee_id,
    salary,
    RANK() OVER (ORDER BY salary) AS rank,
    ROUND(CUME_DIST() OVER (ORDER BY salary), 4) AS cume_dist,
    ROUND(PERCENT_RANK() OVER (ORDER BY salary), 4) AS pct_rank
FROM employees
ORDER BY salary;
 
-- Identify employees above the 75th percentile
SELECT *
FROM (
    SELECT 
        employee_id,
        department,
        salary,
        PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary) AS pct_rank_in_dept
    FROM employees
) ranked
WHERE pct_rank_in_dept >= 0.75;  -- Top 25% of each department
 
-- Performance rating distribution within teams
SELECT 
    team_id,
    employee_name,
    performance_score,
    ROUND(100 * PERCENT_RANK() OVER (
        PARTITION BY team_id 
        ORDER BY performance_score
    ), 1) AS percentile_in_team
FROM employee_reviews
ORDER BY team_id, performance_score;

Single-Row Edge Case

When a partition has only one row, PERCENT_RANK returns 0 (the formula would divide by zero, so 0 is returned by convention). CUME_DIST returns 1.0. Be aware of this when working with highly partitioned data that may create single-row groups.

NTILE: Distribution into Buckets

NTILE(n) divides an ordered partition into n roughly equal groups (tiles) and assigns a group number to each row. This is powerful for creating quantile-based segments: quartiles (4), quintiles (5), deciles (10), percentiles (100).

ntile_syntax.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- NTILE syntax
NTILE(number_of_buckets) OVER (
    [PARTITION BY partition_expression, ...]
    ORDER BY sort_expression [ASC | DESC]
)
 
-- Create quartiles (4 groups) of sales performance
SELECT 
    salesperson,
    total_sales,
    NTILE(4) OVER (ORDER BY total_sales) AS quartile
FROM sales_summary
ORDER BY total_sales;
 
-- Result:
-- | salesperson | total_sales | quartile |
-- |-------------|-------------|----------|
-- | Alice       | 50000       | 1        | (bottom 25%)
-- | Bob         | 60000       | 1        |
-- | Charlie     | 75000       | 2        |
-- | Diana       | 80000       | 2        |
-- | Eve         | 100000      | 3        |
-- | Frank       | 110000      | 3        |
-- | Grace       | 130000      | 4        | (top 25%)
-- | Henry       | 150000      | 4        |

Uneven Distribution:

When the number of rows isn't evenly divisible by n, NTILE distributes the remainder across the first few buckets. With 10 rows and NTILE(4):

Bucket 1: 3 rows
Bucket 2: 3 rows
Bucket 3: 2 rows
Bucket 4: 2 rows

The first (10 mod 4) = 2 buckets get one extra row each.

ntile_applications.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Customer segmentation into quintiles by lifetime value
SELECT 
    customer_id,
    lifetime_value,
    NTILE(5) OVER (ORDER BY lifetime_value DESC) AS value_quintile,
    CASE NTILE(5) OVER (ORDER BY lifetime_value DESC)
        WHEN 1 THEN 'Platinum'
        WHEN 2 THEN 'Gold'
        WHEN 3 THEN 'Silver'
        WHEN 4 THEN 'Bronze'
        WHEN 5 THEN 'Basic'
    END AS customer_tier
FROM customers;
 
-- Decile analysis for distribution understanding
SELECT 
    decile,
    COUNT(*) AS customers_in_decile,
    MIN(lifetime_value) AS min_value,
    MAX(lifetime_value) AS max_value,
    ROUND(AVG(lifetime_value), 2) AS avg_value,
    SUM(lifetime_value) AS total_value
FROM (
    SELECT 
        customer_id,
        lifetime_value,
        NTILE(10) OVER (ORDER BY lifetime_value) AS decile
    FROM customers
) deciled
GROUP BY decile
ORDER BY decile;
 
-- Percentile buckets for fine-grained analysis
SELECT 
    product_id,
    revenue,
    NTILE(100) OVER (ORDER BY revenue) AS percentile
FROM products;

Common NTILE Usage Patterns
NTILE(n)	Name	Each Bucket Represents	Use Case
NTILE(2)	Median split	50%	Above/below median classification
NTILE(3)	Tertiles	33.3%	Low/Medium/High classification
NTILE(4)	Quartiles	25%	Standard statistical quartiles
NTILE(5)	Quintiles	20%	Customer tier segmentation
NTILE(10)	Deciles	10%	Distribution analysis
NTILE(100)	Percentiles	1%	Fine-grained percentile assignment

NTILE vs True Percentiles

NTILE assigns rows to buckets of roughly equal size. For true percentile values (the actual cutoff value at each percentile), use PERCENTILE_CONT or PERCENTILE_DISC aggregate functions if your database supports them. NTILE tells you which bucket a row belongs to; PERCENTILE functions tell you the boundary values.

Cumulative Percentages and Contribution Analysis

Beyond distribution functions, a common need is computing what percentage of the total each value represents, and what the cumulative percentage is. This is essential for Pareto analysis (80/20 rule), contribution analysis, and understanding concentration.

cumulative_percentage.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Individual and cumulative percentage of total
SELECT 
    product_name,
    revenue,
    ROUND(100.0 * revenue / SUM(revenue) OVER (), 2) AS pct_of_total,
    ROUND(100.0 * SUM(revenue) OVER (ORDER BY revenue DESC) / SUM(revenue) OVER (), 2) AS cumulative_pct
FROM product_sales
ORDER BY revenue DESC;
 
-- Result:
-- | product_name | revenue | pct_of_total | cumulative_pct |
-- |--------------|---------|--------------|----------------|
-- | Product A    | 50000   | 25.00        | 25.00          |
-- | Product B    | 40000   | 20.00        | 45.00          |
-- | Product C    | 35000   | 17.50        | 62.50          |
-- | Product D    | 30000   | 15.00        | 77.50          |
-- | Product E    | 25000   | 12.50        | 90.00          |
-- | Product F    | 20000   | 10.00        | 100.00         |

Pareto (80/20) Analysis:

The Pareto principle suggests that roughly 80% of effects come from 20% of causes. Cumulative percentages help identify the vital few vs. the trivial many.

pareto_analysis.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Pareto analysis: Find products contributing to 80% of revenue
WITH product_contribution AS (
    SELECT 
        product_name,
        revenue,
        100.0 * revenue / SUM(revenue) OVER () AS pct_of_total,
        100.0 * SUM(revenue) OVER (ORDER BY revenue DESC) / SUM(revenue) OVER () AS cumulative_pct,
        ROW_NUMBER() OVER (ORDER BY revenue DESC) AS rank,
        COUNT(*) OVER () AS total_products
    FROM product_sales
)
SELECT 
    product_name,
    revenue,
    ROUND(pct_of_total, 2) AS pct_of_total,
    ROUND(cumulative_pct, 2) AS cumulative_pct,
    ROUND(100.0 * rank / total_products, 2) AS pct_of_products,
    CASE 
        WHEN cumulative_pct <= 80 THEN 'Vital Few (80% revenue)'
        ELSE 'Trivial Many'
    END AS pareto_classification
FROM product_contribution
ORDER BY revenue DESC;
 
-- Identify the minimum number of customers generating 50% of revenue
WITH customer_contribution AS (
    SELECT 
        customer_id,
        total_purchases,
        100.0 * SUM(total_purchases) OVER (ORDER BY total_purchases DESC) 
            / SUM(total_purchases) OVER () AS cumulative_pct
    FROM customer_summary
)
SELECT COUNT(*) AS customers_for_50_pct
FROM customer_contribution
WHERE cumulative_pct <= 50;

Partitioned Cumulative Percentages:

partitioned_cumulative.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Cumulative percentage within each category
SELECT 
    category,
    product_name,
    revenue,
    ROUND(100.0 * revenue / SUM(revenue) OVER (PARTITION BY category), 2) AS pct_of_category,
    ROUND(100.0 * SUM(revenue) OVER (
        PARTITION BY category 
        ORDER BY revenue DESC
    ) / SUM(revenue) OVER (PARTITION BY category), 2) AS cumulative_pct_in_category
FROM product_sales
ORDER BY category, revenue DESC;
 
-- Sales concentration by region
SELECT 
    region,
    salesperson,
    sales_amount,
    ROUND(100.0 * sales_amount / SUM(sales_amount) OVER (PARTITION BY region), 2) AS pct_of_region,
    ROUND(100.0 * SUM(sales_amount) OVER (
        PARTITION BY region 
        ORDER BY sales_amount DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) / SUM(sales_amount) OVER (PARTITION BY region), 2) AS cumulative_pct
FROM regional_sales
ORDER BY region, sales_amount DESC;

Cumulative Percentage Pattern

The pattern is: SUM(value) OVER (ORDER BY value DESC ROWS UNBOUNDED PRECEDING) / SUM(value) OVER (). The numerator is the running total (in descending value order), and the denominator is the total across all rows. This gives the cumulative percentage from highest to lowest.

Percentile Functions: PERCENTILE_CONT and PERCENTILE_DISC

While NTILE assigns rows to percentile buckets, sometimes you need the actual value at a given percentile. SQL provides PERCENTILE_CONT (continuous, interpolated) and PERCENTILE_DISC (discrete, actual value) for this purpose.

percentile_functions.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- PERCENTILE_CONT: Interpolated percentile (may return value not in dataset)
-- PERCENTILE_DISC: Discrete percentile (returns actual value from dataset)
 
-- Syntax (varies by database):
-- PostgreSQL, Oracle:
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column)      -- Median
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY column)      -- Median
 
-- Used as aggregate functions (not window functions by default)
SELECT 
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) AS median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) AS p75,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY salary) AS p90
FROM employees;
 
-- PERCENTILE_DISC returns an actual salary from the dataset
SELECT 
    PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees;

PERCENTILE_CONT vs PERCENTILE_DISC
Function	Result Type	Interpolation	Best For
PERCENTILE_CONT	Continuous (may interpolate)	Yes, linear between adjacent values	Statistical analysis requiring smooth percentiles
PERCENTILE_DISC	Discrete (actual data value)	No, picks closest actual value	When you need a real value from the dataset

percentile_by_partition.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Percentiles by partition (department salaries)
SELECT 
    department,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) AS median_salary,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY salary) AS p90_salary
FROM employees
GROUP BY department;
 
-- Multiple percentiles in one query
SELECT 
    department,
    PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY salary) AS p10,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) AS p50,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) AS p75,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY salary) AS p90
FROM employees
GROUP BY department;
 
-- IQR (Interquartile Range) calculation for outlier detection
WITH quartiles AS (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3
    FROM data_points
)
SELECT 
    q1,
    q3,
    q3 - q1 AS iqr,
    q1 - 1.5 * (q3 - q1) AS lower_fence,
    q3 + 1.5 * (q3 - q1) AS upper_fence
FROM quartiles;

Database Support Varies

PERCENTILE_CONT and PERCENTILE_DISC are part of SQL:2003 but not universally supported. PostgreSQL, Oracle, and SQL Server support them. MySQL doesn't have direct equivalents—you'd need to compute percentiles manually using NTILE or subqueries.

Combining Cumulative Functions for Advanced Analytics

The real power emerges when you combine multiple cumulative and window functions. Let's explore advanced analytical patterns that leverage these combinations.

combined_analytics.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Comprehensive distribution analysis
SELECT 
    employee_id,
    department,
    salary,
    -- Absolute position
    RANK() OVER (PARTITION BY department ORDER BY salary) AS dept_rank,
    COUNT(*) OVER (PARTITION BY department) AS dept_size,
    -- Relative position
    ROUND(PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary), 4) AS pct_rank,
    ROUND(CUME_DIST() OVER (PARTITION BY department ORDER BY salary), 4) AS cume_dist,
    -- Bucket assignment
    NTILE(4) OVER (PARTITION BY department ORDER BY salary) AS salary_quartile,
    -- Comparison to department statistics
    ROUND(salary / AVG(salary) OVER (PARTITION BY department), 4) AS ratio_to_dept_avg,
    salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_dept_avg
FROM employees;

advanced_scenarios.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Customer lifetime value analysis with multiple perspectives
WITH customer_analysis AS (
    SELECT 
        customer_id,
        signup_date,
        lifetime_value,
        -- Distribution measures
        NTILE(10) OVER (ORDER BY lifetime_value DESC) AS value_decile,
        ROUND(100 * PERCENT_RANK() OVER (ORDER BY lifetime_value), 2) AS percentile,
        -- Contribution analysis
        ROUND(100.0 * lifetime_value / SUM(lifetime_value) OVER (), 4) AS pct_of_total_value,
        ROUND(100.0 * SUM(lifetime_value) OVER (ORDER BY lifetime_value DESC) 
            / SUM(lifetime_value) OVER (), 2) AS cumulative_value_pct,
        -- Cohort comparison
        AVG(lifetime_value) OVER (
            PARTITION BY DATE_TRUNC('month', signup_date)
        ) AS cohort_avg_ltv
    FROM customers
)
SELECT 
    *,
    ROUND(lifetime_value / NULLIF(cohort_avg_ltv, 0), 2) AS ratio_to_cohort
FROM customer_analysis
ORDER BY lifetime_value DESC;
 
-- Time-based distribution shifts
SELECT 
    DATE_TRUNC('quarter', order_date) AS quarter,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY order_total) AS median_order,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY order_total) AS p90_order,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY order_total) 
        - PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY order_total) AS median_to_p90_gap
FROM orders
GROUP BY DATE_TRUNC('quarter', order_date)
ORDER BY quarter;

Cohort-Based Distribution Analysis:

cohort_distribution.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Compare individual to cohort distribution
WITH employee_cohort AS (
    SELECT 
        employee_id,
        hire_date,
        department,
        salary,
        DATE_TRUNC('year', hire_date) AS hire_year,
        PERCENT_RANK() OVER (
            PARTITION BY DATE_TRUNC('year', hire_date), department
            ORDER BY salary
        ) AS pct_rank_in_cohort,
        PERCENT_RANK() OVER (
            PARTITION BY department
            ORDER BY salary
        ) AS pct_rank_in_dept
    FROM employees
)
SELECT 
    employee_id,
    department,
    hire_year,
    salary,
    ROUND(100 * pct_rank_in_cohort, 1) AS cohort_percentile,
    ROUND(100 * pct_rank_in_dept, 1) AS dept_percentile,
    ROUND(100 * (pct_rank_in_cohort - pct_rank_in_dept), 1) AS cohort_vs_dept_diff
FROM employee_cohort
ORDER BY department, hire_year, salary;

Build Analytical Layers

Complex analysis often benefits from building in layers: first compute basic cumulative measures in a CTE, then add comparisons in a second CTE, and finally select with business logic in the main query. This makes the query more readable and debuggable.

Summary: Cumulative Functions

Cumulative functions complete your analytical toolkit, enabling distribution analysis, relative positioning, and contribution measurement—all within SQL. Let's consolidate the essential concepts.

Key Takeaways

•CUME_DIST shows fraction at or below: Returns the proportion of values less than or equal to the current value. Ranges from 1/n to 1.
•PERCENT_RANK shows fraction below: Returns the relative rank as a percentage. First row is always 0, last is 1.
•NTILE distributes into buckets: Divides data into n roughly equal groups. Essential for quartile, decile, and percentile segmentation.
•Cumulative percentages enable Pareto analysis: SUM OVER ordered / SUM OVER total pattern reveals contribution concentration.
•PERCENTILE_CONT and PERCENTILE_DISC find cutoff values: Return actual values at percentile points rather than bucket assignments.
•Combining functions unlocks advanced analytics: Distribution, ranking, contribution, and comparison measures together provide comprehensive analytical views.
•Partitioning enables segmented analysis: Compare individuals within cohorts, departments, or any logical grouping.
•These functions are purely computational: No aggregation that loses rows—each input row produces an output row with distribution context.

Module Complete:

You have now mastered the full range of SQL analytic functions:

LAG and LEAD for accessing neighboring row values
FIRST_VALUE and LAST_VALUE for frame boundary access
Running totals for cumulative accumulation
Moving averages for windowed smoothing
Cumulative functions for distribution analysis

These capabilities transform SQL from a simple data retrieval language into a powerful analytical engine, enabling sophisticated time-series analysis, statistical computation, and business intelligence—all without leaving the database.

Module Complete

Congratulations! You've completed the Analytic Functions module. You now possess the skills to perform sophisticated row-by-row analysis, compute distributions, detect trends, and measure contributions—all within SQL. These patterns form the foundation of modern data analytics and business intelligence. Apply them to your own datasets to solidify mastery.