Loading content...
Running totals and moving averages accumulate raw values—but many analytical questions require understanding relative position within a distribution: What percentile is this value? What fraction of values fall below this point? How does this item rank as a percentage of all items?
Cumulative functions answer these distributional questions. They compute how values relate to the entire dataset or partition, expressing position as ranks, percentiles, or probabilities.
These functions are essential for:
SQL provides several cumulative distribution functions: CUME_DIST(), PERCENT_RANK(), NTILE(), and patterns for cumulative percentages. Mastering these completes your analytical toolkit.
By the end of this page, you will understand CUME_DIST and PERCENT_RANK for relative positioning, master NTILE for bucket distribution, compute cumulative percentages and contribution analysis, apply these functions to real-world segmentation and analysis, and combine cumulative functions with other window operations for advanced analytics.
In statistics, a Cumulative Distribution Function (CDF) gives the probability that a random variable is less than or equal to a given value. In SQL, cumulative distribution functions serve a similar purpose: they tell you what fraction of values fall at or below each point in your ordered data.
Key Concepts:
Percentile: A value below which a given percentage of data falls. The 90th percentile means 90% of values are below this point.
Rank as percentage: Expressing position as a fraction of total items rather than absolute position.
Distribution shape: How values are spread—concentrated at extremes, uniform, normal, skewed, etc.
CUME_DIST vs PERCENT_RANK:
These two functions seem similar but differ in how they handle the current row:
CUME_DIST(): The fraction of rows with values less than or equal to the current row's value.
CUME_DIST = (rows with value ≤ current value) / (total rows)
PERCENT_RANK(): The relative rank expressed as a percentage; the fraction of rows that rank below the current row.
PERCENT_RANK = (rank - 1) / (total rows - 1)
Key difference: CUME_DIST ranges from 1/n to 1 (never 0). PERCENT_RANK ranges from 0 to 1 (first row is always 0).
Use CUME_DIST when you want 'what percentage of data is at or below this point' (inclusive). Use PERCENT_RANK when you want 'what percentage of data ranks lower than this' (exclusive of current value). PERCENT_RANK gives the first-ranked item 0%, while CUME_DIST always shows some positive percentage.
CUME_DIST() calculates the cumulative distribution of a value within a partition. It returns the fraction of rows that have a value less than or equal to the current row's value.
123456789101112131415161718192021222324
-- CUME_DIST syntaxCUME_DIST() OVER ( [PARTITION BY partition_expression, ...] ORDER BY sort_expression [ASC | DESC]) -- Formula: (number of rows with value <= current row) / (total rows in partition) -- Basic example: Sales performance distributionSELECT salesperson, total_sales, ROUND(CUME_DIST() OVER (ORDER BY total_sales), 4) AS cume_distFROM sales_summaryORDER BY total_sales; -- Result:-- | salesperson | total_sales | cume_dist |-- |-------------|-------------|-----------|-- | Alice | 50000 | 0.2000 | (1/5 = 20% at or below)-- | Bob | 75000 | 0.4000 | (2/5 = 40%)-- | Charlie | 90000 | 0.6000 | (3/5 = 60%)-- | Diana | 120000 | 0.8000 | (4/5 = 80%)-- | Eve | 150000 | 1.0000 | (5/5 = 100%)Handling Ties:
When multiple rows have the same value, they all receive the same CUME_DIST—the cumulative distribution accounts for all tied values together.
1234567891011121314151617
-- Example with tied values-- | score | cume_dist calculation |-- |-------|----------------------|-- | 70 | 2/6 = 0.333 (2 rows have score <= 70)-- | 70 | 2/6 = 0.333 (same - ties get same value)-- | 80 | 3/6 = 0.500-- | 90 | 5/6 = 0.833 (2 rows at 90, so 5 total <= 90)-- | 90 | 5/6 = 0.833 (same tie behavior)-- | 100 | 6/6 = 1.000 SELECT student_id, score, CUME_DIST() OVER (ORDER BY score) AS cume_dist, RANK() OVER (ORDER BY score) AS rank, COUNT(*) OVER () AS total_studentsFROM exam_scores;Practical Applications:
123456789101112131415161718192021222324252627282930313233
-- Find items in the top 10% of salesSELECT *FROM ( SELECT product_name, sales_amount, CUME_DIST() OVER (ORDER BY sales_amount DESC) AS cume_dist_desc FROM product_sales) rankedWHERE cume_dist_desc <= 0.10; -- Top 10% -- Identify the median (50th percentile) thresholdWITH ranked AS ( SELECT value, CUME_DIST() OVER (ORDER BY value) AS cume_dist FROM data_points)SELECT MIN(value) AS median_thresholdFROM rankedWHERE cume_dist >= 0.50; -- Segment customers by spending distributionSELECT customer_id, lifetime_value, CASE WHEN CUME_DIST() OVER (ORDER BY lifetime_value DESC) <= 0.10 THEN 'Top 10%' WHEN CUME_DIST() OVER (ORDER BY lifetime_value DESC) <= 0.25 THEN 'Top 25%' WHEN CUME_DIST() OVER (ORDER BY lifetime_value DESC) <= 0.50 THEN 'Top 50%' ELSE 'Bottom 50%' END AS customer_tierFROM customers;The minimum CUME_DIST is 1/n (where n is the partition size). Even the lowest-ranked row has at least itself in the 'at or below' count, so it's never 0. The maximum is always 1.0 for the highest value(s).
PERCENT_RANK() calculates the relative rank of a row as a percentage. It answers: What percentage of rows rank lower than this one?
1234567891011121314151617181920212223242526
-- PERCENT_RANK syntaxPERCENT_RANK() OVER ( [PARTITION BY partition_expression, ...] ORDER BY sort_expression [ASC | DESC]) -- Formula: (rank - 1) / (total_rows - 1)-- First row always gets 0; last row always gets 1 (if no ties) -- Basic exampleSELECT salesperson, total_sales, RANK() OVER (ORDER BY total_sales) AS rank, ROUND(PERCENT_RANK() OVER (ORDER BY total_sales), 4) AS pct_rankFROM sales_summaryORDER BY total_sales; -- Result:-- | salesperson | total_sales | rank | pct_rank |-- |-------------|-------------|------|----------|-- | Alice | 50000 | 1 | 0.0000 | (1-1)/(5-1) = 0/4-- | Bob | 75000 | 2 | 0.2500 | (2-1)/(5-1) = 1/4-- | Charlie | 90000 | 3 | 0.5000 | (3-1)/(5-1) = 2/4-- | Diana | 120000 | 4 | 0.7500 | (4-1)/(5-1) = 3/4-- | Eve | 150000 | 5 | 1.0000 | (5-1)/(5-1) = 4/4| Aspect | CUME_DIST | PERCENT_RANK |
|---|---|---|
| Question answered | What % is at or below? | What % ranks below? |
| First row value | 1/n (never 0) | 0 (always) |
| Last row value | 1.0 | 1.0 (if no ties at max) |
| Formula | count(≤ current) / n | (rank - 1) / (n - 1) |
| Tie handling | All ties get same value | All ties get same value |
| Single-row partition | 1.0 | 0.0 (division by zero → 0) |
123456789101112131415161718192021222324252627282930313233
-- Compare CUME_DIST and PERCENT_RANK side by sideSELECT employee_id, salary, RANK() OVER (ORDER BY salary) AS rank, ROUND(CUME_DIST() OVER (ORDER BY salary), 4) AS cume_dist, ROUND(PERCENT_RANK() OVER (ORDER BY salary), 4) AS pct_rankFROM employeesORDER BY salary; -- Identify employees above the 75th percentileSELECT *FROM ( SELECT employee_id, department, salary, PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary) AS pct_rank_in_dept FROM employees) rankedWHERE pct_rank_in_dept >= 0.75; -- Top 25% of each department -- Performance rating distribution within teamsSELECT team_id, employee_name, performance_score, ROUND(100 * PERCENT_RANK() OVER ( PARTITION BY team_id ORDER BY performance_score ), 1) AS percentile_in_teamFROM employee_reviewsORDER BY team_id, performance_score;When a partition has only one row, PERCENT_RANK returns 0 (the formula would divide by zero, so 0 is returned by convention). CUME_DIST returns 1.0. Be aware of this when working with highly partitioned data that may create single-row groups.
NTILE(n) divides an ordered partition into n roughly equal groups (tiles) and assigns a group number to each row. This is powerful for creating quantile-based segments: quartiles (4), quintiles (5), deciles (10), percentiles (100).
12345678910111213141516171819202122232425
-- NTILE syntaxNTILE(number_of_buckets) OVER ( [PARTITION BY partition_expression, ...] ORDER BY sort_expression [ASC | DESC]) -- Create quartiles (4 groups) of sales performanceSELECT salesperson, total_sales, NTILE(4) OVER (ORDER BY total_sales) AS quartileFROM sales_summaryORDER BY total_sales; -- Result:-- | salesperson | total_sales | quartile |-- |-------------|-------------|----------|-- | Alice | 50000 | 1 | (bottom 25%)-- | Bob | 60000 | 1 |-- | Charlie | 75000 | 2 |-- | Diana | 80000 | 2 |-- | Eve | 100000 | 3 |-- | Frank | 110000 | 3 |-- | Grace | 130000 | 4 | (top 25%)-- | Henry | 150000 | 4 |Uneven Distribution:
When the number of rows isn't evenly divisible by n, NTILE distributes the remainder across the first few buckets. With 10 rows and NTILE(4):
The first (10 mod 4) = 2 buckets get one extra row each.
1234567891011121314151617181920212223242526272829303132333435363738
-- Customer segmentation into quintiles by lifetime valueSELECT customer_id, lifetime_value, NTILE(5) OVER (ORDER BY lifetime_value DESC) AS value_quintile, CASE NTILE(5) OVER (ORDER BY lifetime_value DESC) WHEN 1 THEN 'Platinum' WHEN 2 THEN 'Gold' WHEN 3 THEN 'Silver' WHEN 4 THEN 'Bronze' WHEN 5 THEN 'Basic' END AS customer_tierFROM customers; -- Decile analysis for distribution understandingSELECT decile, COUNT(*) AS customers_in_decile, MIN(lifetime_value) AS min_value, MAX(lifetime_value) AS max_value, ROUND(AVG(lifetime_value), 2) AS avg_value, SUM(lifetime_value) AS total_valueFROM ( SELECT customer_id, lifetime_value, NTILE(10) OVER (ORDER BY lifetime_value) AS decile FROM customers) deciledGROUP BY decileORDER BY decile; -- Percentile buckets for fine-grained analysisSELECT product_id, revenue, NTILE(100) OVER (ORDER BY revenue) AS percentileFROM products;| NTILE(n) | Name | Each Bucket Represents | Use Case |
|---|---|---|---|
| NTILE(2) | Median split | 50% | Above/below median classification |
| NTILE(3) | Tertiles | 33.3% | Low/Medium/High classification |
| NTILE(4) | Quartiles | 25% | Standard statistical quartiles |
| NTILE(5) | Quintiles | 20% | Customer tier segmentation |
| NTILE(10) | Deciles | 10% | Distribution analysis |
| NTILE(100) | Percentiles | 1% | Fine-grained percentile assignment |
NTILE assigns rows to buckets of roughly equal size. For true percentile values (the actual cutoff value at each percentile), use PERCENTILE_CONT or PERCENTILE_DISC aggregate functions if your database supports them. NTILE tells you which bucket a row belongs to; PERCENTILE functions tell you the boundary values.
Beyond distribution functions, a common need is computing what percentage of the total each value represents, and what the cumulative percentage is. This is essential for Pareto analysis (80/20 rule), contribution analysis, and understanding concentration.
123456789101112131415161718
-- Individual and cumulative percentage of totalSELECT product_name, revenue, ROUND(100.0 * revenue / SUM(revenue) OVER (), 2) AS pct_of_total, ROUND(100.0 * SUM(revenue) OVER (ORDER BY revenue DESC) / SUM(revenue) OVER (), 2) AS cumulative_pctFROM product_salesORDER BY revenue DESC; -- Result:-- | product_name | revenue | pct_of_total | cumulative_pct |-- |--------------|---------|--------------|----------------|-- | Product A | 50000 | 25.00 | 25.00 |-- | Product B | 40000 | 20.00 | 45.00 |-- | Product C | 35000 | 17.50 | 62.50 |-- | Product D | 30000 | 15.00 | 77.50 |-- | Product E | 25000 | 12.50 | 90.00 |-- | Product F | 20000 | 10.00 | 100.00 |Pareto (80/20) Analysis:
The Pareto principle suggests that roughly 80% of effects come from 20% of causes. Cumulative percentages help identify the vital few vs. the trivial many.
123456789101112131415161718192021222324252627282930313233343536
-- Pareto analysis: Find products contributing to 80% of revenueWITH product_contribution AS ( SELECT product_name, revenue, 100.0 * revenue / SUM(revenue) OVER () AS pct_of_total, 100.0 * SUM(revenue) OVER (ORDER BY revenue DESC) / SUM(revenue) OVER () AS cumulative_pct, ROW_NUMBER() OVER (ORDER BY revenue DESC) AS rank, COUNT(*) OVER () AS total_products FROM product_sales)SELECT product_name, revenue, ROUND(pct_of_total, 2) AS pct_of_total, ROUND(cumulative_pct, 2) AS cumulative_pct, ROUND(100.0 * rank / total_products, 2) AS pct_of_products, CASE WHEN cumulative_pct <= 80 THEN 'Vital Few (80% revenue)' ELSE 'Trivial Many' END AS pareto_classificationFROM product_contributionORDER BY revenue DESC; -- Identify the minimum number of customers generating 50% of revenueWITH customer_contribution AS ( SELECT customer_id, total_purchases, 100.0 * SUM(total_purchases) OVER (ORDER BY total_purchases DESC) / SUM(total_purchases) OVER () AS cumulative_pct FROM customer_summary)SELECT COUNT(*) AS customers_for_50_pctFROM customer_contributionWHERE cumulative_pct <= 50;Partitioned Cumulative Percentages:
1234567891011121314151617181920212223242526
-- Cumulative percentage within each categorySELECT category, product_name, revenue, ROUND(100.0 * revenue / SUM(revenue) OVER (PARTITION BY category), 2) AS pct_of_category, ROUND(100.0 * SUM(revenue) OVER ( PARTITION BY category ORDER BY revenue DESC ) / SUM(revenue) OVER (PARTITION BY category), 2) AS cumulative_pct_in_categoryFROM product_salesORDER BY category, revenue DESC; -- Sales concentration by regionSELECT region, salesperson, sales_amount, ROUND(100.0 * sales_amount / SUM(sales_amount) OVER (PARTITION BY region), 2) AS pct_of_region, ROUND(100.0 * SUM(sales_amount) OVER ( PARTITION BY region ORDER BY sales_amount DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) / SUM(sales_amount) OVER (PARTITION BY region), 2) AS cumulative_pctFROM regional_salesORDER BY region, sales_amount DESC;The pattern is: SUM(value) OVER (ORDER BY value DESC ROWS UNBOUNDED PRECEDING) / SUM(value) OVER (). The numerator is the running total (in descending value order), and the denominator is the total across all rows. This gives the cumulative percentage from highest to lowest.
While NTILE assigns rows to percentile buckets, sometimes you need the actual value at a given percentile. SQL provides PERCENTILE_CONT (continuous, interpolated) and PERCENTILE_DISC (discrete, actual value) for this purpose.
1234567891011121314151617181920
-- PERCENTILE_CONT: Interpolated percentile (may return value not in dataset)-- PERCENTILE_DISC: Discrete percentile (returns actual value from dataset) -- Syntax (varies by database):-- PostgreSQL, Oracle:PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column) -- MedianPERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY column) -- Median -- Used as aggregate functions (not window functions by default)SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS p25, PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) AS median, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) AS p75, PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY salary) AS p90FROM employees; -- PERCENTILE_DISC returns an actual salary from the datasetSELECT PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY salary) AS median_salaryFROM employees;| Function | Result Type | Interpolation | Best For |
|---|---|---|---|
| PERCENTILE_CONT | Continuous (may interpolate) | Yes, linear between adjacent values | Statistical analysis requiring smooth percentiles |
| PERCENTILE_DISC | Discrete (actual data value) | No, picks closest actual value | When you need a real value from the dataset |
123456789101112131415161718192021222324252627282930313233
-- Percentiles by partition (department salaries)SELECT department, PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) AS median_salary, PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY salary) AS p90_salaryFROM employeesGROUP BY department; -- Multiple percentiles in one querySELECT department, PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY salary) AS p10, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) AS p25, PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary) AS p50, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) AS p75, PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY salary) AS p90FROM employeesGROUP BY department; -- IQR (Interquartile Range) calculation for outlier detectionWITH quartiles AS ( SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3 FROM data_points)SELECT q1, q3, q3 - q1 AS iqr, q1 - 1.5 * (q3 - q1) AS lower_fence, q3 + 1.5 * (q3 - q1) AS upper_fenceFROM quartiles;PERCENTILE_CONT and PERCENTILE_DISC are part of SQL:2003 but not universally supported. PostgreSQL, Oracle, and SQL Server support them. MySQL doesn't have direct equivalents—you'd need to compute percentiles manually using NTILE or subqueries.
The real power emerges when you combine multiple cumulative and window functions. Let's explore advanced analytical patterns that leverage these combinations.
1234567891011121314151617
-- Comprehensive distribution analysisSELECT employee_id, department, salary, -- Absolute position RANK() OVER (PARTITION BY department ORDER BY salary) AS dept_rank, COUNT(*) OVER (PARTITION BY department) AS dept_size, -- Relative position ROUND(PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary), 4) AS pct_rank, ROUND(CUME_DIST() OVER (PARTITION BY department ORDER BY salary), 4) AS cume_dist, -- Bucket assignment NTILE(4) OVER (PARTITION BY department ORDER BY salary) AS salary_quartile, -- Comparison to department statistics ROUND(salary / AVG(salary) OVER (PARTITION BY department), 4) AS ratio_to_dept_avg, salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_dept_avgFROM employees;1234567891011121314151617181920212223242526272829303132333435
-- Customer lifetime value analysis with multiple perspectivesWITH customer_analysis AS ( SELECT customer_id, signup_date, lifetime_value, -- Distribution measures NTILE(10) OVER (ORDER BY lifetime_value DESC) AS value_decile, ROUND(100 * PERCENT_RANK() OVER (ORDER BY lifetime_value), 2) AS percentile, -- Contribution analysis ROUND(100.0 * lifetime_value / SUM(lifetime_value) OVER (), 4) AS pct_of_total_value, ROUND(100.0 * SUM(lifetime_value) OVER (ORDER BY lifetime_value DESC) / SUM(lifetime_value) OVER (), 2) AS cumulative_value_pct, -- Cohort comparison AVG(lifetime_value) OVER ( PARTITION BY DATE_TRUNC('month', signup_date) ) AS cohort_avg_ltv FROM customers)SELECT *, ROUND(lifetime_value / NULLIF(cohort_avg_ltv, 0), 2) AS ratio_to_cohortFROM customer_analysisORDER BY lifetime_value DESC; -- Time-based distribution shiftsSELECT DATE_TRUNC('quarter', order_date) AS quarter, PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY order_total) AS median_order, PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY order_total) AS p90_order, PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY order_total) - PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY order_total) AS median_to_p90_gapFROM ordersGROUP BY DATE_TRUNC('quarter', order_date)ORDER BY quarter;Cohort-Based Distribution Analysis:
12345678910111213141516171819202122232425262728
-- Compare individual to cohort distributionWITH employee_cohort AS ( SELECT employee_id, hire_date, department, salary, DATE_TRUNC('year', hire_date) AS hire_year, PERCENT_RANK() OVER ( PARTITION BY DATE_TRUNC('year', hire_date), department ORDER BY salary ) AS pct_rank_in_cohort, PERCENT_RANK() OVER ( PARTITION BY department ORDER BY salary ) AS pct_rank_in_dept FROM employees)SELECT employee_id, department, hire_year, salary, ROUND(100 * pct_rank_in_cohort, 1) AS cohort_percentile, ROUND(100 * pct_rank_in_dept, 1) AS dept_percentile, ROUND(100 * (pct_rank_in_cohort - pct_rank_in_dept), 1) AS cohort_vs_dept_diffFROM employee_cohortORDER BY department, hire_year, salary;Complex analysis often benefits from building in layers: first compute basic cumulative measures in a CTE, then add comparisons in a second CTE, and finally select with business logic in the main query. This makes the query more readable and debuggable.
Cumulative functions complete your analytical toolkit, enabling distribution analysis, relative positioning, and contribution measurement—all within SQL. Let's consolidate the essential concepts.
Module Complete:
You have now mastered the full range of SQL analytic functions:
These capabilities transform SQL from a simple data retrieval language into a powerful analytical engine, enabling sophisticated time-series analysis, statistical computation, and business intelligence—all without leaving the database.
Congratulations! You've completed the Analytic Functions module. You now possess the skills to perform sophisticated row-by-row analysis, compute distributions, detect trends, and measure contributions—all within SQL. These patterns form the foundation of modern data analytics and business intelligence. Apply them to your own datasets to solidify mastery.