Database Management SystemsDenormalization

Denormalization Techniques

LevelIntermediate

Duration75 mins

TopicDenormalization

2 / 5

Pre-computed Aggregates

The Aggregation Bottleneck

Consider a business intelligence dashboard displaying these metrics:

Total revenue this month: Requires summing all order amounts for the current month
Average order value by region: Requires grouping and averaging across millions of orders
Units sold by product category: Requires joining orders, line items, and products, then aggregating
Customer count by acquisition channel: Requires aggregating the entire customer table

In a normalized schema, every dashboard refresh recomputes these statistics from raw transactional data. As your data grows, these queries become progressively slower. What took milliseconds with 10,000 orders takes seconds with 10 million orders, and minutes with 100 million orders.

Pre-computed aggregates solve this by calculating and storing summary statistics in advance. Instead of scanning millions of rows to answer "what's this month's revenue?", you query a single pre-computed row.

What You Will Learn

By the end of this page, you will master the design and implementation of pre-computed aggregates. You'll understand granularity hierarchies, incremental vs. full refresh strategies, consistency guarantees, and the architectural patterns that make aggregates maintainable at scale.

Understanding Pre-computed Aggregates

A pre-computed aggregate is a stored summary statistic calculated from detailed source data. Unlike derived columns (which compute values for individual rows), aggregates combine data across multiple rows into summary records.

Terminology:

Term	Definition	Example
Source data	The detailed transactional records	Individual orders, page views, transactions
Aggregate	The computed summary value	SUM, COUNT, AVG, MIN, MAX, etc.
Dimension	The grouping key(s) for aggregation	Date, region, category, customer segment
Granularity	The level of detail in the aggregate	Daily, weekly, monthly, by-region, by-store
Aggregate table	The table storing pre-computed summaries	`daily_sales_summary`, `monthly_revenue_by_region`

How aggregates transform queries:

Without Pre-computed Aggregates

•Query scans millions of detail rows
•Heavy computation on every request
•Performance degrades with data growth
•Dashboard refresh takes seconds/minutes
•Database under constant aggregate load

With Pre-computed Aggregates

•Query reads pre-calculated summary rows
•Computation done once during update
•Performance independent of data volume
•Dashboard refresh in milliseconds
•Aggregate computation isolated from reads

Common aggregate types:

Aggregate Function	Use Case	Incremental Maintenance
`COUNT`	Row counts, visitor counts	Easy: +1 or -1 per change
`SUM`	Revenue, quantities, totals	Easy: +value or -value per change
`AVG`	Average order value, ratings	Moderate: Store sum and count separately
`MIN` / `MAX`	First/last values, extremes	Hard: May need full recalc on delete
`COUNT DISTINCT`	Unique visitors, unique products	Hard: Requires probabilistic structures or full recalc
`PERCENTILE`	Median, 95th percentile	Very hard: Typically requires full recalc

The complexity of maintaining an aggregate dictates whether incremental updates are practical or if periodic full recalculation is required.

Designing Aggregate Schemas

Effective aggregate design requires understanding your query patterns and choosing appropriate granularity levels. The goal is to pre-answer the questions your applications actually ask.

Granularity hierarchy example (temporal):

Raw transactions (most granular)
     ↓ aggregate ↓
Hourly summaries
     ↓ aggregate ↓
Daily summaries
     ↓ aggregate ↓
Monthly summaries
     ↓ aggregate ↓
Yearly summaries (least granular)

Each level in the hierarchy is both a consumer of more granular data and a source for less granular aggregates. This layered approach enables efficient queries at any time scale.

aggregate-schema-design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
-- Example: Multi-granularity sales aggregate schema
 
-- Source table: Individual order items (millions of rows)
CREATE TABLE order_items (
    order_item_id SERIAL PRIMARY KEY,
    order_id INT NOT NULL,
    product_id INT NOT NULL,
    category_id INT NOT NULL,
    region_id INT NOT NULL,
    order_date DATE NOT NULL,
    quantity INT NOT NULL,
    unit_price DECIMAL(10, 2) NOT NULL,
    total_amount DECIMAL(12, 2) GENERATED ALWAYS AS (quantity * unit_price) STORED
);
 
-- Aggregate Level 1: Daily sales by category and region
-- Granularity: Day × Category × Region (tens of thousands of rows)
CREATE TABLE daily_sales_summary (
    summary_id SERIAL PRIMARY KEY,
    summary_date DATE NOT NULL,
    category_id INT NOT NULL,
    region_id INT NOT NULL,
    
    -- Aggregate columns
    total_revenue DECIMAL(14, 2) NOT NULL DEFAULT 0,
    total_quantity INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    
    -- For AVG calculation: store components separately
    -- avg_order_value = total_revenue / order_count
    
    -- Metadata
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    UNIQUE(summary_date, category_id, region_id)
);
 
-- Aggregate Level 2: Monthly sales by category
-- Granularity: Month × Category (thousands of rows)
CREATE TABLE monthly_category_summary (
    summary_id SERIAL PRIMARY KEY,
    summary_month DATE NOT NULL,  -- First day of month
    category_id INT NOT NULL,
    
    total_revenue DECIMAL(16, 2) NOT NULL DEFAULT 0,
    total_quantity INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    unique_customers INT NOT NULL DEFAULT 0,  -- Requires special handling
    
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    UNIQUE(summary_month, category_id)
);
 
-- Aggregate Level 3: Yearly company-wide totals
-- Granularity: Year (tens of rows)
CREATE TABLE yearly_company_summary (
    summary_id SERIAL PRIMARY KEY,
    summary_year INT NOT NULL,
    
    total_revenue DECIMAL(18, 2) NOT NULL DEFAULT 0,
    total_orders INT NOT NULL DEFAULT 0,
    total_customers INT NOT NULL DEFAULT 0,
    avg_order_value DECIMAL(12, 2) NOT NULL DEFAULT 0,
    
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    UNIQUE(summary_year)
);
 
-- Index design for aggregate tables
CREATE INDEX idx_daily_summary_date ON daily_sales_summary(summary_date DESC);
CREATE INDEX idx_daily_summary_category ON daily_sales_summary(category_id, summary_date DESC);
CREATE INDEX idx_monthly_summary_category ON monthly_category_summary(category_id, summary_month DESC);

Design for Query Patterns

Before creating aggregate tables, catalog your actual query patterns. What dimensions are always filtered? What time granularity is needed? What metrics are displayed together? Design aggregates that directly answer these queries without additional computation.

Multi-dimensional aggregates:

Real-world analytics often require drilling down across multiple dimensions. A well-designed aggregate schema supports common combinations:

Aggregate Table	Dimensions	Query Examples
`daily_sales_by_region`	date, region	Regional daily trends
`daily_sales_by_category`	date, category	Category daily trends
`monthly_sales_by_region_category`	month, region, category	Regional category performance
`customer_segment_summary`	segment, acquisition_month	Cohort analysis
`product_performance`	product, month	Product-level trends

Each aggregate table trades storage for query speed. The key is identifying which dimension combinations are frequently queried together.

Aggregate Maintenance Strategies

Maintaining aggregate accuracy as source data changes is the central challenge. There are three fundamental approaches, each with distinct trade-offs.

Strategy 1: Full Refresh (Complete Recalculation)

Periodically recalculate aggregates from scratch by querying source data.

full-refresh-strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Full refresh: Recalculate daily summary from source data
-- Typically run as a scheduled job (e.g., nightly, hourly)
 
-- Step 1: Calculate new aggregates
CREATE TEMP TABLE new_daily_summary AS
SELECT 
    order_date AS summary_date,
    category_id,
    region_id,
    SUM(total_amount) AS total_revenue,
    SUM(quantity) AS total_quantity,
    COUNT(DISTINCT order_id) AS order_count
FROM order_items
WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'  -- Scope to recent data
GROUP BY order_date, category_id, region_id;
 
-- Step 2: Merge into aggregate table (upsert pattern)
INSERT INTO daily_sales_summary 
    (summary_date, category_id, region_id, total_revenue, total_quantity, order_count, last_updated)
SELECT 
    summary_date, category_id, region_id, total_revenue, total_quantity, order_count, CURRENT_TIMESTAMP
FROM new_daily_summary
ON CONFLICT (summary_date, category_id, region_id)
DO UPDATE SET
    total_revenue = EXCLUDED.total_revenue,
    total_quantity = EXCLUDED.total_quantity,
    order_count = EXCLUDED.order_count,
    last_updated = CURRENT_TIMESTAMP;
 
-- Step 3: Cleanup
DROP TABLE new_daily_summary;

Full Refresh Strategy Trade-offs
Pros	Cons
Simple to implement and debug	Expensive for large datasets
Guaranteed accuracy (no drift)	Must scan all source data
Handles all aggregate types (including MIN/MAX)	Stale data between refreshes
No triggers or complex maintenance code	Resource-intensive during refresh

Strategy 2: Incremental Update (Real-time Maintenance)

Update aggregates immediately when source data changes, using triggers or application logic.

incremental-update-strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
-- Incremental update: Maintain aggregates via triggers
 
CREATE OR REPLACE FUNCTION maintain_daily_sales_aggregate()
RETURNS TRIGGER AS $$
DECLARE
    affected_date DATE;
    affected_category INT;
    affected_region INT;
    delta_revenue DECIMAL(12, 2);
    delta_quantity INT;
    delta_orders INT;
BEGIN
    -- Determine affected dimensions and deltas based on operation
    IF TG_OP = 'INSERT' THEN
        affected_date := NEW.order_date;
        affected_category := NEW.category_id;
        affected_region := NEW.region_id;
        delta_revenue := NEW.quantity * NEW.unit_price;
        delta_quantity := NEW.quantity;
        delta_orders := 1;  -- Simplified; see note on unique order counting
        
    ELSIF TG_OP = 'DELETE' THEN
        affected_date := OLD.order_date;
        affected_category := OLD.category_id;
        affected_region := OLD.region_id;
        delta_revenue := -(OLD.quantity * OLD.unit_price);
        delta_quantity := -OLD.quantity;
        delta_orders := -1;
        
    ELSIF TG_OP = 'UPDATE' THEN
        -- Handle updates by processing as delete + insert if dimensions changed
        IF OLD.order_date != NEW.order_date 
           OR OLD.category_id != NEW.category_id 
           OR OLD.region_id != NEW.region_id THEN
            -- Decrement old aggregate
            INSERT INTO daily_sales_summary 
                (summary_date, category_id, region_id, total_revenue, total_quantity, order_count)
            VALUES 
                (OLD.order_date, OLD.category_id, OLD.region_id, 
                 -(OLD.quantity * OLD.unit_price), -OLD.quantity, -1)
            ON CONFLICT (summary_date, category_id, region_id) DO UPDATE SET
                total_revenue = daily_sales_summary.total_revenue + EXCLUDED.total_revenue,
                total_quantity = daily_sales_summary.total_quantity + EXCLUDED.total_quantity,
                order_count = daily_sales_summary.order_count + EXCLUDED.order_count,
                last_updated = CURRENT_TIMESTAMP;
        END IF;
        
        affected_date := NEW.order_date;
        affected_category := NEW.category_id;
        affected_region := NEW.region_id;
        delta_revenue := (NEW.quantity * NEW.unit_price) - (OLD.quantity * OLD.unit_price);
        delta_quantity := NEW.quantity - OLD.quantity;
        delta_orders := 0;  -- Same order, different values
    END IF;
    
    -- Apply delta to aggregate table (upsert with increment)
    INSERT INTO daily_sales_summary 
        (summary_date, category_id, region_id, total_revenue, total_quantity, order_count, last_updated)
    VALUES 
        (affected_date, affected_category, affected_region, delta_revenue, delta_quantity, delta_orders, CURRENT_TIMESTAMP)
    ON CONFLICT (summary_date, category_id, region_id) DO UPDATE SET
        total_revenue = daily_sales_summary.total_revenue + EXCLUDED.total_revenue,
        total_quantity = daily_sales_summary.total_quantity + EXCLUDED.total_quantity,
        order_count = daily_sales_summary.order_count + EXCLUDED.order_count,
        last_updated = CURRENT_TIMESTAMP;
    
    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER order_items_aggregate_trigger
AFTER INSERT OR UPDATE OR DELETE ON order_items
FOR EACH ROW EXECUTE FUNCTION maintain_daily_sales_aggregate();

Incremental Update Strategy Trade-offs
Pros	Cons
Real-time aggregate accuracy	Complex trigger logic
No periodic refresh overhead	Potential for drift over time
Consistent query latency	Adds latency to every write
No stale data window	Difficult for MIN/MAX/COUNT DISTINCT

Strategy 3: Hybrid Approach (Incremental + Periodic Reconciliation)

The most robust approach combines real-time incremental updates with periodic full reconciliation to detect and correct drift.

hybrid-maintenance-strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
-- Hybrid approach: Incremental updates + periodic reconciliation
 
-- 1. Use triggers for real-time incremental updates (as shown above)
-- 2. Run periodic reconciliation job to detect and fix drift
 
-- Reconciliation job: Compare computed vs. stored aggregates
CREATE OR REPLACE FUNCTION reconcile_daily_sales_aggregates(
    p_start_date DATE DEFAULT CURRENT_DATE - INTERVAL '7 days',
    p_end_date DATE DEFAULT CURRENT_DATE
)
RETURNS TABLE (
    summary_date DATE,
    category_id INT,
    region_id INT,
    stored_revenue DECIMAL(14,2),
    computed_revenue DECIMAL(14,2),
    discrepancy DECIMAL(14,2),
    corrected BOOLEAN
) AS $$
BEGIN
    RETURN QUERY
    WITH computed AS (
        SELECT 
            oi.order_date AS summary_date,
            oi.category_id,
            oi.region_id,
            COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS computed_revenue,
            COALESCE(SUM(oi.quantity), 0) AS computed_quantity,
            COUNT(DISTINCT oi.order_id) AS computed_orders
        FROM order_items oi
        WHERE oi.order_date BETWEEN p_start_date AND p_end_date
        GROUP BY oi.order_date, oi.category_id, oi.region_id
    ),
    comparison AS (
        SELECT 
            COALESCE(c.summary_date, s.summary_date) AS summary_date,
            COALESCE(c.category_id, s.category_id) AS category_id,
            COALESCE(c.region_id, s.region_id) AS region_id,
            COALESCE(s.total_revenue, 0) AS stored_revenue,
            COALESCE(c.computed_revenue, 0) AS computed_revenue,
            COALESCE(c.computed_revenue, 0) - COALESCE(s.total_revenue, 0) AS discrepancy,
            COALESCE(c.computed_quantity, 0) AS computed_quantity,
            COALESCE(c.computed_orders, 0) AS computed_orders
        FROM computed c
        FULL OUTER JOIN daily_sales_summary s 
            ON c.summary_date = s.summary_date 
            AND c.category_id = s.category_id 
            AND c.region_id = s.region_id
        WHERE s.summary_date BETWEEN p_start_date AND p_end_date
           OR c.summary_date BETWEEN p_start_date AND p_end_date
    )
    -- Auto-fix discrepancies
    UPDATE daily_sales_summary dst
    SET 
        total_revenue = cmp.computed_revenue,
        total_quantity = cmp.computed_quantity,
        order_count = cmp.computed_orders,
        last_updated = CURRENT_TIMESTAMP
    FROM comparison cmp
    WHERE dst.summary_date = cmp.summary_date
      AND dst.category_id = cmp.category_id
      AND dst.region_id = cmp.region_id
      AND ABS(cmp.discrepancy) > 0.01  -- Tolerance for floating-point
    RETURNING 
        dst.summary_date, dst.category_id, dst.region_id,
        cmp.stored_revenue, cmp.computed_revenue, cmp.discrepancy, true;
END;
$$ LANGUAGE plpgsql;
 
-- Schedule reconciliation to run during off-peak hours
-- Example: Run nightly at 3 AM via pg_cron or external scheduler
-- SELECT reconcile_daily_sales_aggregates();

Choose Based on Tolerance

If your application requires real-time accurate aggregates (financial systems, inventory counts), use incremental with frequent reconciliation. If slight staleness is acceptable (analytics dashboards, reporting), periodic full refresh is simpler and less error-prone.

Handling Complex Aggregates

Not all aggregates are equally maintainable. While SUM and COUNT can be updated incrementally with simple delta arithmetic, other aggregates require special treatment.

COUNT DISTINCT (Unique Counts)

Counting unique values (unique customers, unique products, etc.) is challenging because:

Adding a new value may or may not increase the count (depends on prior presence)
Removing a value may or may not decrease the count (depends on remaining duplicates)

Solutions for COUNT DISTINCT:

count-distinct-strategies.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-- Strategy 1: Store component elements, not just the count
-- Maintains a set of unique values, count is derived
 
CREATE TABLE monthly_unique_customers (
    summary_month DATE NOT NULL,
    category_id INT NOT NULL,
    customer_id INT NOT NULL,  -- Store each unique customer
    first_order_date DATE,
    last_order_date DATE,
    order_count INT DEFAULT 1,
    PRIMARY KEY (summary_month, category_id, customer_id)
);
 
-- Count is derived by counting rows
SELECT summary_month, category_id, COUNT(*) AS unique_customers
FROM monthly_unique_customers
GROUP BY summary_month, category_id;
 
-- Strategy 2: HyperLogLog for approximate counts (PostgreSQL extension)
-- Trades perfect accuracy for dramatically reduced storage and computation
 
CREATE EXTENSION IF NOT EXISTS hll;
 
CREATE TABLE monthly_unique_visitors_hll (
    summary_month DATE PRIMARY KEY,
    unique_visitors hll  -- HyperLogLog sketch
);
 
-- Insert/update by adding to HLL sketch
INSERT INTO monthly_unique_visitors_hll (summary_month, unique_visitors)
VALUES (DATE_TRUNC('month', CURRENT_DATE), hll_empty())
ON CONFLICT (summary_month) DO NOTHING;
 
UPDATE monthly_unique_visitors_hll
SET unique_visitors = hll_add(unique_visitors, hll_hash_text('user_12345'))
WHERE summary_month = DATE_TRUNC('month', CURRENT_DATE);
 
-- Query approximate count (typically within 2% accuracy)
SELECT summary_month, hll_cardinality(unique_visitors) AS approx_unique_visitors
FROM monthly_unique_visitors_hll;
 
-- Strategy 3: Periodic recalculation (simplest, most accurate)
-- Accept staleness, recalculate nightly
 
UPDATE monthly_category_summary mcs
SET unique_customers = (
    SELECT COUNT(DISTINCT oi.customer_id)
    FROM order_items oi
    JOIN orders o ON oi.order_id = o.order_id
    WHERE DATE_TRUNC('month', o.order_date) = mcs.summary_month
      AND oi.category_id = mcs.category_id
);

MIN / MAX Aggregates

Minimum and maximum values are difficult to maintain incrementally because:

A new value may or may not become the new min/max
When a current min/max is deleted, you must scan for the new min/max

Solutions for MIN/MAX:

min-max-strategies.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
-- Strategy 1: Store top-N values, not just the single min/max
-- Deleting the current max still leaves other candidates
 
CREATE TABLE daily_top_orders (
    summary_date DATE NOT NULL,
    rank INT NOT NULL CHECK (rank BETWEEN 1 AND 10),
    order_id INT NOT NULL,
    order_total DECIMAL(12, 2) NOT NULL,
    PRIMARY KEY (summary_date, rank)
);
 
-- When a new order arrives, check if it enters top-10
-- When an order is deleted, promote next candidate
 
-- Strategy 2: Accept eventual consistency for min/max
-- Store current min/max, mark as "potentially stale" on delete
 
CREATE TABLE daily_sales_extremes (
    summary_date DATE PRIMARY KEY,
    max_order_id INT,
    max_order_total DECIMAL(12, 2),
    min_order_id INT,
    min_order_total DECIMAL(12, 2),
    needs_recalc BOOLEAN DEFAULT FALSE  -- Flag when max/min deleted
);
 
-- Trigger on delete: flag for recalculation
CREATE OR REPLACE FUNCTION flag_extreme_recalc()
RETURNS TRIGGER AS $$
BEGIN
    UPDATE daily_sales_extremes
    SET needs_recalc = TRUE
    WHERE summary_date = OLD.order_date
      AND (max_order_id = OLD.order_id OR min_order_id = OLD.order_id);
    RETURN OLD;
END;
$$ LANGUAGE plpgsql;
 
-- Background job recalculates flagged rows
UPDATE daily_sales_extremes dse
SET 
    max_order_id = sub.max_order_id,
    max_order_total = sub.max_total,
    min_order_id = sub.min_order_id,
    min_order_total = sub.min_total,
    needs_recalc = FALSE
FROM (
    SELECT 
        order_date,
        MAX(order_id) FILTER (WHERE order_total = MAX(order_total) OVER ()) AS max_order_id,
        MAX(order_total) AS max_total,
        MIN(order_id) FILTER (WHERE order_total = MIN(order_total) OVER ()) AS min_order_id,
        MIN(order_total) AS min_total
    FROM orders
    WHERE order_date IN (SELECT summary_date FROM daily_sales_extremes WHERE needs_recalc)
    GROUP BY order_date
) sub
WHERE dse.summary_date = sub.order_date;

Match Strategy to Requirements

Different aggregate types may require different maintenance strategies in the same system. A dashboard might need real-time SUM/COUNT but accept nightly-recalculated COUNT DISTINCT and MIN/MAX. Design your aggregate tables to accommodate mixed refresh cadences.

Roll-up and Drill-down Patterns

A well-designed aggregate hierarchy enables efficient querying at any granularity level. Roll-up combines finer-grained aggregates into coarser ones. Drill-down queries more granular aggregates (or source data) for details.

Hierarchical aggregate example:

rollup-drilldown-architecture.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
-- Level 0: Source data (most granular)
-- order_items table (millions of rows)
 
-- Level 1: Hourly aggregates
CREATE TABLE hourly_sales (
    summary_hour TIMESTAMP NOT NULL,  -- truncated to hour
    region_id INT NOT NULL,
    category_id INT NOT NULL,
    revenue DECIMAL(14, 2) NOT NULL DEFAULT 0,
    units_sold INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    PRIMARY KEY (summary_hour, region_id, category_id)
);
 
-- Level 2: Daily aggregates (built from hourly)
CREATE TABLE daily_sales (
    summary_date DATE NOT NULL,
    region_id INT NOT NULL,
    category_id INT NOT NULL,
    revenue DECIMAL(14, 2) NOT NULL DEFAULT 0,
    units_sold INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    PRIMARY KEY (summary_date, region_id, category_id)
);
 
-- Level 3: Monthly aggregates (built from daily)
CREATE TABLE monthly_sales (
    summary_month DATE NOT NULL,  -- first of month
    region_id INT NOT NULL,
    category_id INT NOT NULL,
    revenue DECIMAL(16, 2) NOT NULL DEFAULT 0,
    units_sold INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    PRIMARY KEY (summary_month, region_id, category_id)
);
 
-- Roll-up function: Aggregate hourly → daily
CREATE OR REPLACE FUNCTION rollup_hourly_to_daily(p_date DATE)
RETURNS void AS $$
BEGIN
    INSERT INTO daily_sales (summary_date, region_id, category_id, revenue, units_sold, order_count)
    SELECT 
        p_date,
        region_id,
        category_id,
        SUM(revenue),
        SUM(units_sold),
        SUM(order_count)
    FROM hourly_sales
    WHERE summary_hour >= p_date AND summary_hour < p_date + INTERVAL '1 day'
    GROUP BY region_id, category_id
    ON CONFLICT (summary_date, region_id, category_id) DO UPDATE SET
        revenue = EXCLUDED.revenue,
        units_sold = EXCLUDED.units_sold,
        order_count = EXCLUDED.order_count;
END;
$$ LANGUAGE plpgsql;
 
-- Roll-up function: Aggregate daily → monthly
CREATE OR REPLACE FUNCTION rollup_daily_to_monthly(p_month DATE)
RETURNS void AS $$
BEGIN
    INSERT INTO monthly_sales (summary_month, region_id, category_id, revenue, units_sold, order_count)
    SELECT 
        DATE_TRUNC('month', p_month),
        region_id,
        category_id,
        SUM(revenue),
        SUM(units_sold),
        SUM(order_count)
    FROM daily_sales
    WHERE summary_date >= DATE_TRUNC('month', p_month)
      AND summary_date < DATE_TRUNC('month', p_month) + INTERVAL '1 month'
    GROUP BY region_id, category_id
    ON CONFLICT (summary_month, region_id, category_id) DO UPDATE SET
        revenue = EXCLUDED.revenue,
        units_sold = EXCLUDED.units_sold,
        order_count = EXCLUDED.order_count;
END;
$$ LANGUAGE plpgsql;
 
-- Query patterns at different granularities:
 
-- Dashboard: Show today's hourly trend (queries hourly aggregates)
SELECT summary_hour, SUM(revenue) AS total_revenue
FROM hourly_sales
WHERE summary_hour >= CURRENT_DATE AND region_id = 1
GROUP BY summary_hour ORDER BY summary_hour;
 
-- Report: Show this month's daily trend (queries daily aggregates)
SELECT summary_date, SUM(revenue) AS daily_revenue
FROM daily_sales
WHERE summary_date >= DATE_TRUNC('month', CURRENT_DATE)
GROUP BY summary_date ORDER BY summary_date;
 
-- Exec summary: Year-over-year monthly comparison (queries monthly aggregates)
SELECT summary_month, SUM(revenue) AS monthly_revenue
FROM monthly_sales
WHERE summary_month >= CURRENT_DATE - INTERVAL '2 years'
GROUP BY summary_month ORDER BY summary_month;

Benefits of hierarchical aggregates:

Query efficiency at all scales — Yearly trend queries don't scan billions of source rows
Incremental maintenance — Each level only needs to process from the level below it
Selective drill-down — Start at coarse level, drill into finer granularity only when needed
Data retention tiers — Keep hourly data for 30 days, daily for 2 years, monthly forever
Parallel processing — Different aggregate levels can be maintained by different jobs/servers

Common Granularity Choices

Temporal: Second → Minute → Hour → Day → Week → Month → Quarter → Year. Organizational: Store → District → Region → Country → Global. Product: SKU → Product → Subcategory → Category → Department. Design your hierarchy to match how your business naturally analyzes data.

Materialized Views as Aggregates

Most modern database systems support materialized views—stored query results that can be periodically refreshed. Materialized views provide a declarative way to create pre-computed aggregates without manual maintenance code.

Advantages of materialized views:

Query definition is stored and version-controlled
Refresh is a single command
No trigger code to maintain
Database can optimize refresh operations

Disadvantages:

Less control over incremental refresh logic
May not support all refresh strategies
Query changes require view recreation

materialized-views-aggregates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
-- PostgreSQL: Materialized Views for Pre-computed Aggregates
 
-- Create materialized view for daily sales summary
CREATE MATERIALIZED VIEW mv_daily_sales_summary AS
SELECT 
    DATE(o.order_date) AS summary_date,
    oi.category_id,
    oi.region_id,
    SUM(oi.quantity * oi.unit_price) AS total_revenue,
    SUM(oi.quantity) AS total_quantity,
    COUNT(DISTINCT o.order_id) AS order_count,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    AVG(oi.quantity * oi.unit_price) AS avg_line_item_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY DATE(o.order_date), oi.category_id, oi.region_id
WITH NO DATA;  -- Create structure only, don't populate yet
 
-- Create indexes on materialized view (after population)
CREATE UNIQUE INDEX idx_mv_daily_sales_pk 
    ON mv_daily_sales_summary(summary_date, category_id, region_id);
CREATE INDEX idx_mv_daily_sales_date 
    ON mv_daily_sales_summary(summary_date DESC);
CREATE INDEX idx_mv_daily_sales_category 
    ON mv_daily_sales_summary(category_id, summary_date DESC);
 
-- Initial population
REFRESH MATERIALIZED VIEW mv_daily_sales_summary;
 
-- Subsequent refreshes (exclusive lock - blocks queries during refresh)
REFRESH MATERIALIZED VIEW mv_daily_sales_summary;
 
-- Concurrent refresh (no lock - queries can continue, requires unique index)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_sales_summary;
 
-- Query the materialized view exactly like a regular table
SELECT summary_date, SUM(total_revenue) AS daily_total
FROM mv_daily_sales_summary
WHERE category_id = 5 AND summary_date >= CURRENT_DATE - 30
GROUP BY summary_date
ORDER BY summary_date;
 
-- Monthly rollup materialized view (aggregates the daily view)
CREATE MATERIALIZED VIEW mv_monthly_sales_summary AS
SELECT 
    DATE_TRUNC('month', summary_date) AS summary_month,
    category_id,
    region_id,
    SUM(total_revenue) AS total_revenue,
    SUM(total_quantity) AS total_quantity,
    SUM(order_count) AS order_count
    -- Note: Cannot directly sum unique_customers (would overcount)
FROM mv_daily_sales_summary
GROUP BY DATE_TRUNC('month', summary_date), category_id, region_id;
 
-- Scheduled refresh (using pg_cron extension)
CREATE EXTENSION IF NOT EXISTS pg_cron;
 
-- Refresh daily MV every hour
SELECT cron.schedule(
    'refresh-daily-sales',
    '0 * * * *',  -- Every hour on the hour
    'REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_sales_summary'
);
 
-- Refresh monthly MV once per day
SELECT cron.schedule(
    'refresh-monthly-sales',
    '0 2 * * *',  -- 2 AM daily
    'REFRESH MATERIALIZED VIEW CONCURRENTLY mv_monthly_sales_summary'
);

Concurrent Refresh Requirements

REFRESH MATERIALIZED VIEW CONCURRENTLY requires a unique index on the materialized view. Without a unique index, you must use blocking refresh, which prevents queries during the refresh operation. Plan your materialized view indexes carefully.

Comparing Materialized Views vs. Manual Aggregate Tables:

Aspect	Materialized Views	Manual Aggregate Tables
Definition	Declarative SQL query	Explicit schema + maintenance code
Refresh	Single REFRESH command	Custom update logic (triggers/jobs)
Incremental update	Limited (database-dependent)	Full control over delta logic
Query complexity	Direct SELECT from source	Must match aggregate schema
Index control	Full index control	Full index control
Cross-table refresh	Automatic via view query	Must coordinate manually
Production maturity	Varies by database	Battle-tested patterns available

Summary and Best Practices

Pre-computed aggregates are among the most impactful denormalization techniques for analytical workloads. When designed and maintained correctly, they transform query performance from seconds to milliseconds.

Key Takeaways

•Pre-computed aggregates store summary statistics — They trade storage for query speed by avoiding repeated computation across millions of source rows.
•Design aggregates for actual query patterns — Analyze what dimensions and time granularities your applications need; build aggregates that answer those specific questions.
•Choose maintenance strategy based on requirements — Real-time needs demand incremental updates; analytics tolerance allows periodic full refresh; hybrid approaches offer reliability.
•Handle complex aggregates appropriately — SUM/COUNT are easy to maintain incrementally; COUNT DISTINCT and MIN/MAX require special strategies or periodic recalculation.
•Build hierarchical aggregate structures — Enable roll-up and drill-down by creating aggregates at multiple granularity levels, each building from the level below.
•Consider materialized views for simpler cases — They provide declarative aggregate definition with built-in refresh, though with less control than manual tables.

What's Next:

Pre-computed aggregates address analytical query performance. But many transactional queries are slowed by joins—retrieving related data from normalized tables. The next page explores duplicating foreign key data: storing copies of related data in the same table to eliminate join operations.

Page Complete

You now understand pre-computed aggregates as a denormalization technique. You can design aggregate schemas, implement maintenance strategies, handle complex aggregate types, and build hierarchical aggregate structures. Next, we explore duplicating foreign key data for join elimination.

2 / 5

Loading learning content...

Database Management SystemsDenormalization

Denormalization Techniques

LevelIntermediate

Duration75 mins

TopicDenormalization

2 / 5

Pre-computed Aggregates

The Aggregation Bottleneck

Consider a business intelligence dashboard displaying these metrics:

Total revenue this month: Requires summing all order amounts for the current month
Average order value by region: Requires grouping and averaging across millions of orders
Units sold by product category: Requires joining orders, line items, and products, then aggregating
Customer count by acquisition channel: Requires aggregating the entire customer table

What You Will Learn

Understanding Pre-computed Aggregates

Terminology:

Term	Definition	Example
Source data	The detailed transactional records	Individual orders, page views, transactions
Aggregate	The computed summary value	SUM, COUNT, AVG, MIN, MAX, etc.
Dimension	The grouping key(s) for aggregation	Date, region, category, customer segment
Granularity	The level of detail in the aggregate	Daily, weekly, monthly, by-region, by-store
Aggregate table	The table storing pre-computed summaries	`daily_sales_summary`, `monthly_revenue_by_region`

How aggregates transform queries:

Without Pre-computed Aggregates

•Query scans millions of detail rows
•Heavy computation on every request
•Performance degrades with data growth
•Dashboard refresh takes seconds/minutes
•Database under constant aggregate load

With Pre-computed Aggregates

•Query reads pre-calculated summary rows
•Computation done once during update
•Performance independent of data volume
•Dashboard refresh in milliseconds
•Aggregate computation isolated from reads

Common aggregate types:

Aggregate Function	Use Case	Incremental Maintenance
`COUNT`	Row counts, visitor counts	Easy: +1 or -1 per change
`SUM`	Revenue, quantities, totals	Easy: +value or -value per change
`AVG`	Average order value, ratings	Moderate: Store sum and count separately
`MIN` / `MAX`	First/last values, extremes	Hard: May need full recalc on delete
`COUNT DISTINCT`	Unique visitors, unique products	Hard: Requires probabilistic structures or full recalc
`PERCENTILE`	Median, 95th percentile	Very hard: Typically requires full recalc

The complexity of maintaining an aggregate dictates whether incremental updates are practical or if periodic full recalculation is required.

Designing Aggregate Schemas

Effective aggregate design requires understanding your query patterns and choosing appropriate granularity levels. The goal is to pre-answer the questions your applications actually ask.

Granularity hierarchy example (temporal):

Raw transactions (most granular)
     ↓ aggregate ↓
Hourly summaries
     ↓ aggregate ↓
Daily summaries
     ↓ aggregate ↓
Monthly summaries
     ↓ aggregate ↓
Yearly summaries (least granular)

Each level in the hierarchy is both a consumer of more granular data and a source for less granular aggregates. This layered approach enables efficient queries at any time scale.

aggregate-schema-design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
-- Example: Multi-granularity sales aggregate schema
 
-- Source table: Individual order items (millions of rows)
CREATE TABLE order_items (
    order_item_id SERIAL PRIMARY KEY,
    order_id INT NOT NULL,
    product_id INT NOT NULL,
    category_id INT NOT NULL,
    region_id INT NOT NULL,
    order_date DATE NOT NULL,
    quantity INT NOT NULL,
    unit_price DECIMAL(10, 2) NOT NULL,
    total_amount DECIMAL(12, 2) GENERATED ALWAYS AS (quantity * unit_price) STORED
);
 
-- Aggregate Level 1: Daily sales by category and region
-- Granularity: Day × Category × Region (tens of thousands of rows)
CREATE TABLE daily_sales_summary (
    summary_id SERIAL PRIMARY KEY,
    summary_date DATE NOT NULL,
    category_id INT NOT NULL,
    region_id INT NOT NULL,
    
    -- Aggregate columns
    total_revenue DECIMAL(14, 2) NOT NULL DEFAULT 0,
    total_quantity INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    
    -- For AVG calculation: store components separately
    -- avg_order_value = total_revenue / order_count
    
    -- Metadata
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    UNIQUE(summary_date, category_id, region_id)
);
 
-- Aggregate Level 2: Monthly sales by category
-- Granularity: Month × Category (thousands of rows)
CREATE TABLE monthly_category_summary (
    summary_id SERIAL PRIMARY KEY,
    summary_month DATE NOT NULL,  -- First day of month
    category_id INT NOT NULL,
    
    total_revenue DECIMAL(16, 2) NOT NULL DEFAULT 0,
    total_quantity INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    unique_customers INT NOT NULL DEFAULT 0,  -- Requires special handling
    
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    UNIQUE(summary_month, category_id)
);
 
-- Aggregate Level 3: Yearly company-wide totals
-- Granularity: Year (tens of rows)
CREATE TABLE yearly_company_summary (
    summary_id SERIAL PRIMARY KEY,
    summary_year INT NOT NULL,
    
    total_revenue DECIMAL(18, 2) NOT NULL DEFAULT 0,
    total_orders INT NOT NULL DEFAULT 0,
    total_customers INT NOT NULL DEFAULT 0,
    avg_order_value DECIMAL(12, 2) NOT NULL DEFAULT 0,
    
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    UNIQUE(summary_year)
);
 
-- Index design for aggregate tables
CREATE INDEX idx_daily_summary_date ON daily_sales_summary(summary_date DESC);
CREATE INDEX idx_daily_summary_category ON daily_sales_summary(category_id, summary_date DESC);
CREATE INDEX idx_monthly_summary_category ON monthly_category_summary(category_id, summary_month DESC);

Design for Query Patterns

Multi-dimensional aggregates:

Real-world analytics often require drilling down across multiple dimensions. A well-designed aggregate schema supports common combinations:

Aggregate Table	Dimensions	Query Examples
`daily_sales_by_region`	date, region	Regional daily trends
`daily_sales_by_category`	date, category	Category daily trends
`monthly_sales_by_region_category`	month, region, category	Regional category performance
`customer_segment_summary`	segment, acquisition_month	Cohort analysis
`product_performance`	product, month	Product-level trends

Each aggregate table trades storage for query speed. The key is identifying which dimension combinations are frequently queried together.

Aggregate Maintenance Strategies

Maintaining aggregate accuracy as source data changes is the central challenge. There are three fundamental approaches, each with distinct trade-offs.

Strategy 1: Full Refresh (Complete Recalculation)

Periodically recalculate aggregates from scratch by querying source data.

full-refresh-strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Full refresh: Recalculate daily summary from source data
-- Typically run as a scheduled job (e.g., nightly, hourly)
 
-- Step 1: Calculate new aggregates
CREATE TEMP TABLE new_daily_summary AS
SELECT 
    order_date AS summary_date,
    category_id,
    region_id,
    SUM(total_amount) AS total_revenue,
    SUM(quantity) AS total_quantity,
    COUNT(DISTINCT order_id) AS order_count
FROM order_items
WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'  -- Scope to recent data
GROUP BY order_date, category_id, region_id;
 
-- Step 2: Merge into aggregate table (upsert pattern)
INSERT INTO daily_sales_summary 
    (summary_date, category_id, region_id, total_revenue, total_quantity, order_count, last_updated)
SELECT 
    summary_date, category_id, region_id, total_revenue, total_quantity, order_count, CURRENT_TIMESTAMP
FROM new_daily_summary
ON CONFLICT (summary_date, category_id, region_id)
DO UPDATE SET
    total_revenue = EXCLUDED.total_revenue,
    total_quantity = EXCLUDED.total_quantity,
    order_count = EXCLUDED.order_count,
    last_updated = CURRENT_TIMESTAMP;
 
-- Step 3: Cleanup
DROP TABLE new_daily_summary;

Full Refresh Strategy Trade-offs
Pros	Cons
Simple to implement and debug	Expensive for large datasets
Guaranteed accuracy (no drift)	Must scan all source data
Handles all aggregate types (including MIN/MAX)	Stale data between refreshes
No triggers or complex maintenance code	Resource-intensive during refresh

Strategy 2: Incremental Update (Real-time Maintenance)

Update aggregates immediately when source data changes, using triggers or application logic.

incremental-update-strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
-- Incremental update: Maintain aggregates via triggers
 
CREATE OR REPLACE FUNCTION maintain_daily_sales_aggregate()
RETURNS TRIGGER AS $$
DECLARE
    affected_date DATE;
    affected_category INT;
    affected_region INT;
    delta_revenue DECIMAL(12, 2);
    delta_quantity INT;
    delta_orders INT;
BEGIN
    -- Determine affected dimensions and deltas based on operation
    IF TG_OP = 'INSERT' THEN
        affected_date := NEW.order_date;
        affected_category := NEW.category_id;
        affected_region := NEW.region_id;
        delta_revenue := NEW.quantity * NEW.unit_price;
        delta_quantity := NEW.quantity;
        delta_orders := 1;  -- Simplified; see note on unique order counting
        
    ELSIF TG_OP = 'DELETE' THEN
        affected_date := OLD.order_date;
        affected_category := OLD.category_id;
        affected_region := OLD.region_id;
        delta_revenue := -(OLD.quantity * OLD.unit_price);
        delta_quantity := -OLD.quantity;
        delta_orders := -1;
        
    ELSIF TG_OP = 'UPDATE' THEN
        -- Handle updates by processing as delete + insert if dimensions changed
        IF OLD.order_date != NEW.order_date 
           OR OLD.category_id != NEW.category_id 
           OR OLD.region_id != NEW.region_id THEN
            -- Decrement old aggregate
            INSERT INTO daily_sales_summary 
                (summary_date, category_id, region_id, total_revenue, total_quantity, order_count)
            VALUES 
                (OLD.order_date, OLD.category_id, OLD.region_id, 
                 -(OLD.quantity * OLD.unit_price), -OLD.quantity, -1)
            ON CONFLICT (summary_date, category_id, region_id) DO UPDATE SET
                total_revenue = daily_sales_summary.total_revenue + EXCLUDED.total_revenue,
                total_quantity = daily_sales_summary.total_quantity + EXCLUDED.total_quantity,
                order_count = daily_sales_summary.order_count + EXCLUDED.order_count,
                last_updated = CURRENT_TIMESTAMP;
        END IF;
        
        affected_date := NEW.order_date;
        affected_category := NEW.category_id;
        affected_region := NEW.region_id;
        delta_revenue := (NEW.quantity * NEW.unit_price) - (OLD.quantity * OLD.unit_price);
        delta_quantity := NEW.quantity - OLD.quantity;
        delta_orders := 0;  -- Same order, different values
    END IF;
    
    -- Apply delta to aggregate table (upsert with increment)
    INSERT INTO daily_sales_summary 
        (summary_date, category_id, region_id, total_revenue, total_quantity, order_count, last_updated)
    VALUES 
        (affected_date, affected_category, affected_region, delta_revenue, delta_quantity, delta_orders, CURRENT_TIMESTAMP)
    ON CONFLICT (summary_date, category_id, region_id) DO UPDATE SET
        total_revenue = daily_sales_summary.total_revenue + EXCLUDED.total_revenue,
        total_quantity = daily_sales_summary.total_quantity + EXCLUDED.total_quantity,
        order_count = daily_sales_summary.order_count + EXCLUDED.order_count,
        last_updated = CURRENT_TIMESTAMP;
    
    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER order_items_aggregate_trigger
AFTER INSERT OR UPDATE OR DELETE ON order_items
FOR EACH ROW EXECUTE FUNCTION maintain_daily_sales_aggregate();

Incremental Update Strategy Trade-offs
Pros	Cons
Real-time aggregate accuracy	Complex trigger logic
No periodic refresh overhead	Potential for drift over time
Consistent query latency	Adds latency to every write
No stale data window	Difficult for MIN/MAX/COUNT DISTINCT

Strategy 3: Hybrid Approach (Incremental + Periodic Reconciliation)

The most robust approach combines real-time incremental updates with periodic full reconciliation to detect and correct drift.

hybrid-maintenance-strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
-- Hybrid approach: Incremental updates + periodic reconciliation
 
-- 1. Use triggers for real-time incremental updates (as shown above)
-- 2. Run periodic reconciliation job to detect and fix drift
 
-- Reconciliation job: Compare computed vs. stored aggregates
CREATE OR REPLACE FUNCTION reconcile_daily_sales_aggregates(
    p_start_date DATE DEFAULT CURRENT_DATE - INTERVAL '7 days',
    p_end_date DATE DEFAULT CURRENT_DATE
)
RETURNS TABLE (
    summary_date DATE,
    category_id INT,
    region_id INT,
    stored_revenue DECIMAL(14,2),
    computed_revenue DECIMAL(14,2),
    discrepancy DECIMAL(14,2),
    corrected BOOLEAN
) AS $$
BEGIN
    RETURN QUERY
    WITH computed AS (
        SELECT 
            oi.order_date AS summary_date,
            oi.category_id,
            oi.region_id,
            COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS computed_revenue,
            COALESCE(SUM(oi.quantity), 0) AS computed_quantity,
            COUNT(DISTINCT oi.order_id) AS computed_orders
        FROM order_items oi
        WHERE oi.order_date BETWEEN p_start_date AND p_end_date
        GROUP BY oi.order_date, oi.category_id, oi.region_id
    ),
    comparison AS (
        SELECT 
            COALESCE(c.summary_date, s.summary_date) AS summary_date,
            COALESCE(c.category_id, s.category_id) AS category_id,
            COALESCE(c.region_id, s.region_id) AS region_id,
            COALESCE(s.total_revenue, 0) AS stored_revenue,
            COALESCE(c.computed_revenue, 0) AS computed_revenue,
            COALESCE(c.computed_revenue, 0) - COALESCE(s.total_revenue, 0) AS discrepancy,
            COALESCE(c.computed_quantity, 0) AS computed_quantity,
            COALESCE(c.computed_orders, 0) AS computed_orders
        FROM computed c
        FULL OUTER JOIN daily_sales_summary s 
            ON c.summary_date = s.summary_date 
            AND c.category_id = s.category_id 
            AND c.region_id = s.region_id
        WHERE s.summary_date BETWEEN p_start_date AND p_end_date
           OR c.summary_date BETWEEN p_start_date AND p_end_date
    )
    -- Auto-fix discrepancies
    UPDATE daily_sales_summary dst
    SET 
        total_revenue = cmp.computed_revenue,
        total_quantity = cmp.computed_quantity,
        order_count = cmp.computed_orders,
        last_updated = CURRENT_TIMESTAMP
    FROM comparison cmp
    WHERE dst.summary_date = cmp.summary_date
      AND dst.category_id = cmp.category_id
      AND dst.region_id = cmp.region_id
      AND ABS(cmp.discrepancy) > 0.01  -- Tolerance for floating-point
    RETURNING 
        dst.summary_date, dst.category_id, dst.region_id,
        cmp.stored_revenue, cmp.computed_revenue, cmp.discrepancy, true;
END;
$$ LANGUAGE plpgsql;
 
-- Schedule reconciliation to run during off-peak hours
-- Example: Run nightly at 3 AM via pg_cron or external scheduler
-- SELECT reconcile_daily_sales_aggregates();

Choose Based on Tolerance

Handling Complex Aggregates

Not all aggregates are equally maintainable. While SUM and COUNT can be updated incrementally with simple delta arithmetic, other aggregates require special treatment.

COUNT DISTINCT (Unique Counts)

Counting unique values (unique customers, unique products, etc.) is challenging because:

Adding a new value may or may not increase the count (depends on prior presence)
Removing a value may or may not decrease the count (depends on remaining duplicates)

Solutions for COUNT DISTINCT:

count-distinct-strategies.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-- Strategy 1: Store component elements, not just the count
-- Maintains a set of unique values, count is derived
 
CREATE TABLE monthly_unique_customers (
    summary_month DATE NOT NULL,
    category_id INT NOT NULL,
    customer_id INT NOT NULL,  -- Store each unique customer
    first_order_date DATE,
    last_order_date DATE,
    order_count INT DEFAULT 1,
    PRIMARY KEY (summary_month, category_id, customer_id)
);
 
-- Count is derived by counting rows
SELECT summary_month, category_id, COUNT(*) AS unique_customers
FROM monthly_unique_customers
GROUP BY summary_month, category_id;
 
-- Strategy 2: HyperLogLog for approximate counts (PostgreSQL extension)
-- Trades perfect accuracy for dramatically reduced storage and computation
 
CREATE EXTENSION IF NOT EXISTS hll;
 
CREATE TABLE monthly_unique_visitors_hll (
    summary_month DATE PRIMARY KEY,
    unique_visitors hll  -- HyperLogLog sketch
);
 
-- Insert/update by adding to HLL sketch
INSERT INTO monthly_unique_visitors_hll (summary_month, unique_visitors)
VALUES (DATE_TRUNC('month', CURRENT_DATE), hll_empty())
ON CONFLICT (summary_month) DO NOTHING;
 
UPDATE monthly_unique_visitors_hll
SET unique_visitors = hll_add(unique_visitors, hll_hash_text('user_12345'))
WHERE summary_month = DATE_TRUNC('month', CURRENT_DATE);
 
-- Query approximate count (typically within 2% accuracy)
SELECT summary_month, hll_cardinality(unique_visitors) AS approx_unique_visitors
FROM monthly_unique_visitors_hll;
 
-- Strategy 3: Periodic recalculation (simplest, most accurate)
-- Accept staleness, recalculate nightly
 
UPDATE monthly_category_summary mcs
SET unique_customers = (
    SELECT COUNT(DISTINCT oi.customer_id)
    FROM order_items oi
    JOIN orders o ON oi.order_id = o.order_id
    WHERE DATE_TRUNC('month', o.order_date) = mcs.summary_month
      AND oi.category_id = mcs.category_id
);

MIN / MAX Aggregates

Minimum and maximum values are difficult to maintain incrementally because:

A new value may or may not become the new min/max
When a current min/max is deleted, you must scan for the new min/max

Solutions for MIN/MAX:

min-max-strategies.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
-- Strategy 1: Store top-N values, not just the single min/max
-- Deleting the current max still leaves other candidates
 
CREATE TABLE daily_top_orders (
    summary_date DATE NOT NULL,
    rank INT NOT NULL CHECK (rank BETWEEN 1 AND 10),
    order_id INT NOT NULL,
    order_total DECIMAL(12, 2) NOT NULL,
    PRIMARY KEY (summary_date, rank)
);
 
-- When a new order arrives, check if it enters top-10
-- When an order is deleted, promote next candidate
 
-- Strategy 2: Accept eventual consistency for min/max
-- Store current min/max, mark as "potentially stale" on delete
 
CREATE TABLE daily_sales_extremes (
    summary_date DATE PRIMARY KEY,
    max_order_id INT,
    max_order_total DECIMAL(12, 2),
    min_order_id INT,
    min_order_total DECIMAL(12, 2),
    needs_recalc BOOLEAN DEFAULT FALSE  -- Flag when max/min deleted
);
 
-- Trigger on delete: flag for recalculation
CREATE OR REPLACE FUNCTION flag_extreme_recalc()
RETURNS TRIGGER AS $$
BEGIN
    UPDATE daily_sales_extremes
    SET needs_recalc = TRUE
    WHERE summary_date = OLD.order_date
      AND (max_order_id = OLD.order_id OR min_order_id = OLD.order_id);
    RETURN OLD;
END;
$$ LANGUAGE plpgsql;
 
-- Background job recalculates flagged rows
UPDATE daily_sales_extremes dse
SET 
    max_order_id = sub.max_order_id,
    max_order_total = sub.max_total,
    min_order_id = sub.min_order_id,
    min_order_total = sub.min_total,
    needs_recalc = FALSE
FROM (
    SELECT 
        order_date,
        MAX(order_id) FILTER (WHERE order_total = MAX(order_total) OVER ()) AS max_order_id,
        MAX(order_total) AS max_total,
        MIN(order_id) FILTER (WHERE order_total = MIN(order_total) OVER ()) AS min_order_id,
        MIN(order_total) AS min_total
    FROM orders
    WHERE order_date IN (SELECT summary_date FROM daily_sales_extremes WHERE needs_recalc)
    GROUP BY order_date
) sub
WHERE dse.summary_date = sub.order_date;

Match Strategy to Requirements

Roll-up and Drill-down Patterns

Hierarchical aggregate example:

rollup-drilldown-architecture.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
-- Level 0: Source data (most granular)
-- order_items table (millions of rows)
 
-- Level 1: Hourly aggregates
CREATE TABLE hourly_sales (
    summary_hour TIMESTAMP NOT NULL,  -- truncated to hour
    region_id INT NOT NULL,
    category_id INT NOT NULL,
    revenue DECIMAL(14, 2) NOT NULL DEFAULT 0,
    units_sold INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    PRIMARY KEY (summary_hour, region_id, category_id)
);
 
-- Level 2: Daily aggregates (built from hourly)
CREATE TABLE daily_sales (
    summary_date DATE NOT NULL,
    region_id INT NOT NULL,
    category_id INT NOT NULL,
    revenue DECIMAL(14, 2) NOT NULL DEFAULT 0,
    units_sold INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    PRIMARY KEY (summary_date, region_id, category_id)
);
 
-- Level 3: Monthly aggregates (built from daily)
CREATE TABLE monthly_sales (
    summary_month DATE NOT NULL,  -- first of month
    region_id INT NOT NULL,
    category_id INT NOT NULL,
    revenue DECIMAL(16, 2) NOT NULL DEFAULT 0,
    units_sold INT NOT NULL DEFAULT 0,
    order_count INT NOT NULL DEFAULT 0,
    PRIMARY KEY (summary_month, region_id, category_id)
);
 
-- Roll-up function: Aggregate hourly → daily
CREATE OR REPLACE FUNCTION rollup_hourly_to_daily(p_date DATE)
RETURNS void AS $$
BEGIN
    INSERT INTO daily_sales (summary_date, region_id, category_id, revenue, units_sold, order_count)
    SELECT 
        p_date,
        region_id,
        category_id,
        SUM(revenue),
        SUM(units_sold),
        SUM(order_count)
    FROM hourly_sales
    WHERE summary_hour >= p_date AND summary_hour < p_date + INTERVAL '1 day'
    GROUP BY region_id, category_id
    ON CONFLICT (summary_date, region_id, category_id) DO UPDATE SET
        revenue = EXCLUDED.revenue,
        units_sold = EXCLUDED.units_sold,
        order_count = EXCLUDED.order_count;
END;
$$ LANGUAGE plpgsql;
 
-- Roll-up function: Aggregate daily → monthly
CREATE OR REPLACE FUNCTION rollup_daily_to_monthly(p_month DATE)
RETURNS void AS $$
BEGIN
    INSERT INTO monthly_sales (summary_month, region_id, category_id, revenue, units_sold, order_count)
    SELECT 
        DATE_TRUNC('month', p_month),
        region_id,
        category_id,
        SUM(revenue),
        SUM(units_sold),
        SUM(order_count)
    FROM daily_sales
    WHERE summary_date >= DATE_TRUNC('month', p_month)
      AND summary_date < DATE_TRUNC('month', p_month) + INTERVAL '1 month'
    GROUP BY region_id, category_id
    ON CONFLICT (summary_month, region_id, category_id) DO UPDATE SET
        revenue = EXCLUDED.revenue,
        units_sold = EXCLUDED.units_sold,
        order_count = EXCLUDED.order_count;
END;
$$ LANGUAGE plpgsql;
 
-- Query patterns at different granularities:
 
-- Dashboard: Show today's hourly trend (queries hourly aggregates)
SELECT summary_hour, SUM(revenue) AS total_revenue
FROM hourly_sales
WHERE summary_hour >= CURRENT_DATE AND region_id = 1
GROUP BY summary_hour ORDER BY summary_hour;
 
-- Report: Show this month's daily trend (queries daily aggregates)
SELECT summary_date, SUM(revenue) AS daily_revenue
FROM daily_sales
WHERE summary_date >= DATE_TRUNC('month', CURRENT_DATE)
GROUP BY summary_date ORDER BY summary_date;
 
-- Exec summary: Year-over-year monthly comparison (queries monthly aggregates)
SELECT summary_month, SUM(revenue) AS monthly_revenue
FROM monthly_sales
WHERE summary_month >= CURRENT_DATE - INTERVAL '2 years'
GROUP BY summary_month ORDER BY summary_month;

Benefits of hierarchical aggregates:

Query efficiency at all scales — Yearly trend queries don't scan billions of source rows
Incremental maintenance — Each level only needs to process from the level below it
Selective drill-down — Start at coarse level, drill into finer granularity only when needed
Data retention tiers — Keep hourly data for 30 days, daily for 2 years, monthly forever
Parallel processing — Different aggregate levels can be maintained by different jobs/servers

Common Granularity Choices

Materialized Views as Aggregates

Advantages of materialized views:

Query definition is stored and version-controlled
Refresh is a single command
No trigger code to maintain
Database can optimize refresh operations

Disadvantages:

Less control over incremental refresh logic
May not support all refresh strategies
Query changes require view recreation

materialized-views-aggregates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
-- PostgreSQL: Materialized Views for Pre-computed Aggregates
 
-- Create materialized view for daily sales summary
CREATE MATERIALIZED VIEW mv_daily_sales_summary AS
SELECT 
    DATE(o.order_date) AS summary_date,
    oi.category_id,
    oi.region_id,
    SUM(oi.quantity * oi.unit_price) AS total_revenue,
    SUM(oi.quantity) AS total_quantity,
    COUNT(DISTINCT o.order_id) AS order_count,
    COUNT(DISTINCT o.customer_id) AS unique_customers,
    AVG(oi.quantity * oi.unit_price) AS avg_line_item_value
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY DATE(o.order_date), oi.category_id, oi.region_id
WITH NO DATA;  -- Create structure only, don't populate yet
 
-- Create indexes on materialized view (after population)
CREATE UNIQUE INDEX idx_mv_daily_sales_pk 
    ON mv_daily_sales_summary(summary_date, category_id, region_id);
CREATE INDEX idx_mv_daily_sales_date 
    ON mv_daily_sales_summary(summary_date DESC);
CREATE INDEX idx_mv_daily_sales_category 
    ON mv_daily_sales_summary(category_id, summary_date DESC);
 
-- Initial population
REFRESH MATERIALIZED VIEW mv_daily_sales_summary;
 
-- Subsequent refreshes (exclusive lock - blocks queries during refresh)
REFRESH MATERIALIZED VIEW mv_daily_sales_summary;
 
-- Concurrent refresh (no lock - queries can continue, requires unique index)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_sales_summary;
 
-- Query the materialized view exactly like a regular table
SELECT summary_date, SUM(total_revenue) AS daily_total
FROM mv_daily_sales_summary
WHERE category_id = 5 AND summary_date >= CURRENT_DATE - 30
GROUP BY summary_date
ORDER BY summary_date;
 
-- Monthly rollup materialized view (aggregates the daily view)
CREATE MATERIALIZED VIEW mv_monthly_sales_summary AS
SELECT 
    DATE_TRUNC('month', summary_date) AS summary_month,
    category_id,
    region_id,
    SUM(total_revenue) AS total_revenue,
    SUM(total_quantity) AS total_quantity,
    SUM(order_count) AS order_count
    -- Note: Cannot directly sum unique_customers (would overcount)
FROM mv_daily_sales_summary
GROUP BY DATE_TRUNC('month', summary_date), category_id, region_id;
 
-- Scheduled refresh (using pg_cron extension)
CREATE EXTENSION IF NOT EXISTS pg_cron;
 
-- Refresh daily MV every hour
SELECT cron.schedule(
    'refresh-daily-sales',
    '0 * * * *',  -- Every hour on the hour
    'REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_sales_summary'
);
 
-- Refresh monthly MV once per day
SELECT cron.schedule(
    'refresh-monthly-sales',
    '0 2 * * *',  -- 2 AM daily
    'REFRESH MATERIALIZED VIEW CONCURRENTLY mv_monthly_sales_summary'
);

Concurrent Refresh Requirements

Comparing Materialized Views vs. Manual Aggregate Tables:

Aspect	Materialized Views	Manual Aggregate Tables
Definition	Declarative SQL query	Explicit schema + maintenance code
Refresh	Single REFRESH command	Custom update logic (triggers/jobs)
Incremental update	Limited (database-dependent)	Full control over delta logic
Query complexity	Direct SELECT from source	Must match aggregate schema
Index control	Full index control	Full index control
Cross-table refresh	Automatic via view query	Must coordinate manually
Production maturity	Varies by database	Battle-tested patterns available

Summary and Best Practices

Key Takeaways

•Pre-computed aggregates store summary statistics — They trade storage for query speed by avoiding repeated computation across millions of source rows.
•Design aggregates for actual query patterns — Analyze what dimensions and time granularities your applications need; build aggregates that answer those specific questions.
•Choose maintenance strategy based on requirements — Real-time needs demand incremental updates; analytics tolerance allows periodic full refresh; hybrid approaches offer reliability.
•Handle complex aggregates appropriately — SUM/COUNT are easy to maintain incrementally; COUNT DISTINCT and MIN/MAX require special strategies or periodic recalculation.
•Build hierarchical aggregate structures — Enable roll-up and drill-down by creating aggregates at multiple granularity levels, each building from the level below.
•Consider materialized views for simpler cases — They provide declarative aggregate definition with built-in refresh, though with less control than manual tables.

What's Next:

Page Complete

2 / 5