Maintenance Tasks - Learning Module

Loading content...

0/241

Statistics Updates

The Invisible Foundation of Query Performance

Every time you execute a SQL query, the database optimizer makes critical decisions in fractions of a second: Should it use an index or scan the entire table? Which join algorithm will be fastest? In what order should tables be processed? These decisions can mean the difference between a query completing in 10 milliseconds or 10 minutes.

But how does the optimizer make these decisions? It relies on database statistics—metadata that describes the shape, distribution, and characteristics of your data. Statistics are the optimizer's eyes into your data, and when those eyes become clouded by stale or missing information, query performance suffers dramatically.

Statistics maintenance is not optional—it is fundamental to database health.

What You Will Learn

By the end of this page, you will understand the critical role of database statistics in query optimization, master the types of statistics collected by modern databases, know when and how to update statistics, and develop strategies for automated statistics maintenance in production environments.

Understanding Database Statistics

Database statistics are metadata collections that describe the characteristics of data stored in tables and indexes. The query optimizer uses these statistics to estimate the cost of various execution strategies and select the most efficient plan.

Why statistics matter:

Imagine you have a query that filters on a column status with the condition WHERE status = 'active'. Without statistics, the optimizer cannot know:

How many rows match this condition?
Is 'active' a common value (representing 90% of rows) or rare (0.1% of rows)?
Should it use an index or perform a full table scan?

With accurate statistics, the optimizer knows exactly what to expect and can make intelligent decisions. With stale or missing statistics, it essentially guesses—often badly.

Core Types of Database Statistics
Statistic Type	What It Measures	Optimizer Use Case
Row Count (Cardinality)	Total number of rows in a table	Estimating result set sizes for joins and filters
Column Cardinality	Number of distinct values in a column	Evaluating selectivity of equality predicates
Null Count	Number of NULL values in a column	Estimating rows returned by IS NULL/IS NOT NULL
Average Column Width	Average byte size of column values	Memory allocation and I/O cost estimation
Value Distribution (Histograms)	Frequency distribution of column values	Handling non-uniform data distributions
Correlation	Physical ordering relative to index ordering	Deciding between index scan and bitmap scan
Index Statistics	Depth, leaf pages, clustering factor	Index selection and cost estimation

Statistics vs. Constraints

Don't confuse statistics with constraints. Constraints (UNIQUE, NOT NULL, FOREIGN KEY) enforce data integrity rules. Statistics describe data as it exists. A UNIQUE constraint guarantees uniqueness; column cardinality statistics measure observed uniqueness. Both inform optimization, but they serve different purposes.

How Statistics Drive Query Optimization

The query optimizer uses statistics to perform selectivity estimation—predicting how many rows will result from each operation in a query plan. This estimation cascades through the entire plan, affecting every decision the optimizer makes.

The selectivity chain:

Base table access: Statistics determine whether to use an index or table scan
Join ordering: Statistics help determine which tables to join first (smaller results first reduces work)
Join algorithms: Estimated row counts determine whether nested loop, hash join, or merge join is optimal
Memory allocation: Work memory for sorts and hash tables is sized based on estimated rows
Parallelism: Statistics influence parallel query planning decisions

selectivity_example.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Consider this query joining orders with customers
SELECT c.name, o.order_date, o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA'
  AND o.order_date >= '2024-01-01'
  AND o.status = 'completed';
 
-- Statistics provide these estimates:
-- customers table: 1,000,000 rows
--   country column: 200 distinct values
--   'USA' appears in ~40% of rows (histogram data)
--   → Estimated 400,000 rows after country filter
 
-- orders table: 50,000,000 rows
--   order_date: 10% of orders are from 2024
--   status: 5 distinct values, 'completed' = 60%
--   → Estimated 3,000,000 rows after date + status filter
 
-- Join estimate:
-- With avg 50 orders per customer
-- Final estimate: ~1,200,000 result rows
 
-- Without statistics:
-- Optimizer might assume 10% selectivity for each predicate
-- Leading to vastly different (and wrong) estimates

What happens with stale statistics:

When statistics don't reflect current data reality, the optimizer makes poor decisions:

Underestimation: Optimizer expects 1,000 rows but gets 1,000,000. Hash tables overflow to disk, memory is exhausted, query runs 100x slower than expected.
Overestimation: Optimizer expects 1,000,000 rows but gets 1,000. Allocates unnecessary resources, may choose suboptimal algorithms suited for larger data.
Wrong join order: Table sizes have changed, but statistics reflect old data. Optimizer joins in an order that explodes intermediate results.
Index neglect: Index would be perfect, but statistics suggest it's not selective enough. Full table scan ensues.

The Cost of Stale Statistics

In production systems, stale statistics commonly cause queries to run 10x-1000x slower than optimal. A query that should complete in 100ms may take 10 minutes. This isn't theoretical—it's one of the most common performance problems in database administration.

Histograms: Capturing Data Distribution

Simple statistics like cardinality and null counts aren't sufficient for columns with non-uniform data distributions. Consider a country column where 40% of customers are from the USA, 15% from UK, 10% from Germany, and the remaining 35% spread across 197 other countries.

A simple "200 distinct values" statistic doesn't capture this skew. The optimizer would estimate 0.5% selectivity for any country (1/200), but 'USA' actually returns 40%—an 80x underestimate.

Histograms solve this problem by capturing the actual distribution of values.

Types of Histograms in Database Systems
Histogram Type	Bucket Formation	Best For	Database Support
Equi-width	Equal value ranges per bucket	Uniformly distributed numerical data	MySQL, older systems
Equi-height (Equi-depth)	Equal row counts per bucket	Skewed distributions, most general purpose	PostgreSQL, Oracle, SQL Server
Frequency	One bucket per distinct value	Low-cardinality columns (< 256 values)	PostgreSQL, Oracle
Top-N + Frequency	Frequent values tracked individually	High skew with popular values	Oracle
Hybrid	Combination of above methods	Complex real-world distributions	SQL Server, Oracle

histogram_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- View histogram statistics in PostgreSQL
SELECT 
    attname AS column_name,
    n_distinct,
    most_common_vals,
    most_common_freqs,
    histogram_bounds
FROM pg_stats
WHERE tablename = 'orders'
  AND attname = 'status';
 
-- Example output:
-- column_name: status
-- n_distinct: 5
-- most_common_vals: {completed,pending,shipped,cancelled,refunded}
-- most_common_freqs: {0.45,0.25,0.15,0.10,0.05}
-- histogram_bounds: NULL (frequency histogram used instead)
 
-- For high-cardinality columns, histogram_bounds shows bucket boundaries:
SELECT 
    attname,
    array_length(histogram_bounds, 1) AS num_buckets,
    histogram_bounds[1:5] AS first_5_bounds,
    histogram_bounds[96:100] AS last_5_bounds
FROM pg_stats
WHERE tablename = 'orders'
  AND attname = 'order_date';

Histogram Resolution Trade-offs

More histogram buckets mean higher accuracy but require more storage and collection time. PostgreSQL defaults to 100 buckets, SQL Server uses up to 200 steps. For extremely critical columns, you can increase resolution—but more buckets means more maintenance overhead.

Statistics Collection Methods

Collecting statistics requires reading data from tables and indexes. For large tables, this can be expensive. Database systems offer multiple collection strategies that trade accuracy for performance.

Full Table Scan Collection

•Reads every row in the table
•100% accurate statistics
•Very expensive for large tables
•Can cause significant I/O load
•May lock tables in some systems
•Use for: Critical tables, after major loads

Sampled Collection

•Reads a percentage of table pages
•Accuracy proportional to sample size
•Much faster than full scan
•Default in most databases
•May miss rare values
•Use for: Routine maintenance, large tables

statistics_collection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- PostgreSQL: ANALYZE command collects statistics
 
-- Analyze entire database (samples by default)
ANALYZE;
 
-- Analyze specific table
ANALYZE orders;
 
-- Analyze specific columns
ANALYZE orders (customer_id, order_date, status);
 
-- Verbose mode shows what was collected
ANALYZE VERBOSE orders;
-- Output includes: pages sampled, rows sampled, 
-- which columns got histograms, etc.
 
-- Control sampling target (number of histogram buckets)
ALTER TABLE orders ALTER COLUMN customer_id 
    SET STATISTICS 500;  -- More buckets = higher accuracy
 
-- Then re-analyze to apply
ANALYZE orders (customer_id);
 
-- Check when statistics were last updated
SELECT 
    schemaname,
    relname AS table_name,
    last_analyze,
    last_autoanalyze,
    n_live_tup,
    n_dead_tup
FROM pg_stat_user_tables
ORDER BY last_analyze DESC NULLS LAST;

Sampling Strategy Selection

For tables under 1 million rows, full scans are often acceptable. For larger tables, start with 10-30% sampling. Monitor query plan changes—if critical queries regress, increase sampling or run full scans on affected tables. The goal is statistics accurate enough to prevent plan regressions, not perfect accuracy.

Automatic Statistics Maintenance

Modern databases include auto-statistics features that detect stale statistics and refresh them automatically. Understanding these mechanisms—and their limitations—is essential for production database administration.

Auto-Statistics Behavior Across Database Systems
Database	Feature Name	Trigger Threshold	Default Behavior
PostgreSQL	autovacuum (autoanalyze)	50 + 10% of table rows changed	Enabled by default, runs in background
SQL Server	AUTO_UPDATE_STATISTICS	500 + 20% of rows changed (older) / Adaptive (2016+)	Enabled by default, synchronous by default
MySQL (InnoDB)	innodb_stats_auto_recalc	10% of rows changed	Enabled by default
Oracle	Auto Statistics Jobs	10% staleness threshold	Runs nightly in maintenance window by default

auto_statistics_config.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- PostgreSQL autovacuum handles automatic ANALYZE
 
-- View current autovacuum settings
SHOW autovacuum;
SHOW autovacuum_analyze_threshold;
SHOW autovacuum_analyze_scale_factor;
 
-- Default: analyze after 50 + 10% of rows changed
-- For 1M row table: triggers after 100,050 row changes
 
-- Customize per-table for high-churn tables
ALTER TABLE orders SET (
    autovacuum_analyze_threshold = 100,
    autovacuum_analyze_scale_factor = 0.02  -- 2% instead of 10%
);
-- Now analyzes after 100 + 2% = 20,100 changes for 1M rows
 
-- For append-only tables, can be more aggressive
ALTER TABLE event_log SET (
    autovacuum_analyze_threshold = 1000,
    autovacuum_analyze_scale_factor = 0.001  -- 0.1%
);
 
-- Monitor autovacuum activity
SELECT 
    relname,
    n_tup_ins AS inserts,
    n_tup_upd AS updates,
    n_tup_del AS deletes,
    n_mod_since_analyze AS changes_since_analyze,
    last_autoanalyze,
    autoanalyze_count
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 0
ORDER BY n_mod_since_analyze DESC;

Auto-Statistics Limitations

Auto-statistics can lag behind data changes, especially during bulk loads or partition operations. Large tables may take hours to analyze after threshold is reached. Async updates mean queries may use stale statistics. Critical systems often require supplementary manual statistics maintenance.

Production Statistics Maintenance Strategies

Relying solely on automatic statistics is insufficient for production databases with performance SLAs. A comprehensive statistics maintenance strategy combines automatic updates, scheduled maintenance, and event-driven updates.

Comprehensive Statistics Maintenance Strategy

•Layer 1: Automatic Updates — Enable and tune auto-statistics as a baseline. They catch routine changes but may lag during unusual activity.
•Layer 2: Scheduled Full Updates — Run full statistics updates during low-activity windows (nightly, weekly). Ensures no statistics drift too far from reality.
•Layer 3: Event-Driven Updates — Trigger immediate statistics updates after bulk loads, partition switches, significant deletes, or any operation that substantially changes data distribution.
•Layer 4: Query-Based Updates — Monitor for plan regressions. When queries suddenly slow down, check and update relevant statistics before investigating further.
•Layer 5: Pre-Deployment Updates — Include statistics updates in deployment checklists. Schema changes, new indexes, and application changes may require fresh statistics.

maintenance_scripts.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Comprehensive PostgreSQL Statistics Maintenance
 
-- 1. Find tables needing analysis
WITH analysis_candidates AS (
    SELECT 
        schemaname,
        relname,
        n_live_tup,
        n_dead_tup,
        n_mod_since_analyze,
        last_analyze,
        last_autoanalyze,
        GREATEST(last_analyze, last_autoanalyze) AS last_stats,
        CASE 
            WHEN n_live_tup = 0 THEN 0
            ELSE ROUND(100.0 * n_mod_since_analyze / n_live_tup, 2)
        END AS pct_changed
    FROM pg_stat_user_tables
)
SELECT *
FROM analysis_candidates
WHERE pct_changed > 5
   OR last_stats < NOW() - INTERVAL '7 days'
   OR last_stats IS NULL
ORDER BY n_mod_since_analyze DESC;
 
-- 2. Analyze only tables that need it (dynamic SQL in function)
CREATE OR REPLACE FUNCTION maintain_statistics(
    threshold_pct NUMERIC DEFAULT 5.0,
    max_age_days INTEGER DEFAULT 7
) RETURNS TABLE(analyzed_table TEXT, reason TEXT) AS $$
DECLARE
    rec RECORD;
BEGIN
    FOR rec IN
        SELECT 
            schemaname,
            relname,
            n_live_tup,
            n_mod_since_analyze,
            GREATEST(last_analyze, last_autoanalyze) AS last_stats
        FROM pg_stat_user_tables
        WHERE (n_live_tup > 0 AND 
               100.0 * n_mod_since_analyze / n_live_tup > threshold_pct)
           OR GREATEST(last_analyze, last_autoanalyze) < 
              NOW() - (max_age_days || ' days')::INTERVAL
           OR (last_analyze IS NULL AND last_autoanalyze IS NULL)
    LOOP
        EXECUTE format('ANALYZE %I.%I', rec.schemaname, rec.relname);
        analyzed_table := rec.schemaname || '.' || rec.relname;
        reason := CASE 
            WHEN rec.last_stats IS NULL THEN 'never analyzed'
            WHEN 100.0 * rec.n_mod_since_analyze / rec.n_live_tup > threshold_pct 
                THEN 'high modification count'
            ELSE 'age exceeded threshold'
        END;
        RETURN NEXT;
    END LOOP;
END;
$$ LANGUAGE plpgsql;
 
-- Run maintenance
SELECT * FROM maintain_statistics(5.0, 7);

Monitoring and Troubleshooting Statistics Issues

Effective statistics management requires ongoing monitoring. You need to detect stale statistics before they cause performance problems and diagnose statistics-related issues when queries regress.

Signs of Statistics Problems

•Sudden query slowdowns — Queries that previously ran fast now take much longer. Often indicates the optimizer chose a different (worse) plan due to changed statistics.
•Estimate vs. actual row mismatch — EXPLAIN ANALYZE shows estimated rows vastly different from actual rows. A 10x or greater difference warrants investigation.
•Hash join spills to disk — Work area sized for estimated rows, but actual rows overflow memory. Statistics underestimated result size.
•Nested loop on large tables — Optimizer chose nested loop expecting few rows, but millions were returned. Statistics problem caused bad plan selection.
•Plan instability — Same query produces different plans on different executions. May indicate statistics changing between runs.

statistics_diagnostics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
-- Diagnosing statistics issues in PostgreSQL
 
-- 1. Compare estimated vs actual rows
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders 
WHERE status = 'pending' 
  AND order_date > '2024-01-01';
 
-- Look for rows like:
-- "Seq Scan on orders (actual rows=500000 loops=1)"
--   "Rows Removed by Filter: 1000000"
--   "Estimated rows: 5000"  ← 100x underestimate!
 
-- 2. Check if column has statistics
SELECT 
    attname,
    n_distinct,
    null_frac,
    avg_width,
    most_common_vals IS NOT NULL AS has_mcv,
    histogram_bounds IS NOT NULL AS has_histogram
FROM pg_stats
WHERE tablename = 'orders'
  AND attname IN ('status', 'order_date', 'customer_id');
 
-- 3. Verify histogram accuracy
SELECT 
    attname,
    array_length(histogram_bounds, 1) AS bucket_count,
    histogram_bounds[1] AS min_value,
    histogram_bounds[array_length(histogram_bounds, 1)] AS max_value
FROM pg_stats
WHERE tablename = 'orders'
  AND histogram_bounds IS NOT NULL;
 
-- 4. Check MCV accuracy for specific value
SELECT 
    attname,
    most_common_vals,
    most_common_freqs,
    (SELECT COUNT(*)::float / (SELECT COUNT(*) FROM orders)
     FROM orders WHERE status = 'pending') AS actual_freq
FROM pg_stats
WHERE tablename = 'orders'
  AND attname = 'status';
 
-- 5. Identify correlation issues (index efficiency)
SELECT 
    attname,
    correlation  -- -1 to 1: how sorted is data relative to physical order
FROM pg_stats
WHERE tablename = 'orders';
-- Low correlation on indexed column = index less effective

Summary: Mastering Statistics Maintenance

Statistics maintenance is the invisible foundation of database performance. Without accurate statistics, even the most powerful optimizer becomes blind, guessing at data distributions and making suboptimal decisions that can slow queries by orders of magnitude.

Key Takeaways

•Statistics inform every optimization decision — Row counts, value distributions, and histograms drive access path selection, join ordering, and algorithm choice.
•Stale statistics cause performance disasters — A query that should run in milliseconds can take minutes when statistics don't reflect reality.
•Histograms capture data skew — Simple cardinality isn't enough for non-uniform distributions. Histograms prevent massive estimation errors.
•Collection methods trade accuracy for speed — Full scans provide perfect statistics but are expensive. Sampling is faster but may miss edge cases.
•Auto-statistics are a baseline, not a solution — Enable them, but supplement with scheduled maintenance, event-driven updates, and monitoring.
•Monitor for estimation mismatches — When estimated rows differ greatly from actual rows, statistics are the likely culprit.

What's next:

Statistics tell the optimizer about data distribution. But the physical organization of data matters too. In the next page, we'll explore Index Maintenance—how to keep indexes efficient, when to rebuild or reorganize them, and strategies for managing index bloat in production systems.

Page Complete

You now understand the critical role of database statistics in query optimization, the types of statistics collected, collection methods, and maintenance strategies. This knowledge is essential for maintaining high-performance production databases.