Loading learning content...
Statistics collected today describe the data that exists today. But databases are living systems—rows are inserted, updated, and deleted continuously. A histogram built last month for a 10-million row table might be wildly inaccurate if the table now contains 50 million rows with entirely different value distributions.
The challenge is clear: statistics must be maintained as data evolves. Too frequent updates waste resources. Too infrequent updates lead to optimizer decisions based on outdated information. Finding the right balance is both science and art.
By the end of this page, you will understand how automatic statistics maintenance works across major databases, when and how to trigger manual statistics collection, strategies for detecting and responding to stale statistics, and operational best practices for different workload patterns.
Statistics decay through several mechanisms, each with different implications for optimizer accuracy.
| Change Type | Effect on Statistics | Detection Difficulty | Impact Severity |
|---|---|---|---|
| Bulk INSERT | Row count underestimated; new value ranges not in histogram | Easy (row count delta) | High |
| Bulk DELETE | Row count overestimated; deleted values still in histogram | Easy (row count delta) | High |
| Gradual INSERT | Slow drift in row count and distribution | Moderate (cumulative) | Moderate |
| UPDATE on indexed columns | Value distribution shifts; correlation degrades | Hard to detect | Moderate to High |
| UPDATE on non-indexed columns | Column statistics inaccurate | Hard to detect | Low to Moderate |
| Seasonal patterns | Time-based distributions shift | Very hard | Varies |
The Drift Window:
Statistics Quality Over Time
100% ─┬────────────────────────────────
│ Excellent estimates
│ ╔═══════════════════╗
80% ─┼─║ Statistics ║───────────
│ ║ collected here ║
60% ─┼─║ ║───────────
│ ╚═════════╤═════════╝
40% ─┼───────────│─────────────────────
│ │ ← Drift begins
20% ─┼───────────│─────────────────────
│ ▼
0% ─┴─────────────────────────────────
T₀ T₁ T₂ Time
↑
Heavy data modifications
cause accelerated drift
The goal of statistics maintenance is to refresh statistics before quality degrades to the point of causing bad query plans.
The most dangerous staleness is undetectable. If you DELETE 90% of a table's rows but they were uniformly distributed, histograms remain 'correct' in shape but row count estimates are 10x too high. Table-level modification counters help, but aren't foolproof.
Modern databases include sophisticated automatic statistics collection. Understanding how they work helps you configure them appropriately.
PostgreSQL: Autovacuum Daemon
PostgreSQL combines VACUUM (dead tuple cleanup) with ANALYZE in the autovacuum daemon.
Trigger Condition:
tuples_modified > autovacuum_analyze_threshold +
(autovacuum_analyze_scale_factor × table_size)
Defaults:
autovacuum_analyze_threshold = 50 tuples
autovacuum_analyze_scale_factor = 0.1 (10%)
Example: 1,000,000 row table
Threshold = 50 + (0.1 × 1,000,000) = 100,050 modifications
→ ANALYZE triggers after ~10% of table modified
Configuration:
-- View current settings
SHOW autovacuum_analyze_threshold;
SHOW autovacuum_analyze_scale_factor;
-- Adjust globally
ALTER SYSTEM SET autovacuum_analyze_scale_factor = 0.05;
SELECT pg_reload_conf();
-- Per-table override for critical tables
ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0.01);
ALTER TABLE orders SET (autovacuum_analyze_threshold = 1000);
Monitoring:
SELECT
relname,
last_analyze,
last_autoanalyze,
n_mod_since_analyze,
n_live_tup,
ROUND(100.0 * n_mod_since_analyze / NULLIF(n_live_tup, 0), 2) AS mod_percent
FROM pg_stat_user_tables
ORDER BY n_mod_since_analyze DESC;
All automatic systems are reactive—they detect changes after they occur. For bulk operations (ETL loads, batch deletes), the first query after the load runs before statistics update. Always ANALYZE immediately after bulk data operations.
While automatic maintenance handles steady-state operations, many scenarios require deliberate manual intervention.
123456789101112131415161718192021222324252627282930313233343536
-- Post-ETL statistics refreshBEGIN; -- Bulk load COPY orders FROM '/data/orders_2024.csv' WITH CSV HEADER; -- Immediate statistics update ANALYZE orders;COMMIT; -- Targeted column analysisANALYZE orders (customer_id, order_date, status); -- Full database refresh (use sparingly)ANALYZE; -- Verbose mode for monitoringANALYZE VERBOSE orders; -- Statistics collection with increased precisionALTER TABLE orders ALTER COLUMN amount SET STATISTICS 1000;ANALYZE orders (amount); -- Create combined script for maintenance windowDO $$DECLARE tbl record;BEGIN FOR tbl IN SELECT schemaname, tablename FROM pg_stat_user_tables WHERE n_mod_since_analyze > 10000 LOOP EXECUTE format('ANALYZE %I.%I', tbl.schemaname, tbl.tablename); RAISE NOTICE 'Analyzed: %.%', tbl.schemaname, tbl.tablename; END LOOP;END $$;For very large tables (100M+ rows), full scans are expensive. Use sampling: 10-30% samples provide excellent accuracy with much lower overhead. Most databases default to AUTO sampling which adapts sample size to table size.
Proactively identifying stale statistics prevents performance surprises. Multiple approaches help detect staleness at different levels.
Approach 1: Modification Counters
Most databases track how many rows have been modified since the last statistics collection.
-- PostgreSQL: Modifications since analyze
SELECT
relname,
n_live_tup AS current_rows,
n_mod_since_analyze AS modifications,
ROUND(100.0 * n_mod_since_analyze / NULLIF(n_live_tup, 0), 1) AS mod_pct,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > n_live_tup * 0.1 -- More than 10% modified
ORDER BY mod_pct DESC;
Approach 2: Age-Based Detection
Simple but effective: statistics older than a threshold are suspect.
-- PostgreSQL: Statistics older than 7 days
SELECT
relname,
GREATEST(last_analyze, last_autoanalyze) AS last_stats,
NOW() - GREATEST(last_analyze, last_autoanalyze) AS age
FROM pg_stat_user_tables
WHERE GREATEST(last_analyze, last_autoanalyze) < NOW() - INTERVAL '7 days'
ORDER BY age DESC;
-- Oracle: Stale stats flag
SELECT table_name, stale_stats, last_analyzed
FROM dba_tab_statistics
WHERE stale_stats = 'YES' AND owner = 'SALES';
Approach 3: Estimate vs. Actual Comparison
The gold standard: compare optimizer estimates to actual execution results.
-- PostgreSQL: EXPLAIN ANALYZE shows both
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE status = 'pending';
-- Look for: "rows=1000 actual rows=150000"
-- 150x difference indicates serious statistics problem!
-- SQL Server: Include Actual Execution Plan
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
-- Execute query and compare EstimatedRows vs ActualRows
If actual rows differ from estimated rows by more than 10x, statistics are likely stale or inadequate. Immediately ANALYZE the affected tables and check for missing histograms or extended statistics.
12345678910111213141516171819202122
-- Comprehensive staleness dashboardSELECT schemaname || '.' || relname AS table_name, n_live_tup AS rows, n_dead_tup AS dead_rows, n_mod_since_analyze AS modifications, CASE WHEN n_live_tup = 0 THEN 'Empty table' WHEN n_mod_since_analyze > n_live_tup THEN 'CRITICAL: More mods than rows' WHEN n_mod_since_analyze > n_live_tup * 0.3 THEN 'HIGH: >30% modified' WHEN n_mod_since_analyze > n_live_tup * 0.1 THEN 'MEDIUM: >10% modified' ELSE 'OK' END AS staleness_status, COALESCE(last_autoanalyze, last_analyze) AS last_stats_update, NOW() - COALESCE(last_autoanalyze, last_analyze) AS stats_ageFROM pg_stat_user_tablesWHERE n_live_tup > 1000 -- Ignore tiny tablesORDER BY CASE WHEN n_live_tup = 0 THEN 0 ELSE n_mod_since_analyze::float / n_live_tup END DESC;Different workload patterns require different statistics maintenance strategies. Here are battle-tested best practices.
| Task | Frequency | Trigger | Priority |
|---|---|---|---|
| Auto-stats enabled | Continuous | Modification threshold | Critical |
| Post-ETL ANALYZE | After each load | Load completion | Critical |
| Full database ANALYZE | Weekly | Maintenance window | High |
| Staleness monitoring | Daily | Scheduled check | High |
| Estimate vs actual audit | Weekly | Query review | Medium |
| Histogram tuning | As needed | Poor estimates detected | Medium |
| Extended statistics review | Quarterly | Correlation issues | Low |
Special Considerations for Partitioned Tables:
-- PostgreSQL 14+: Incremental statistics inheritance
-- Child partition stats roll up to parent automatically
-- Oracle: Incremental statistics for partitions
EXEC DBMS_STATS.SET_TABLE_PREFS(
'SALES', 'ORDERS',
'INCREMENTAL', 'TRUE'
);
-- Only new/modified partitions are scanned; global stats updated incrementally
-- SQL Server: Partition-level statistics
CREATE STATISTICS stat_orders_p202401
ON orders(order_date)
WHERE order_date >= '2024-01-01' AND order_date < '2024-02-01'
WITH FULLSCAN;
When in doubt, ANALYZE. Statistics collection is vastly cheaper than running suboptimal query plans. A 10-second ANALYZE that prevents a 10-minute query regression pays for itself immediately.
Fresh statistics are generally desirable, but they can cause plan instability—queries suddenly using different plans after statistics refresh. This is both a feature (adapting to new data) and a risk (regression).
The Plan Regression Problem:
Before ANALYZE:
Query uses Hash Join (runs in 2 seconds)
Estimated rows: 1,000
After ANALYZE:
Statistics show actual rows: 500,000
Optimizer now chooses Sort-Merge Join (runs in 30 seconds!)
The 'better' statistics led to a worse plan.
How is this possible?
Causes of Post-Statistics Regression:
Mitigation Strategies:
-- 1. Lock statistics on stable tables
-- PostgreSQL (not directly supported; control via autovacuum settings)
ALTER TABLE stable_table SET (autovacuum_enabled = false);
-- Oracle: Explicit lock
EXEC DBMS_STATS.LOCK_TABLE_STATS('SALES', 'REFERENCE_DATA');
-- SQL Server: Disable auto-update on specific stats
ALTER INDEX ALL ON stable_table SET (STATISTICS_NORECOMPUTE = ON);
-- 2. Plan guides/baselines to enforce specific plans
-- SQL Server: Plan Guide
sp_create_plan_guide ...
-- Oracle: SQL Plan Baseline
EXEC DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(...);
-- PostgreSQL: Hint extension or pg_plan_guarantee (3rd party)
-- 3. Test statistics before promoting
-- Oracle: Pending statistics
EXEC DBMS_STATS.SET_TABLE_PREFS('SALES', 'ORDERS', 'PUBLISH', 'FALSE');
EXEC DBMS_STATS.GATHER_TABLE_STATS('SALES', 'ORDERS');
-- Test queries with pending stats
ALTER SESSION SET OPTIMIZER_USE_PENDING_STATISTICS = TRUE;
-- If good, publish
EXEC DBMS_STATS.PUBLISH_PENDING_STATS('SALES', 'ORDERS');
Plan regressions after statistics updates are usually symptoms of underlying issues (poor indexes, suboptimal cost model settings, or parameter sniffing). Address the root cause rather than avoiding statistics maintenance.
Proactive monitoring prevents statistics-related performance incidents. Set up automated checks for early warning.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
-- Create monitoring viewCREATE OR REPLACE VIEW stats_health ASSELECT schemaname || '.' || relname AS table_name, n_live_tup AS rows, n_mod_since_analyze AS mods, ROUND(100.0 * n_mod_since_analyze / NULLIF(n_live_tup, 0), 1) AS mod_pct, GREATEST(last_analyze, last_autoanalyze) AS last_stats, NOW() - GREATEST(last_analyze, last_autoanalyze) AS stats_age, CASE WHEN n_live_tup = 0 THEN 'EMPTY' WHEN GREATEST(last_analyze, last_autoanalyze) IS NULL THEN 'NEVER' WHEN n_mod_since_analyze > n_live_tup * 0.5 THEN 'CRITICAL' WHEN n_mod_since_analyze > n_live_tup * 0.2 THEN 'HIGH' WHEN NOW() - GREATEST(last_analyze, last_autoanalyze) > INTERVAL '7 days' THEN 'STALE' ELSE 'OK' END AS healthFROM pg_stat_user_tables; -- Alert query (run from monitoring system)SELECT table_name, rows, mod_pct, healthFROM stats_healthWHERE health IN ('CRITICAL', 'HIGH', 'NEVER')ORDER BY CASE health WHEN 'CRITICAL' THEN 1 WHEN 'HIGH' THEN 2 WHEN 'NEVER' THEN 3 END; -- Auto-fix: Analyze critical tablesDO $$DECLARE rec RECORD;BEGIN FOR rec IN SELECT schemaname, relname FROM stats_health WHERE health = 'CRITICAL' LOOP EXECUTE format('ANALYZE %I.%I', rec.schemaname, rec.relname); RAISE NOTICE 'Analyzed critical table: %.%', rec.schemaname, rec.relname; END LOOP;END $$;Statistics maintenance is the operational backbone of query optimization. Fresh, accurate statistics enable the optimizer to make intelligent decisions; stale or missing statistics lead to performance chaos.
What's next:
Statistics feed into one critical output: cardinality estimation—predicting how many rows flow through each operator in a query plan. The next page explores how the optimizer combines table statistics, column statistics, histograms, and selectivity formulas to produce cardinality estimates, and how errors propagate through query plans.
You now understand how to maintain statistics across major database systems, detect staleness, and implement operational best practices. You can configure auto-stats, schedule manual refreshes, and monitor for issues proactively. Next: the art and science of cardinality estimation.