Loading content...
Every time you execute a SQL query, the database optimizer makes critical decisions in fractions of a second: Should it use an index or scan the entire table? Which join algorithm will be fastest? In what order should tables be processed? These decisions can mean the difference between a query completing in 10 milliseconds or 10 minutes.
But how does the optimizer make these decisions? It relies on database statistics—metadata that describes the shape, distribution, and characteristics of your data. Statistics are the optimizer's eyes into your data, and when those eyes become clouded by stale or missing information, query performance suffers dramatically.
Statistics maintenance is not optional—it is fundamental to database health.
By the end of this page, you will understand the critical role of database statistics in query optimization, master the types of statistics collected by modern databases, know when and how to update statistics, and develop strategies for automated statistics maintenance in production environments.
Database statistics are metadata collections that describe the characteristics of data stored in tables and indexes. The query optimizer uses these statistics to estimate the cost of various execution strategies and select the most efficient plan.
Why statistics matter:
Imagine you have a query that filters on a column status with the condition WHERE status = 'active'. Without statistics, the optimizer cannot know:
With accurate statistics, the optimizer knows exactly what to expect and can make intelligent decisions. With stale or missing statistics, it essentially guesses—often badly.
| Statistic Type | What It Measures | Optimizer Use Case |
|---|---|---|
| Row Count (Cardinality) | Total number of rows in a table | Estimating result set sizes for joins and filters |
| Column Cardinality | Number of distinct values in a column | Evaluating selectivity of equality predicates |
| Null Count | Number of NULL values in a column | Estimating rows returned by IS NULL/IS NOT NULL |
| Average Column Width | Average byte size of column values | Memory allocation and I/O cost estimation |
| Value Distribution (Histograms) | Frequency distribution of column values | Handling non-uniform data distributions |
| Correlation | Physical ordering relative to index ordering | Deciding between index scan and bitmap scan |
| Index Statistics | Depth, leaf pages, clustering factor | Index selection and cost estimation |
Don't confuse statistics with constraints. Constraints (UNIQUE, NOT NULL, FOREIGN KEY) enforce data integrity rules. Statistics describe data as it exists. A UNIQUE constraint guarantees uniqueness; column cardinality statistics measure observed uniqueness. Both inform optimization, but they serve different purposes.
The query optimizer uses statistics to perform selectivity estimation—predicting how many rows will result from each operation in a query plan. This estimation cascades through the entire plan, affecting every decision the optimizer makes.
The selectivity chain:
1234567891011121314151617181920212223242526
-- Consider this query joining orders with customersSELECT c.name, o.order_date, o.total_amountFROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE c.country = 'USA' AND o.order_date >= '2024-01-01' AND o.status = 'completed'; -- Statistics provide these estimates:-- customers table: 1,000,000 rows-- country column: 200 distinct values-- 'USA' appears in ~40% of rows (histogram data)-- → Estimated 400,000 rows after country filter -- orders table: 50,000,000 rows-- order_date: 10% of orders are from 2024-- status: 5 distinct values, 'completed' = 60%-- → Estimated 3,000,000 rows after date + status filter -- Join estimate:-- With avg 50 orders per customer-- Final estimate: ~1,200,000 result rows -- Without statistics:-- Optimizer might assume 10% selectivity for each predicate-- Leading to vastly different (and wrong) estimatesWhat happens with stale statistics:
When statistics don't reflect current data reality, the optimizer makes poor decisions:
Underestimation: Optimizer expects 1,000 rows but gets 1,000,000. Hash tables overflow to disk, memory is exhausted, query runs 100x slower than expected.
Overestimation: Optimizer expects 1,000,000 rows but gets 1,000. Allocates unnecessary resources, may choose suboptimal algorithms suited for larger data.
Wrong join order: Table sizes have changed, but statistics reflect old data. Optimizer joins in an order that explodes intermediate results.
Index neglect: Index would be perfect, but statistics suggest it's not selective enough. Full table scan ensues.
In production systems, stale statistics commonly cause queries to run 10x-1000x slower than optimal. A query that should complete in 100ms may take 10 minutes. This isn't theoretical—it's one of the most common performance problems in database administration.
Simple statistics like cardinality and null counts aren't sufficient for columns with non-uniform data distributions. Consider a country column where 40% of customers are from the USA, 15% from UK, 10% from Germany, and the remaining 35% spread across 197 other countries.
A simple "200 distinct values" statistic doesn't capture this skew. The optimizer would estimate 0.5% selectivity for any country (1/200), but 'USA' actually returns 40%—an 80x underestimate.
Histograms solve this problem by capturing the actual distribution of values.
| Histogram Type | Bucket Formation | Best For | Database Support |
|---|---|---|---|
| Equi-width | Equal value ranges per bucket | Uniformly distributed numerical data | MySQL, older systems |
| Equi-height (Equi-depth) | Equal row counts per bucket | Skewed distributions, most general purpose | PostgreSQL, Oracle, SQL Server |
| Frequency | One bucket per distinct value | Low-cardinality columns (< 256 values) | PostgreSQL, Oracle |
| Top-N + Frequency | Frequent values tracked individually | High skew with popular values | Oracle |
| Hybrid | Combination of above methods | Complex real-world distributions | SQL Server, Oracle |
123456789101112131415161718192021222324252627
-- View histogram statistics in PostgreSQLSELECT attname AS column_name, n_distinct, most_common_vals, most_common_freqs, histogram_boundsFROM pg_statsWHERE tablename = 'orders' AND attname = 'status'; -- Example output:-- column_name: status-- n_distinct: 5-- most_common_vals: {completed,pending,shipped,cancelled,refunded}-- most_common_freqs: {0.45,0.25,0.15,0.10,0.05}-- histogram_bounds: NULL (frequency histogram used instead) -- For high-cardinality columns, histogram_bounds shows bucket boundaries:SELECT attname, array_length(histogram_bounds, 1) AS num_buckets, histogram_bounds[1:5] AS first_5_bounds, histogram_bounds[96:100] AS last_5_boundsFROM pg_statsWHERE tablename = 'orders' AND attname = 'order_date';More histogram buckets mean higher accuracy but require more storage and collection time. PostgreSQL defaults to 100 buckets, SQL Server uses up to 200 steps. For extremely critical columns, you can increase resolution—but more buckets means more maintenance overhead.
Collecting statistics requires reading data from tables and indexes. For large tables, this can be expensive. Database systems offer multiple collection strategies that trade accuracy for performance.
123456789101112131415161718192021222324252627282930313233
-- PostgreSQL: ANALYZE command collects statistics -- Analyze entire database (samples by default)ANALYZE; -- Analyze specific tableANALYZE orders; -- Analyze specific columnsANALYZE orders (customer_id, order_date, status); -- Verbose mode shows what was collectedANALYZE VERBOSE orders;-- Output includes: pages sampled, rows sampled, -- which columns got histograms, etc. -- Control sampling target (number of histogram buckets)ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 500; -- More buckets = higher accuracy -- Then re-analyze to applyANALYZE orders (customer_id); -- Check when statistics were last updatedSELECT schemaname, relname AS table_name, last_analyze, last_autoanalyze, n_live_tup, n_dead_tupFROM pg_stat_user_tablesORDER BY last_analyze DESC NULLS LAST;For tables under 1 million rows, full scans are often acceptable. For larger tables, start with 10-30% sampling. Monitor query plan changes—if critical queries regress, increase sampling or run full scans on affected tables. The goal is statistics accurate enough to prevent plan regressions, not perfect accuracy.
Modern databases include auto-statistics features that detect stale statistics and refresh them automatically. Understanding these mechanisms—and their limitations—is essential for production database administration.
| Database | Feature Name | Trigger Threshold | Default Behavior |
|---|---|---|---|
| PostgreSQL | autovacuum (autoanalyze) | 50 + 10% of table rows changed | Enabled by default, runs in background |
| SQL Server | AUTO_UPDATE_STATISTICS | 500 + 20% of rows changed (older) / Adaptive (2016+) | Enabled by default, synchronous by default |
| MySQL (InnoDB) | innodb_stats_auto_recalc | 10% of rows changed | Enabled by default |
| Oracle | Auto Statistics Jobs | 10% staleness threshold | Runs nightly in maintenance window by default |
1234567891011121314151617181920212223242526272829303132333435
-- PostgreSQL autovacuum handles automatic ANALYZE -- View current autovacuum settingsSHOW autovacuum;SHOW autovacuum_analyze_threshold;SHOW autovacuum_analyze_scale_factor; -- Default: analyze after 50 + 10% of rows changed-- For 1M row table: triggers after 100,050 row changes -- Customize per-table for high-churn tablesALTER TABLE orders SET ( autovacuum_analyze_threshold = 100, autovacuum_analyze_scale_factor = 0.02 -- 2% instead of 10%);-- Now analyzes after 100 + 2% = 20,100 changes for 1M rows -- For append-only tables, can be more aggressiveALTER TABLE event_log SET ( autovacuum_analyze_threshold = 1000, autovacuum_analyze_scale_factor = 0.001 -- 0.1%); -- Monitor autovacuum activitySELECT relname, n_tup_ins AS inserts, n_tup_upd AS updates, n_tup_del AS deletes, n_mod_since_analyze AS changes_since_analyze, last_autoanalyze, autoanalyze_countFROM pg_stat_user_tablesWHERE n_mod_since_analyze > 0ORDER BY n_mod_since_analyze DESC;Auto-statistics can lag behind data changes, especially during bulk loads or partition operations. Large tables may take hours to analyze after threshold is reached. Async updates mean queries may use stale statistics. Critical systems often require supplementary manual statistics maintenance.
Relying solely on automatic statistics is insufficient for production databases with performance SLAs. A comprehensive statistics maintenance strategy combines automatic updates, scheduled maintenance, and event-driven updates.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
-- Comprehensive PostgreSQL Statistics Maintenance -- 1. Find tables needing analysisWITH analysis_candidates AS ( SELECT schemaname, relname, n_live_tup, n_dead_tup, n_mod_since_analyze, last_analyze, last_autoanalyze, GREATEST(last_analyze, last_autoanalyze) AS last_stats, CASE WHEN n_live_tup = 0 THEN 0 ELSE ROUND(100.0 * n_mod_since_analyze / n_live_tup, 2) END AS pct_changed FROM pg_stat_user_tables)SELECT *FROM analysis_candidatesWHERE pct_changed > 5 OR last_stats < NOW() - INTERVAL '7 days' OR last_stats IS NULLORDER BY n_mod_since_analyze DESC; -- 2. Analyze only tables that need it (dynamic SQL in function)CREATE OR REPLACE FUNCTION maintain_statistics( threshold_pct NUMERIC DEFAULT 5.0, max_age_days INTEGER DEFAULT 7) RETURNS TABLE(analyzed_table TEXT, reason TEXT) AS $$DECLARE rec RECORD;BEGIN FOR rec IN SELECT schemaname, relname, n_live_tup, n_mod_since_analyze, GREATEST(last_analyze, last_autoanalyze) AS last_stats FROM pg_stat_user_tables WHERE (n_live_tup > 0 AND 100.0 * n_mod_since_analyze / n_live_tup > threshold_pct) OR GREATEST(last_analyze, last_autoanalyze) < NOW() - (max_age_days || ' days')::INTERVAL OR (last_analyze IS NULL AND last_autoanalyze IS NULL) LOOP EXECUTE format('ANALYZE %I.%I', rec.schemaname, rec.relname); analyzed_table := rec.schemaname || '.' || rec.relname; reason := CASE WHEN rec.last_stats IS NULL THEN 'never analyzed' WHEN 100.0 * rec.n_mod_since_analyze / rec.n_live_tup > threshold_pct THEN 'high modification count' ELSE 'age exceeded threshold' END; RETURN NEXT; END LOOP;END;$$ LANGUAGE plpgsql; -- Run maintenanceSELECT * FROM maintain_statistics(5.0, 7);Effective statistics management requires ongoing monitoring. You need to detect stale statistics before they cause performance problems and diagnose statistics-related issues when queries regress.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
-- Diagnosing statistics issues in PostgreSQL -- 1. Compare estimated vs actual rowsEXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT * FROM orders WHERE status = 'pending' AND order_date > '2024-01-01'; -- Look for rows like:-- "Seq Scan on orders (actual rows=500000 loops=1)"-- "Rows Removed by Filter: 1000000"-- "Estimated rows: 5000" ← 100x underestimate! -- 2. Check if column has statisticsSELECT attname, n_distinct, null_frac, avg_width, most_common_vals IS NOT NULL AS has_mcv, histogram_bounds IS NOT NULL AS has_histogramFROM pg_statsWHERE tablename = 'orders' AND attname IN ('status', 'order_date', 'customer_id'); -- 3. Verify histogram accuracySELECT attname, array_length(histogram_bounds, 1) AS bucket_count, histogram_bounds[1] AS min_value, histogram_bounds[array_length(histogram_bounds, 1)] AS max_valueFROM pg_statsWHERE tablename = 'orders' AND histogram_bounds IS NOT NULL; -- 4. Check MCV accuracy for specific valueSELECT attname, most_common_vals, most_common_freqs, (SELECT COUNT(*)::float / (SELECT COUNT(*) FROM orders) FROM orders WHERE status = 'pending') AS actual_freqFROM pg_statsWHERE tablename = 'orders' AND attname = 'status'; -- 5. Identify correlation issues (index efficiency)SELECT attname, correlation -- -1 to 1: how sorted is data relative to physical orderFROM pg_statsWHERE tablename = 'orders';-- Low correlation on indexed column = index less effectiveStatistics maintenance is the invisible foundation of database performance. Without accurate statistics, even the most powerful optimizer becomes blind, guessing at data distributions and making suboptimal decisions that can slow queries by orders of magnitude.
What's next:
Statistics tell the optimizer about data distribution. But the physical organization of data matters too. In the next page, we'll explore Index Maintenance—how to keep indexes efficient, when to rebuild or reorganize them, and strategies for managing index bloat in production systems.
You now understand the critical role of database statistics in query optimization, the types of statistics collected, collection methods, and maintenance strategies. This knowledge is essential for maintaining high-performance production databases.