Loading content...
When you execute a SQL query, the database optimizer faces a profound challenge: choosing the best execution plan among potentially millions of alternatives—without actually running any of them. How can an optimizer predict that a hash join will outperform a nested loop join, or that an index scan is superior to a full table scan, without executing the query?
The answer lies in table statistics—metadata that describes the size, structure, and distribution of data within tables. These statistics form the empirical foundation upon which all cost-based optimization decisions rest. Without accurate statistics, even the most sophisticated optimizer operates in the dark, making decisions based on guesswork rather than data-driven analysis.
By the end of this page, you will understand what table statistics are, why they are indispensable for query optimization, how databases collect and store them, and how the optimizer uses them to estimate costs. You'll also learn the practical implications of stale or missing statistics on query performance.
Table statistics are metadata that summarize the characteristics of a database table. Unlike the actual data stored in tables, statistics are compact, derived information designed to help the query optimizer make informed decisions without scanning every row.
Key insight: The optimizer doesn't need to know every individual value in a table—it needs to understand patterns, distributions, and sizes. Statistics provide exactly this abstracted view.
You might wonder: why not just run SELECT COUNT(*) when we need the row count? For a 100-million-row table, that full scan could take minutes. Query optimization happens for every query, often hundreds of times per second. Pre-computed statistics provide instant answers that would otherwise require expensive full-table operations.
Cost-based optimizers evaluate query plans by estimating their execution cost—typically measured in terms of I/O operations, CPU cycles, and memory usage. Table statistics directly feed into every aspect of these calculations.
The cost estimation pipeline:
| Statistic | Cost Impact | Example Use |
|---|---|---|
| Row Count (n) | Determines base cardinality for all operators | Nested loop join cost: O(n × m) requires accurate n, m |
| Block Count (b) | Directly estimates sequential I/O cost | Table scan cost ≈ b × page_read_cost |
| Average Row Width | Estimates tuples per block and memory needs | Buffer pool sizing for hash joins |
| Table Size | Informs parallel scan partitioning | Dividing 100GB table across 10 workers |
| Fill Factor | Adjusts I/O estimates for sparse tables | Half-full pages double expected I/O |
Example: Table Scan Cost Estimation
Consider estimating the cost of a full table scan on table orders:
Table Statistics:
Row count (n) = 10,000,000
Block count (b) = 125,000
Block size = 8KB
Cost Model Parameters:
Sequential page read cost = 1.0
CPU tuple processing cost = 0.01
Estimated Scan Cost:
I/O Cost = b × seq_page_cost = 125,000 × 1.0 = 125,000
CPU Cost = n × cpu_tuple_cost = 10,000,000 × 0.01 = 100,000
Total Cost = 225,000
Without the row count and block count statistics, the optimizer would have no way to compute this estimate. It would either have to assume default values (often wildly inaccurate) or refuse to optimize entirely.
Database systems employ several strategies to collect table statistics, each with different trade-offs between accuracy, overhead, and freshness.
Sampling-Based Collection
Read a random subset of rows and extrapolate statistics for the entire table.
Common Approaches:
Statistical Foundation:
For a sample of size s from population N, the sample mean x̄ estimates the population mean μ with standard error:
SE = σ / √s
Larger samples reduce error, but even small samples (1-10%) often provide sufficient accuracy for optimization.
Advantages:
Disadvantages:
Most production systems use sampling with adaptive sample sizes: larger samples for small tables (where full scans are cheap) and proportionally smaller samples for very large tables (where even 1% provides millions of rows). PostgreSQL's default sampling target is 30,000 rows regardless of table size—enough for robust statistical inference.
Database systems store statistics in system catalogs—special internal tables that the optimizer queries during query planning. Understanding where statistics live helps with diagnostics and tuning.
1234567891011121314151617181920212223
-- Table-level statistics in pg_classSELECT relname AS table_name, reltuples AS estimated_row_count, relpages AS page_count, relallvisible AS all_visible_pagesFROM pg_classWHERE relname = 'orders'; -- Result example:-- table_name | estimated_row_count | page_count | all_visible_pages-- -----------+--------------------+------------+-------------------- orders | 10000000 | 125000 | 124500 -- More detailed statistics in pg_statistic-- (complex structure, usually accessed via pg_stats view)SELECT tablename, attname AS column_name, n_distinct, correlationFROM pg_statsWHERE tablename = 'orders';Key observations across systems:
Notice that row counts in system catalogs are often labeled 'estimated' or 'approximate'. These values come from the last statistics collection and may not reflect current reality. A table with 10M rows in pg_class might actually contain 12M rows if 2M were inserted since the last ANALYZE.
Statistics decay over time as data changes. When statistics become stale—significantly divergent from actual data—the optimizer makes decisions based on outdated information. The consequences can be severe.
Case Study: The Growing Orders Table
Consider an e-commerce system where the orders table was last analyzed when it contained 1 million rows. Since then, a successful marketing campaign drove growth to 50 million rows.
Statistics (stale): n = 1,000,000
Reality: n = 50,000,000
Query: SELECT * FROM orders WHERE status = 'pending'
AND order_date > '2024-01-01'
Optimizer's estimate: ~1,000 rows (based on old selectivity)
Actual result: ~500,000 rows
Optimizer's plan: Index Nested Loop Join (optimal for 1,000 rows)
Optimal plan: Hash Join with parallel scan
Performance impact: Query takes 45 seconds instead of 2 seconds
Stale statistics don't degrade performance gradually—they often cause cliff-like drops. A plan that's optimal for estimated 1,000 rows can be catastrophically wrong for actual 500,000 rows. The optimizer has no way to know it's wrong until you analyze the table.
Every major database system provides explicit commands to collect (or refresh) statistics. Understanding these commands is essential for database administration and performance tuning.
12345678910111213141516171819202122
-- Analyze a specific tableANALYZE orders; -- Analyze specific columns onlyANALYZE orders (customer_id, order_date, status); -- Analyze entire databaseANALYZE; -- Verbose output showing what was analyzedANALYZE VERBOSE orders; -- Configure sampling-- (default_statistics_target: samples per column, default 100)SET default_statistics_target = 500; -- More detailed statsALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000; -- Higher precision for this column -- Force full scan instead of samplingALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0);ALTER TABLE orders SET (autovacuum_analyze_threshold = 0);For most production systems: Enable automatic statistics updates, but also schedule comprehensive statistics collection during maintenance windows. Pay special attention to tables with high modification rates—they may need more frequent explicit ANALYZE operations than auto-stats provides.
Modern database systems include automatic statistics maintenance features that reduce the need for manual intervention. Understanding how these systems work helps you configure them appropriately.
| Database | Mechanism | Trigger Condition | Configuration |
|---|---|---|---|
| PostgreSQL | autovacuum daemon | ≥ 50 + 10% of rows modified | autovacuum_analyze_scale_factor, autovacuum_analyze_threshold |
| MySQL (InnoDB) | Background thread | ≥ 10% of rows modified | innodb_stats_auto_recalc, innodb_stats_persistent |
| SQL Server | Auto Update Statistics | ≥ 500 + 20% of rows modified | AUTO_UPDATE_STATISTICS, AUTO_UPDATE_STATISTICS_ASYNC |
| Oracle | Maintenance Window | Stale > 10% (configurable) | DBMS_STATS.SET_GLOBAL_PREFS, maintenance jobs |
Auto-Stats Workflow (PostgreSQL Example):
┌─────────────────────────────────────────────────────┐
│ DML Operations │
│ (INSERT, UPDATE, DELETE) │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Modification Counter │
│ pg_stat_user_tables.n_mod_since_analyze │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Autovacuum Daemon (periodic) │
│ Checks: n_mod > threshold + (scale_factor × n)? │
└───────────────┬─────────────────┬───────────────────┘
│ Yes │ No
▼ ▼
┌───────────────────────┐ ┌────────────────────────┐
│ Run ANALYZE │ │ Continue monitoring │
│ Update pg_statistic │ │ │
└───────────────────────┘ └────────────────────────┘
Auto-stats is reactive, not proactive. After a massive data load, the first query runs before statistics update. For bulk ETL operations, always run ANALYZE immediately after loading—don't wait for auto-stats to catch up.
Table statistics form the empirical foundation of query optimization. Without them, cost-based optimization is impossible—the optimizer simply cannot estimate I/O costs, compare join strategies, or allocate resources accurately.
What's next:
Table statistics provide the foundation, but column statistics add crucial detail. The next page explores how databases track per-column properties—distinct values, null counts, ranges, and distributions—to enable precise selectivity estimation for predicates.
You now understand what table statistics are, how they're collected and stored, and why they're essential for query optimization. Next, we'll drill down to column-level statistics and see how they enable filter selectivity estimation.