Statistics And Histograms - Learning Module

Loading content...

0/241

Table Statistics

The Foundation of Intelligent Query Planning

When you execute a SQL query, the database optimizer faces a profound challenge: choosing the best execution plan among potentially millions of alternatives—without actually running any of them. How can an optimizer predict that a hash join will outperform a nested loop join, or that an index scan is superior to a full table scan, without executing the query?

The answer lies in table statistics—metadata that describes the size, structure, and distribution of data within tables. These statistics form the empirical foundation upon which all cost-based optimization decisions rest. Without accurate statistics, even the most sophisticated optimizer operates in the dark, making decisions based on guesswork rather than data-driven analysis.

What You Will Learn

By the end of this page, you will understand what table statistics are, why they are indispensable for query optimization, how databases collect and store them, and how the optimizer uses them to estimate costs. You'll also learn the practical implications of stale or missing statistics on query performance.

What Are Table Statistics?

Table statistics are metadata that summarize the characteristics of a database table. Unlike the actual data stored in tables, statistics are compact, derived information designed to help the query optimizer make informed decisions without scanning every row.

Key insight: The optimizer doesn't need to know every individual value in a table—it needs to understand patterns, distributions, and sizes. Statistics provide exactly this abstracted view.

Core Table-Level Statistics

•Row Count (Cardinality) — The total number of rows in the table. This is perhaps the single most important statistic, as it directly influences every cost calculation.
•Block Count (Page Count) — The number of data pages (blocks) the table occupies on disk. Crucial for estimating I/O costs in sequential and random access patterns.
•Average Row Width — The average size of a row in bytes. Used to estimate memory consumption and I/O efficiency.
•Table Size — Total storage consumed, typically derived from block count and block size.
•Fill Factor / Density — How densely packed the data is within allocated pages. Affects both storage efficiency and scan costs.
•Last Update Timestamp — When statistics were last collected, indicating their freshness and reliability.

Why Not Just Count Every Time?

You might wonder: why not just run SELECT COUNT(*) when we need the row count? For a 100-million-row table, that full scan could take minutes. Query optimization happens for every query, often hundreds of times per second. Pre-computed statistics provide instant answers that would otherwise require expensive full-table operations.

The Role of Statistics in Cost Estimation

Cost-based optimizers evaluate query plans by estimating their execution cost—typically measured in terms of I/O operations, CPU cycles, and memory usage. Table statistics directly feed into every aspect of these calculations.

The cost estimation pipeline:

Cardinality estimation — Using statistics to predict how many rows will result from each operation (selection, join, projection)
Operator cost calculation — Applying cardinality estimates to cost formulas for each physical operator
Plan comparison — Summing costs across the entire plan and comparing alternatives
Plan selection — Choosing the plan with the lowest estimated total cost

How Table Statistics Drive Cost Calculations
Statistic	Cost Impact	Example Use
Row Count (n)	Determines base cardinality for all operators	Nested loop join cost: O(n × m) requires accurate n, m
Block Count (b)	Directly estimates sequential I/O cost	Table scan cost ≈ b × page_read_cost
Average Row Width	Estimates tuples per block and memory needs	Buffer pool sizing for hash joins
Table Size	Informs parallel scan partitioning	Dividing 100GB table across 10 workers
Fill Factor	Adjusts I/O estimates for sparse tables	Half-full pages double expected I/O

Example: Table Scan Cost Estimation

Consider estimating the cost of a full table scan on table orders:

Table Statistics:
  Row count (n) = 10,000,000
  Block count (b) = 125,000
  Block size = 8KB

Cost Model Parameters:
  Sequential page read cost = 1.0
  CPU tuple processing cost = 0.01

Estimated Scan Cost:
  I/O Cost = b × seq_page_cost = 125,000 × 1.0 = 125,000
  CPU Cost = n × cpu_tuple_cost = 10,000,000 × 0.01 = 100,000
  Total Cost = 225,000

Without the row count and block count statistics, the optimizer would have no way to compute this estimate. It would either have to assume default values (often wildly inaccurate) or refuse to optimize entirely.

Statistics Collection Methods

Database systems employ several strategies to collect table statistics, each with different trade-offs between accuracy, overhead, and freshness.

Sampling-Based Collection

Read a random subset of rows and extrapolate statistics for the entire table.

Common Approaches:

Simple Random Sampling — Select rows with uniform probability
Block-Level Sampling — Randomly select blocks, read all rows in selected blocks
Two-Phase Sampling — First sample blocks, then sample rows within blocks

Statistical Foundation:

For a sample of size s from population N, the sample mean x̄ estimates the population mean μ with standard error:

SE = σ / √s

Larger samples reduce error, but even small samples (1-10%) often provide sufficient accuracy for optimization.

Advantages:

O(s) time where s << n
Minimal impact on running queries
Can be run frequently

Disadvantages:

Sampling error (mitigated by larger samples)
May miss rare values affecting selectivity
Block-level sampling can introduce bias if data is clustered

Industry Best Practice

Most production systems use sampling with adaptive sample sizes: larger samples for small tables (where full scans are cheap) and proportionally smaller samples for very large tables (where even 1% provides millions of rows). PostgreSQL's default sampling target is 30,000 rows regardless of table size—enough for robust statistical inference.

Statistics Storage and System Catalogs

Database systems store statistics in system catalogs—special internal tables that the optimizer queries during query planning. Understanding where statistics live helps with diagnostics and tuning.

Statistics System Catalogs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Table-level statistics in pg_class
SELECT 
    relname AS table_name,
    reltuples AS estimated_row_count,
    relpages AS page_count,
    relallvisible AS all_visible_pages
FROM pg_class
WHERE relname = 'orders';
 
-- Result example:
-- table_name | estimated_row_count | page_count | all_visible_pages
-- -----------+--------------------+------------+------------------
-- orders     | 10000000           | 125000     | 124500
 
-- More detailed statistics in pg_statistic
-- (complex structure, usually accessed via pg_stats view)
SELECT 
    tablename,
    attname AS column_name,
    n_distinct,
    correlation
FROM pg_stats
WHERE tablename = 'orders';

Key observations across systems:

Statistics are stored separately from data — They occupy dedicated system tables, not the user tables themselves
Statistics include metadata about statistics — Last update time, sample size, whether auto-generated
Different granularities — Some systems track partition-level statistics for partitioned tables
Caching — Statistics are typically cached in memory for fast optimizer access

Statistics Are Estimates

Notice that row counts in system catalogs are often labeled 'estimated' or 'approximate'. These values come from the last statistics collection and may not reflect current reality. A table with 10M rows in pg_class might actually contain 12M rows if 2M were inserted since the last ANALYZE.

The Impact of Stale Statistics

Statistics decay over time as data changes. When statistics become stale—significantly divergent from actual data—the optimizer makes decisions based on outdated information. The consequences can be severe.

Stale Statistics Symptoms

•Query plans that were fast become inexplicably slow
•Optimizer chooses nested loop joins for large tables
•Full table scans when indexes exist and should be used
•Hash joins spill to disk unexpectedly
•Parallelism not engaged when it should be
•Memory grants vastly over or underestimate needs

Fresh Statistics Benefits

•Consistent, predictable query performance
•Accurate join order selection
•Appropriate access method selection
•Correctly sized memory allocations
•Optimal parallelization decisions
•Efficient use of available indexes

Case Study: The Growing Orders Table

Consider an e-commerce system where the orders table was last analyzed when it contained 1 million rows. Since then, a successful marketing campaign drove growth to 50 million rows.

Statistics (stale):     n = 1,000,000
Reality:                n = 50,000,000

Query: SELECT * FROM orders WHERE status = 'pending'
       AND order_date > '2024-01-01'

Optimizer's estimate:  ~1,000 rows (based on old selectivity)
Actual result:         ~500,000 rows

Optimizer's plan:      Index Nested Loop Join (optimal for 1,000 rows)
Optimal plan:          Hash Join with parallel scan

Performance impact:    Query takes 45 seconds instead of 2 seconds

The 50x Performance Cliff

Stale statistics don't degrade performance gradually—they often cause cliff-like drops. A plan that's optimal for estimated 1,000 rows can be catastrophically wrong for actual 500,000 rows. The optimizer has no way to know it's wrong until you analyze the table.

Statistics Collection Commands

Every major database system provides explicit commands to collect (or refresh) statistics. Understanding these commands is essential for database administration and performance tuning.

Statistics Collection Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Analyze a specific table
ANALYZE orders;
 
-- Analyze specific columns only
ANALYZE orders (customer_id, order_date, status);
 
-- Analyze entire database
ANALYZE;
 
-- Verbose output showing what was analyzed
ANALYZE VERBOSE orders;
 
-- Configure sampling
-- (default_statistics_target: samples per column, default 100)
SET default_statistics_target = 500;  -- More detailed stats
ALTER TABLE orders 
    ALTER COLUMN status 
    SET STATISTICS 1000;  -- Higher precision for this column
 
-- Force full scan instead of sampling
ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0);
ALTER TABLE orders SET (autovacuum_analyze_threshold = 0);

Statistics Collection Strategy

For most production systems: Enable automatic statistics updates, but also schedule comprehensive statistics collection during maintenance windows. Pay special attention to tables with high modification rates—they may need more frequent explicit ANALYZE operations than auto-stats provides.

Automatic Statistics Maintenance

Modern database systems include automatic statistics maintenance features that reduce the need for manual intervention. Understanding how these systems work helps you configure them appropriately.

Auto-Statistics Mechanisms by Database
Database	Mechanism	Trigger Condition	Configuration
PostgreSQL	autovacuum daemon	≥ 50 + 10% of rows modified	autovacuum_analyze_scale_factor, autovacuum_analyze_threshold
MySQL (InnoDB)	Background thread	≥ 10% of rows modified	innodb_stats_auto_recalc, innodb_stats_persistent
SQL Server	Auto Update Statistics	≥ 500 + 20% of rows modified	AUTO_UPDATE_STATISTICS, AUTO_UPDATE_STATISTICS_ASYNC
Oracle	Maintenance Window	Stale > 10% (configurable)	DBMS_STATS.SET_GLOBAL_PREFS, maintenance jobs

Auto-Stats Workflow (PostgreSQL Example):

┌─────────────────────────────────────────────────────┐
│                   DML Operations                    │
│              (INSERT, UPDATE, DELETE)               │
└─────────────────────┬───────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────┐
│              Modification Counter                    │
│         pg_stat_user_tables.n_mod_since_analyze     │
└─────────────────────┬───────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────┐
│           Autovacuum Daemon (periodic)              │
│   Checks: n_mod > threshold + (scale_factor × n)?  │
└───────────────┬─────────────────┬───────────────────┘
                │ Yes             │ No
                ▼                 ▼
┌───────────────────────┐   ┌────────────────────────┐
│   Run ANALYZE         │   │   Continue monitoring  │
│   Update pg_statistic │   │                        │
└───────────────────────┘   └────────────────────────┘

Auto-Stats Limitations

Auto-stats is reactive, not proactive. After a massive data load, the first query runs before statistics update. For bulk ETL operations, always run ANALYZE immediately after loading—don't wait for auto-stats to catch up.

Summary: Table Statistics Fundamentals

Table statistics form the empirical foundation of query optimization. Without them, cost-based optimization is impossible—the optimizer simply cannot estimate I/O costs, compare join strategies, or allocate resources accurately.

Key Takeaways

•Table statistics summarize key characteristics — Row count, block count, and average row width are the core metrics that drive all cost calculations.
•Statistics enable cost estimation — Every operator's cost formula depends on accurate table statistics as inputs.
•Collection methods balance accuracy and overhead — Full scans are accurate but expensive; sampling provides a practical trade-off for large tables.
•Statistics are stored in system catalogs — Each DBMS has catalog tables (pg_class, information_schema, sys.stats, DBA_TAB_STATISTICS) that hold this metadata.
•Stale statistics cause plan regression — When statistics diverge from reality, the optimizer makes poor decisions with potentially catastrophic performance impact.
•Auto-stats helps but isn't perfect — Automatic maintenance catches gradual drift; bulk operations require explicit ANALYZE.

What's next:

Table statistics provide the foundation, but column statistics add crucial detail. The next page explores how databases track per-column properties—distinct values, null counts, ranges, and distributions—to enable precise selectivity estimation for predicates.

Page Complete

You now understand what table statistics are, how they're collected and stored, and why they're essential for query optimization. Next, we'll drill down to column-level statistics and see how they enable filter selectivity estimation.