Cost Models - Learning Module

Loading content...

0/252

Statistics Usage

The Database's Memory of Its Data

When a query optimizer estimates that a filter will match 10,000 rows, where does this number come from? When it predicts a join will produce 500,000 result tuples, what information supports this prediction? The answer lies in database statistics—meta-information about the data that enables intelligent query planning without examining every row.

Statistics are the optimizer's memory of what the data looks like. They capture table sizes, column distributions, frequent values, and correlation patterns—all distilled into compact summaries that guide billions of daily decisions about query execution.

The quality of statistics directly determines optimization quality. With accurate, current statistics, the optimizer makes brilliant choices. With stale or missing statistics, it makes blind guesses that can lead to performance disasters. Understanding statistics is essential for anyone who manages, tunes, or develops database systems.

What You Will Learn

By the end of this page, you will understand the types of statistics databases maintain (table, column, index), how histograms capture data distribution and enable accurate selectivity estimation, extended statistics for correlated columns, when and how statistics are collected and updated, and how to diagnose and resolve statistics-related performance problems.

Table-Level Statistics

The most fundamental statistics describe tables as a whole. These provide the baseline for all cost and cardinality calculations.

Core Table-Level Statistics
Statistic	Description	Used For
Row Count (n_tuples)	Number of live rows in the table	Base cardinality for all estimates
Page Count (n_pages)	Number of disk pages the table occupies	I/O cost estimation for scans
Dead Tuple Count	Number of deleted/updated but not vacuumed rows	VACUUM scheduling, bloat assessment
Last Vacuum Time	When table was last vacuumed	Maintenance monitoring
Last Analyze Time	When statistics were last gathered	Statistics freshness assessment
Modifications Since Analyze	Rows inserted/updated/deleted since last analyze	AutoAnalyze triggering

table_statistics_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- PostgreSQL: Viewing table-level statistics
-- ================================================================
 
-- Core table statistics from pg_stat_user_tables
SELECT 
    schemaname,
    relname AS table_name,
    n_live_tup AS row_count,
    n_dead_tup AS dead_rows,
    n_mod_since_analyze AS modifications_pending,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';
 
-- Example output:
-- table_name | row_count  | dead_rows | modifications_pending | last_analyze
-- orders     | 10,234,567 | 45,678    | 123,456              | 2024-03-15 03:45:12
 
-- Physical storage statistics from pg_class
SELECT 
    relname AS table_name,
    reltuples::bigint AS estimated_rows,
    relpages AS page_count,
    pg_size_pretty(pg_relation_size(oid)) AS table_size
FROM pg_class
WHERE relname = 'orders';
 
-- Example output:
-- table_name | estimated_rows | page_count | table_size
-- orders     | 10,234,567     | 856,214    | 6.5 GB
 
-- Understanding the relationship:
-- 856,214 pages × 8KB/page = 6.85 GB
-- 10,234,567 rows / 856,214 pages = ~12 rows per page
-- Average row size: 8KB / 12 = ~680 bytes

Row Count Accuracy:

The stored row count (reltuples) is an estimate, not an exact count. It's updated by ANALYZE operations and may drift between analyses. For a table with 10 million rows that sees 500K inserts daily, the row count could be off by 5% within a day of the last ANALYZE.

This matters because every cardinality estimate starts with:

Estimated result rows = Table row count × Selectivity

If the row count is wrong, all downstream estimates inherit that error.

When to Use pg_stat_get_live_tuples()

For critical capacity planning, pg_stat_user_tables.n_live_tup provides more current row counts than pg_class.reltuples. The counter is updated by each INSERT/DELETE (not UPDATE in place). For exact counts, SELECT COUNT(*) remains necessary, but it requires a full table scan.

Column-Level Statistics

While table statistics provide row and page counts, column statistics enable selectivity estimation for predicates and joins. PostgreSQL maintains rich per-column statistics in the pg_statistic catalog table.

Core Column Statistics

•null_frac (Null Fraction) — Percentage of rows where this column is NULL. Used for IS NULL/IS NOT NULL predicates and to exclude NULLs from other statistics.
•n_distinct (Distinct Count) — Number of distinct values. Positive = exact count; Negative = multiplier (-0.5 means ~50% of rows are distinct). Critical for GROUP BY and equality predicate estimation.
•most_common_vals (MCV List) — Array of the most frequently occurring values. Typically stores top 100 values. Enables accurate estimation for highly skewed distributions.
•most_common_freqs (MCV Frequencies) — Frequency (0.0-1.0) of each value in the MCV list. If 'active' appears in 40% of rows, frequency = 0.40.
•histogram_bounds — Array of values dividing non-MCV values into equal-frequency buckets. Enables accurate range query estimation. Typically 100 buckets.
•correlation — How closely column values correlate with physical row order (-1.0 to 1.0). High correlation enables efficient index range scans; low correlation makes them expensive.

column_statistics_exploration.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- PostgreSQL: Exploring column statistics via pg_stats view
-- ================================================================
 
-- View human-readable column statistics
SELECT 
    tablename,
    attname AS column_name,
    null_frac,
    n_distinct,
    most_common_vals,
    most_common_freqs,
    histogram_bounds,
    correlation
FROM pg_stats
WHERE tablename = 'orders' 
  AND attname = 'status';
 
-- Example output for 'status' column:
/*
tablename | orders
column    | status
null_frac | 0.00          (no nulls)
n_distinct| 5.00          (exactly 5 distinct values)
most_common_vals | {pending,processing,shipped,delivered,cancelled}
most_common_freqs| {0.35, 0.25, 0.20, 0.15, 0.05}
histogram_bounds | null       (all values in MCV list)
correlation      | 0.08       (low correlation with row order)
*/
 
-- For a numeric column with range queries:
SELECT 
    attname AS column_name,
    null_frac,
    n_distinct,
    array_length(most_common_vals::text[], 1) AS mcv_count,
    array_length(histogram_bounds::text[], 1) - 1 AS histogram_buckets
FROM pg_stats  
WHERE tablename = 'orders'
  AND attname = 'total_amount';
 
-- Example output:
/*
column      | total_amount
null_frac   | 0.001        (0.1% null)
n_distinct  | -0.85        (~85% of rows have unique values)
mcv_count   | 10           (only 10 common values - unusual for amounts)
histogram_buckets| 99       (100 bounds = 99 buckets for range estimation)
*/

Understanding n_distinct:

The n_distinct value uses a clever encoding:

Positive value (e.g., 100): Exactly 100 distinct values
Negative value (e.g., -0.5): ~50% of rows have distinct values
                              For 1M rows: ~500,000 distinct values

Examples:
  status column: n_distinct = 5 (exactly 5 values: pending, active, etc.)
  email column:  n_distinct = -1.0 (every row has unique email)
  country column: n_distinct = 195 (fixed set of countries)
  order_id:      n_distinct = -1.0 (primary key, all unique)
  category:      n_distinct = 50 (50 product categories)

The negative encoding handles tables that grow: a -0.5 n_distinct on a table that doubles remains accurate (50% unique), while a fixed 500,000 would become stale.

Histograms: The Engine of Range Estimation

Histograms are perhaps the most important statistical structure for query optimization. They enable accurate estimation for range predicates (>, <, BETWEEN), inequality joins, and ORDER BY planning.

What Is a Histogram?

A histogram divides the value space into buckets, each containing approximately equal numbers of rows. PostgreSQL uses equi-height histograms where each bucket represents ~1% of values (with 100 buckets by default).

Column: order_date (1M orders over 2 years)
Histogram bounds: ['2022-01-01', '2022-01-08', '2022-01-15', ..., '2023-12-31']

Interpretation:
  - Each bucket boundary marks where ~1% of rows fall
  - Bucket 1: 2022-01-01 to 2022-01-08 (~10,000 rows)
  - Bucket 2: 2022-01-08 to 2022-01-15 (~10,000 rows)
  - ...
  - Bucket 100: 2023-12-24 to 2023-12-31 (~10,000 rows)

Note: The MCV (Most Common Values) are excluded from histograms. A histogram represents only the 'uniform' portion of the distribution.

Converting Mermaid diagram...

Selectivity Estimation with Histograms:

Query: WHERE total_amount BETWEEN 100 AND 500

Histogram bounds: [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, ...]
                   ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑
                   B0   B1   B2   B3   B4   B5   B6   B7   B8   B9   B10

Buckets fully in range: B2, B3, B4, B5, B6, B7, B8, B9 = 8 buckets
Partial bucket at start (100): B2 starts at 100, full inclusion
Partial bucket at end (500): B10 starts at 500, partial

With 100 buckets, each represents 1%:
  Full buckets: 8 × 1% = 8%
  Partial B10 (500 at lower bound): ~0.5%
  
Estimated selectivity: ~8.5%
For 1M rows: ~85,000 matching rows

histogram_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Analyzing histogram quality and coverage
-- ================================================================
 
-- View histogram bounds for a column
SELECT 
    attname,
    array_length(histogram_bounds::text[], 1) AS num_bounds,
    (histogram_bounds::text[])[1:3] AS first_few_bounds,
    (histogram_bounds::text[])[array_length(histogram_bounds::text[],1)-2:] AS last_few_bounds
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'total_amount';
 
-- Increase histogram resolution for important columns
-- default_statistics_target = 100 → 100 buckets
ALTER TABLE orders ALTER COLUMN total_amount SET STATISTICS 500;
ANALYZE orders;
 
-- Now the column has 500 histogram buckets instead of 100
-- More granular estimation for range queries
 
-- View the effect on statistics storage
SELECT 
    attname,
    array_length(histogram_bounds::text[], 1) - 1 AS histogram_buckets,
    array_length(most_common_vals::text[], 1) AS mcv_entries
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'total_amount';
 
-- For very important columns that drive critical queries:
ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 1000;
 
-- Maximum useful value depends on n_distinct and query patterns

Histogram Limitations

Histograms assume uniform distribution within each bucket. For data with extreme local clustering, this assumption breaks. Additionally, histograms are per-column—they cannot capture multi-column patterns. For correlated columns, extended statistics are needed.

Extended Statistics: Correlations and Dependencies

Standard column statistics assume columns are independent—the probability of city = 'NYC' AND state = 'NY' is estimated as:

P(city='NYC') × P(state='NY') = 0.05 × 0.10 = 0.005 (0.5%)

But these columns are functionally dependent: NYC is always in NY. The actual selectivity is 0.05 (5%), not 0.5%—a 10× underestimate. This independence assumption causes massive cardinality estimation errors for correlated data.

Extended statistics explicitly capture multi-column relationships:

Types of Extended Statistics (PostgreSQL)

•Functional Dependencies (dependencies) — Captures when one column determines another. If city→state (knowing city determines state), the optimizer won't multiply selectivities incorrectly.
•N-Distinct (ndistinct) — Captures the distinct count of column combinations. Essential for accurate GROUP BY estimation on multiple columns.
•MCV Lists (mcv) — Most common value combinations. Tracks frequent multi-column value patterns, like (payment_method='credit', currency='USD') appearing in 30% of transactions.

extended_statistics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- PostgreSQL: Creating and using extended statistics
-- ================================================================
 
-- Problem: Correlated columns cause estimation errors
-- city and state are functionally dependent
EXPLAIN ANALYZE
SELECT * FROM customers 
WHERE city = 'New York' AND state = 'NY';
 
-- Without extended stats:
-- Estimated rows: 50 (0.05 × 0.10 × 10000)
-- Actual rows: 500 (0.05 × 10000)
-- 10× underestimate!
 
-- Solution: Create extended statistics for dependencies
CREATE STATISTICS customers_city_state_dep (dependencies)
ON city, state FROM customers;
 
ANALYZE customers;
 
-- Now the same query:
EXPLAIN ANALYZE
SELECT * FROM customers 
WHERE city = 'New York' AND state = 'NY';
-- Estimated rows: 500 (correct!)
 
-- ================================================================
-- N-Distinct extended statistics for GROUP BY accuracy
-- ================================================================
 
-- Problem: GROUP BY (year, month) distinct count estimation
EXPLAIN ANALYZE
SELECT EXTRACT(YEAR FROM order_date) AS year,
       EXTRACT(MONTH FROM order_date) AS month,
       COUNT(*)
FROM orders
GROUP BY year, month;
 
-- Without extended stats:
-- Estimated groups: 5 × 12 = 60 (assumes independence)
-- Actual groups: 24 (2 years × 12 months, not all combinations exist)
 
CREATE STATISTICS orders_date_parts_ndistinct (ndistinct)
ON EXTRACT(YEAR FROM order_date), EXTRACT(MONTH FROM order_date)
FROM orders;
 
ANALYZE orders;
-- Now estimates ~24 groups correctly
 
-- ================================================================
-- MCV extended statistics for common combinations
-- ================================================================
 
CREATE STATISTICS products_cat_brand_mcv (mcv)
ON category, brand FROM products;
 
ANALYZE products;
 
-- Captures patterns like:
-- (Electronics, Apple): 15%
-- (Electronics, Samsung): 12%  
-- (Clothing, Nike): 8%
-- Enables accurate estimation for multi-column equality predicates

When to Create Extended Statistics

Create extended statistics when: (1) EXPLAIN shows large estimation errors on multi-column predicates, (2) columns have obvious real-world relationships (city/state, product/category, date components), (3) GROUP BY on multiple columns shows poor estimates. Monitor with pg_stat_user_tables.n_mod_since_analyze to ensure statistics stay current.

Statistics Collection: ANALYZE in Depth

Statistics don't collect themselves—the database must explicitly gather them through the ANALYZE command. Understanding this process is critical for maintaining optimizer accuracy.

How ANALYZE Works:

1. SAMPLE COLLECTION
   - Read random sample of table pages
   - Default sample: 30,000 rows (or less for small tables)
   - Sample size: 300 × default_statistics_target (default 100)

2. PER-COLUMN ANALYSIS
   For each column:
   a) Count NULLs → null_frac
   b) Count distinct values → n_distinct
   c) Identify most common values → most_common_vals, most_common_freqs
   d) Build histogram from remaining values → histogram_bounds
   e) Calculate value/row order correlation → correlation

3. TABLE STATISTICS UPDATE
   - Update reltuples (row count) based on sample extrapolation
   - Update relpages (page count) from actual storage

4. EXTENDED STATISTICS COMPUTATION
   - If extended stats defined, compute multi-column structures
   - Functional dependencies, n-distinct, MCVs for combinations

analyze_operations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- PostgreSQL: ANALYZE operations and configuration
-- ================================================================
 
-- Analyze entire database
ANALYZE;
 
-- Analyze specific table
ANALYZE orders;
 
-- Analyze specific columns (faster when only some columns changed)
ANALYZE orders (status, updated_at);
 
-- Verbose analyze shows progress and statistics gathered
ANALYZE VERBOSE orders;
 
/*
Output example:
INFO:  analyzing "public.orders"
INFO:  "orders": scanned 30000 of 856214 pages, containing 354789 live rows
       and 12345 dead rows; 30000 rows in sample, 10123456 estimated total rows
*/
 
-- ================================================================
-- AutoAnalyze Configuration
-- ================================================================
 
-- View current autovacuum/autoanalyze settings
SELECT name, setting, unit, short_desc
FROM pg_settings
WHERE name LIKE '%autovacuum%' OR name LIKE '%analyze%';
 
-- Key settings:
-- autovacuum_analyze_threshold = 50          -- Min rows modified to trigger
-- autovacuum_analyze_scale_factor = 0.1      -- Fraction of table size
 
-- Trigger formula: 
-- threshold = autovacuum_analyze_threshold + 
--             autovacuum_analyze_scale_factor × reltuples
 
-- For a 1M row table: 50 + 0.1 × 1,000,000 = 100,050 modifications trigger analyze
 
-- Override for specific tables
ALTER TABLE orders SET (autovacuum_analyze_threshold = 1000);
ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0.02);
-- Now: 1000 + 0.02 × 1,000,000 = 21,000 modifications trigger analyze
 
-- Force more frequent analysis for critical tables
ALTER TABLE transactions SET (autovacuum_analyze_scale_factor = 0.01);

The Sampling Trade-off

ANALYZE samples ~30,000 rows regardless of table size. For a 100M row table, this is 0.03%—enough for accurate statistics in most cases, but outliers or rare patterns may be missed. For extremely important columns, increase STATISTICS target to sample more rows and capture more MCV entries and histogram buckets.

Statistics Staleness and Its Impact

Statistics represent a snapshot of data at analysis time. As tables change through inserts, updates, and deletes, statistics become stale. The impact ranges from subtle performance degradation to catastrophic query plan choices.

Signs of Stale Statistics

•EXPLAIN shows estimates far from actuals
•Queries that were fast become slow
•Hash joins spill to disk unexpectedly
•Nested loops chosen over hash joins
•Index scans where table scans are better
•Parallel plans not chosen when beneficial

How to Detect Staleness

•Compare reltuples vs n_live_tup
•Check last_analyze timestamp
•Monitor n_mod_since_analyze
•Use EXPLAIN ANALYZE regularly
•Track pg_stat_user_tables metrics
•Set up monitoring alerts

staleness_detection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
-- Detecting and addressing stale statistics
-- ================================================================
 
-- Find tables with potentially stale statistics
SELECT 
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    n_mod_since_analyze,
    last_analyze,
    ROUND(n_mod_since_analyze::numeric / NULLIF(n_live_tup, 0) * 100, 2) 
        AS pct_modified_since_analyze
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 1000
ORDER BY pct_modified_since_analyze DESC;
 
-- Tables with >10% modification since analyze likely need refresh
 
-- ================================================================
-- Comparing estimated vs stored row counts
-- ================================================================
SELECT 
    c.relname,
    c.reltuples::bigint AS optimizer_knows,
    s.n_live_tup AS actual_live_rows,
    ROUND((c.reltuples - s.n_live_tup)::numeric / 
          NULLIF(s.n_live_tup, 0) * 100, 2) AS pct_difference
FROM pg_class c
JOIN pg_stat_user_tables s ON c.relname = s.relname
WHERE c.reltuples > 1000
ORDER BY ABS((c.reltuples - s.n_live_tup)::numeric / 
             NULLIF(s.n_live_tup, 0)) DESC;
 
-- Tables where optimizer's row count differs >20% from reality
-- are prime candidates for ANALYZE
 
-- ================================================================
-- Automated freshness check for critical tables
-- ================================================================
CREATE OR REPLACE FUNCTION check_statistics_freshness()
RETURNS TABLE(table_name text, issue text, recommendation text) AS $$
BEGIN
    -- Tables with no recent analyze
    RETURN QUERY
    SELECT 
        s.relname::text,
        'Statistics older than 24 hours'::text,
        'ANALYZE ' || s.relname::text
    FROM pg_stat_user_tables s
    WHERE s.last_analyze < NOW() - INTERVAL '24 hours'
      AND s.n_live_tup > 10000;
      
    -- Tables with high modification rate
    RETURN QUERY
    SELECT 
        s.relname::text,
        format('%s%% modified since analyze', 
               ROUND(s.n_mod_since_analyze::numeric / 
                     NULLIF(s.n_live_tup, 0) * 100, 1))::text,
        'ANALYZE ' || s.relname::text
    FROM pg_stat_user_tables s
    WHERE s.n_mod_since_analyze::numeric / NULLIF(s.n_live_tup, 0) > 0.1
      AND s.n_live_tup > 10000;
END;
$$ LANGUAGE plpgsql;
 
SELECT * FROM check_statistics_freshness();

The Bulk Load Problem

After bulk loading data (ETL, data migrations), statistics are completely absent or wildly wrong. Always ANALYZE immediately after bulk operations. Some systems pause autovacuum during bulk loads for performance—remember to re-enable and force ANALYZE afterward.

Index Statistics

Indexes have their own statistics that influence whether the optimizer chooses index scans versus table scans.

Key Index Statistics
Statistic	Description	Optimizer Use
reltuples (index)	Number of index entries	Compare with table tuples for coverage analysis
relpages (index)	Number of index pages	I/O cost for index scan
Index height	Levels in B-tree	Startup cost for index navigation
Leaf page density	How full leaf pages are	Affects range scan I/O
Correlation with heap	Index order vs heap order	Random vs sequential heap access

index_statistics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- Viewing index statistics
-- ================================================================
 
-- Index size and tuple counts
SELECT 
    i.indexrelid::regclass AS index_name,
    i.indrelid::regclass AS table_name,
    c.reltuples::bigint AS index_entries,
    c.relpages AS index_pages,
    pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size
FROM pg_index i
JOIN pg_class c ON i.indexrelid = c.oid
WHERE i.indrelid = 'orders'::regclass;
 
-- Index usage statistics
SELECT 
    indexrelname,
    idx_scan AS times_used,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY idx_scan DESC;
 
-- B-tree specific statistics via pageinspect extension
CREATE EXTENSION IF NOT EXISTS pageinspect;
 
-- Get tree height and stats from metapage
SELECT * FROM bt_metap('orders_customer_id_idx');
/*
magic     | 340322
version   | 4
root      | 290
level     | 2          -- Height is level + 1 = 3
fastroot  | 290
fastlevel | 2
...
*/
 
-- Correlation between index column and heap order
SELECT 
    attname,
    correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'customer_id';
 
-- correlation close to 1.0 or -1.0: Index range scans are efficient
-- correlation close to 0: Index range scans cause random heap I/O

The Correlation Sweet Spot

High correlation (>0.9 or <-0.9) means rows with similar index values are stored close together on disk. Range scans read sequential heap pages. Low correlation (close to 0) means random heap access for each index entry. For low-correlation columns, consider CLUSTER to reorder the table by index, or accept that index scans will be expensive for large result sets.

Summary: Mastering Statistics

Statistics are the foundation of accurate cost estimation. Let's consolidate the essential knowledge:

Key Takeaways

•Table statistics provide row and page counts — The foundation for all cost calculations. reltuples and relpages must stay accurate.
•Column statistics capture distributions — null_frac, n_distinct, MCVs, and histograms enable selectivity estimation for predicates and joins.
•Histograms enable range query estimation — Equi-height buckets divide the value space; the optimizer counts buckets to estimate selectivity.
•Extended statistics capture correlations — Functional dependencies, multi-column n_distinct, and combined MCVs fix errors from the independence assumption.
•ANALYZE collects statistics via sampling — Default 30,000 row sample; increase STATISTICS target for critical columns.
•Stale statistics cause optimizer failures — Monitor n_mod_since_analyze and last_analyze; configure autoanalyze appropriately.
•Index statistics include structure and correlation — Tree height, page counts, and heap correlation influence index scan cost estimates.

What's Next:

With I/O cost, CPU cost, and statistics understood, we're ready for the culmination of cost modeling: Cardinality Estimation. The next page explores how databases combine all this statistical information to predict how many rows flow through each query operator—the single most important factor in query optimization success or failure.

Page Complete

You now have deep expertise in database statistics—how they're structured, collected, maintained, and used for query optimization. This knowledge enables you to diagnose estimation errors, configure statistics targets appropriately, and maintain optimizer accuracy in production systems.