Index Terminology - Learning Module

Loading content...

0/241

Performance Metrics: Quantifying Index Efficiency

Measuring What Matters

"You can't improve what you can't measure." This maxim applies powerfully to database indexing. While intuition and experience guide initial index design, performance metrics provide the objective data needed to validate designs, diagnose problems, and optimize for specific workloads.

Index performance isn't a single number—it's a multi-dimensional space encompassing I/O costs, CPU usage, memory consumption, and maintenance overhead. Different metrics matter for different operations: point lookups care about tree height, range scans care about leaf clustering, and write-heavy workloads care about split frequency.

This page introduces the key metrics used by database professionals to analyze index performance, along with formulas for estimation and tools for measurement.

What You Will Learn

By the end of this page, you will understand the core performance metrics for B+-tree indexes, how to calculate expected I/O costs, what selectivity and cardinality mean for index effectiveness, and how to use database-provided statistics for performance analysis. You'll be equipped to make data-driven index decisions.

Categories of Index Performance Metrics

Index metrics fall into several categories, each addressing different aspects of performance. A holistic understanding requires familiarity with all categories.

Index Performance Metric Categories
Category	Metrics	What It Measures	Primary Concern
Structure	Height, fanout, node count, depth	Tree shape and size	Lookup cost, storage overhead
Storage	Total size, page count, fill factor	Disk space consumption	Storage cost, cache efficiency
Access Cost	I/O per lookup, I/O per range scan	Read operation expense	Query latency, throughput
Selectivity	Cardinality, distinct values, density	Index filtering power	Optimizer decisions, query benefit
Maintenance	Split frequency, merge frequency, fragmentation	Write operation overhead	Insert/update cost, degradation
Cache	Hit ratio, buffer pool residency	Memory efficiency	Effective vs theoretical I/O

Read vs Write Optimization

Index design often involves tradeoffs between read and write performance. Metrics help quantify these tradeoffs:

• Higher fill factor → Better read performance, worse write (more splits) • More indexes → Better read flexibility, worse write (more maintenance) • Wider indexes (covering) → Better read (no lookup), worse write (more data to maintain)

I/O Cost Metrics: The Primary Performance Measure

For disk-based databases, I/O is the dominant cost factor. Understanding and predicting I/O costs is fundamental to index performance analysis.

Core I/O Metrics

•Point Lookup I/O — Pages read for a single-key lookup. Equals tree height (h) in worst case, often 1-2 with caching.
•Range Scan I/O — Pages read for a range query. Initial descent (h) plus sequential leaf reads (⌈matching_rows / rows_per_leaf⌉).
•Index Scan I/O — Full scan of all leaf pages. Equals leaf page count.
•Write I/O — Pages written for an insert/update/delete. 1 leaf write minimum, plus split cascade (rare).
•Effective I/O — Actual disk operations accounting for buffer cache hits. Often much lower than logical I/O.

I/O Cost Formulas:

Point Lookup:

Logical I/O = Height (h) = ⌈log_F(N)⌉
Effective I/O = h - cached_levels ≈ 1-2 for hot indexes

Range Scan:

Logical I/O = h + ⌈matching_rows / rows_per_leaf⌉
For 10,000 matching rows with 100 rows/leaf:
I/O = 4 + ⌈10000/100⌉ = 4 + 100 = 104 pages

Index-Only Scan (covering index):

I/O = height + result_leaf_pages
(No additional table access required)

Non-Covering Index Scan:

I/O = height + result_leaf_pages + bookmark_lookups
Each bookmark lookup may add 1-4 I/Os depending on table structure

The Tipping Point

At some selectivity level, sequential table scan becomes cheaper than index scan + bookmark lookups. This typically occurs when selecting 5-20% of rows. The query optimizer estimates costs to choose between:

• Sequential scan: (table_pages) I/Os • Index scan: (index_pages + rows × bookmark_cost) I/Os

Indexes help most when retrieving small fractions of data. For large result sets, full scans may win.

Selectivity and Cardinality: Index Effectiveness

Selectivity and cardinality measure how well an index can narrow down results. These metrics fundamentally determine whether an index will be used and how much benefit it provides.

Definitions:

Selectivity and Cardinality Metrics
Metric	Definition	Formula	Ideal Value
Cardinality	Number of distinct values in the index key	COUNT(DISTINCT key)	Equals row count (unique)
Selectivity	Fraction of rows matching a specific value	1 / cardinality	As low as possible
Density	Average fraction of rows per distinct value	1 / cardinality	Low for indexes
Index Selectivity	Cardinality / Total rows	DISTINCT / COUNT(*)	Close to 1.0

Examples:

Column	Rows	Distinct Values	Cardinality	Selectivity	Index Quality
user_id (PK)	1,000,000	1,000,000	1,000,000	0.000001	Excellent
email	1,000,000	1,000,000	1,000,000	0.000001	Excellent
country	1,000,000	200	200	0.005	Good for filtering
gender	1,000,000	3	3	0.333	Poor standalone
is_active	1,000,000	2	2	0.5	Very poor

Interpretation:

High cardinality (unique or near-unique): Excellent for point lookups. Each index entry points to few rows.
Medium cardinality (hundreds-thousands distinct): Good for filtering, may need combined with other predicates.
Low cardinality (few distinct values): Poor standalone; consider bitmap index or composite index.

Composite Index Selectivity:

For composite indexes on (A, B), selectivity compounds:

Selectivity(A, B) ≈ Selectivity(A) × Selectivity(B)

If country has selectivity 1/200 and city has 1/500:
Composite selectivity ≈ 1/100,000

This is why composite indexes are powerful—multiplying selectivities dramatically reduces result sets.

Collecting Statistics

Query optimizers rely on statistics for cardinality estimates:

• PostgreSQL: ANALYZE table_name (or auto-vacuum) • MySQL: ANALYZE TABLE table_name • SQL Server: UPDATE STATISTICS table_name

Stale statistics lead to poor query plans. After large data changes, always refresh statistics.

Storage Metrics: Size and Efficiency

Index storage metrics affect both disk costs and cache efficiency. Understanding storage composition helps optimize resource usage.

Key Storage Metrics

•Total Index Size — Disk space consumed by the entire index (internal + leaf pages). Monitor for growth patterns.
•Leaf Page Count — Number of leaf-level pages. Determines I/O for full index scans and range queries.
•Internal Page Count — Non-leaf pages. Usually <1% of total but critical for traversal caching.
•Average Page Fill (Density) — Percentage of page space used. 100% = fully packed, 50% = half empty.
•Fragmentation — When logical order differs from physical order, causing sequential read inefficiency.
•Index-to-Table Ratio — Index size as percentage of table size. Warning if indexes exceed table size.

Storage Metrics Queries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Index size and page counts
SELECT 
    relname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    pg_relation_size(indexrelid) / 8192 AS page_count,
    idx_scan AS times_used,
    idx_tup_read AS tuples_read,
    idx_tup_fetch AS tuples_fetched
FROM pg_stat_user_indexes
JOIN pg_class ON pg_class.oid = indexrelid
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- Detailed page statistics (requires pgstattuple extension)
CREATE EXTENSION IF NOT EXISTS pgstattuple;
 
SELECT 
    'idx_users_email' AS index_name,
    tree_level AS height,
    index_size,
    leaf_pages,
    internal_pages,
    avg_leaf_density,
    leaf_fragmentation
FROM pgstatindex('idx_users_email');
 
-- Index to table size ratio
SELECT 
    schemaname || '.' || relname AS table_name,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
    pg_size_pretty(pg_table_size(relid)) AS table_size,
    pg_size_pretty(pg_indexes_size(relid)) AS indexes_size,
    ROUND(100.0 * pg_indexes_size(relid) / NULLIF(pg_table_size(relid), 0), 1) 
        AS index_to_table_percent
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Fragmentation Metrics: Degradation Indicators

Over time, indexes can become fragmented—their logical and physical organization diverge, reducing performance. Understanding fragmentation metrics helps schedule maintenance appropriately.

Types of Index Fragmentation
Type	Description	Impact	Detection
Logical Fragmentation	Leaf pages not in key order on disk	Sequential reads become random	Fragmentation %
Internal Fragmentation	Pages not fully utilized (low fill)	More I/O for same data	Avg page space used %
Free Space Fragmentation	Small gaps scattered across pages	Can't insert without split	Free space distribution
Extent Fragmentation	Index extents not contiguous	Prefetch less effective	Extent distribution

Fragmentation Thresholds (General Guidelines):

Metric	Good	Acceptable	Needs Attention	Critical
Logical Fragmentation	<5%	5-15%	15-30%	>30%
Page Fill (Internal)	>85%	70-85%	50-70%	<50%
Free Space	<15%	15-30%	30-50%	>50%

Remediation Options:

REINDEX / REBUILD — Completely rebuilds index with optimal structure. Most thorough but most expensive.
REORGANIZE — In-place defragmentation, compacts pages. Less intrusive, can run online.
ALTER INDEX ... FILL FACTOR — Rebuild with specified fill factor to leave room for future inserts.
Drop and Recreate — Sometimes fastest for heavily fragmented indexes.

When to Address Fragmentation:

OLTP (random access): Logical fragmentation matters less; internal fragmentation matters more
Analytics (range scans): Logical fragmentation matters greatly; sequential reads must be truly sequential
Write-heavy: Maintain lower fill factor to prevent splits; accept some fragmentation

SSD Fragmentation Myth

Common belief: "Fragmentation doesn't matter on SSDs." Reality: Logical fragmentation matters less (random ≈ sequential latency), but internal fragmentation still wastes space and cache. Low page fill means more pages cached for same data, reducing buffer pool effectiveness. Don't ignore internal fragmentation even on SSD storage.

Maintenance Cost Metrics: The Write-Side Story

While indexes accelerate reads, they impose overhead on writes. Quantifying this overhead helps balance query performance against update costs.

Write Overhead Metrics

•Index Writes per Table Write — Each INSERT/UPDATE/DELETE may touch multiple indexes. More indexes = more overhead.
•Page Split Rate — Frequency of leaf and internal node splits. High rate indicates suboptimal fill or random inserts.
•WAL/Redo Volume — Transaction log bytes generated by index updates. Affects replication lag and backup size.
•Update-to-Read Ratio — Compares index updates to index reads. Low ratio suggests the index is worth maintaining.
•Write Amplification — Total bytes written vs user data bytes. Includes index maintenance, WAL, and cleanup.

Estimating Index Write Overhead:

Write Cost Per Row = Base_Row_Cost + Sum(Index_Costs)

Where Index_Cost ≈ (
    Key_Size × Log_Overhead +
    Tree_Descent_Cost +
    Split_Probability × Split_Cost
)

Rule of Thumb:

Each additional index adds ~10-30% to INSERT time
UPDATE on indexed column adds ~15-40% overhead
DELETE must update all indexes

Identifying Unused Indexes:

Unused indexes impose cost without benefit. Identify them using usage statistics:

Finding Unused Indexes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Unused indexes (never scanned since stats reset)
SELECT 
    schemaname || '.' || relname AS table_name,
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan AS times_used,
    idx_tup_read AS rows_read
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE '%_pkey'  -- Exclude primary keys
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- Low-value indexes (many updates, few reads)
SELECT 
    schemaname || '.' || relname AS table_name,
    indexrelname AS index_name,
    idx_scan AS reads,
    idx_tup_read AS rows_read,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan < 100  -- Rarely used
  AND pg_relation_size(indexrelid) > 10000000  -- But >10MB
ORDER BY pg_relation_size(indexrelid) DESC;

Don't Drop Without Analysis

Before dropping 'unused' indexes, consider:

• Monthly/quarterly reports may need them infrequently • Stats may have been reset recently • Some indexes support UNIQUE constraints • Covering indexes may be valuable for specific queries

Run in test environment first or use 'INVISIBLE' index feature (MySQL 8+, Oracle) to disable without dropping.

Calculating Expected Performance: Estimation Models

With the metrics and formulas covered, let's build complete performance estimation models for common scenarios.

Model 1: Point Lookup Time

Input:
  F = Fanout (e.g., 500)
  N = Row count (e.g., 10,000,000)
  t_disk = Disk read latency (e.g., 5ms HDD, 0.1ms SSD)
  cache_levels = Levels fully cached (e.g., 2)

Calculation:
  h = ⌈log_F(N)⌉ = ⌈log_500(10,000,000)⌉ = 3
  physical_ios = max(0, h - cache_levels) = max(0, 3 - 2) = 1
  lookup_time = physical_ios × t_disk = 1 × 0.1ms = 0.1ms

Result: ~0.1ms per point lookup on SSD with cached internal nodes

Model 2: Range Scan Time

Input:
  N = Row count = 10,000,000
  result_rows = 10,000
  rows_per_leaf = 100
  h = 3 (tree height)
  t_disk = 0.1ms

Calculation:
  descent_ios = 1 (internal nodes cached)
  leaf_ios = ⌈result_rows / rows_per_leaf⌉ = ⌈10000/100⌉ = 100
  total_ios = descent_ios + leaf_ios = 101
  scan_time = total_ios × t_disk = 101 × 0.1ms = 10.1ms

Result: ~10ms for 10K-row range scan (covering index)

Model 3: Non-Covering Index Cost

Input (continuing from Model 2):
  leaf_ios = 100 (same as above)
  bookmark_lookup_cost = 2 (clustered index descent)
  clustering_factor = 0.8 (rows somewhat clustered)

Calculation:
  bookmark_ios = result_rows × (1 - clustering_factor) × bookmark_lookup_cost
            = 10000 × 0.2 × 2 = 4000 additional I/Os
  total_ios = 1 + 100 + 4000 = 4101

Result: Non-covering is ~40x more expensive than covering for this query!

The Clustering Factor

Clustering factor measures correlation between index key order and table row physical order:

• 1.0 = Perfect clustering (index order = table order) • 0.0 = No correlation (each index entry requires separate table page)

For non-clustered indexes, poor clustering dramatically increases bookmark lookup cost. This is why clustered index choice is so important—it determines the physical row order that affects all other indexes.

Summary: Metrics-Driven Index Optimization

Performance metrics transform index optimization from guesswork into science. Armed with these measurements and formulas, you can make data-driven decisions about index design, identify problems before users notice them, and justify infrastructure investments.

Key Takeaways

•I/O is the primary cost — For disk-based systems, page reads dominate query time. Height, fanout, and caching determine I/O count.
•Selectivity determines index value — High-cardinality indexes filter effectively; low-cardinality may hurt. Quantify with selectivity = 1/cardinality.
•Storage metrics reveal efficiency — Page fill, fragmentation, and index-to-table ratio indicate when maintenance is needed.
•Monitor read/write ratios — Indexes that are written frequently but rarely read may not be worth the overhead. Use usage statistics.
•Fragmentation degrades performance gradually — Track and remediate before it impacts users. Thresholds vary by workload type.
•Estimation models guide design — Calculate expected costs before creating indexes. Compare covering vs non-covering, composite vs single-column.

Module Complete!

With this page, you've completed Module 6: Index Terminology. You now have a comprehensive vocabulary and conceptual framework for discussing, analyzing, and optimizing database indexes. From fanout to height, from leaf nodes to internal nodes, and from qualitative understanding to quantitative metrics—you're equipped to work with indexes at a professional level.

Next Steps:

The next chapter explores B-Trees & B+-Trees in depth, diving into the algorithms for search, insert, delete, and the structural properties that make these trees so effective for database indexing.

Module Complete

Congratulations! You've mastered the essential terminology and metrics for database indexing. You understand fanout's impact on tree height, how leaf and internal nodes differ structurally and functionally, and the quantitative metrics that measure index performance. This foundation prepares you for advanced topics in B-tree implementations and index optimization strategies.