Loading content...
"You can't improve what you can't measure." This maxim applies powerfully to database indexing. While intuition and experience guide initial index design, performance metrics provide the objective data needed to validate designs, diagnose problems, and optimize for specific workloads.
Index performance isn't a single number—it's a multi-dimensional space encompassing I/O costs, CPU usage, memory consumption, and maintenance overhead. Different metrics matter for different operations: point lookups care about tree height, range scans care about leaf clustering, and write-heavy workloads care about split frequency.
This page introduces the key metrics used by database professionals to analyze index performance, along with formulas for estimation and tools for measurement.
By the end of this page, you will understand the core performance metrics for B+-tree indexes, how to calculate expected I/O costs, what selectivity and cardinality mean for index effectiveness, and how to use database-provided statistics for performance analysis. You'll be equipped to make data-driven index decisions.
Index metrics fall into several categories, each addressing different aspects of performance. A holistic understanding requires familiarity with all categories.
| Category | Metrics | What It Measures | Primary Concern |
|---|---|---|---|
| Structure | Height, fanout, node count, depth | Tree shape and size | Lookup cost, storage overhead |
| Storage | Total size, page count, fill factor | Disk space consumption | Storage cost, cache efficiency |
| Access Cost | I/O per lookup, I/O per range scan | Read operation expense | Query latency, throughput |
| Selectivity | Cardinality, distinct values, density | Index filtering power | Optimizer decisions, query benefit |
| Maintenance | Split frequency, merge frequency, fragmentation | Write operation overhead | Insert/update cost, degradation |
| Cache | Hit ratio, buffer pool residency | Memory efficiency | Effective vs theoretical I/O |
Index design often involves tradeoffs between read and write performance. Metrics help quantify these tradeoffs:
• Higher fill factor → Better read performance, worse write (more splits) • More indexes → Better read flexibility, worse write (more maintenance) • Wider indexes (covering) → Better read (no lookup), worse write (more data to maintain)
For disk-based databases, I/O is the dominant cost factor. Understanding and predicting I/O costs is fundamental to index performance analysis.
I/O Cost Formulas:
Point Lookup:
Logical I/O = Height (h) = ⌈log_F(N)⌉
Effective I/O = h - cached_levels ≈ 1-2 for hot indexes
Range Scan:
Logical I/O = h + ⌈matching_rows / rows_per_leaf⌉
For 10,000 matching rows with 100 rows/leaf:
I/O = 4 + ⌈10000/100⌉ = 4 + 100 = 104 pages
Index-Only Scan (covering index):
I/O = height + result_leaf_pages
(No additional table access required)
Non-Covering Index Scan:
I/O = height + result_leaf_pages + bookmark_lookups
Each bookmark lookup may add 1-4 I/Os depending on table structure
At some selectivity level, sequential table scan becomes cheaper than index scan + bookmark lookups. This typically occurs when selecting 5-20% of rows. The query optimizer estimates costs to choose between:
• Sequential scan: (table_pages) I/Os • Index scan: (index_pages + rows × bookmark_cost) I/Os
Indexes help most when retrieving small fractions of data. For large result sets, full scans may win.
Selectivity and cardinality measure how well an index can narrow down results. These metrics fundamentally determine whether an index will be used and how much benefit it provides.
Definitions:
| Metric | Definition | Formula | Ideal Value |
|---|---|---|---|
| Cardinality | Number of distinct values in the index key | COUNT(DISTINCT key) | Equals row count (unique) |
| Selectivity | Fraction of rows matching a specific value | 1 / cardinality | As low as possible |
| Density | Average fraction of rows per distinct value | 1 / cardinality | Low for indexes |
| Index Selectivity | Cardinality / Total rows | DISTINCT / COUNT(*) | Close to 1.0 |
Examples:
| Column | Rows | Distinct Values | Cardinality | Selectivity | Index Quality |
|---|---|---|---|---|---|
| user_id (PK) | 1,000,000 | 1,000,000 | 1,000,000 | 0.000001 | Excellent |
| 1,000,000 | 1,000,000 | 1,000,000 | 0.000001 | Excellent | |
| country | 1,000,000 | 200 | 200 | 0.005 | Good for filtering |
| gender | 1,000,000 | 3 | 3 | 0.333 | Poor standalone |
| is_active | 1,000,000 | 2 | 2 | 0.5 | Very poor |
Interpretation:
Composite Index Selectivity:
For composite indexes on (A, B), selectivity compounds:
Selectivity(A, B) ≈ Selectivity(A) × Selectivity(B)
If country has selectivity 1/200 and city has 1/500:
Composite selectivity ≈ 1/100,000
This is why composite indexes are powerful—multiplying selectivities dramatically reduces result sets.
Query optimizers rely on statistics for cardinality estimates:
• PostgreSQL: ANALYZE table_name (or auto-vacuum) • MySQL: ANALYZE TABLE table_name • SQL Server: UPDATE STATISTICS table_name
Stale statistics lead to poor query plans. After large data changes, always refresh statistics.
Index storage metrics affect both disk costs and cache efficiency. Understanding storage composition helps optimize resource usage.
123456789101112131415161718192021222324252627282930313233343536
-- Index size and page countsSELECT relname AS index_name, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, pg_relation_size(indexrelid) / 8192 AS page_count, idx_scan AS times_used, idx_tup_read AS tuples_read, idx_tup_fetch AS tuples_fetchedFROM pg_stat_user_indexesJOIN pg_class ON pg_class.oid = indexrelidWHERE schemaname = 'public'ORDER BY pg_relation_size(indexrelid) DESC; -- Detailed page statistics (requires pgstattuple extension)CREATE EXTENSION IF NOT EXISTS pgstattuple; SELECT 'idx_users_email' AS index_name, tree_level AS height, index_size, leaf_pages, internal_pages, avg_leaf_density, leaf_fragmentationFROM pgstatindex('idx_users_email'); -- Index to table size ratioSELECT schemaname || '.' || relname AS table_name, pg_size_pretty(pg_total_relation_size(relid)) AS total_size, pg_size_pretty(pg_table_size(relid)) AS table_size, pg_size_pretty(pg_indexes_size(relid)) AS indexes_size, ROUND(100.0 * pg_indexes_size(relid) / NULLIF(pg_table_size(relid), 0), 1) AS index_to_table_percentFROM pg_stat_user_tablesORDER BY pg_total_relation_size(relid) DESC;Over time, indexes can become fragmented—their logical and physical organization diverge, reducing performance. Understanding fragmentation metrics helps schedule maintenance appropriately.
| Type | Description | Impact | Detection |
|---|---|---|---|
| Logical Fragmentation | Leaf pages not in key order on disk | Sequential reads become random | Fragmentation % |
| Internal Fragmentation | Pages not fully utilized (low fill) | More I/O for same data | Avg page space used % |
| Free Space Fragmentation | Small gaps scattered across pages | Can't insert without split | Free space distribution |
| Extent Fragmentation | Index extents not contiguous | Prefetch less effective | Extent distribution |
Fragmentation Thresholds (General Guidelines):
| Metric | Good | Acceptable | Needs Attention | Critical |
|---|---|---|---|---|
| Logical Fragmentation | <5% | 5-15% | 15-30% | >30% |
| Page Fill (Internal) | >85% | 70-85% | 50-70% | <50% |
| Free Space | <15% | 15-30% | 30-50% | >50% |
Remediation Options:
REINDEX / REBUILD — Completely rebuilds index with optimal structure. Most thorough but most expensive.
REORGANIZE — In-place defragmentation, compacts pages. Less intrusive, can run online.
ALTER INDEX ... FILL FACTOR — Rebuild with specified fill factor to leave room for future inserts.
Drop and Recreate — Sometimes fastest for heavily fragmented indexes.
When to Address Fragmentation:
Common belief: "Fragmentation doesn't matter on SSDs." Reality: Logical fragmentation matters less (random ≈ sequential latency), but internal fragmentation still wastes space and cache. Low page fill means more pages cached for same data, reducing buffer pool effectiveness. Don't ignore internal fragmentation even on SSD storage.
While indexes accelerate reads, they impose overhead on writes. Quantifying this overhead helps balance query performance against update costs.
Estimating Index Write Overhead:
Write Cost Per Row = Base_Row_Cost + Sum(Index_Costs)
Where Index_Cost ≈ (
Key_Size × Log_Overhead +
Tree_Descent_Cost +
Split_Probability × Split_Cost
)
Rule of Thumb:
Identifying Unused Indexes:
Unused indexes impose cost without benefit. Identify them using usage statistics:
1234567891011121314151617181920212223
-- Unused indexes (never scanned since stats reset)SELECT schemaname || '.' || relname AS table_name, indexrelname AS index_name, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, idx_scan AS times_used, idx_tup_read AS rows_readFROM pg_stat_user_indexesWHERE idx_scan = 0 AND indexrelname NOT LIKE '%_pkey' -- Exclude primary keysORDER BY pg_relation_size(indexrelid) DESC; -- Low-value indexes (many updates, few reads)SELECT schemaname || '.' || relname AS table_name, indexrelname AS index_name, idx_scan AS reads, idx_tup_read AS rows_read, pg_size_pretty(pg_relation_size(indexrelid)) AS sizeFROM pg_stat_user_indexesWHERE idx_scan < 100 -- Rarely used AND pg_relation_size(indexrelid) > 10000000 -- But >10MBORDER BY pg_relation_size(indexrelid) DESC;Before dropping 'unused' indexes, consider:
• Monthly/quarterly reports may need them infrequently • Stats may have been reset recently • Some indexes support UNIQUE constraints • Covering indexes may be valuable for specific queries
Run in test environment first or use 'INVISIBLE' index feature (MySQL 8+, Oracle) to disable without dropping.
With the metrics and formulas covered, let's build complete performance estimation models for common scenarios.
Model 1: Point Lookup Time
Input:
F = Fanout (e.g., 500)
N = Row count (e.g., 10,000,000)
t_disk = Disk read latency (e.g., 5ms HDD, 0.1ms SSD)
cache_levels = Levels fully cached (e.g., 2)
Calculation:
h = ⌈log_F(N)⌉ = ⌈log_500(10,000,000)⌉ = 3
physical_ios = max(0, h - cache_levels) = max(0, 3 - 2) = 1
lookup_time = physical_ios × t_disk = 1 × 0.1ms = 0.1ms
Result: ~0.1ms per point lookup on SSD with cached internal nodes
Model 2: Range Scan Time
Input:
N = Row count = 10,000,000
result_rows = 10,000
rows_per_leaf = 100
h = 3 (tree height)
t_disk = 0.1ms
Calculation:
descent_ios = 1 (internal nodes cached)
leaf_ios = ⌈result_rows / rows_per_leaf⌉ = ⌈10000/100⌉ = 100
total_ios = descent_ios + leaf_ios = 101
scan_time = total_ios × t_disk = 101 × 0.1ms = 10.1ms
Result: ~10ms for 10K-row range scan (covering index)
Model 3: Non-Covering Index Cost
Input (continuing from Model 2):
leaf_ios = 100 (same as above)
bookmark_lookup_cost = 2 (clustered index descent)
clustering_factor = 0.8 (rows somewhat clustered)
Calculation:
bookmark_ios = result_rows × (1 - clustering_factor) × bookmark_lookup_cost
= 10000 × 0.2 × 2 = 4000 additional I/Os
total_ios = 1 + 100 + 4000 = 4101
Result: Non-covering is ~40x more expensive than covering for this query!
Clustering factor measures correlation between index key order and table row physical order:
• 1.0 = Perfect clustering (index order = table order) • 0.0 = No correlation (each index entry requires separate table page)
For non-clustered indexes, poor clustering dramatically increases bookmark lookup cost. This is why clustered index choice is so important—it determines the physical row order that affects all other indexes.
Performance metrics transform index optimization from guesswork into science. Armed with these measurements and formulas, you can make data-driven decisions about index design, identify problems before users notice them, and justify infrastructure investments.
Module Complete!
With this page, you've completed Module 6: Index Terminology. You now have a comprehensive vocabulary and conceptual framework for discussing, analyzing, and optimizing database indexes. From fanout to height, from leaf nodes to internal nodes, and from qualitative understanding to quantitative metrics—you're equipped to work with indexes at a professional level.
Next Steps:
The next chapter explores B-Trees & B+-Trees in depth, diving into the algorithms for search, insert, delete, and the structural properties that make these trees so effective for database indexing.
Congratulations! You've mastered the essential terminology and metrics for database indexing. You understand fanout's impact on tree height, how leaf and internal nodes differ structurally and functionally, and the quantitative metrics that measure index performance. This foundation prepares you for advanced topics in B-tree implementations and index optimization strategies.