System Design (HLD)Indexes and Query Performance

Indexes and Query Performance

LevelIntermediate

Duration90 mins

TopicIndexes and Query Performance

5 / 5

Index Trade-offs

The Art of Indexing: Balancing Competing Concerns

If indexing were simply about making queries fast, every database would have an index on every column. The reason they don't reveals the fundamental truth of database engineering: every index is a trade-off.

Indexes accelerate reads but slow writes. They consume storage but save CPU. They benefit some queries while being useless for others. The art of indexing lies not in knowing how to create indexes, but in knowing when to create them, which columns to include, and when the costs outweigh the benefits.

This page synthesizes everything we've learned about indexes into a comprehensive framework for making indexing decisions in production systems. You'll learn to reason systematically about trade-offs, avoid common pitfalls, and design indexing strategies that serve your application's unique needs.

What You Will Master

By the end of this page, you'll understand the read/write trade-off quantitatively, know how to evaluate storage costs, master index maintenance strategies, design for different workload profiles, and develop a systematic approach to indexing decisions that balances performance, cost, and complexity.

The Fundamental Trade-off: Reads vs Writes

The core trade-off in indexing is simple: indexes make reads faster but writes slower. Every index on a table creates additional work during INSERT, UPDATE, and DELETE operations.

Write Amplification Explained:

When you insert a row into a table with 5 indexes:

Write the row to the table heap (1 operation)
Insert an entry into index 1 (1 operation)
Insert an entry into index 2 (1 operation)
Insert an entry into index 3 (1 operation)
Insert an entry into index 4 (1 operation)
Insert an entry into index 5 (1 operation)

Total: 6 write operations for 1 logical insert. This is write amplification.

For updates, it's even worse if the updated column is indexed:

Old index entry must be marked as deleted or removed
New index entry must be inserted
MVCC databases may create new row versions, multiplying the effect

Write Cost by Number of Indexes
of Indexes	INSERT Cost	UPDATE Cost*	DELETE Cost
0	1 write	1 write	1 write
1	2 writes	1-3 writes	2 writes
3	4 writes	1-7 writes	4 writes
5	6 writes	1-11 writes	6 writes
10	11 writes	1-21 writes	11 writes

*UPDATE cost depends on which columns are modified. Updating an unindexed column may only affect the heap.

Quantifying the Read Benefit:

To justify an index, the read performance benefit must outweigh the write performance cost:

Benefit = (queries_per_second) × (time_saved_per_query)
Cost = (writes_per_second) × (time_added_per_write)

Index is worthwhile when: Benefit > Cost

Example Calculation:

Query without index: 100ms (full table scan)
Query with index: 1ms (index lookup)
Time saved per query: 99ms
Query frequency: 1000 queries/second
Read benefit: 99 seconds of CPU time saved per second
Write without index: 1ms
Write with index: 2ms
Time added per write: 1ms
Write frequency: 100 writes/second
Write cost: 0.1 seconds of CPU time added per second
Net benefit: 98.9 seconds saved per second → Index is clearly worthwhile

The Read/Write Ratio Rule

As a rough guideline: If your read/write ratio is 10:1 or higher, indexes almost always help. At 1:1 ratio, carefully evaluate each index. Below 1:1 (write-heavy), be very selective about indexes—each one significantly impacts throughput.

Storage Costs and Memory Pressure

Indexes consume storage proportional to the data they index. Understanding storage costs helps you make informed decisions about index design.

Index Size Calculation:

For a B-tree index, approximate size:

index_size ≈ num_rows × (key_size + pointer_size) × overhead_factor

Where:

key_size: Sum of indexed column sizes
pointer_size: 8-16 bytes per row pointer
overhead_factor: 1.3-2.0 for B-tree structure and free space

Example:

Table: 100 million rows
Index: (user_id INT, created_at TIMESTAMP) = 4 + 8 = 12 bytes
Pointer: 8 bytes
Per-entry: 20 bytes × 1.5 overhead = 30 bytes
Total index size: 3 GB

Storage Overhead by Index Type
Index Type	Column Type	100M Row Size	Notes
Single (INT)	4 bytes	~1.5 GB	Most compact
Single (BIGINT)	8 bytes	~2 GB	Common for IDs
Single (UUID)	16 bytes	~3 GB	Random distribution
Single (VARCHAR 100)	up to 100 bytes	~10-15 GB	Variable length
Composite (INT, INT)	8 bytes	~2 GB	Efficient
Composite (INT, TIMESTAMP)	12 bytes	~2.5 GB	Common pattern
Covering (5 columns)	varies	~5-20 GB	Duplicates table data

Memory Pressure:

Indexes don't just consume disk space—they compete for memory in the buffer pool:

Frequently accessed index pages stay in memory
More indexes → more pages competing for limited memory
Memory-starved indexes cause disk I/O → slower queries

The Working Set Problem:

Your database's working set is the data actively accessed by queries. Ideally, the entire working set fits in memory:

working_set = hot_table_data + hot_index_data + temp_space
buffer_pool_size >= working_set → mostly memory operations
buffer_pool_size < working_set → disk I/O, slower queries

Practical Impact:

Adding a 5 GB index to a database with 16 GB buffer pool might push other hot data out of memory, degrading overall performance even for queries that don't use the new index.

Monitoring Storage Impact:

index-size-queries.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- PostgreSQL: Index sizes for a table
SELECT 
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- PostgreSQL: Relation (table + indexes) total size
SELECT 
    relname,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
    pg_size_pretty(pg_relation_size(relid)) AS table_size,
    pg_size_pretty(pg_indexes_size(relid)) AS indexes_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
 
-- PostgreSQL: Buffer pool usage by relation
SELECT 
    c.relname,
    count(*) AS buffers,
    pg_size_pretty(count(*) * 8192) AS size_in_buffer
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
GROUP BY c.relname
ORDER BY count(*) DESC
LIMIT 20;

The Index Bloat Problem

Indexes can bloat over time due to fragmentation and dead tuples. A heavily updated table might have indexes 2-3x larger than necessary. Regular REINDEX or VACUUM operations compact indexes back to their optimal size. Monitor index bloat as part of database maintenance.

Index Maintenance Overhead

Beyond read/write trade-offs, indexes create ongoing maintenance overhead that affects database operations.

Transaction Logging:

In databases with write-ahead logging (WAL), every index modification generates log entries:

More indexes → more WAL entries → larger WAL files
Replication lag increases with more indexes (more data to replicate)
Point-in-time recovery takes longer with larger WAL

Vacuum and Maintenance Operations:

Indexes must be maintained during vacuum/analyze operations:

Dead tuples in indexes must be cleaned up
Index statistics must be updated for the query optimizer
Bloated indexes may require REINDEX

Lock Contention:

Index operations can create lock contention:

B-tree node splits may lock parent nodes briefly
Heavy concurrent writes can bottleneck on hot index pages
DDL operations (CREATE INDEX, DROP INDEX) may lock tables

Index Maintenance Costs

•WAL amplification: Each indexed column doubles WAL entries for that operation. 5 indexes ≈ 5x WAL for affected rows.
•Vacuum duration: More indexes = longer vacuum. A table with 10 indexes vacuums much slower than one with 2.
•Reindex time: Rebuilding indexes on large tables can take hours, during which the table may be partially locked.
•Backup size: Database backups include index data. More indexes = larger backups = longer restore times.
•Replication lag: Replicas must apply all index modifications. Heavy writes to indexed tables cause replica lag.
•Auto-vacuum impact: PostgreSQL's auto-vacuum may struggle to keep up with heavily indexed, write-heavy tables.

Concurrent Index Operations:

Modern databases support concurrent index operations to minimize disruption:

-- PostgreSQL: Create index without locking writes
CREATE INDEX CONCURRENTLY idx_orders_user 
    ON orders(user_id);
-- Takes longer but doesn't block other operations

-- PostgreSQL: Reindex without heavy locking
REINDEX INDEX CONCURRENTLY idx_orders_user;

However, concurrent operations:

Take 2-3x longer than regular operations
Consume more resources (must track concurrent modifications)
May fail and leave invalid indexes requiring cleanup
Still impact performance, just less severely

Maintenance Window Planning

Schedule index maintenance during low-traffic periods. For critical systems, use concurrent operations and monitor progress. Consider creating new indexes concurrently, then dropping old redundant ones, rather than modifying indexes in place.

Workload-Specific Index Strategies

Different application workloads demand different indexing approaches. Understanding your workload profile guides index design.

Index Strategy by Workload Type
Workload	Read/Write Ratio	Typical Indexes	Strategy
OLTP (transactions)	10:1 to 100:1	3-5 per table	Index hot query paths, avoid over-indexing
OLAP (analytics)	1000:1+	Many, including wide covering	Comprehensive indexes, prioritize query speed
Write-heavy (logging)	1:10 to 1:100	0-2 per table	Minimal indexes, bulk-load patterns
Hybrid (mixed)	varies	Selective	Partition by workload if possible
Time-series	Write-heavy, read varies	Time-based partitioning	Partition indexes, retention policies

OLTP (Transactional) Systems:

Typical characteristics:

Many small, fast transactions
Point lookups by primary key
Range queries on recent data
Moderate write volume

Index strategy:

Index primary and foreign keys (often automatic)
Add indexes for common WHERE clauses
Composite indexes for frequent query patterns
Avoid covering indexes unless queries are extremely frequent
Keep total indexes per table to 3-5

OLAP (Analytical) Systems:

Typical characteristics:

Complex aggregation queries
Scans over large data ranges
Rare writes (batch loads)
Query patterns known in advance

Index strategy:

Wide covering indexes for common dashboards
Indexes on all filter/GROUP BY columns
Consider column-store or specialized OLAP indexes
Index maintenance can happen during batch loads
10+ indexes per table is acceptable

Write-Heavy Workload Tips

•Minimal indexes (only what's essential)
•Batch writes when possible (fewer transactions)
•Disable indexes during bulk loads, rebuild after
•Consider append-only tables with no indexes
•Use write-optimized storage (LSM trees)
•Partition tables to limit index scope

Read-Heavy Workload Tips

•Comprehensive indexes for query patterns
•Covering indexes for frequent queries
•Composite indexes aligned with WHERE clauses
•Consider materialized views for complex queries
•Read replicas with additional indexes
•Full-text and specialized indexes where helpful

Read Replicas for Index Experimentation

Use read replicas to experiment with different indexing strategies. Add indexes to a replica, benchmark query performance, then promote the strategy to production if it helps. This avoids impacting production writes during experimentation.

Partial and Filtered Indexes

Partial indexes (PostgreSQL) or filtered indexes (SQL Server) index only a subset of rows matching a specified condition. They offer significant trade-off advantages for certain query patterns.

Why Partial Indexes?

Consider orders table with 100 million rows:

98 million are 'completed' (historical)
2 million are 'pending', 'processing', 'shipped' (active)

Queries almost always filter for active orders. A full index on status wastes 98% of its space on rows that are rarely queried.

Partial Index Solution:

partial-indexes.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Full index (indexes all 100M rows)
CREATE INDEX idx_orders_status ON orders(status);
-- Size: ~2 GB
 
-- Partial index (indexes only active orders)
CREATE INDEX idx_orders_active ON orders(status)
WHERE status IN ('pending', 'processing', 'shipped');
-- Size: ~40 MB (98% smaller!)
 
-- Even more focused: just one status
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Tiny index, only pending orders, sorted by creation time
 
-- Query that uses partial index
SELECT * FROM orders 
WHERE status = 'pending' 
ORDER BY created_at;
-- Uses idx_orders_pending perfectly
 
-- Soft-delete pattern
CREATE INDEX idx_active_users ON users(email)
WHERE deleted_at IS NULL;
-- Index only non-deleted users (much smaller than full index)
 
-- Unique constraint on subset
CREATE UNIQUE INDEX idx_unique_active_email 
ON users(email) WHERE deleted_at IS NULL;
-- Allows re-registration of deleted emails while ensuring
-- only one active user per email

Benefits of Partial Indexes:

Dramatically smaller size: Index only what you query
Faster index operations: Fewer entries to maintain
Better cache efficiency: Hot data stays in memory
Lower write overhead: Many writes don't touch the index
Unique constraints on subsets: Enable soft-delete patterns

Limitations:

Query must include the index predicate (WHERE clause must match)
Not all databases support partial indexes
Optimizer must recognize the match (sometimes requires explicit conditions)

Common Partial Index Patterns:

Partial Index Use Cases

•Active records only: WHERE deleted_at IS NULL or WHERE is_active = true
•Recent data only: WHERE created_at > CURRENT_DATE - INTERVAL '90 days'
•Pending items: WHERE status IN ('pending', 'processing') for job queues
•Non-null values: WHERE optional_field IS NOT NULL
•Specific types: WHERE type = 'premium' for premium user features
•Incomplete records: WHERE completed_at IS NULL for in-progress workflows

Partial Indexes are Underutilized

Partial indexes are one of the most powerful yet underused indexing features. If your queries consistently filter to a small subset of data, partial indexes can provide 10-100x storage savings while maintaining identical query performance. Always consider them for status-based, soft-delete, or time-windowed query patterns.

Index Selection Anti-Patterns

Recognizing common indexing mistakes helps you avoid them in your own systems.

Index Anti-Patterns

•Index on every column: 'Just in case' indexing devastates write performance and wastes storage. Each index must justify its existence.
•Ignoring query patterns: Creating indexes without analyzing actual query workload. Use query logs to identify slow queries and their filter columns.
•Duplicate indexes: Having both (A) and (A, B) when only the composite is needed. The composite handles all 'A' queries.
•Indexing low-cardinality columns: Index on boolean or status with 3 values rarely helps. The index isn't selective enough.
•Wrong column order in composites: Putting range columns before equality columns. Always: equality first, then range.
•Never reviewing indexes: Indexes added during development may become obsolete. Unused indexes should be dropped.
•Indexing computed values without expression indexes: WHERE LOWER(email) = 'x' can't use an index on email. Need index on LOWER(email).
•Over-trusting the ORM: ORM-generated queries may not align with your indexes, or may generate patterns indexes can't help.

The 'More Indexes is Better' Fallacy:

A common belief is that adding indexes always improves performance. Reality:

Each index slows every INSERT by ~10-20%
Each index slows affected UPDATEs by ~20-50%
Storage and memory pressure increase linearly
Optimizer confusion: too many indexes can lead to poor choices
Maintenance complexity increases

The Pareto Principle of Indexes:

80% of query patterns can be served by 20% of possible indexes. Identify your critical query paths and index those. Accept that rare queries may be slower—the trade-off is usually worth it.

Signs You Have Too Many Indexes

Watch for: Write latency increasing over time. INSERT throughput declining. Vacuum operations taking hours. Index size exceeding table size by 3x+. Queries using unexpected indexes (optimizer confusion). These all suggest over-indexing.

Monitoring and Identifying Index Opportunities

Effective indexing requires ongoing monitoring to identify both missing indexes and unused ones.

index-monitoring.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- PostgreSQL: Find unused indexes
SELECT 
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(i.indexrelid)) AS size,
    idx_scan AS scans
FROM pg_stat_user_indexes ui
JOIN pg_index i ON ui.indexrelid = i.indexrelid
WHERE NOT i.indisunique  -- Exclude primary keys and unique constraints
  AND idx_scan < 50      -- Rarely used
  AND pg_relation_size(i.indexrelid) > 5000000  -- > 5MB
ORDER BY pg_relation_size(i.indexrelid) DESC;
 
-- PostgreSQL: Tables missing primary key indexes
SELECT relname 
FROM pg_class 
WHERE relkind = 'r' 
  AND relname NOT IN (
    SELECT indrelid::regclass::text 
    FROM pg_index WHERE indisprimary
  );
 
-- PostgreSQL: Find slow queries potentially needing indexes
-- (Requires pg_stat_statements extension)
SELECT 
    query,
    calls,
    mean_time::numeric(10,2) AS avg_ms,
    total_time::numeric(10,2) AS total_ms
FROM pg_stat_statements
WHERE mean_time > 100  -- Queries averaging over 100ms
ORDER BY total_time DESC
LIMIT 20;
 
-- PostgreSQL: Identify sequential scans on large tables
SELECT 
    relname AS table,
    seq_scan,
    seq_tup_read,
    idx_scan,
    pg_size_pretty(pg_relation_size(relid)) AS size
FROM pg_stat_user_tables
WHERE seq_scan > 100              -- Many sequential scans
  AND pg_relation_size(relid) > 10000000  -- > 10MB
  AND idx_scan < seq_scan * 0.1   -- Very low index usage
ORDER BY seq_tup_read DESC;

Key Metrics to Monitor:

Metric	What It Indicates	Action
idx_scan = 0	Index never used	Consider dropping
seq_scan high on large tables	Missing index	Analyze query patterns
idx_tup_read >> idx_tup_fetch	Low selectivity	Index may not help
Index size >> expected	Bloat or fragmentation	REINDEX
Buffer cache hit ratio low	Not enough memory	Increase buffer pool or reduce indexes

EXPLAIN ANALYZE for Index Investigation:

When optimizing specific queries, always use EXPLAIN ANALYZE to understand execution:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) 
SELECT * FROM orders 
WHERE user_id = 123 AND created_at > '2024-01-01';

Look for:

Seq Scan on large tables (missing index)
Bitmap Heap Scan (index used, but hitting many rows)
Index Scan vs Index Only Scan (covering index opportunity)
Sort after scan (ORDER BY not using index)
Rows estimated vs actual (optimizer statistics issue)

Scheduled Index Reviews

Schedule quarterly index reviews: Identify and drop unused indexes. Analyze slow query logs for indexing opportunities. Check index bloat and reindex if needed. Verify index sizes are proportional to their value. Document why each index exists.

The Systematic Indexing Decision Framework

When faced with an indexing decision, follow this systematic framework to make informed choices.

Index Decision Checklist

•Identify the query pattern: What queries would this index serve? Are they frequent? Latency-sensitive? Document specific queries.
•Check existing indexes: Can existing indexes serve this need? Would modifying an existing composite index be sufficient?
•Estimate read benefit: How much faster will queries be? How many queries per second? Multiply for total benefit.
•Estimate write cost: How many writes per second affect this table? How many indexes already exist? Estimate additional overhead.
•Calculate storage impact: How many rows? Column sizes? Estimated index size? Does it fit in your memory budget?
•Consider alternatives: Would a partial index work? Covering index needed? Different column order better?
•Plan for testing: How will you validate the index helps? Use EXPLAIN on production-like data before deploying.
•Plan for rollback: If the index causes problems, how quickly can you drop it? Document the rollback plan.

The Decision Matrix:

Index Decision Matrix
Query Frequency	Write Frequency	Recommendation
High (100+/sec)	Low (< 10/sec)	✅ Index aggressively, consider covering indexes
High (100+/sec)	High (100+/sec)	⚠️ Index carefully, monitor write latency
Medium (10-100/sec)	Low (< 10/sec)	✅ Index if query is slow without it
Medium (10-100/sec)	High (100+/sec)	⚠️ Index only critical paths
Low (< 10/sec)	Low (< 10/sec)	⚠️ Index if query is unacceptably slow
Low (< 10/sec)	High (100+/sec)	❌ Usually don't index, accept slower reads

Production Deployment Best Practices:

Create indexes concurrently when possible
Deploy during low-traffic periods if concurrent isn't available
Monitor query plans before and after to verify improvement
Watch write latency for unexpected degradation
Have a rollback plan (DROP INDEX CONCURRENTLY)
Document the index purpose in a runbook or schema comments

The Expert Mindset

Expert database engineers don't add indexes reflexively. They analyze query patterns, quantify trade-offs, consider alternatives, validate with data, and monitor continuously. This systematic approach, not intuition, is what separates great database performance from mediocre.

Summary: The Complete Indexing Picture

We've explored the full spectrum of index trade-offs. Let's consolidate the key principles:

Key Takeaways

•Every index trades write performance for read performance—quantify both sides before deciding. Read/write ratio is the primary driver.
•Storage costs are real and ongoing—indexes consume disk space, compete for buffer pool memory, and increase backup sizes. Size matters.
•Maintenance overhead compounds—more indexes mean longer vacuums, more replication lag, and increased WAL volume. Consider operational impact.
•Different workloads need different strategies—OLTP systems need focused indexes; OLAP can afford comprehensive indexing; write-heavy systems need minimal indexes.
•Partial indexes are powerful—index only what you query to save 90%+ storage while maintaining performance. Underutilized feature.
•Anti-patterns are expensive—over-indexing, wrong column order, and duplicate indexes cost performance and money. Audit regularly.
•Monitoring guides decisions—use pg_stat_user_indexes, EXPLAIN ANALYZE, and slow query logs to make data-driven indexing choices.
•Systematic frameworks beat intuition—follow the decision checklist for every new index to avoid costly mistakes.

Module Complete:

You've now completed a comprehensive exploration of database indexes and query performance:

What indexes are — fundamental concepts and purpose
B-tree indexes — the dominant structure and its mechanics
Hash indexes — O(1) lookups and their limitations
Composite indexes — multi-column strategies and the prefix rule
Index trade-offs — the complete framework for indexing decisions

With this knowledge, you can design indexing strategies that optimize query performance while managing costs and complexity—the mark of a true database expert.

Module Complete: Indexes and Query Performance

You've mastered the art and science of database indexing. From B-tree internals to composite index column ordering to trade-off analysis, you possess the comprehensive knowledge needed to design high-performance database systems. Apply these principles systematically, monitor continuously, and your databases will serve applications reliably at any scale.

5 / 5

Loading learning content...

System Design (HLD)Indexes and Query Performance

Indexes and Query Performance

LevelIntermediate

Duration90 mins

TopicIndexes and Query Performance

5 / 5

Index Trade-offs

The Art of Indexing: Balancing Competing Concerns

What You Will Master

The Fundamental Trade-off: Reads vs Writes

The core trade-off in indexing is simple: indexes make reads faster but writes slower. Every index on a table creates additional work during INSERT, UPDATE, and DELETE operations.

Write Amplification Explained:

When you insert a row into a table with 5 indexes:

Write the row to the table heap (1 operation)
Insert an entry into index 1 (1 operation)
Insert an entry into index 2 (1 operation)
Insert an entry into index 3 (1 operation)
Insert an entry into index 4 (1 operation)
Insert an entry into index 5 (1 operation)

Total: 6 write operations for 1 logical insert. This is write amplification.

For updates, it's even worse if the updated column is indexed:

Old index entry must be marked as deleted or removed
New index entry must be inserted
MVCC databases may create new row versions, multiplying the effect

Write Cost by Number of Indexes
of Indexes	INSERT Cost	UPDATE Cost*	DELETE Cost
0	1 write	1 write	1 write
1	2 writes	1-3 writes	2 writes
3	4 writes	1-7 writes	4 writes
5	6 writes	1-11 writes	6 writes
10	11 writes	1-21 writes	11 writes

*UPDATE cost depends on which columns are modified. Updating an unindexed column may only affect the heap.

Quantifying the Read Benefit:

To justify an index, the read performance benefit must outweigh the write performance cost:

Benefit = (queries_per_second) × (time_saved_per_query)
Cost = (writes_per_second) × (time_added_per_write)

Index is worthwhile when: Benefit > Cost

Example Calculation:

Query without index: 100ms (full table scan)
Query with index: 1ms (index lookup)
Time saved per query: 99ms
Query frequency: 1000 queries/second
Read benefit: 99 seconds of CPU time saved per second
Write without index: 1ms
Write with index: 2ms
Time added per write: 1ms
Write frequency: 100 writes/second
Write cost: 0.1 seconds of CPU time added per second
Net benefit: 98.9 seconds saved per second → Index is clearly worthwhile

The Read/Write Ratio Rule

Storage Costs and Memory Pressure

Indexes consume storage proportional to the data they index. Understanding storage costs helps you make informed decisions about index design.

Index Size Calculation:

For a B-tree index, approximate size:

index_size ≈ num_rows × (key_size + pointer_size) × overhead_factor

Where:

key_size: Sum of indexed column sizes
pointer_size: 8-16 bytes per row pointer
overhead_factor: 1.3-2.0 for B-tree structure and free space

Example:

Table: 100 million rows
Index: (user_id INT, created_at TIMESTAMP) = 4 + 8 = 12 bytes
Pointer: 8 bytes
Per-entry: 20 bytes × 1.5 overhead = 30 bytes
Total index size: 3 GB

Storage Overhead by Index Type
Index Type	Column Type	100M Row Size	Notes
Single (INT)	4 bytes	~1.5 GB	Most compact
Single (BIGINT)	8 bytes	~2 GB	Common for IDs
Single (UUID)	16 bytes	~3 GB	Random distribution
Single (VARCHAR 100)	up to 100 bytes	~10-15 GB	Variable length
Composite (INT, INT)	8 bytes	~2 GB	Efficient
Composite (INT, TIMESTAMP)	12 bytes	~2.5 GB	Common pattern
Covering (5 columns)	varies	~5-20 GB	Duplicates table data

Memory Pressure:

Indexes don't just consume disk space—they compete for memory in the buffer pool:

Frequently accessed index pages stay in memory
More indexes → more pages competing for limited memory
Memory-starved indexes cause disk I/O → slower queries

The Working Set Problem:

Your database's working set is the data actively accessed by queries. Ideally, the entire working set fits in memory:

working_set = hot_table_data + hot_index_data + temp_space
buffer_pool_size >= working_set → mostly memory operations
buffer_pool_size < working_set → disk I/O, slower queries

Practical Impact:

Adding a 5 GB index to a database with 16 GB buffer pool might push other hot data out of memory, degrading overall performance even for queries that don't use the new index.

Monitoring Storage Impact:

index-size-queries.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- PostgreSQL: Index sizes for a table
SELECT 
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- PostgreSQL: Relation (table + indexes) total size
SELECT 
    relname,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
    pg_size_pretty(pg_relation_size(relid)) AS table_size,
    pg_size_pretty(pg_indexes_size(relid)) AS indexes_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
 
-- PostgreSQL: Buffer pool usage by relation
SELECT 
    c.relname,
    count(*) AS buffers,
    pg_size_pretty(count(*) * 8192) AS size_in_buffer
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
GROUP BY c.relname
ORDER BY count(*) DESC
LIMIT 20;

The Index Bloat Problem

Index Maintenance Overhead

Beyond read/write trade-offs, indexes create ongoing maintenance overhead that affects database operations.

Transaction Logging:

In databases with write-ahead logging (WAL), every index modification generates log entries:

More indexes → more WAL entries → larger WAL files
Replication lag increases with more indexes (more data to replicate)
Point-in-time recovery takes longer with larger WAL

Vacuum and Maintenance Operations:

Indexes must be maintained during vacuum/analyze operations:

Dead tuples in indexes must be cleaned up
Index statistics must be updated for the query optimizer
Bloated indexes may require REINDEX

Lock Contention:

Index operations can create lock contention:

B-tree node splits may lock parent nodes briefly
Heavy concurrent writes can bottleneck on hot index pages
DDL operations (CREATE INDEX, DROP INDEX) may lock tables

Index Maintenance Costs

•WAL amplification: Each indexed column doubles WAL entries for that operation. 5 indexes ≈ 5x WAL for affected rows.
•Vacuum duration: More indexes = longer vacuum. A table with 10 indexes vacuums much slower than one with 2.
•Reindex time: Rebuilding indexes on large tables can take hours, during which the table may be partially locked.
•Backup size: Database backups include index data. More indexes = larger backups = longer restore times.
•Replication lag: Replicas must apply all index modifications. Heavy writes to indexed tables cause replica lag.
•Auto-vacuum impact: PostgreSQL's auto-vacuum may struggle to keep up with heavily indexed, write-heavy tables.

Concurrent Index Operations:

Modern databases support concurrent index operations to minimize disruption:

-- PostgreSQL: Create index without locking writes
CREATE INDEX CONCURRENTLY idx_orders_user 
    ON orders(user_id);
-- Takes longer but doesn't block other operations

-- PostgreSQL: Reindex without heavy locking
REINDEX INDEX CONCURRENTLY idx_orders_user;

However, concurrent operations:

Take 2-3x longer than regular operations
Consume more resources (must track concurrent modifications)
May fail and leave invalid indexes requiring cleanup
Still impact performance, just less severely

Maintenance Window Planning

Workload-Specific Index Strategies

Different application workloads demand different indexing approaches. Understanding your workload profile guides index design.

Index Strategy by Workload Type
Workload	Read/Write Ratio	Typical Indexes	Strategy
OLTP (transactions)	10:1 to 100:1	3-5 per table	Index hot query paths, avoid over-indexing
OLAP (analytics)	1000:1+	Many, including wide covering	Comprehensive indexes, prioritize query speed
Write-heavy (logging)	1:10 to 1:100	0-2 per table	Minimal indexes, bulk-load patterns
Hybrid (mixed)	varies	Selective	Partition by workload if possible
Time-series	Write-heavy, read varies	Time-based partitioning	Partition indexes, retention policies

OLTP (Transactional) Systems:

Typical characteristics:

Many small, fast transactions
Point lookups by primary key
Range queries on recent data
Moderate write volume

Index strategy:

Index primary and foreign keys (often automatic)
Add indexes for common WHERE clauses
Composite indexes for frequent query patterns
Avoid covering indexes unless queries are extremely frequent
Keep total indexes per table to 3-5

OLAP (Analytical) Systems:

Typical characteristics:

Complex aggregation queries
Scans over large data ranges
Rare writes (batch loads)
Query patterns known in advance

Index strategy:

Wide covering indexes for common dashboards
Indexes on all filter/GROUP BY columns
Consider column-store or specialized OLAP indexes
Index maintenance can happen during batch loads
10+ indexes per table is acceptable

Write-Heavy Workload Tips

•Minimal indexes (only what's essential)
•Batch writes when possible (fewer transactions)
•Disable indexes during bulk loads, rebuild after
•Consider append-only tables with no indexes
•Use write-optimized storage (LSM trees)
•Partition tables to limit index scope

Read-Heavy Workload Tips

•Comprehensive indexes for query patterns
•Covering indexes for frequent queries
•Composite indexes aligned with WHERE clauses
•Consider materialized views for complex queries
•Read replicas with additional indexes
•Full-text and specialized indexes where helpful

Read Replicas for Index Experimentation

Partial and Filtered Indexes

Why Partial Indexes?

Consider orders table with 100 million rows:

98 million are 'completed' (historical)
2 million are 'pending', 'processing', 'shipped' (active)

Queries almost always filter for active orders. A full index on status wastes 98% of its space on rows that are rarely queried.

Partial Index Solution:

partial-indexes.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Full index (indexes all 100M rows)
CREATE INDEX idx_orders_status ON orders(status);
-- Size: ~2 GB
 
-- Partial index (indexes only active orders)
CREATE INDEX idx_orders_active ON orders(status)
WHERE status IN ('pending', 'processing', 'shipped');
-- Size: ~40 MB (98% smaller!)
 
-- Even more focused: just one status
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Tiny index, only pending orders, sorted by creation time
 
-- Query that uses partial index
SELECT * FROM orders 
WHERE status = 'pending' 
ORDER BY created_at;
-- Uses idx_orders_pending perfectly
 
-- Soft-delete pattern
CREATE INDEX idx_active_users ON users(email)
WHERE deleted_at IS NULL;
-- Index only non-deleted users (much smaller than full index)
 
-- Unique constraint on subset
CREATE UNIQUE INDEX idx_unique_active_email 
ON users(email) WHERE deleted_at IS NULL;
-- Allows re-registration of deleted emails while ensuring
-- only one active user per email

Benefits of Partial Indexes:

Dramatically smaller size: Index only what you query
Faster index operations: Fewer entries to maintain
Better cache efficiency: Hot data stays in memory
Lower write overhead: Many writes don't touch the index
Unique constraints on subsets: Enable soft-delete patterns

Limitations:

Query must include the index predicate (WHERE clause must match)
Not all databases support partial indexes
Optimizer must recognize the match (sometimes requires explicit conditions)

Common Partial Index Patterns:

Partial Index Use Cases

•Active records only: WHERE deleted_at IS NULL or WHERE is_active = true
•Recent data only: WHERE created_at > CURRENT_DATE - INTERVAL '90 days'
•Pending items: WHERE status IN ('pending', 'processing') for job queues
•Non-null values: WHERE optional_field IS NOT NULL
•Specific types: WHERE type = 'premium' for premium user features
•Incomplete records: WHERE completed_at IS NULL for in-progress workflows

Partial Indexes are Underutilized

Index Selection Anti-Patterns

Recognizing common indexing mistakes helps you avoid them in your own systems.

Index Anti-Patterns

•Index on every column: 'Just in case' indexing devastates write performance and wastes storage. Each index must justify its existence.
•Ignoring query patterns: Creating indexes without analyzing actual query workload. Use query logs to identify slow queries and their filter columns.
•Duplicate indexes: Having both (A) and (A, B) when only the composite is needed. The composite handles all 'A' queries.
•Indexing low-cardinality columns: Index on boolean or status with 3 values rarely helps. The index isn't selective enough.
•Wrong column order in composites: Putting range columns before equality columns. Always: equality first, then range.
•Never reviewing indexes: Indexes added during development may become obsolete. Unused indexes should be dropped.
•Indexing computed values without expression indexes: WHERE LOWER(email) = 'x' can't use an index on email. Need index on LOWER(email).
•Over-trusting the ORM: ORM-generated queries may not align with your indexes, or may generate patterns indexes can't help.

The 'More Indexes is Better' Fallacy:

A common belief is that adding indexes always improves performance. Reality:

Each index slows every INSERT by ~10-20%
Each index slows affected UPDATEs by ~20-50%
Storage and memory pressure increase linearly
Optimizer confusion: too many indexes can lead to poor choices
Maintenance complexity increases

The Pareto Principle of Indexes:

80% of query patterns can be served by 20% of possible indexes. Identify your critical query paths and index those. Accept that rare queries may be slower—the trade-off is usually worth it.

Signs You Have Too Many Indexes

Monitoring and Identifying Index Opportunities

Effective indexing requires ongoing monitoring to identify both missing indexes and unused ones.

index-monitoring.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- PostgreSQL: Find unused indexes
SELECT 
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(i.indexrelid)) AS size,
    idx_scan AS scans
FROM pg_stat_user_indexes ui
JOIN pg_index i ON ui.indexrelid = i.indexrelid
WHERE NOT i.indisunique  -- Exclude primary keys and unique constraints
  AND idx_scan < 50      -- Rarely used
  AND pg_relation_size(i.indexrelid) > 5000000  -- > 5MB
ORDER BY pg_relation_size(i.indexrelid) DESC;
 
-- PostgreSQL: Tables missing primary key indexes
SELECT relname 
FROM pg_class 
WHERE relkind = 'r' 
  AND relname NOT IN (
    SELECT indrelid::regclass::text 
    FROM pg_index WHERE indisprimary
  );
 
-- PostgreSQL: Find slow queries potentially needing indexes
-- (Requires pg_stat_statements extension)
SELECT 
    query,
    calls,
    mean_time::numeric(10,2) AS avg_ms,
    total_time::numeric(10,2) AS total_ms
FROM pg_stat_statements
WHERE mean_time > 100  -- Queries averaging over 100ms
ORDER BY total_time DESC
LIMIT 20;
 
-- PostgreSQL: Identify sequential scans on large tables
SELECT 
    relname AS table,
    seq_scan,
    seq_tup_read,
    idx_scan,
    pg_size_pretty(pg_relation_size(relid)) AS size
FROM pg_stat_user_tables
WHERE seq_scan > 100              -- Many sequential scans
  AND pg_relation_size(relid) > 10000000  -- > 10MB
  AND idx_scan < seq_scan * 0.1   -- Very low index usage
ORDER BY seq_tup_read DESC;

Key Metrics to Monitor:

Metric	What It Indicates	Action
idx_scan = 0	Index never used	Consider dropping
seq_scan high on large tables	Missing index	Analyze query patterns
idx_tup_read >> idx_tup_fetch	Low selectivity	Index may not help
Index size >> expected	Bloat or fragmentation	REINDEX
Buffer cache hit ratio low	Not enough memory	Increase buffer pool or reduce indexes

EXPLAIN ANALYZE for Index Investigation:

When optimizing specific queries, always use EXPLAIN ANALYZE to understand execution:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) 
SELECT * FROM orders 
WHERE user_id = 123 AND created_at > '2024-01-01';

Look for:

Seq Scan on large tables (missing index)
Bitmap Heap Scan (index used, but hitting many rows)
Index Scan vs Index Only Scan (covering index opportunity)
Sort after scan (ORDER BY not using index)
Rows estimated vs actual (optimizer statistics issue)

Scheduled Index Reviews

The Systematic Indexing Decision Framework

When faced with an indexing decision, follow this systematic framework to make informed choices.

Index Decision Checklist

•Identify the query pattern: What queries would this index serve? Are they frequent? Latency-sensitive? Document specific queries.
•Check existing indexes: Can existing indexes serve this need? Would modifying an existing composite index be sufficient?
•Estimate read benefit: How much faster will queries be? How many queries per second? Multiply for total benefit.
•Estimate write cost: How many writes per second affect this table? How many indexes already exist? Estimate additional overhead.
•Calculate storage impact: How many rows? Column sizes? Estimated index size? Does it fit in your memory budget?
•Consider alternatives: Would a partial index work? Covering index needed? Different column order better?
•Plan for testing: How will you validate the index helps? Use EXPLAIN on production-like data before deploying.
•Plan for rollback: If the index causes problems, how quickly can you drop it? Document the rollback plan.

The Decision Matrix:

Index Decision Matrix
Query Frequency	Write Frequency	Recommendation
High (100+/sec)	Low (< 10/sec)	✅ Index aggressively, consider covering indexes
High (100+/sec)	High (100+/sec)	⚠️ Index carefully, monitor write latency
Medium (10-100/sec)	Low (< 10/sec)	✅ Index if query is slow without it
Medium (10-100/sec)	High (100+/sec)	⚠️ Index only critical paths
Low (< 10/sec)	Low (< 10/sec)	⚠️ Index if query is unacceptably slow
Low (< 10/sec)	High (100+/sec)	❌ Usually don't index, accept slower reads

Production Deployment Best Practices:

Create indexes concurrently when possible
Deploy during low-traffic periods if concurrent isn't available
Monitor query plans before and after to verify improvement
Watch write latency for unexpected degradation
Have a rollback plan (DROP INDEX CONCURRENTLY)
Document the index purpose in a runbook or schema comments

The Expert Mindset

Summary: The Complete Indexing Picture

We've explored the full spectrum of index trade-offs. Let's consolidate the key principles:

Key Takeaways

•Every index trades write performance for read performance—quantify both sides before deciding. Read/write ratio is the primary driver.
•Storage costs are real and ongoing—indexes consume disk space, compete for buffer pool memory, and increase backup sizes. Size matters.
•Maintenance overhead compounds—more indexes mean longer vacuums, more replication lag, and increased WAL volume. Consider operational impact.
•Different workloads need different strategies—OLTP systems need focused indexes; OLAP can afford comprehensive indexing; write-heavy systems need minimal indexes.
•Partial indexes are powerful—index only what you query to save 90%+ storage while maintaining performance. Underutilized feature.
•Anti-patterns are expensive—over-indexing, wrong column order, and duplicate indexes cost performance and money. Audit regularly.
•Monitoring guides decisions—use pg_stat_user_indexes, EXPLAIN ANALYZE, and slow query logs to make data-driven indexing choices.
•Systematic frameworks beat intuition—follow the decision checklist for every new index to avoid costly mistakes.

Module Complete:

You've now completed a comprehensive exploration of database indexes and query performance:

What indexes are — fundamental concepts and purpose
B-tree indexes — the dominant structure and its mechanics
Hash indexes — O(1) lookups and their limitations
Composite indexes — multi-column strategies and the prefix rule
Index trade-offs — the complete framework for indexing decisions

With this knowledge, you can design indexing strategies that optimize query performance while managing costs and complexity—the mark of a true database expert.

Module Complete: Indexes and Query Performance

5 / 5