Loading learning content...
Indexes are among the most powerful tools in a database administrator's arsenal—yet they are also among the most misused. The ability to reduce a query from seconds to milliseconds is seductive, leading many to believe that more indexes are always better. This assumption is not just wrong; it can be catastrophically wrong.
The core paradox of indexing: Indexes accelerate reads but decelerate writes. Every index you create is a trade-off, a balance between the benefit of faster lookups and the cost of maintaining additional data structures. The art of indexing lies not in creating indexes, but in knowing when to create them—and more importantly, when not to.
By the end of this page, you will understand the fundamental decision framework for index creation, the key factors that influence indexing decisions, common anti-patterns to avoid, and the structured methodology used by database architects to make strategic indexing choices.
Before creating any index, you must answer one fundamental question: Will this index provide more benefit than cost?
This question seems simple, but answering it correctly requires understanding the complex interplay between read patterns, write patterns, data distribution, query types, and system constraints. Let's establish the foundational principles that govern this decision.
The most common mistake is to create an index for every column referenced in a WHERE clause. This naive approach ignores selectivity, query frequency, write patterns, and system constraints. It leads to over-indexed databases that are slow to write, expensive to maintain, and often no faster to read.
Professional database architects use a structured decision framework to evaluate potential indexes. This framework considers multiple factors and applies them systematically to determine whether an index should be created.
The Index Decision Matrix:
Every proposed index should be evaluated against several key criteria. The matrix below provides a structured approach to this evaluation:
| Factor | Favorable for Indexing (✓) | Unfavorable for Indexing (✗) |
|---|---|---|
| Query Frequency | Query executes thousands of times per hour | Query runs once per day in batch job |
| Base Table Size | Table contains millions of rows | Table has fewer than 1,000 rows |
| Selectivity | Predicate returns < 5% of rows | Predicate returns > 30% of rows |
| Write Frequency | Table is updated rarely (< 10% write ratio) | Table has high write frequency (> 50% write ratio) |
| Column Cardinality | Column has high cardinality (many distinct values) | Column has low cardinality (few distinct values) |
| Query Performance Gap | Current query takes seconds; target is milliseconds | Query already meets SLA requirements |
| Available Storage | Disk space is abundant | Storage is constrained and expensive |
| Memory Availability | Buffer pool can accommodate index | Memory is already over-committed |
Applying the Framework:
For each proposed index, score it against each factor. A strong candidate should have favorable conditions for most factors. However, some factors carry more weight than others:
Think of this framework as your first-pass filter. Indexes that fail multiple criteria are unlikely to provide net benefit.
In most systems, 80% of database activity comes from 20% of queries. Focus your indexing efforts on this critical 20%. A perfectly indexed rare query is worth less than a moderately indexed frequent query.
Selectivity measures what fraction of rows a predicate returns. It is calculated as:
$$\text{Selectivity} = \frac{\text{Number of rows matching predicate}}{\text{Total number of rows}}$$
Selectivity determines whether an index can provide meaningful benefit. Understanding this concept is essential for index decision-making.
| Selectivity Range | Interpretation | Index Recommendation |
|---|---|---|
| < 1% | Highly selective | Strong candidate — Index almost always beneficial |
| 1-5% | Selective | Good candidate — Index likely beneficial for most workloads |
| 5-15% | Moderately selective | Evaluate carefully — Benefits depend on other factors |
| 15-30% | Low selectivity | Unlikely beneficial — Full scan often more efficient |
30% | Non-selective | Avoid indexing — Index adds overhead without benefit |
Why Selectivity Matters:
Consider a table with 1,000,000 rows. An index lookup typically involves:
If your predicate returns 50% of rows (500,000 rows), the index lookup could require 500,000+ I/O operations, while a sequential table scan would require only the number of pages containing the table (perhaps 10,000 operations for a typical table). The index is dramatically worse.
Conversely, if your predicate returns 0.1% of rows (1,000 rows), the index lookup requires roughly 1,000 operations plus B+-tree traversal, while the full scan still requires 10,000 operations. The index wins decisively.
WHERE user_id = 12345 — Unique or near-unique lookupsWHERE order_date = '2024-01-15' — Specific date in large rangeWHERE email = 'user@example.com' — Unique constraint columnWHERE transaction_id = 'TXN-789012' — Primary key lookupWHERE ssn = '123-45-6789' — Natural unique identifierWHERE status = 'active' — 90% of users are activeWHERE gender = 'M' — Binary/low-cardinality columnWHERE is_deleted = 0 — Boolean flagWHERE country = 'USA' — Dominant value in skewed distributionWHERE year >= 2020 — Range covering most dataThe same column can have different selectivity depending on the query. An index on country might be excellent for WHERE country = 'Vatican City' (highly selective) but useless for WHERE country = 'USA' (non-selective). The query optimizer uses statistics to make these determinations at runtime.
Write amplification refers to the phenomenon where a single logical write operation in your application results in multiple physical write operations in the database. Indexes are a primary source of write amplification.
Consider what happens when you execute a simple INSERT:
INSERT INTO orders (order_id, customer_id, order_date, status, total)
VALUES (12345, 789, '2024-01-15', 'pending', 299.99);
If this table has 8 indexes, the database must:
The Amplification Factor:
A single logical INSERT becomes 9+ physical operations. For UPDATE operations affecting indexed columns, the cost is even higher—the old key must be removed and the new key inserted, effectively doubling the index maintenance work.
Quantifying the Impact:
Let's calculate the write overhead for a realistic scenario:
| Scenario | Base Write Cost | With 5 Indexes | With 10 Indexes | With 20 Indexes |
|---|---|---|---|---|
| INSERT (1 row) | 1 operation | 6 operations (6×) | 11 operations (11×) | 21 operations (21×) |
| UPDATE (1 indexed column) | 1 operation | 3 operations (3×) | 3 operations (3×) | 3 operations (3×) |
| UPDATE (all indexed columns) | 1 operation | 11 operations (11×) | 21 operations (21×) | 41 operations (41×) |
| DELETE (1 row) | 1 operation | 6 operations (6×) | 11 operations (11×) | 21 operations (21×) |
| Bulk INSERT (10,000 rows) | 10K operations | 60K operations | 110K operations | 210K operations |
A production system processing 10,000 writes per second with 15 indexes is actually performing 150,000+ index operations per second. This overhead directly impacts transaction latency, lock contention, and system throughput. Over-indexing is a common cause of unexplained write performance degradation.
The read/write ratio of your workload is a critical factor in indexing decisions. This ratio determines whether the read acceleration from an index outweighs its write penalties.
Analyzing Your Workload:
For any given table, you should understand:
| Workload Type | Read/Write Ratio | Index Strategy |
|---|---|---|
| Read-Heavy (OLAP) | 95% reads | Index liberally. Read optimization is paramount. Write overhead is acceptable. |
| Read-Dominant | 80-95% reads | Index strategically. Focus on high-impact queries. Monitor write latency. |
| Balanced | 50-80% reads | Index conservatively. Justify each index with measurable read benefit. |
| Write-Dominant | 20-50% reads | Index minimally. Only index for critical queries. Consider materialized views. |
| Write-Heavy (OLTP) | < 20% reads | Index sparingly. Prefer covering indexes over wide indexes. Minimize index count. |
Calculating Index ROI:
For a proposed index, calculate the return on investment:
$$\text{Index ROI} = \frac{\text{Read Benefit} \times \text{Read Frequency}}{\text{Write Cost} \times \text{Write Frequency}}$$
Where:
An index with ROI > 1 provides net benefit; ROI < 1 indicates the index costs more than it saves.
123456789101112131415161718192021222324252627282930313233
-- Step 1: Analyze current query performance-- Average response time without index: 450ms -- Step 2: Estimate performance with index (from test environment)-- Average response time with index: 15ms-- Read Benefit = 450ms - 15ms = 435ms -- Step 3: Count read frequencySELECT COUNT(*) as query_countFROM query_logWHERE query_text LIKE '%customer_id = %' AND timestamp > NOW() - INTERVAL '1 hour';-- Result: 12,500 queries per hour -- Step 4: Estimate write overhead-- Benchmark shows index adds 0.3ms per INSERT-- Write Cost = 0.3ms -- Step 5: Count write frequencySELECT COUNT(*) as write_countFROM audit_logWHERE table_name = 'customers' AND operation IN ('INSERT', 'UPDATE', 'DELETE') AND timestamp > NOW() - INTERVAL '1 hour';-- Result: 850 writes per hour -- Step 6: Calculate ROI-- Read Benefit × Read Frequency = 435ms × 12,500 = 5,437,500ms saved-- Write Cost × Write Frequency = 0.3ms × 850 = 255ms added-- ROI = 5,437,500 / 255 = 21,324 -- Interpretation: This index returns 21,000× its cost!-- Strong candidate for creation.The ROI calculation above uses estimates. Before creating indexes in production, always validate these assumptions in a staging environment with production-like data. Selectivity, caching behavior, and lock contention can significantly affect actual performance.
While the general framework and selectivity guidelines cover most scenarios, there are important special cases where standard indexing rules require modification.
Foreign key columns should almost always be indexed. Even if the column has low selectivity for SELECT queries, the index is critical for maintaining referential integrity.
Why:
When you UPDATE or DELETE a row in the parent table, the database must verify no child rows reference it. Without an index on the foreign key column, this check requires a full table scan of the child table—for every parent modification.
Example:
Consider orders.customer_id → customers.id. If you delete a customer, the database must verify no orders reference that customer. Without an index on orders.customer_id:
This can make parent table modifications excruciatingly slow or even cause timeouts.
Unlike some databases, PostgreSQL does NOT automatically create indexes on foreign key columns. You must create them explicitly. This is a common oversight that causes mysterious performance problems in production.
Learning what not to do is as important as learning what to do. These anti-patterns appear frequently in production databases and cause predictable problems.
INDEX(a, b) and INDEX(a). The composite index already supports prefix queries. The standalone index wastes space and maintenance. Audit regularly for redundant indexes.is_active, gender, or status with only 2-5 distinct values. These indexes are almost never used because selectivity is too low. Exception: Partial indexes for rare values.INDEX(a, b, c, d, e) when you only query on columns b, c, or d individually. The index only helps queries that start with column 'a'. Understand composite index usage rules.INDEX(column) to help with WHERE UPPER(column) = 'VALUE'. Functions on the column prevent index usage. Create expression indexes when needed.Anti-patterns compound over time. A system with 50 poorly-chosen indexes requires massive maintenance overhead, fragments storage, competes for buffer pool space, and makes every write operation slower. The technical debt accumulates invisibly until the system becomes untenable.
Before creating any index, walk through this checklist. A 'No' answer to any critical question should give you pause.
You now understand the fundamental decision framework for index creation. You know when to index, when not to, and how to evaluate the trade-offs. Next, we'll explore the costs of index maintenance in detail—understanding not just that indexes have overhead, but precisely what that overhead entails.