Loading content...
If storage structures form the foundation of physical design, indexes are the superstructure—the mechanisms that transform slow scans into lightning-fast lookups. A well-indexed database can answer queries in milliseconds that would otherwise require minutes. Conversely, poor indexing leads to either unnecessary overhead or query performance disasters.
Index selection is both art and science. The science involves understanding index data structures, their complexity characteristics, and how query optimizers utilize them. The art involves anticipating workload patterns, balancing read and write costs, and making strategic trade-offs when perfect indexing is impossible.
The fundamental promise: An index is an additional data structure that maintains sorted/hashed mappings from attribute values to record locations, enabling the database to locate data without scanning entire tables.
This page covers the major index types (B-tree, hash, bitmap, specialized indexes), their internal structures, when to use each, and the systematic methodology for index selection. You'll learn to analyze workloads and design index strategies that optimize the queries that matter most.
Consider a table with 10 million customer records requiring 100,000 disk blocks. Without an index, finding a customer by name requires:
With a B-tree index on customer_name:
This represents a 10,000x improvement—the difference between an unusable system and an excellent one.
| Table Size | Full Scan Time | B-tree Index Time | Speedup |
|---|---|---|---|
| 10,000 rows | ~100 ms | ~5 ms | 20x |
| 1 million rows | ~10 sec | ~8 ms | 1,250x |
| 100 million rows | ~15 min | ~12 ms | 75,000x |
| 1 billion rows | ~2.5 hrs | ~15 ms | 600,000x |
Indexes are not free. Every index: (1) consumes storage space (often 10-30% of table size per index), (2) slows INSERT, UPDATE, DELETE operations (each must update all affected indexes), (3) increases complexity for the query optimizer, and (4) requires maintenance (fragmentation, statistics updates). The goal is not 'add all indexes' but 'add the right indexes for the workload.'
The B-tree (and its variant, the B+ tree) is the dominant index structure in virtually all relational databases. It provides efficient support for both equality and range queries while remaining balanced under insertions and deletions.
B+ Tree structure:
Why B+ Trees dominate:
| Operation | Cost | I/O Pattern |
|---|---|---|
| Equality search | O(log_f n) — 3-5 I/O | Random reads down tree |
| Range search (k results) | O(log_f n + k/f) | Traverse to start, sequential scan leaves |
| Insert | O(log_f n) | Find leaf, insert, possibly split |
| Delete | O(log_f n) | Find leaf, remove, possibly merge |
| Min/Max | O(log_f n) | Traverse to leftmost/rightmost leaf |
Where f = fanout (typically 100-500), n = number of records
Practical implications:
B-tree variations:
123456789101112131415161718192021
-- Standard B-tree index (default in most databases)CREATE INDEX idx_customer_name ON customers(name); -- Composite B-tree index (multiple columns)CREATE INDEX idx_order_customer_date ON orders(customer_id, order_date);-- This index supports:-- WHERE customer_id = ? (uses first column)-- WHERE customer_id = ? AND order_date = ? (uses both columns)-- WHERE customer_id = ? AND order_date > ? (uses both columns)-- Does NOT efficiently support:-- WHERE order_date = ? (second column without first) -- Unique B-tree index (enforces uniqueness + provides index)CREATE UNIQUE INDEX idx_email ON customers(email); -- Descending index (optimizes ORDER BY DESC)CREATE INDEX idx_created_desc ON orders(created_at DESC); -- Partial index (indexes only subset of rows)CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';-- Smaller index, faster updates, but only covers active usersFor composite (multi-column) B-tree indexes, column order matters critically. The index is sorted by the first column, then by the second within each first-column value, and so on. Place the most selective column first, or the column most frequently used alone. A query filtering only on the second column cannot efficiently use the index.
Hash indexes use a hash function to map key values directly to bucket locations. They provide O(1) average-case lookups—theoretically faster than B-trees—but with significant limitations.
Structure:
Performance characteristics:
| Operation | Cost | Notes |
|---|---|---|
| Equality lookup | O(1) average | Single hash computation + bucket read |
| Range query | O(n) | Hash provides no ordering—must scan all |
| Insert | O(1) average | Hash + append to bucket |
| Delete | O(1) average | Hash + remove from bucket |
PostgreSQL supports explicit hash indexes (CREATE INDEX ... USING HASH), but recommends B-tree for most cases due to better concurrency and crash recovery. MySQL/InnoDB uses hash indexes only for its internal Adaptive Hash Index (automatic, not user-created). Oracle uses hash clusters, not hash indexes. In practice, B-trees are preferred for nearly all OLTP index needs.
123456789101112131415
-- Explicit hash index in PostgreSQLCREATE INDEX idx_session_token_hash ON sessions USING HASH (token); -- Use case: exact match lookups on session tokens-- SELECT * FROM sessions WHERE token = 'abc123def456'; -- Note: This CANNOT accelerate:-- SELECT * FROM sessions WHERE token LIKE 'abc%';-- SELECT * FROM sessions WHERE token > 'abc';-- SELECT * FROM sessions ORDER BY token; -- In most cases, B-tree is preferred:CREATE INDEX idx_session_token_btree ON sessions(token);-- B-tree handles all the above cases efficientlyBitmap indexes take a radically different approach from B-trees and hash indexes. Instead of mapping key values to record locations, they create bit vectors indicating which records contain each distinct value.
Structure:
For a column with distinct values {v₁, v₂, ..., vₖ}:
Example:
| Row | Gender | Region |
|---|---|---|
| 1 | M | East |
| 2 | F | West |
| 3 | M | East |
| 4 | F | East |
| 5 | M | West |
Bitmap for Gender:
Bitmap for Region:
Query evaluation with bitmaps:
Bitmaps shine for complex predicates combining multiple conditions:
SELECT * FROM table
WHERE gender = 'M' AND region = 'East';
Execution:
101011011010101 AND 10110 = 10100Performance characteristics:
| Characteristic | Bitmap Index | B-tree Index |
|---|---|---|
| Best for cardinality | Low (< 100 distinct values) | High (many distinct values) |
| Multi-column AND/OR | Excellent (bitwise ops) | Requires index intersection |
| Space for low cardinality | Very small | Larger |
| Space for high cardinality | Explodes | Scales linearly |
| Insert/Update cost | High (rebuild bitmaps) | O(log n) |
| Typical use case | Data warehouses (OLAP) | Transactional systems (OLTP) |
Bitmap indexes are catastrophic for write-heavy workloads. A single INSERT may require updating multiple bitmap vectors and their compression. Worse, bitmap operations typically lock entire bitmaps, creating severe concurrency bottlenecks. Use bitmap indexes only in read-heavy analytical environments where data is bulk-loaded periodically.
Beyond the fundamental index types, modern databases offer specialized indexes for specific data types and query patterns.
Full-Text Indexes:
Optimized for text search operations (word matching, phrase search, relevance ranking):
WHERE MATCH(content) AGAINST ('database optimization')Spatial Indexes (R-Tree, Quad-Tree):
Optimized for geometric data (points, polygons, geographic coordinates):
GIN (Generalized Inverted Index):
PostgreSQL's index for composite/array types:
WHERE tags @> ARRAY['urgent', 'important']| Index Type | Data Type | Query Pattern | Database Support |
|---|---|---|---|
| Full-Text | Text/VARCHAR | MATCH, CONTAINS, phrase search | All major RDBMS |
| R-Tree/Spatial | Geometry, Geography | ST_Contains, ST_Distance, bounding box | PostGIS, MySQL, Oracle Spatial |
| GIN | Array, JSONB, tsvector | Array containment, JSONB path | PostgreSQL |
| GiST | Custom types, ranges | Range overlap, geometric operatores | PostgreSQL |
| BRIN | Large sorted tables | Range queries on naturally ordered data | PostgreSQL |
| Bloom Filter | Multi-column equality | Multi-column equality with many columns | PostgreSQL |
123456789101112131415161718192021222324
-- Full-text search indexCREATE INDEX idx_articles_fts ON articles USING GIN (to_tsvector('english', title || ' ' || body)); -- Query using full-text indexSELECT * FROM articles WHERE to_tsvector('english', title || ' ' || body) @@ to_tsquery('english', 'database & optimization'); -- GIN index on JSONB columnCREATE INDEX idx_metadata_gin ON products USING GIN (metadata); -- Query using GIN indexSELECT * FROM products WHERE metadata @> '{"category": "electronics"}'; -- BRIN index for time-series data (naturally ordered by timestamp)CREATE INDEX idx_events_brin ON events USING BRIN (created_at);-- BRIN is tiny (stores min/max per block range), efficient for sequential data -- Partial index for common query patternCREATE INDEX idx_pending_orders ON orders(customer_id, created_at) WHERE status = 'pending';-- Much smaller than full index, perfect for "show pending orders" queriesDefault to B-tree for general-purpose indexing. Use hash only for pure equality lookups in memory-optimized scenarios. Use bitmap for low-cardinality columns in analytical workloads. Use specialized indexes (GIN, spatial, full-text) only when the query pattern explicitly requires them—they add complexity and maintenance overhead.
Systematic index selection requires analyzing your workload—the mix of queries that run against the database—and designing indexes that accelerate the most important ones while minimizing write overhead.
The index selection framework:
Step 1: Query Workload Analysis
Identify and prioritize queries by:
Step 2: Query Decomposition
For each important query, identify:
Step 3: Candidate Index Generation
For each query, generate candidate indexes:
Query: SELECT name, email FROM users
WHERE status = 'active' AND country = 'US'
ORDER BY created_at DESC LIMIT 10;
Candidates:
1. INDEX(status)
2. INDEX(country)
3. INDEX(status, country)
4. INDEX(status, country, created_at)
5. INDEX(status, country, created_at) INCLUDE (name, email) -- covering
Step 4: Cost-Benefit Analysis
For each candidate:
Step 5: Index Consolidation
Identify indexes that serve multiple queries:
INDEX(customer_id, order_date) serves both:
WHERE customer_id = ?WHERE customer_id = ? AND order_date > ?Prefer consolidated indexes that cover multiple query patterns.
Most modern databases include index advisors that automate parts of this analysis. SQL Server's Database Tuning Advisor, MySQL's Performance Schema with sys schema, and PostgreSQL's pg_stat_statements combined with hypothetical indexes can suggest candidates. However, human judgment remains essential for understanding business priorities and workload nuances.
| Query Pattern | Recommended Index Strategy |
|---|---|
| Equality on single column | B-tree on that column |
| Equality on multiple columns | Composite B-tree (most selective first) |
| Range on single column | B-tree on that column |
| Equality + Range | Composite B-tree (equality columns first, then range) |
| ORDER BY frequently | Include ORDER BY column in index (typically last) |
| SELECT few columns | Consider covering index with INCLUDE clause |
| Low cardinality + OLAP | Bitmap index (if supported) |
| Full-text search | Full-text index |
| JSON/Array queries | GIN index (PostgreSQL) |
Even experienced engineers make indexing mistakes. Understanding common anti-patterns helps you avoid them.
123456789101112131415161718192021222324
-- Find indexes that are never or rarely usedSELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, pg_size_pretty(pg_relation_size(indexrelid)) as index_sizeFROM pg_stat_user_indexesWHERE idx_scan = 0 -- Never usedORDER BY pg_relation_size(indexrelid) DESC; -- Find duplicate/redundant indexesSELECT a.indexrelid::regclass AS index_a, b.indexrelid::regclass AS index_b, a.indkey AS keys_a, b.indkey AS keys_bFROM pg_index aJOIN pg_index b ON ( a.indrelid = b.indrelid AND a.indexrelid != b.indexrelid AND a.indkey::text LIKE b.indkey::text || '%');Schedule monthly index audits: (1) Identify unused indexes and consider dropping, (2) Find slow queries that might benefit from new indexes, (3) Check for redundant indexes that waste space, (4) Rebuild fragmented indexes in maintenance windows, (5) Update statistics after major data changes.
Index selection is one of the highest-impact physical design decisions. A well-chosen index strategy can improve query performance by orders of magnitude while poorly chosen indexes waste resources and slow writes.
What's next:
With storage structures and indexing understood, we turn to partitioning—the technique of dividing large tables into smaller, more manageable pieces. Partitioning improves query performance, simplifies maintenance, and enables efficient data lifecycle management.
You now possess a comprehensive understanding of index types and selection strategies. You can analyze workloads, identify indexing opportunities, and avoid common pitfalls. Combined with storage structure knowledge, you're equipped to make informed physical design decisions. Next: partitioning strategies for large-scale data management.