Index Usage - Learning Module

Loading content...

0/241

Index Limitations

The Other Side of the Coin

Throughout this module, we've explored the tremendous power of indexes to accelerate query performance. But indexes are not a universal solution. They come with significant limitations, costs, and constraints that every database professional must understand.

Thinking of indexes as 'more is better' leads to databases with too many indexes, inefficient write operations, and wasted storage. Understanding when indexes don't help—or actively hurt—is essential for balanced, effective database design.

What You Will Learn

By the end of this page, you will understand the fundamental limitations of B-tree indexes, query patterns that cannot leverage indexes effectively, the overhead indexes impose on write operations, storage and memory costs, and strategies for managing index-related trade-offs.

Write Operation Overhead

The most significant limitation of indexes is their impact on write operations. Every index added to a table increases the cost of INSERT, UPDATE, and DELETE operations.

The Write Amplification Problem

When you insert a row into a table with indexes, the database must:

Insert the row into the table (1 operation)
Insert an entry into EACH index (N operations for N indexes)
Maintain B-tree structure for each index (potential page splits)
Update transaction logs for all changes
Potentially update statistics

write-amplification.txt
Table: orders
Columns: 15 columns
Indexes: 8 indexes (common in OLTP systems)
 
Single INSERT operation must perform:
=====================================
1. Insert row into table heap/clustered index ... 1 operation
2. Insert into idx_customer_id .................. 1 operation
3. Insert into idx_order_date ................... 1 operation
4. Insert into idx_status ...................... 1 operation
5. Insert into idx_product_id .................. 1 operation
6. Insert into idx_shipping_address ............ 1 operation
7. Insert into idx_payment_method .............. 1 operation
8. Insert into idx_region ...................... 1 operation
9. Insert into idx_priority .................... 1 operation
 
Total: 9 separate B-tree modifications for 1 row!
 
At 100,000 inserts/day:
- Without indexes: 100,000 operations
- With 8 indexes: 900,000 operations (9x write amplification!)
 
Each B-tree modification may also trigger:
- Page splits (when pages are full)
- Tree rebalancing
- Lock acquisition and release

UPDATE Amplification

UPDATE operations can be even more impactful. If an UPDATE modifies indexed columns, the database must:

Delete the old index entry (for each affected index)
Insert a new index entry (for each affected index)

This means updating a single indexed column requires TWO index operations per affected index:

update-amplification.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Table with indexes on: customer_id, status, order_date, total
 
-- UPDATE that changes only 'status' column
UPDATE orders SET status = 'shipped' WHERE order_id = 12345;
 
-- Operations performed:
-- 1. Update row in table
-- 2. Delete old entry from idx_status (status='pending', order_id=12345)
-- 3. Insert new entry into idx_status (status='shipped', order_id=12345)
-- 4. Other indexes: unchanged (customer_id, order_date, total not modified)
 
-- UPDATE that changes multiple indexed columns
UPDATE orders 
SET status = 'shipped', 
    order_date = CURRENT_DATE,
    total = total * 0.9  -- 10% discount
WHERE order_id = 12345;
 
-- Now must update THREE indexes (status, order_date, total)
-- Each requires delete + insert = 6 index operations
-- Plus 1 table update = 7 total operations for 1 row!

Index Fatigue

Tables with many indexes—10, 15, or more—often suffer from 'index fatigue'. Write throughput degrades severely, batch operations become painfully slow, and the database spends more time maintaining indexes than serving queries. Audit your indexes regularly and remove those that aren't actively used.

Low Selectivity: When Indexes Don't Help

Indexes provide the greatest benefit for highly selective queries—those that return a small fraction of the table. When selectivity is low (many rows match), indexes may provide no benefit or even hurt performance.

The Selectivity Threshold

As a general rule, indexes become ineffective when a query returns more than 10-20% of the table (the exact threshold varies by database, hardware, and clustering). At high row counts, the overhead of index traversal plus table lookups exceeds the cost of a simple sequential scan.

Selectivity Impact on Index Usefulness
Selectivity	Matching Rows (1M table)	Index Benefit
0.01% (1 in 10,000)	100 rows	Excellent — Index is essential
0.1% (1 in 1,000)	1,000 rows	Very Good — Clear index advantage
1.0% (1 in 100)	10,000 rows	Good — Index still beneficial
5% (1 in 20)	50,000 rows	Moderate — Depends on clustering
15% (1 in 7)	150,000 rows	Marginal — Often similar to scan
30%+ (1 in 3)	300,000+ rows	Poor — Full scan often faster

selectivity-examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Example table: users (10 million rows)
-- Column: gender (2 values: 'M', 'F') - 50% each
-- Column: is_active (2 values: 1, 0) - 80% active, 20% inactive
-- Column: country (200 values, but USA has 30% of users)
-- Column: user_id (unique primary key)
 
-- Index on gender: USELESS
CREATE INDEX idx_gender ON users(gender);
SELECT * FROM users WHERE gender = 'F';
-- Returns 5 million rows (50%) - full scan is faster
 
-- Index on is_active: MARGINALLY USEFUL
CREATE INDEX idx_active ON users(is_active);
SELECT * FROM users WHERE is_active = 0;
-- Returns 2 million rows (20%) - might help, depends on access pattern
SELECT * FROM users WHERE is_active = 1;
-- Returns 8 million rows (80%) - full scan definitely faster
 
-- Index on country for common values: LIMITED
SELECT * FROM users WHERE country = 'USA';
-- Returns 3 million rows (30%) - index probably not used
SELECT * FROM users WHERE country = 'Luxembourg';
-- Returns 500 rows - index is very helpful!
 
-- Conclusion: Index on country helps for rare countries,
-- but not for common ones. Optimizer will choose appropriately.

Boolean and Status Columns

Think twice before indexing columns with only 2-5 distinct values (boolean flags, status columns, type codes). These indexes are often useless because any query returns at minimum 20-50% of rows. The exception: when combined with other columns in a composite index, or when one value is very rare (<1%).

Query Patterns That Defeat Indexes

Several common query patterns prevent effective index usage, even when an index exists on the relevant column.

1. Functions on Indexed Columns

Applying a function to an indexed column prevents B-tree index usage because the index stores original values, not transformed values:

function-on-column.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Index exists on order_date column
CREATE INDEX idx_order_date ON orders(order_date);
 
-- ❌ Function on column - INDEX CANNOT BE USED
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
SELECT * FROM orders WHERE DATE(order_date) = '2024-03-15';
SELECT * FROM orders WHERE order_date + INTERVAL 7 DAY > CURRENT_DATE;
 
-- ✓ Rewrite to use index
SELECT * FROM orders 
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
 
SELECT * FROM orders 
WHERE order_date >= '2024-03-15' AND order_date < '2024-03-16';
 
SELECT * FROM orders 
WHERE order_date > CURRENT_DATE - INTERVAL 7 DAY;
 
-- Alternative: Functional index (if database supports it)
-- PostgreSQL, MySQL 8.0+, Oracle
CREATE INDEX idx_order_year ON orders((YEAR(order_date)));

2. Implicit Type Conversions

When data types don't match exactly, implicit conversions can prevent index usage:

type-conversions.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Table has: phone_number VARCHAR(20), zipcode CHAR(5)
 
-- ❌ Type mismatch - may not use index
SELECT * FROM customers WHERE phone_number = 5551234567;  -- number, not string
SELECT * FROM customers WHERE zipcode = 90210;            -- number, not string
 
-- What happens: database converts column to number
-- WHERE CAST(phone_number AS INT) = 5551234567
-- This is a function on the column → no index!
 
-- ✓ Use correct types
SELECT * FROM customers WHERE phone_number = '5551234567';
SELECT * FROM customers WHERE zipcode = '90210';
 
-- Also watch for character set mismatches (less common but possible)
-- Column is utf8mb4, query parameter is latin1 
-- May trigger conversion that defeats index

3. Leading Wildcards

LIKE patterns starting with % cannot use B-tree indexes:

leading-wildcards.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Index on product_name (B-tree)
CREATE INDEX idx_product_name ON products(product_name);
 
-- ❌ Leading wildcard - INDEX CANNOT BE USED
SELECT * FROM products WHERE product_name LIKE '%widget%';
SELECT * FROM products WHERE product_name LIKE '%blue';
 
-- ✓ Trailing wildcard - INDEX CAN BE USED
SELECT * FROM products WHERE product_name LIKE 'widget%';
-- This is equivalent to: product_name >= 'widget' AND product_name < 'widgeu'
 
-- For substring/suffix searches, alternatives:
-- 1. Full-text search indexes
CREATE FULLTEXT INDEX idx_product_ft ON products(product_name);
SELECT * FROM products WHERE MATCH(product_name) AGAINST('widget');
 
-- 2. Trigram indexes (PostgreSQL)
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_product_trgm ON products USING GIN (product_name gin_trgm_ops);
SELECT * FROM products WHERE product_name LIKE '%widget%';  -- Now uses index!
 
-- 3. Reverse index for suffix searches (Oracle technique)
CREATE INDEX idx_product_rev ON products(REVERSE(product_name));
-- Search: WHERE REVERSE(product_name) LIKE REVERSE('%blue')

4. NOT and Negation Conditions

Negation conditions often have poor selectivity or cannot leverage index structure:

negation-conditions.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- ❌ Negation conditions - usually cannot use index efficiently
SELECT * FROM orders WHERE status != 'cancelled';
SELECT * FROM orders WHERE status NOT IN ('cancelled', 'refunded');
SELECT * FROM products WHERE category_id <> 5;
SELECT * FROM users WHERE email NOT LIKE '%@test.com';
 
-- Why: Negation typically returns most of the table
-- If status has 5 values, status != 'cancelled' returns 80%+ rows
 
-- ✓ Alternatives when specific values are known
-- If you know the values you WANT:
SELECT * FROM orders WHERE status IN ('pending', 'shipped', 'delivered');
 
-- ✓ IS NOT NULL can sometimes use index
SELECT * FROM orders WHERE customer_notes IS NOT NULL;
-- Depends on NULL distribution; if few rows have values, helpful

The Leftmost Prefix Rule

Composite indexes (indexes on multiple columns) have a critical limitation: they can only be used efficiently when the query includes the leftmost columns of the index. This is called the leftmost prefix rule.

Understanding Why This Happens

A B-tree index on (a, b, c) sorts data first by a, then by b within each a value, then by c within each (a, b) pair. The index structure looks like a phone book: sorted by last name, then first name within each last name.

leftmost-prefix-example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Composite index on (country, state, city)
CREATE INDEX idx_location ON customers(country, state, city);
 
-- Index entries sorted as:
-- ('Canada', 'Alberta', 'Calgary')
-- ('Canada', 'Alberta', 'Edmonton')
-- ('Canada', 'Ontario', 'Ottawa')
-- ('Canada', 'Ontario', 'Toronto')
-- ('USA', 'California', 'Los Angeles')
-- ('USA', 'California', 'San Francisco')
-- ('USA', 'New York', 'Buffalo')
-- ('USA', 'New York', 'New York City')
-- ('USA', 'Texas', 'Austin')
-- ...
 
-- ✓ CAN use index (leftmost prefix)
WHERE country = 'USA'                                    -- Uses (country)
WHERE country = 'USA' AND state = 'California'           -- Uses (country, state)
WHERE country = 'USA' AND state = 'California' AND city = 'LA'  -- Uses all
 
-- ❌ CANNOT use index efficiently
WHERE state = 'California'                               -- Skips country
WHERE city = 'Los Angeles'                               -- Skips country, state
WHERE country = 'USA' AND city = 'Los Angeles'           -- Skips state
 
-- Why can't it search by state alone?
-- 'California' entries are scattered: some under 'USA', potentially under 
-- other countries if data existed. Index can't jump to all 'California' entries.

Gaps in Index Columns

Even partial usage has limitations. If you skip a column, subsequent columns become less useful:

gaps-in-index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Index on (customer_id, order_date, status)
CREATE INDEX idx_orders_composite ON orders(customer_id, order_date, status);
 
-- Query with gap (missing order_date)
SELECT * FROM orders 
WHERE customer_id = 12345 AND status = 'pending';
 
-- Index usage: PARTIAL
-- Can use customer_id for seeking (first column)
-- BUT status condition is applied as FILTER, not seek
-- Why? Within customer 12345, 'pending' orders are scattered across all dates
 
-- Compare with no gap:
SELECT * FROM orders 
WHERE customer_id = 12345 
  AND order_date >= '2024-01-01' 
  AND status = 'pending';
 
-- All three columns used for seeking/scanning!
-- Much more efficient scan

Index Skip Scan

Some databases (Oracle, MySQL 8.0+, PostgreSQL 13+) support 'skip scan' or 'loose index scan' which can use an index even when leading columns are missing—IF those columns have few distinct values. It scans each distinct value of the leading column. However, this is less efficient than a proper leftmost prefix match.

Storage and Memory Costs

Indexes consume disk space and memory. For large tables or tables with many indexes, these costs can be substantial.

Storage Consumption

Each index stores a copy of indexed columns plus row pointers. The total index storage for a table often exceeds the size of the table itself:

storage-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- PostgreSQL: Check table and index sizes
SELECT 
    relname as table_name,
    pg_size_pretty(pg_table_size(relid)) as table_size,
    pg_size_pretty(pg_indexes_size(relid)) as indexes_size,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size
FROM pg_stat_user_tables
WHERE relname = 'orders';
 
-- Example output:
-- table_name | table_size | indexes_size | total_size
-- -----------|------------|--------------|------------
-- orders     | 5 GB       | 8 GB         | 13 GB
--
-- Indexes are 160% of table size!
 
-- MySQL: Check index sizes
SELECT 
    table_name,
    ROUND(data_length / 1024 / 1024, 2) AS data_size_mb,
    ROUND(index_length / 1024 / 1024, 2) AS index_size_mb
FROM information_schema.tables
WHERE table_name = 'orders';
 
-- SQL Server: Index space usage
SELECT 
    i.name AS index_name,
    SUM(ps.used_page_count) * 8 / 1024 AS size_mb
FROM sys.dm_db_partition_stats ps
JOIN sys.indexes i ON ps.object_id = i.object_id AND ps.index_id = i.index_id
WHERE i.object_id = OBJECT_ID('orders')
GROUP BY i.name;

Memory Pressure

Indexes compete for buffer pool (memory cache) space with table data. Wide or numerous indexes can lead to:

Reduced table data caching: More index pages means fewer table pages in memory
Index page eviction: If indexes don't fit in memory, performance suffers
Memory pressure: System may need more RAM or face frequent disk I/O

The Working Set Challenge

Ideally, the 'hot' portions of your indexes (frequently accessed leaf pages) fit in the buffer pool. When they don't, each query triggers disk I/O.

Impact of Index Size on Performance
Scenario	Buffer Pool Hit Rate	Performance Impact
Hot indexes fit in memory	95-99%	Excellent - minimal disk I/O
Most of indexes in memory	80-95%	Good - occasional disk reads
Less than half in memory	50-80%	Degraded - noticeable latency
Mostly on disk	<50%	Poor - heavy disk dependence

Index Bloat

Over time, indexes can become 'bloated'—consuming more space than necessary due to page fragmentation, dead tuples (in MVCC databases), and historical page splits. Regular index maintenance (REINDEX, REBUILD) may be necessary to keep index sizes optimal.

Index Maintenance and Operational Overhead

Beyond write amplification, indexes impose ongoing maintenance burdens:

Statistics Maintenance

Optimizer decisions depend on accurate statistics. Indexes with outdated statistics lead to poor query plans. This requires:

Regular ANALYZE/UPDATE STATISTICS operations
Monitoring for statistics staleness
Additional processing during off-peak hours

index-maintenance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- PostgreSQL: Check index health and usage
SELECT 
    schemaname,
    relname as table_name,
    indexrelname as index_name,
    idx_scan as times_used,
    idx_tup_read as tuples_read,
    idx_tup_fetch as tuples_fetched,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan ASC;  -- Least used first
 
-- Find unused indexes (never or rarely used)
SELECT indexrelname, idx_scan 
FROM pg_stat_user_indexes 
WHERE idx_scan < 10  -- Used fewer than 10 times
  AND schemaname = 'public';
 
-- Index bloat check (PostgreSQL)
SELECT 
    schemaname, tablename, indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    ROUND(100.0 * (1 - (index_bloat_ratio))) as bloat_pct
FROM /* bloat monitoring view */;
 
-- Rebuild bloated index
REINDEX INDEX idx_customer_id;
-- Or concurrently (PostgreSQL 12+):
REINDEX INDEX CONCURRENTLY idx_customer_id;

Schema Change Complications

Indexes complicate many schema operations:

ALTER TABLE: Adding/modifying columns may require index rebuilds
Online DDL: Index presence affects online schema change feasibility
Data migration: Indexes slow bulk data loading significantly
Partition management: Indexes on partitioned tables add complexity

bulk-load-indexes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Bulk loading with indexes is SLOW
-- Each inserted row must update all indexes
 
-- Better approach: Drop indexes, load, recreate
BEGIN;
 
-- Drop indexes (save definitions first!)
DROP INDEX idx_customer_id;
DROP INDEX idx_order_date;
DROP INDEX idx_status;
 
-- Bulk load is now much faster
COPY orders FROM '/path/to/data.csv' WITH (FORMAT csv);
-- Or: LOAD DATA INFILE in MySQL
-- Or: BULK INSERT in SQL Server
 
-- Recreate indexes
CREATE INDEX idx_customer_id ON orders(customer_id);
CREATE INDEX idx_order_date ON orders(order_date);
CREATE INDEX idx_status ON orders(status);
 
COMMIT;
 
-- Time comparison (10 million rows):
-- With indexes during load: 45 minutes
-- Drop/load/create indexes: 8 minutes (6x faster!)

Index Usage Audits

Regularly audit your indexes to identify: (1) unused indexes that can be dropped, (2) duplicate or overlapping indexes, (3) missing indexes suggested by the optimizer, and (4) indexes with high maintenance cost but low benefit. Many databases provide built-in views for this analysis.

Concurrency and Locking Considerations

Indexes interact with database concurrency controls in ways that can cause contention:

Index Hot Spots

Indexes on sequential values (auto-increment IDs, timestamps) create 'hot spots'—a small number of pages that receive all new entries. This causes:

Page-level lock contention: Multiple transactions compete to insert into the same page
Latch contention: Internal mechanisms protecting B-tree structure become bottlenecks
Cache line bouncing: CPU caches constantly invalidate across cores

hot-spot-problem.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Table with auto-increment primary key
CREATE TABLE events (
    event_id BIGINT AUTO_INCREMENT PRIMARY KEY,  -- Hot spot!
    event_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,  -- Also hot spot!
    event_type VARCHAR(50),
    payload TEXT
);
 
-- Problem: All inserts go to the rightmost leaf page
-- 1000 concurrent inserts → all compete for same page
 
-- Index on event_time has same problem
CREATE INDEX idx_event_time ON events(event_time);
-- All recent inserts cluster at one end of the index
 
-- Mitigation strategies:
 
-- 1. Use UUIDs instead of sequential IDs (SQL aesthetics vs performance trade-off)
CREATE TABLE events (
    event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    -- Inserts distributed randomly across B-tree
);
 
-- 2. Prefix with random component (MySQL)
-- Application generates: CONCAT(FLOOR(RAND()*1000), auto_increment_value)
 
-- 3. Hash partitioning (distributes inserts across partitions)
CREATE TABLE events PARTITION BY HASH(event_id) PARTITIONS 16;
 
-- 4. Reverse key index (Oracle specific)
CREATE INDEX idx_event_id ON events(event_id) REVERSE;

Locking During Index Scans

Index scans acquire locks on accessed rows/pages. Large index scans can:

Block concurrent writers targeting the same rows
Hold locks for extended periods during long-running queries
Create deadlock opportunities in complex transaction patterns

Index Creation Locking

Creating an index traditionally blocks writes to the table for the duration of index creation. On large tables, this can mean hours of downtime. Modern databases offer concurrent/online index creation (PostgreSQL's CREATE INDEX CONCURRENTLY, MySQL's ALGORITHM=INPLACE, SQL Server's ONLINE option) but these come with their own limitations and increased resource usage.

Summary: Understanding Index Limitations

Understanding index limitations is essential for effective database design. Indexes are powerful tools, but they are not free and not universally applicable. Let's consolidate the key constraints:

Key Index Limitations

•Write overhead — Every index multiplies insert/update/delete costs with write amplification
•Low selectivity — Indexes on columns with few distinct values provide little to no benefit
•Function interference — Functions on indexed columns, type conversions, and leading wildcards defeat index usage
•Leftmost prefix requirement — Composite indexes require queries to include leading columns
•Storage consumption — Indexes can consume more space than the table itself
•Memory competition — Indexes compete with table data for buffer pool space
•Maintenance burden — Statistics, bloat management, and schema changes require ongoing attention
•Concurrency issues — Hot spots and locking patterns can create contention

Best Practices

•Index selectively — focus on high-value query patterns
•Audit regularly — remove unused and duplicate indexes
•Monitor costs — track write amplification and storage growth
•Match query patterns — ensure WHERE clauses use index-friendly patterns
•Consider workload — balance read and write performance needs
•Maintain proactively — update statistics and address bloat

Module Complete

This concludes Module 3: Index Usage. You now have comprehensive knowledge of how to leverage indexes effectively—understanding their benefits, how the optimizer selects them, when to use hints, the power of covering indexes, and critically, their limitations. Apply this knowledge to design indexing strategies that balance read performance, write overhead, and maintenance costs for your specific workload.

Page Complete

You now understand the important limitations of database indexes. With this knowledge, you can make informed decisions about when indexes help, when they hurt, and how to design indexing strategies that optimize overall system performance rather than just individual read queries.