Loading content...
Throughout this module, we've explored the tremendous power of indexes to accelerate query performance. But indexes are not a universal solution. They come with significant limitations, costs, and constraints that every database professional must understand.
Thinking of indexes as 'more is better' leads to databases with too many indexes, inefficient write operations, and wasted storage. Understanding when indexes don't help—or actively hurt—is essential for balanced, effective database design.
By the end of this page, you will understand the fundamental limitations of B-tree indexes, query patterns that cannot leverage indexes effectively, the overhead indexes impose on write operations, storage and memory costs, and strategies for managing index-related trade-offs.
The most significant limitation of indexes is their impact on write operations. Every index added to a table increases the cost of INSERT, UPDATE, and DELETE operations.
The Write Amplification Problem
When you insert a row into a table with indexes, the database must:
Table: ordersColumns: 15 columnsIndexes: 8 indexes (common in OLTP systems) Single INSERT operation must perform:=====================================1. Insert row into table heap/clustered index ... 1 operation2. Insert into idx_customer_id .................. 1 operation3. Insert into idx_order_date ................... 1 operation4. Insert into idx_status ...................... 1 operation5. Insert into idx_product_id .................. 1 operation6. Insert into idx_shipping_address ............ 1 operation7. Insert into idx_payment_method .............. 1 operation8. Insert into idx_region ...................... 1 operation9. Insert into idx_priority .................... 1 operation Total: 9 separate B-tree modifications for 1 row! At 100,000 inserts/day:- Without indexes: 100,000 operations- With 8 indexes: 900,000 operations (9x write amplification!) Each B-tree modification may also trigger:- Page splits (when pages are full)- Tree rebalancing- Lock acquisition and releaseUPDATE Amplification
UPDATE operations can be even more impactful. If an UPDATE modifies indexed columns, the database must:
This means updating a single indexed column requires TWO index operations per affected index:
123456789101112131415161718192021
-- Table with indexes on: customer_id, status, order_date, total -- UPDATE that changes only 'status' columnUPDATE orders SET status = 'shipped' WHERE order_id = 12345; -- Operations performed:-- 1. Update row in table-- 2. Delete old entry from idx_status (status='pending', order_id=12345)-- 3. Insert new entry into idx_status (status='shipped', order_id=12345)-- 4. Other indexes: unchanged (customer_id, order_date, total not modified) -- UPDATE that changes multiple indexed columnsUPDATE orders SET status = 'shipped', order_date = CURRENT_DATE, total = total * 0.9 -- 10% discountWHERE order_id = 12345; -- Now must update THREE indexes (status, order_date, total)-- Each requires delete + insert = 6 index operations-- Plus 1 table update = 7 total operations for 1 row!Tables with many indexes—10, 15, or more—often suffer from 'index fatigue'. Write throughput degrades severely, batch operations become painfully slow, and the database spends more time maintaining indexes than serving queries. Audit your indexes regularly and remove those that aren't actively used.
Indexes provide the greatest benefit for highly selective queries—those that return a small fraction of the table. When selectivity is low (many rows match), indexes may provide no benefit or even hurt performance.
The Selectivity Threshold
As a general rule, indexes become ineffective when a query returns more than 10-20% of the table (the exact threshold varies by database, hardware, and clustering). At high row counts, the overhead of index traversal plus table lookups exceeds the cost of a simple sequential scan.
| Selectivity | Matching Rows (1M table) | Index Benefit |
|---|---|---|
| 0.01% (1 in 10,000) | 100 rows | Excellent — Index is essential |
| 0.1% (1 in 1,000) | 1,000 rows | Very Good — Clear index advantage |
| 1.0% (1 in 100) | 10,000 rows | Good — Index still beneficial |
| 5% (1 in 20) | 50,000 rows | Moderate — Depends on clustering |
| 15% (1 in 7) | 150,000 rows | Marginal — Often similar to scan |
| 30%+ (1 in 3) | 300,000+ rows | Poor — Full scan often faster |
1234567891011121314151617181920212223242526
-- Example table: users (10 million rows)-- Column: gender (2 values: 'M', 'F') - 50% each-- Column: is_active (2 values: 1, 0) - 80% active, 20% inactive-- Column: country (200 values, but USA has 30% of users)-- Column: user_id (unique primary key) -- Index on gender: USELESSCREATE INDEX idx_gender ON users(gender);SELECT * FROM users WHERE gender = 'F';-- Returns 5 million rows (50%) - full scan is faster -- Index on is_active: MARGINALLY USEFULCREATE INDEX idx_active ON users(is_active);SELECT * FROM users WHERE is_active = 0;-- Returns 2 million rows (20%) - might help, depends on access patternSELECT * FROM users WHERE is_active = 1;-- Returns 8 million rows (80%) - full scan definitely faster -- Index on country for common values: LIMITEDSELECT * FROM users WHERE country = 'USA';-- Returns 3 million rows (30%) - index probably not usedSELECT * FROM users WHERE country = 'Luxembourg';-- Returns 500 rows - index is very helpful! -- Conclusion: Index on country helps for rare countries,-- but not for common ones. Optimizer will choose appropriately.Think twice before indexing columns with only 2-5 distinct values (boolean flags, status columns, type codes). These indexes are often useless because any query returns at minimum 20-50% of rows. The exception: when combined with other columns in a composite index, or when one value is very rare (<1%).
Several common query patterns prevent effective index usage, even when an index exists on the relevant column.
1. Functions on Indexed Columns
Applying a function to an indexed column prevents B-tree index usage because the index stores original values, not transformed values:
123456789101112131415161718192021
-- Index exists on order_date columnCREATE INDEX idx_order_date ON orders(order_date); -- ❌ Function on column - INDEX CANNOT BE USEDSELECT * FROM orders WHERE YEAR(order_date) = 2024;SELECT * FROM orders WHERE DATE(order_date) = '2024-03-15';SELECT * FROM orders WHERE order_date + INTERVAL 7 DAY > CURRENT_DATE; -- ✓ Rewrite to use indexSELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'; SELECT * FROM orders WHERE order_date >= '2024-03-15' AND order_date < '2024-03-16'; SELECT * FROM orders WHERE order_date > CURRENT_DATE - INTERVAL 7 DAY; -- Alternative: Functional index (if database supports it)-- PostgreSQL, MySQL 8.0+, OracleCREATE INDEX idx_order_year ON orders((YEAR(order_date)));2. Implicit Type Conversions
When data types don't match exactly, implicit conversions can prevent index usage:
1234567891011121314151617
-- Table has: phone_number VARCHAR(20), zipcode CHAR(5) -- ❌ Type mismatch - may not use indexSELECT * FROM customers WHERE phone_number = 5551234567; -- number, not stringSELECT * FROM customers WHERE zipcode = 90210; -- number, not string -- What happens: database converts column to number-- WHERE CAST(phone_number AS INT) = 5551234567-- This is a function on the column → no index! -- ✓ Use correct typesSELECT * FROM customers WHERE phone_number = '5551234567';SELECT * FROM customers WHERE zipcode = '90210'; -- Also watch for character set mismatches (less common but possible)-- Column is utf8mb4, query parameter is latin1 -- May trigger conversion that defeats index3. Leading Wildcards
LIKE patterns starting with % cannot use B-tree indexes:
123456789101112131415161718192021222324
-- Index on product_name (B-tree)CREATE INDEX idx_product_name ON products(product_name); -- ❌ Leading wildcard - INDEX CANNOT BE USEDSELECT * FROM products WHERE product_name LIKE '%widget%';SELECT * FROM products WHERE product_name LIKE '%blue'; -- ✓ Trailing wildcard - INDEX CAN BE USEDSELECT * FROM products WHERE product_name LIKE 'widget%';-- This is equivalent to: product_name >= 'widget' AND product_name < 'widgeu' -- For substring/suffix searches, alternatives:-- 1. Full-text search indexesCREATE FULLTEXT INDEX idx_product_ft ON products(product_name);SELECT * FROM products WHERE MATCH(product_name) AGAINST('widget'); -- 2. Trigram indexes (PostgreSQL)CREATE EXTENSION pg_trgm;CREATE INDEX idx_product_trgm ON products USING GIN (product_name gin_trgm_ops);SELECT * FROM products WHERE product_name LIKE '%widget%'; -- Now uses index! -- 3. Reverse index for suffix searches (Oracle technique)CREATE INDEX idx_product_rev ON products(REVERSE(product_name));-- Search: WHERE REVERSE(product_name) LIKE REVERSE('%blue')4. NOT and Negation Conditions
Negation conditions often have poor selectivity or cannot leverage index structure:
12345678910111213141516
-- ❌ Negation conditions - usually cannot use index efficientlySELECT * FROM orders WHERE status != 'cancelled';SELECT * FROM orders WHERE status NOT IN ('cancelled', 'refunded');SELECT * FROM products WHERE category_id <> 5;SELECT * FROM users WHERE email NOT LIKE '%@test.com'; -- Why: Negation typically returns most of the table-- If status has 5 values, status != 'cancelled' returns 80%+ rows -- ✓ Alternatives when specific values are known-- If you know the values you WANT:SELECT * FROM orders WHERE status IN ('pending', 'shipped', 'delivered'); -- ✓ IS NOT NULL can sometimes use indexSELECT * FROM orders WHERE customer_notes IS NOT NULL;-- Depends on NULL distribution; if few rows have values, helpfulComposite indexes (indexes on multiple columns) have a critical limitation: they can only be used efficiently when the query includes the leftmost columns of the index. This is called the leftmost prefix rule.
Understanding Why This Happens
A B-tree index on (a, b, c) sorts data first by a, then by b within each a value, then by c within each (a, b) pair. The index structure looks like a phone book: sorted by last name, then first name within each last name.
12345678910111213141516171819202122232425262728
-- Composite index on (country, state, city)CREATE INDEX idx_location ON customers(country, state, city); -- Index entries sorted as:-- ('Canada', 'Alberta', 'Calgary')-- ('Canada', 'Alberta', 'Edmonton')-- ('Canada', 'Ontario', 'Ottawa')-- ('Canada', 'Ontario', 'Toronto')-- ('USA', 'California', 'Los Angeles')-- ('USA', 'California', 'San Francisco')-- ('USA', 'New York', 'Buffalo')-- ('USA', 'New York', 'New York City')-- ('USA', 'Texas', 'Austin')-- ... -- ✓ CAN use index (leftmost prefix)WHERE country = 'USA' -- Uses (country)WHERE country = 'USA' AND state = 'California' -- Uses (country, state)WHERE country = 'USA' AND state = 'California' AND city = 'LA' -- Uses all -- ❌ CANNOT use index efficientlyWHERE state = 'California' -- Skips countryWHERE city = 'Los Angeles' -- Skips country, stateWHERE country = 'USA' AND city = 'Los Angeles' -- Skips state -- Why can't it search by state alone?-- 'California' entries are scattered: some under 'USA', potentially under -- other countries if data existed. Index can't jump to all 'California' entries.Gaps in Index Columns
Even partial usage has limitations. If you skip a column, subsequent columns become less useful:
1234567891011121314151617181920
-- Index on (customer_id, order_date, status)CREATE INDEX idx_orders_composite ON orders(customer_id, order_date, status); -- Query with gap (missing order_date)SELECT * FROM orders WHERE customer_id = 12345 AND status = 'pending'; -- Index usage: PARTIAL-- Can use customer_id for seeking (first column)-- BUT status condition is applied as FILTER, not seek-- Why? Within customer 12345, 'pending' orders are scattered across all dates -- Compare with no gap:SELECT * FROM orders WHERE customer_id = 12345 AND order_date >= '2024-01-01' AND status = 'pending'; -- All three columns used for seeking/scanning!-- Much more efficient scanSome databases (Oracle, MySQL 8.0+, PostgreSQL 13+) support 'skip scan' or 'loose index scan' which can use an index even when leading columns are missing—IF those columns have few distinct values. It scans each distinct value of the leading column. However, this is less efficient than a proper leftmost prefix match.
Indexes consume disk space and memory. For large tables or tables with many indexes, these costs can be substantial.
Storage Consumption
Each index stores a copy of indexed columns plus row pointers. The total index storage for a table often exceeds the size of the table itself:
1234567891011121314151617181920212223242526272829303132
-- PostgreSQL: Check table and index sizesSELECT relname as table_name, pg_size_pretty(pg_table_size(relid)) as table_size, pg_size_pretty(pg_indexes_size(relid)) as indexes_size, pg_size_pretty(pg_total_relation_size(relid)) as total_sizeFROM pg_stat_user_tablesWHERE relname = 'orders'; -- Example output:-- table_name | table_size | indexes_size | total_size-- -----------|------------|--------------|-------------- orders | 5 GB | 8 GB | 13 GB---- Indexes are 160% of table size! -- MySQL: Check index sizesSELECT table_name, ROUND(data_length / 1024 / 1024, 2) AS data_size_mb, ROUND(index_length / 1024 / 1024, 2) AS index_size_mbFROM information_schema.tablesWHERE table_name = 'orders'; -- SQL Server: Index space usageSELECT i.name AS index_name, SUM(ps.used_page_count) * 8 / 1024 AS size_mbFROM sys.dm_db_partition_stats psJOIN sys.indexes i ON ps.object_id = i.object_id AND ps.index_id = i.index_idWHERE i.object_id = OBJECT_ID('orders')GROUP BY i.name;Memory Pressure
Indexes compete for buffer pool (memory cache) space with table data. Wide or numerous indexes can lead to:
The Working Set Challenge
Ideally, the 'hot' portions of your indexes (frequently accessed leaf pages) fit in the buffer pool. When they don't, each query triggers disk I/O.
| Scenario | Buffer Pool Hit Rate | Performance Impact |
|---|---|---|
| Hot indexes fit in memory | 95-99% | Excellent - minimal disk I/O |
| Most of indexes in memory | 80-95% | Good - occasional disk reads |
| Less than half in memory | 50-80% | Degraded - noticeable latency |
| Mostly on disk | <50% | Poor - heavy disk dependence |
Over time, indexes can become 'bloated'—consuming more space than necessary due to page fragmentation, dead tuples (in MVCC databases), and historical page splits. Regular index maintenance (REINDEX, REBUILD) may be necessary to keep index sizes optimal.
Beyond write amplification, indexes impose ongoing maintenance burdens:
Statistics Maintenance
Optimizer decisions depend on accurate statistics. Indexes with outdated statistics lead to poor query plans. This requires:
123456789101112131415161718192021222324252627282930
-- PostgreSQL: Check index health and usageSELECT schemaname, relname as table_name, indexrelname as index_name, idx_scan as times_used, idx_tup_read as tuples_read, idx_tup_fetch as tuples_fetched, pg_size_pretty(pg_relation_size(indexrelid)) as index_sizeFROM pg_stat_user_indexesWHERE schemaname = 'public'ORDER BY idx_scan ASC; -- Least used first -- Find unused indexes (never or rarely used)SELECT indexrelname, idx_scan FROM pg_stat_user_indexes WHERE idx_scan < 10 -- Used fewer than 10 times AND schemaname = 'public'; -- Index bloat check (PostgreSQL)SELECT schemaname, tablename, indexname, pg_size_pretty(pg_relation_size(indexrelid)) as index_size, ROUND(100.0 * (1 - (index_bloat_ratio))) as bloat_pctFROM /* bloat monitoring view */; -- Rebuild bloated indexREINDEX INDEX idx_customer_id;-- Or concurrently (PostgreSQL 12+):REINDEX INDEX CONCURRENTLY idx_customer_id;Schema Change Complications
Indexes complicate many schema operations:
1234567891011121314151617181920212223242526
-- Bulk loading with indexes is SLOW-- Each inserted row must update all indexes -- Better approach: Drop indexes, load, recreateBEGIN; -- Drop indexes (save definitions first!)DROP INDEX idx_customer_id;DROP INDEX idx_order_date;DROP INDEX idx_status; -- Bulk load is now much fasterCOPY orders FROM '/path/to/data.csv' WITH (FORMAT csv);-- Or: LOAD DATA INFILE in MySQL-- Or: BULK INSERT in SQL Server -- Recreate indexesCREATE INDEX idx_customer_id ON orders(customer_id);CREATE INDEX idx_order_date ON orders(order_date);CREATE INDEX idx_status ON orders(status); COMMIT; -- Time comparison (10 million rows):-- With indexes during load: 45 minutes-- Drop/load/create indexes: 8 minutes (6x faster!)Regularly audit your indexes to identify: (1) unused indexes that can be dropped, (2) duplicate or overlapping indexes, (3) missing indexes suggested by the optimizer, and (4) indexes with high maintenance cost but low benefit. Many databases provide built-in views for this analysis.
Indexes interact with database concurrency controls in ways that can cause contention:
Index Hot Spots
Indexes on sequential values (auto-increment IDs, timestamps) create 'hot spots'—a small number of pages that receive all new entries. This causes:
12345678910111213141516171819202122232425262728293031
-- Table with auto-increment primary keyCREATE TABLE events ( event_id BIGINT AUTO_INCREMENT PRIMARY KEY, -- Hot spot! event_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- Also hot spot! event_type VARCHAR(50), payload TEXT); -- Problem: All inserts go to the rightmost leaf page-- 1000 concurrent inserts → all compete for same page -- Index on event_time has same problemCREATE INDEX idx_event_time ON events(event_time);-- All recent inserts cluster at one end of the index -- Mitigation strategies: -- 1. Use UUIDs instead of sequential IDs (SQL aesthetics vs performance trade-off)CREATE TABLE events ( event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), -- Inserts distributed randomly across B-tree); -- 2. Prefix with random component (MySQL)-- Application generates: CONCAT(FLOOR(RAND()*1000), auto_increment_value) -- 3. Hash partitioning (distributes inserts across partitions)CREATE TABLE events PARTITION BY HASH(event_id) PARTITIONS 16; -- 4. Reverse key index (Oracle specific)CREATE INDEX idx_event_id ON events(event_id) REVERSE;Locking During Index Scans
Index scans acquire locks on accessed rows/pages. Large index scans can:
Creating an index traditionally blocks writes to the table for the duration of index creation. On large tables, this can mean hours of downtime. Modern databases offer concurrent/online index creation (PostgreSQL's CREATE INDEX CONCURRENTLY, MySQL's ALGORITHM=INPLACE, SQL Server's ONLINE option) but these come with their own limitations and increased resource usage.
Understanding index limitations is essential for effective database design. Indexes are powerful tools, but they are not free and not universally applicable. Let's consolidate the key constraints:
Module Complete
This concludes Module 3: Index Usage. You now have comprehensive knowledge of how to leverage indexes effectively—understanding their benefits, how the optimizer selects them, when to use hints, the power of covering indexes, and critically, their limitations. Apply this knowledge to design indexing strategies that balance read performance, write overhead, and maintenance costs for your specific workload.
You now understand the important limitations of database indexes. With this knowledge, you can make informed decisions about when indexes help, when they hurt, and how to design indexing strategies that optimize overall system performance rather than just individual read queries.