Physical Design - Learning Module

Loading content...

0/241

Index Selection

The Art and Science of Index Design

If storage structures form the foundation of physical design, indexes are the superstructure—the mechanisms that transform slow scans into lightning-fast lookups. A well-indexed database can answer queries in milliseconds that would otherwise require minutes. Conversely, poor indexing leads to either unnecessary overhead or query performance disasters.

Index selection is both art and science. The science involves understanding index data structures, their complexity characteristics, and how query optimizers utilize them. The art involves anticipating workload patterns, balancing read and write costs, and making strategic trade-offs when perfect indexing is impossible.

The fundamental promise: An index is an additional data structure that maintains sorted/hashed mappings from attribute values to record locations, enabling the database to locate data without scanning entire tables.

What You Will Learn

This page covers the major index types (B-tree, hash, bitmap, specialized indexes), their internal structures, when to use each, and the systematic methodology for index selection. You'll learn to analyze workloads and design index strategies that optimize the queries that matter most.

Why Indexes Matter

Consider a table with 10 million customer records requiring 100,000 disk blocks. Without an index, finding a customer by name requires:

Full table scan: 100,000 block reads
At 0.1ms per sequential read: ~10 seconds
At 10ms per random read: ~16 minutes

With a B-tree index on customer_name:

Index traversal: ~4 block reads (B-tree height)
Data fetch: 1-2 block reads
Total: ~6 blocks → ~6 milliseconds

This represents a 10,000x improvement—the difference between an unusable system and an excellent one.

Query Performance: Index vs No Index
Table Size	Full Scan Time	B-tree Index Time	Speedup
10,000 rows	~100 ms	~5 ms	20x
1 million rows	~10 sec	~8 ms	1,250x
100 million rows	~15 min	~12 ms	75,000x
1 billion rows	~2.5 hrs	~15 ms	600,000x

The Index Trade-off

Indexes are not free. Every index: (1) consumes storage space (often 10-30% of table size per index), (2) slows INSERT, UPDATE, DELETE operations (each must update all affected indexes), (3) increases complexity for the query optimizer, and (4) requires maintenance (fragmentation, statistics updates). The goal is not 'add all indexes' but 'add the right indexes for the workload.'

B-Tree Indexes (The Workhorse)

The B-tree (and its variant, the B+ tree) is the dominant index structure in virtually all relational databases. It provides efficient support for both equality and range queries while remaining balanced under insertions and deletions.

B+ Tree structure:

Balanced tree — All leaf nodes at the same depth
High fanout — Each node holds many keys (typically 100-500)
Internal nodes — Contain only keys and child pointers (guide search)
Leaf nodes — Contain all keys plus data pointers or data itself
Leaf chain — Leaf nodes linked for efficient range scans

Why B+ Trees dominate:

Shallow depth: With fanout of 200, a tree covering 1 billion records is only 5 levels deep
Disk-optimized: Node size matches disk block size, minimizing I/O
Range-friendly: Sorted leaves enable efficient range scans
Self-balancing: Maintains O(log n) height despite arbitrary insertions/deletions

B+ Tree Operation Costs
Operation	Cost	I/O Pattern
Equality search	O(log_f n) — 3-5 I/O	Random reads down tree
Range search (k results)	O(log_f n + k/f)	Traverse to start, sequential scan leaves
Insert	O(log_f n)	Find leaf, insert, possibly split
Delete	O(log_f n)	Find leaf, remove, possibly merge
Min/Max	O(log_f n)	Traverse to leftmost/rightmost leaf

Where f = fanout (typically 100-500), n = number of records

Practical implications:

A B+ tree on 1 billion records with fanout 200: height = log₂₀₀(10⁹) ≈ 4 levels
Each search requires ~4 disk reads (plus 1-2 to fetch data)
Modern databases cache upper tree levels in memory → often just 1-2 disk reads per query

B-tree variations:

B+ Tree: Most common; all data in leaves, internal nodes are pure index
B Tree:* Maintains higher node utilization (2/3 full vs 1/2 full)
B-link Tree: Adds sibling pointers for better concurrency
Bw-Tree: Lock-free B-tree variant for in-memory databases

btree_index_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Standard B-tree index (default in most databases)
CREATE INDEX idx_customer_name ON customers(name);
 
-- Composite B-tree index (multiple columns)
CREATE INDEX idx_order_customer_date ON orders(customer_id, order_date);
-- This index supports:
--   WHERE customer_id = ?                    (uses first column)
--   WHERE customer_id = ? AND order_date = ? (uses both columns)
--   WHERE customer_id = ? AND order_date > ? (uses both columns)
-- Does NOT efficiently support:
--   WHERE order_date = ?  (second column without first)
 
-- Unique B-tree index (enforces uniqueness + provides index)
CREATE UNIQUE INDEX idx_email ON customers(email);
 
-- Descending index (optimizes ORDER BY DESC)
CREATE INDEX idx_created_desc ON orders(created_at DESC);
 
-- Partial index (indexes only subset of rows)
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';
-- Smaller index, faster updates, but only covers active users

Composite Index Column Order

For composite (multi-column) B-tree indexes, column order matters critically. The index is sorted by the first column, then by the second within each first-column value, and so on. Place the most selective column first, or the column most frequently used alone. A query filtering only on the second column cannot efficiently use the index.

Hash Indexes

Hash indexes use a hash function to map key values directly to bucket locations. They provide O(1) average-case lookups—theoretically faster than B-trees—but with significant limitations.

Structure:

Hash function h(key) → bucket number
Each bucket contains entries: (key, pointer to data)
Collision resolution via chaining or probing

Performance characteristics:

Operation	Cost	Notes
Equality lookup	O(1) average	Single hash computation + bucket read
Range query	O(n)	Hash provides no ordering—must scan all
Insert	O(1) average	Hash + append to bucket
Delete	O(1) average	Hash + remove from bucket

Hash Index Advantages

•Theoretically O(1) equality lookups
•Simple implementation
•Excellent for primary key lookups
•Lower memory overhead than B-tree for simple cases
•Ideal for hash joins in query execution

Hash Index Limitations

•Cannot support range queries (>, <, BETWEEN)
•Cannot support ORDER BY (no sorting)
•Cannot support prefix matching (LIKE 'foo%')
•Bucket overflow degrades to O(n)
•Most RDBMS prefer B-trees even for equality

Hash Index Support Across Databases

PostgreSQL supports explicit hash indexes (CREATE INDEX ... USING HASH), but recommends B-tree for most cases due to better concurrency and crash recovery. MySQL/InnoDB uses hash indexes only for its internal Adaptive Hash Index (automatic, not user-created). Oracle uses hash clusters, not hash indexes. In practice, B-trees are preferred for nearly all OLTP index needs.

hash_index_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Explicit hash index in PostgreSQL
CREATE INDEX idx_session_token_hash 
ON sessions USING HASH (token);
 
-- Use case: exact match lookups on session tokens
-- SELECT * FROM sessions WHERE token = 'abc123def456';
 
-- Note: This CANNOT accelerate:
-- SELECT * FROM sessions WHERE token LIKE 'abc%';
-- SELECT * FROM sessions WHERE token > 'abc';
-- SELECT * FROM sessions ORDER BY token;
 
-- In most cases, B-tree is preferred:
CREATE INDEX idx_session_token_btree ON sessions(token);
-- B-tree handles all the above cases efficiently

Bitmap Indexes

Bitmap indexes take a radically different approach from B-trees and hash indexes. Instead of mapping key values to record locations, they create bit vectors indicating which records contain each distinct value.

Structure:

For a column with distinct values {v₁, v₂, ..., vₖ}:

Create k bit vectors, each with n bits (n = row count)
Bit i in vector j is 1 if row i contains value vⱼ, else 0

Example:

Row	Gender	Region
1	M	East
2	F	West
3	M	East
4	F	East
5	M	West

Bitmap for Gender:

M: 10101
F: 01010

Bitmap for Region:

East: 10110
West: 01001

Query evaluation with bitmaps:

Bitmaps shine for complex predicates combining multiple conditions:

SELECT * FROM table 
WHERE gender = 'M' AND region = 'East';

Execution:

Fetch bitmap for gender='M': 10101
Fetch bitmap for region='East': 10110
Bitwise AND: 10101 AND 10110 = 10100
Rows 1 and 3 match—fetch only those rows

Performance characteristics:

Space efficiency: For low-cardinality columns, bitmaps are tiny compared to B-trees
Multi-predicate queries: Bitwise AND/OR operations are extraordinarily fast (single CPU instruction per 64 bits)
Compression: Run-length encoding can compress bitmaps dramatically
Aggregation: COUNT queries can simply count 1-bits (popcount operation)

Bitmap vs B-tree Index Comparison
Characteristic	Bitmap Index	B-tree Index
Best for cardinality	Low (< 100 distinct values)	High (many distinct values)
Multi-column AND/OR	Excellent (bitwise ops)	Requires index intersection
Space for low cardinality	Very small	Larger
Space for high cardinality	Explodes	Scales linearly
Insert/Update cost	High (rebuild bitmaps)	O(log n)
Typical use case	Data warehouses (OLAP)	Transactional systems (OLTP)

Bitmap Index Write Penalty

Bitmap indexes are catastrophic for write-heavy workloads. A single INSERT may require updating multiple bitmap vectors and their compression. Worse, bitmap operations typically lock entire bitmaps, creating severe concurrency bottlenecks. Use bitmap indexes only in read-heavy analytical environments where data is bulk-loaded periodically.

Ideal Bitmap Index Candidates

•Status columns — order_status ('pending', 'shipped', 'delivered')
•Category columns — product_category, department, region
•Boolean flags — is_active, is_verified, is_deleted
•Date parts — year, quarter, month (not full timestamps)
•Demographic attributes — gender, age_range, membership_tier

Specialized Index Types

Beyond the fundamental index types, modern databases offer specialized indexes for specific data types and query patterns.

Full-Text Indexes:

Optimized for text search operations (word matching, phrase search, relevance ranking):

Inverted index structure: maps terms → documents containing them
Supports stemming, stop words, relevance scoring
Enables queries like WHERE MATCH(content) AGAINST ('database optimization')

Spatial Indexes (R-Tree, Quad-Tree):

Optimized for geometric data (points, polygons, geographic coordinates):

Enable queries like 'find all points within this rectangle'
Support nearest-neighbor searches
Used in GIS applications, mapping, location-based services

GIN (Generalized Inverted Index):

PostgreSQL's index for composite/array types:

Indexes elements within arrays, JSONB documents, tsvector
Enables queries like WHERE tags @> ARRAY['urgent', 'important']

Specialized Index Types Summary
Index Type	Data Type	Query Pattern	Database Support
Full-Text	Text/VARCHAR	MATCH, CONTAINS, phrase search	All major RDBMS
R-Tree/Spatial	Geometry, Geography	ST_Contains, ST_Distance, bounding box	PostGIS, MySQL, Oracle Spatial
GIN	Array, JSONB, tsvector	Array containment, JSONB path	PostgreSQL
GiST	Custom types, ranges	Range overlap, geometric operatores	PostgreSQL
BRIN	Large sorted tables	Range queries on naturally ordered data	PostgreSQL
Bloom Filter	Multi-column equality	Multi-column equality with many columns	PostgreSQL

specialized_index_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Full-text search index
CREATE INDEX idx_articles_fts ON articles 
USING GIN (to_tsvector('english', title || ' ' || body));
 
-- Query using full-text index
SELECT * FROM articles 
WHERE to_tsvector('english', title || ' ' || body) 
   @@ to_tsquery('english', 'database & optimization');
 
-- GIN index on JSONB column
CREATE INDEX idx_metadata_gin ON products USING GIN (metadata);
 
-- Query using GIN index
SELECT * FROM products 
WHERE metadata @> '{"category": "electronics"}';
 
-- BRIN index for time-series data (naturally ordered by timestamp)
CREATE INDEX idx_events_brin ON events USING BRIN (created_at);
-- BRIN is tiny (stores min/max per block range), efficient for sequential data
 
-- Partial index for common query pattern
CREATE INDEX idx_pending_orders ON orders(customer_id, created_at) 
WHERE status = 'pending';
-- Much smaller than full index, perfect for "show pending orders" queries

Index Type Selection Heuristic

Default to B-tree for general-purpose indexing. Use hash only for pure equality lookups in memory-optimized scenarios. Use bitmap for low-cardinality columns in analytical workloads. Use specialized indexes (GIN, spatial, full-text) only when the query pattern explicitly requires them—they add complexity and maintenance overhead.

Index Selection Methodology

Systematic index selection requires analyzing your workload—the mix of queries that run against the database—and designing indexes that accelerate the most important ones while minimizing write overhead.

The index selection framework:

Step 1: Query Workload Analysis

Identify and prioritize queries by:

Frequency: How often does this query run?
Criticality: Is this user-facing with latency requirements?
Cost: How expensive is this query without indexes?

Step 2: Query Decomposition

For each important query, identify:

Selection predicates: WHERE clause columns
Join predicates: Columns used in joins
Ordering requirements: ORDER BY columns
Grouping requirements: GROUP BY columns
Covering potential: SELECT columns (for covering indexes)

Step 3: Candidate Index Generation

For each query, generate candidate indexes:

Query: SELECT name, email FROM users 
       WHERE status = 'active' AND country = 'US' 
       ORDER BY created_at DESC LIMIT 10;

Candidates:
1. INDEX(status)
2. INDEX(country)
3. INDEX(status, country)
4. INDEX(status, country, created_at)
5. INDEX(status, country, created_at) INCLUDE (name, email)  -- covering

Step 4: Cost-Benefit Analysis

For each candidate:

Estimate read benefit (I/O saved per query × query frequency)
Estimate write cost (additional I/O per write × write frequency)
Estimate storage cost

Step 5: Index Consolidation

Identify indexes that serve multiple queries:

INDEX(customer_id, order_date) serves both:
- WHERE customer_id = ?
- WHERE customer_id = ? AND order_date > ?

Prefer consolidated indexes that cover multiple query patterns.

The Index Advisor

Most modern databases include index advisors that automate parts of this analysis. SQL Server's Database Tuning Advisor, MySQL's Performance Schema with sys schema, and PostgreSQL's pg_stat_statements combined with hypothetical indexes can suggest candidates. However, human judgment remains essential for understanding business priorities and workload nuances.

Index Selection Decision Matrix
Query Pattern	Recommended Index Strategy
Equality on single column	B-tree on that column
Equality on multiple columns	Composite B-tree (most selective first)
Range on single column	B-tree on that column
Equality + Range	Composite B-tree (equality columns first, then range)
ORDER BY frequently	Include ORDER BY column in index (typically last)
SELECT few columns	Consider covering index with INCLUDE clause
Low cardinality + OLAP	Bitmap index (if supported)
Full-text search	Full-text index
JSON/Array queries	GIN index (PostgreSQL)

Common Indexing Mistakes

Even experienced engineers make indexing mistakes. Understanding common anti-patterns helps you avoid them.

Index Anti-Patterns to Avoid

•Over-indexing — Adding indexes for every query without considering write overhead. A table with 15 indexes makes writes 15x slower.
•Under-indexing — No indexes except primary key, forcing full table scans for every query. The other extreme.
•Wrong column order — Composite index on (date, customer_id) when queries filter by customer_id first. Column order must match query patterns.
•Indexing low-selectivity columns alone — An index on boolean column 'is_active' may not help—50% of table still matches.
•Ignoring covering indexes — Missing opportunity to avoid table access entirely by including SELECT columns in index.
•Duplicate indexes — INDEX(a) and INDEX(a, b) both exist; first is redundant (composite index covers single-column queries).
•Unused indexes — Indexes that no query uses, consuming space and slowing writes. Audit periodically.
•Not analyzing after bulk load — Statistics stale after major data load, causing optimizer to ignore good indexes.

find_unused_indexes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Find indexes that are never or rarely used
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0  -- Never used
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- Find duplicate/redundant indexes
SELECT 
    a.indexrelid::regclass AS index_a,
    b.indexrelid::regclass AS index_b,
    a.indkey AS keys_a,
    b.indkey AS keys_b
FROM pg_index a
JOIN pg_index b ON (
    a.indrelid = b.indrelid 
    AND a.indexrelid != b.indexrelid
    AND a.indkey::text LIKE b.indkey::text || '%'
);

Index Audit Routine

Schedule monthly index audits: (1) Identify unused indexes and consider dropping, (2) Find slow queries that might benefit from new indexes, (3) Check for redundant indexes that waste space, (4) Rebuild fragmented indexes in maintenance windows, (5) Update statistics after major data changes.

Summary: Index Selection

Index selection is one of the highest-impact physical design decisions. A well-chosen index strategy can improve query performance by orders of magnitude while poorly chosen indexes waste resources and slow writes.

Key Takeaways

•B-tree indexes — The default choice; support equality, range, ordering, and prefix queries with O(log n) lookup
•Hash indexes — O(1) equality only; limited utility in most RDBMS; B-tree often preferred even for equality
•Bitmap indexes — Excellent for low-cardinality columns in OLAP; disastrous for write-heavy OLTP
•Specialized indexes — Full-text, spatial, GIN for specific data types and query patterns
•Composite indexes — Column order matters; leftmost prefix principle governs usability
•Covering indexes — Include SELECT columns to avoid table access entirely
•Index selection process — Analyze workload, generate candidates, evaluate cost-benefit, consolidate
•Avoid common mistakes — Over-indexing, wrong column order, duplicate indexes, stale statistics

What's next:

With storage structures and indexing understood, we turn to partitioning—the technique of dividing large tables into smaller, more manageable pieces. Partitioning improves query performance, simplifies maintenance, and enables efficient data lifecycle management.

Page Complete

You now possess a comprehensive understanding of index types and selection strategies. You can analyze workloads, identify indexing opportunities, and avoid common pitfalls. Combined with storage structure knowledge, you're equipped to make informed physical design decisions. Next: partitioning strategies for large-scale data management.

Index Selection

The Art and Science of Index Design

What You Will Learn

Why Indexes Matter

Consider a table with 10 million customer records requiring 100,000 disk blocks. Without an index, finding a customer by name requires:

Full table scan: 100,000 block reads
At 0.1ms per sequential read: ~10 seconds
At 10ms per random read: ~16 minutes

With a B-tree index on customer_name:

Index traversal: ~4 block reads (B-tree height)
Data fetch: 1-2 block reads
Total: ~6 blocks → ~6 milliseconds

This represents a 10,000x improvement—the difference between an unusable system and an excellent one.

Query Performance: Index vs No Index
Table Size	Full Scan Time	B-tree Index Time	Speedup
10,000 rows	~100 ms	~5 ms	20x
1 million rows	~10 sec	~8 ms	1,250x
100 million rows	~15 min	~12 ms	75,000x
1 billion rows	~2.5 hrs	~15 ms	600,000x

The Index Trade-off

B-Tree Indexes (The Workhorse)

B+ Tree structure:

Balanced tree — All leaf nodes at the same depth
High fanout — Each node holds many keys (typically 100-500)
Internal nodes — Contain only keys and child pointers (guide search)
Leaf nodes — Contain all keys plus data pointers or data itself
Leaf chain — Leaf nodes linked for efficient range scans

Why B+ Trees dominate:

Shallow depth: With fanout of 200, a tree covering 1 billion records is only 5 levels deep
Disk-optimized: Node size matches disk block size, minimizing I/O
Range-friendly: Sorted leaves enable efficient range scans
Self-balancing: Maintains O(log n) height despite arbitrary insertions/deletions

B+ Tree Operation Costs
Operation	Cost	I/O Pattern
Equality search	O(log_f n) — 3-5 I/O	Random reads down tree
Range search (k results)	O(log_f n + k/f)	Traverse to start, sequential scan leaves
Insert	O(log_f n)	Find leaf, insert, possibly split
Delete	O(log_f n)	Find leaf, remove, possibly merge
Min/Max	O(log_f n)	Traverse to leftmost/rightmost leaf

Where f = fanout (typically 100-500), n = number of records

Practical implications:

A B+ tree on 1 billion records with fanout 200: height = log₂₀₀(10⁹) ≈ 4 levels
Each search requires ~4 disk reads (plus 1-2 to fetch data)
Modern databases cache upper tree levels in memory → often just 1-2 disk reads per query

B-tree variations:

B+ Tree: Most common; all data in leaves, internal nodes are pure index
B Tree:* Maintains higher node utilization (2/3 full vs 1/2 full)
B-link Tree: Adds sibling pointers for better concurrency
Bw-Tree: Lock-free B-tree variant for in-memory databases

btree_index_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Standard B-tree index (default in most databases)
CREATE INDEX idx_customer_name ON customers(name);
 
-- Composite B-tree index (multiple columns)
CREATE INDEX idx_order_customer_date ON orders(customer_id, order_date);
-- This index supports:
--   WHERE customer_id = ?                    (uses first column)
--   WHERE customer_id = ? AND order_date = ? (uses both columns)
--   WHERE customer_id = ? AND order_date > ? (uses both columns)
-- Does NOT efficiently support:
--   WHERE order_date = ?  (second column without first)
 
-- Unique B-tree index (enforces uniqueness + provides index)
CREATE UNIQUE INDEX idx_email ON customers(email);
 
-- Descending index (optimizes ORDER BY DESC)
CREATE INDEX idx_created_desc ON orders(created_at DESC);
 
-- Partial index (indexes only subset of rows)
CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';
-- Smaller index, faster updates, but only covers active users

Composite Index Column Order

Hash Indexes

Hash indexes use a hash function to map key values directly to bucket locations. They provide O(1) average-case lookups—theoretically faster than B-trees—but with significant limitations.

Structure:

Hash function h(key) → bucket number
Each bucket contains entries: (key, pointer to data)
Collision resolution via chaining or probing

Performance characteristics:

Operation	Cost	Notes
Equality lookup	O(1) average	Single hash computation + bucket read
Range query	O(n)	Hash provides no ordering—must scan all
Insert	O(1) average	Hash + append to bucket
Delete	O(1) average	Hash + remove from bucket

Hash Index Advantages

•Theoretically O(1) equality lookups
•Simple implementation
•Excellent for primary key lookups
•Lower memory overhead than B-tree for simple cases
•Ideal for hash joins in query execution

Hash Index Limitations

•Cannot support range queries (>, <, BETWEEN)
•Cannot support ORDER BY (no sorting)
•Cannot support prefix matching (LIKE 'foo%')
•Bucket overflow degrades to O(n)
•Most RDBMS prefer B-trees even for equality

Hash Index Support Across Databases

hash_index_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Explicit hash index in PostgreSQL
CREATE INDEX idx_session_token_hash 
ON sessions USING HASH (token);
 
-- Use case: exact match lookups on session tokens
-- SELECT * FROM sessions WHERE token = 'abc123def456';
 
-- Note: This CANNOT accelerate:
-- SELECT * FROM sessions WHERE token LIKE 'abc%';
-- SELECT * FROM sessions WHERE token > 'abc';
-- SELECT * FROM sessions ORDER BY token;
 
-- In most cases, B-tree is preferred:
CREATE INDEX idx_session_token_btree ON sessions(token);
-- B-tree handles all the above cases efficiently

Bitmap Indexes

Structure:

For a column with distinct values {v₁, v₂, ..., vₖ}:

Create k bit vectors, each with n bits (n = row count)
Bit i in vector j is 1 if row i contains value vⱼ, else 0

Example:

Row	Gender	Region
1	M	East
2	F	West
3	M	East
4	F	East
5	M	West

Bitmap for Gender:

M: 10101
F: 01010

Bitmap for Region:

East: 10110
West: 01001

Query evaluation with bitmaps:

Bitmaps shine for complex predicates combining multiple conditions:

SELECT * FROM table 
WHERE gender = 'M' AND region = 'East';

Execution:

Fetch bitmap for gender='M': 10101
Fetch bitmap for region='East': 10110
Bitwise AND: 10101 AND 10110 = 10100
Rows 1 and 3 match—fetch only those rows

Performance characteristics:

Space efficiency: For low-cardinality columns, bitmaps are tiny compared to B-trees
Multi-predicate queries: Bitwise AND/OR operations are extraordinarily fast (single CPU instruction per 64 bits)
Compression: Run-length encoding can compress bitmaps dramatically
Aggregation: COUNT queries can simply count 1-bits (popcount operation)

Bitmap vs B-tree Index Comparison
Characteristic	Bitmap Index	B-tree Index
Best for cardinality	Low (< 100 distinct values)	High (many distinct values)
Multi-column AND/OR	Excellent (bitwise ops)	Requires index intersection
Space for low cardinality	Very small	Larger
Space for high cardinality	Explodes	Scales linearly
Insert/Update cost	High (rebuild bitmaps)	O(log n)
Typical use case	Data warehouses (OLAP)	Transactional systems (OLTP)

Bitmap Index Write Penalty

Ideal Bitmap Index Candidates

•Status columns — order_status ('pending', 'shipped', 'delivered')
•Category columns — product_category, department, region
•Boolean flags — is_active, is_verified, is_deleted
•Date parts — year, quarter, month (not full timestamps)
•Demographic attributes — gender, age_range, membership_tier

Specialized Index Types

Beyond the fundamental index types, modern databases offer specialized indexes for specific data types and query patterns.

Full-Text Indexes:

Optimized for text search operations (word matching, phrase search, relevance ranking):

Inverted index structure: maps terms → documents containing them
Supports stemming, stop words, relevance scoring
Enables queries like WHERE MATCH(content) AGAINST ('database optimization')

Spatial Indexes (R-Tree, Quad-Tree):

Optimized for geometric data (points, polygons, geographic coordinates):

Enable queries like 'find all points within this rectangle'
Support nearest-neighbor searches
Used in GIS applications, mapping, location-based services

GIN (Generalized Inverted Index):

PostgreSQL's index for composite/array types:

Indexes elements within arrays, JSONB documents, tsvector
Enables queries like WHERE tags @> ARRAY['urgent', 'important']

Specialized Index Types Summary
Index Type	Data Type	Query Pattern	Database Support
Full-Text	Text/VARCHAR	MATCH, CONTAINS, phrase search	All major RDBMS
R-Tree/Spatial	Geometry, Geography	ST_Contains, ST_Distance, bounding box	PostGIS, MySQL, Oracle Spatial
GIN	Array, JSONB, tsvector	Array containment, JSONB path	PostgreSQL
GiST	Custom types, ranges	Range overlap, geometric operatores	PostgreSQL
BRIN	Large sorted tables	Range queries on naturally ordered data	PostgreSQL
Bloom Filter	Multi-column equality	Multi-column equality with many columns	PostgreSQL

specialized_index_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Full-text search index
CREATE INDEX idx_articles_fts ON articles 
USING GIN (to_tsvector('english', title || ' ' || body));
 
-- Query using full-text index
SELECT * FROM articles 
WHERE to_tsvector('english', title || ' ' || body) 
   @@ to_tsquery('english', 'database & optimization');
 
-- GIN index on JSONB column
CREATE INDEX idx_metadata_gin ON products USING GIN (metadata);
 
-- Query using GIN index
SELECT * FROM products 
WHERE metadata @> '{"category": "electronics"}';
 
-- BRIN index for time-series data (naturally ordered by timestamp)
CREATE INDEX idx_events_brin ON events USING BRIN (created_at);
-- BRIN is tiny (stores min/max per block range), efficient for sequential data
 
-- Partial index for common query pattern
CREATE INDEX idx_pending_orders ON orders(customer_id, created_at) 
WHERE status = 'pending';
-- Much smaller than full index, perfect for "show pending orders" queries

Index Type Selection Heuristic

Index Selection Methodology

The index selection framework:

Step 1: Query Workload Analysis

Identify and prioritize queries by:

Frequency: How often does this query run?
Criticality: Is this user-facing with latency requirements?
Cost: How expensive is this query without indexes?

Step 2: Query Decomposition

For each important query, identify:

Selection predicates: WHERE clause columns
Join predicates: Columns used in joins
Ordering requirements: ORDER BY columns
Grouping requirements: GROUP BY columns
Covering potential: SELECT columns (for covering indexes)

Step 3: Candidate Index Generation

For each query, generate candidate indexes:

Query: SELECT name, email FROM users 
       WHERE status = 'active' AND country = 'US' 
       ORDER BY created_at DESC LIMIT 10;

Candidates:
1. INDEX(status)
2. INDEX(country)
3. INDEX(status, country)
4. INDEX(status, country, created_at)
5. INDEX(status, country, created_at) INCLUDE (name, email)  -- covering

Step 4: Cost-Benefit Analysis

For each candidate:

Estimate read benefit (I/O saved per query × query frequency)
Estimate write cost (additional I/O per write × write frequency)
Estimate storage cost

Step 5: Index Consolidation

Identify indexes that serve multiple queries:

INDEX(customer_id, order_date) serves both:
- WHERE customer_id = ?
- WHERE customer_id = ? AND order_date > ?

Prefer consolidated indexes that cover multiple query patterns.

The Index Advisor

Index Selection Decision Matrix
Query Pattern	Recommended Index Strategy
Equality on single column	B-tree on that column
Equality on multiple columns	Composite B-tree (most selective first)
Range on single column	B-tree on that column
Equality + Range	Composite B-tree (equality columns first, then range)
ORDER BY frequently	Include ORDER BY column in index (typically last)
SELECT few columns	Consider covering index with INCLUDE clause
Low cardinality + OLAP	Bitmap index (if supported)
Full-text search	Full-text index
JSON/Array queries	GIN index (PostgreSQL)

Common Indexing Mistakes

Even experienced engineers make indexing mistakes. Understanding common anti-patterns helps you avoid them.

Index Anti-Patterns to Avoid

•Over-indexing — Adding indexes for every query without considering write overhead. A table with 15 indexes makes writes 15x slower.
•Under-indexing — No indexes except primary key, forcing full table scans for every query. The other extreme.
•Wrong column order — Composite index on (date, customer_id) when queries filter by customer_id first. Column order must match query patterns.
•Indexing low-selectivity columns alone — An index on boolean column 'is_active' may not help—50% of table still matches.
•Ignoring covering indexes — Missing opportunity to avoid table access entirely by including SELECT columns in index.
•Duplicate indexes — INDEX(a) and INDEX(a, b) both exist; first is redundant (composite index covers single-column queries).
•Unused indexes — Indexes that no query uses, consuming space and slowing writes. Audit periodically.
•Not analyzing after bulk load — Statistics stale after major data load, causing optimizer to ignore good indexes.

find_unused_indexes.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Find indexes that are never or rarely used
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0  -- Never used
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- Find duplicate/redundant indexes
SELECT 
    a.indexrelid::regclass AS index_a,
    b.indexrelid::regclass AS index_b,
    a.indkey AS keys_a,
    b.indkey AS keys_b
FROM pg_index a
JOIN pg_index b ON (
    a.indrelid = b.indrelid 
    AND a.indexrelid != b.indexrelid
    AND a.indkey::text LIKE b.indkey::text || '%'
);

Index Audit Routine

Summary: Index Selection

Key Takeaways

•B-tree indexes — The default choice; support equality, range, ordering, and prefix queries with O(log n) lookup
•Hash indexes — O(1) equality only; limited utility in most RDBMS; B-tree often preferred even for equality
•Bitmap indexes — Excellent for low-cardinality columns in OLAP; disastrous for write-heavy OLTP
•Specialized indexes — Full-text, spatial, GIN for specific data types and query patterns
•Composite indexes — Column order matters; leftmost prefix principle governs usability
•Covering indexes — Include SELECT columns to avoid table access entirely
•Index selection process — Analyze workload, generate candidates, evaluate cost-benefit, consolidate
•Avoid common mistakes — Over-indexing, wrong column order, duplicate indexes, stale statistics

What's next:

Page Complete