Database Management SystemsIndex Concept

Understanding Database Indexes

LevelIntermediate

Duration60 mins

TopicIndex Concept

1 / 5

Index Definition

The Performance Gateway

Imagine a library containing 10 million books. A researcher needs to find all books written by a specific author. Without any organizational system, the only option is to examine every single book—a process that could take years. Now imagine the library has a card catalog, organized alphabetically by author name. Suddenly, finding all books by a particular author takes minutes, not years.

This is precisely what a database index does.

An index is one of the most powerful tools in a database administrator's arsenal. It is the difference between a query that completes in milliseconds and one that takes hours. It is the structure that enables databases to scale from thousands of records to billions while maintaining responsive performance. Understanding indexes thoroughly—not just how to create them, but what they fundamentally are—is essential for anyone who works with data at scale.

What You Will Learn

By the end of this page, you will understand the formal definition of a database index, comprehend its structural characteristics, recognize its relationship to the underlying data, and appreciate why indexes are fundamental to database system design—not merely performance enhancements, but architectural necessities.

Formal Definition of a Database Index

A database index is an auxiliary data structure that maintains a mapping from search key values to the locations of records containing those values in the underlying data file. This deceptively simple definition contains several critical components that warrant deep examination.

The Auxiliary Nature:

An index is fundamentally auxiliary—it is not the data itself but a secondary structure derived from the data. This distinction is crucial. The index exists to accelerate access to data that resides elsewhere (in the base table or data file). If the index were destroyed, the data would still exist; only the fast path to that data would be lost.

This auxiliary nature creates a fundamental tension in database design: indexes must be kept synchronized with the underlying data. When data changes, indexes must change too. This synchronization overhead is the price we pay for faster read access.

The Data-Index Relationship

Think of an index like the index at the back of a textbook. The textbook's content is the authoritative data; the index is a derived structure that helps you find content faster. If you update a chapter, you must update the corresponding index entries. The index never replaces the content—it merely points to it.

The Mapping Function:

At its core, an index implements a mapping—a function that takes a search key value as input and returns the location(s) of records containing that value. This mapping can be formalized as:

Index: SearchKeyValue → {RecordLocation₁, RecordLocation₂, ..., RecordLocationₙ}

The nature of this mapping has profound implications:

One-to-One Mapping: When the search key is a unique attribute (like a primary key), each search key value maps to exactly one record location.
One-to-Many Mapping: When the search key is a non-unique attribute, each search key value may map to multiple record locations.

The efficiency of the index depends entirely on how quickly this mapping function can be evaluated. A naive implementation might store key-location pairs in an unsorted array, requiring O(n) time for each lookup. Sophisticated implementations use data structures that achieve O(log n) or even O(1) lookup times.

Index Mapping Characteristics
Characteristic	Description	Example
Domain	Set of all possible search key values	All possible employee IDs (integers 1 to 2³¹)
Range	Set of all record locations (pointers/RIDs)	Page numbers and slot offsets within pages
Cardinality	Number of distinct values in the index	100,000 unique department names
Selectivity	Average records per distinct key value	On average, 50 employees per department

Physical and Logical Index Structure

To truly understand indexes, we must distinguish between their logical and physical manifestations. This distinction is fundamental to database architecture and explains many nuances of index behavior.

Logical Structure:

Logically, an index is an ordered collection of (search key, pointer) pairs. The search key is the value (or combination of values) on which the index is built, and the pointer is a reference to the location of the corresponding data record.

This logical view is clean and abstract. It tells us what an index contains without specifying how that content is organized in storage. From a logical perspective, an index on the last_name column might look like:

("Adams", ptr_to_record_1)
("Baker", ptr_to_record_2)
("Chen", ptr_to_record_3)
("Chen", ptr_to_record_7)    // Multiple records with same key
("Davis", ptr_to_record_4)
...

Physical Structure:

Physically, an index occupies disk pages, just like the data itself. The index entries must be organized within these pages in a way that enables efficient lookup. Different physical organizations lead to different index types:

B-tree / B+-tree Indexes: Organize entries in a balanced tree structure across multiple pages. Each page is a tree node containing ordered entries and pointers to child pages.
Hash Indexes: Organize entries into buckets based on a hash function applied to the search key. Each bucket occupies one or more pages.
Bitmap Indexes: Use bit vectors to represent which records contain each key value. Each distinct value has an associated bit vector stored across pages.

The physical structure determines the index's performance characteristics: which operations are fast, which are slow, and how the index scales as data grows.

Logical View

•Abstract key-pointer pairs
•Conceptually ordered by search key
•Independent of storage medium
•Defines semantics of lookup
•Used in query optimization analysis

Physical View

•Concrete disk page layouts
•Tree nodes, hash buckets, or bitmaps
•Optimized for disk I/O patterns
•Determines actual performance
•Used in storage engine implementation

Record Identification: Pointers and RIDs

The "pointer" in an index entry deserves careful examination. What exactly does it mean to "point to" a record in a database? Unlike in-memory pointers in programming languages, database pointers must work with data stored on disk, potentially across restarts, and must remain valid even as data is reorganized.

Record Identifier (RID):

Most database systems use a Record Identifier (RID), also called a Row ID or Tuple ID, as the pointer format. A RID typically consists of:

RID = (PageNumber, SlotNumber)

PageNumber: Identifies which page (block) of the data file contains the record
SlotNumber: Identifies the position of the record within that page

This two-level addressing scheme provides a good balance between directness and flexibility. The page number gives a direct path to the relevant disk block, while the slot number allows records within a page to be reorganized without invalidating external references.

Why Not Direct File Offsets?

Using raw byte offsets into data files would provide the most direct path to records but creates brittleness. Any insertion, deletion, or compaction that shifts bytes would invalidate all pointers. The page-slot indirection allows records within a page to move (during compaction) by updating only the slot directory—a single page operation—rather than updating every index entry pointing to that page.

Primary Key as Pointer (Clustered Index Secondary Reference):

Some database systems, particularly those using clustered index organization (like MySQL's InnoDB), use the primary key value as the pointer in secondary indexes rather than a RID. This approach has interesting trade-offs:

Advantages:

Secondary indexes remain valid even when the underlying page structure changes
No need to update secondary indexes when records move to different pages
Natural integration with the clustered index structure

Disadvantages:

Larger index entries (primary key values are often bigger than RIDs)
Secondary index lookups require an additional clustered index traversal (double lookup)
Performance depends on primary key size

This design choice illustrates a fundamental principle: there is no universally "best" pointer format. The optimal choice depends on workload characteristics, update patterns, and system architecture.

Pointer Format Comparison
Format	Size	Lookup Cost	Update Resilience	Common Use
RID (Page, Slot)	6-8 bytes	One I/O (direct)	Low (broken by page splits)	Heap-organized tables
Primary Key Value	Variable (key-dependent)	Two I/Os (via clustered)	High (survives reorganization)	InnoDB secondary indexes
Physical Address	4-8 bytes	Zero I/O (memory)	Very Low (any movement breaks)	In-memory databases only

The Index Contract: Guarantees and Obligations

An index can be understood as a contract between the database system and its users. This contract specifies what guarantees the index provides and what obligations it incurs. Understanding this contract is essential for making informed decisions about when and how to use indexes.

The Guarantees (What the Index Promises):

Equivalence: The index will return the same results as a full scan would return. If a record matching the search criteria exists, the index will find it. The index never lies about data presence or absence.
Efficiency: Lookups via the index will be more efficient than full table scans, at least for selective queries. The index provides a sublinear path to the data.
Consistency: At any moment, the index accurately reflects the current state of the underlying data. There is no stale view or delayed synchronization (in most implementations).

The Obligations (What the Index Requires):

Storage Space: The index consumes additional disk space. For large tables with many indexes, the combined index storage can exceed the table storage.
Write Overhead: Every insert, update, or delete on indexed columns must also update the index. This overhead can be substantial for write-heavy workloads.
Maintenance: Indexes can become fragmented over time, requiring periodic rebuilding or reorganization to maintain optimal performance.
Memory Pressure: Effective index usage requires keeping frequently-accessed index pages in the buffer pool, competing with data pages for limited memory.

The Hidden Cost of Indexes

Many developers think of indexes as "free" optimizations—create them anywhere queries are slow. This is dangerously wrong. Each index is a commitment. A table with 10 indexes means every INSERT touches 10 separate data structures. On write-heavy workloads, over-indexing can cause more harm than under-indexing. The contract cuts both ways.

Index Contract Summary

•Completeness: If a qualifying record exists, the index will locate it
•Soundness: The index will never return false positives (non-matching records)
•Synchronization: Index state always matches table state (transactionally)
•Performance Bound: Lookup complexity is O(log n) or O(1), not O(n)
•Space Bound: Index size is proportional to data size with known constants

Indexes in Database Architecture

To fully appreciate what an index is, we must understand where it fits within the broader database architecture. Indexes do not exist in isolation—they interact with nearly every major component of a database management system.

The Storage Engine Layer:

At the lowest level, the storage engine is responsible for organizing data on disk and managing index structures. The storage engine implements the physical index organization (B+-tree, hash, etc.) and handles all I/O operations. Different storage engines may implement indexes differently—this is why MySQL's InnoDB and MyISAM have different indexing behaviors despite using the same SQL syntax.

Converting Mermaid diagram...

The Query Optimizer Layer:

The query optimizer is perhaps the most critical consumer of index information. When generating an execution plan for a query, the optimizer must:

Identify which indexes are available for each table
Estimate the cost of using each index versus a full table scan
Consider index combinations for complex predicates
Choose between different index types when multiple options exist

The optimizer maintains statistics about each index—number of distinct values, data distribution, clustering factor—to make informed decisions. Poor statistics lead to poor index usage, which is why regular statistics updates are essential.

The Transaction Manager:

Indexes must participate in transactions. When a transaction modifies data, the corresponding index updates must be atomic with the data updates. If a transaction rolls back, index changes must roll back too. This integration adds complexity but is non-negotiable for data integrity.

Index Interactions with Database Components
Component	Index Relationship	Key Consideration
Query Optimizer	Selects whether/how to use indexes	Accurate statistics essential
Buffer Manager	Caches index pages in memory	Index pages compete for buffer space
Transaction Manager	Coordinates index updates atomically	Index locks can cause contention
Recovery Manager	Logs index modifications for crash recovery	Write-ahead logging overhead
Concurrency Control	Manages concurrent index access	Index structure locking protocols

What an Index is NOT: Common Misconceptions

Understanding what an index is becomes clearer when we explicitly address what an index is not. Many performance problems stem from misconceptions about index capabilities and behaviors.

An Index is NOT a Copy of the Data:

An index contains keys and pointers, not complete records. When you create an index on a column, you are not duplicating the column's data—you are creating a sorted structure of key values paired with record locators. The actual column data remains in the base table only.

Exception: Covering indexes (discussed later in this chapter) may include additional columns beyond the key to avoid table lookups, but even these are optimizations, not fundamental to what an index is.

An Index is NOT Always Faster:

For queries that return a large percentage of the table (low selectivity), a full table scan can be faster than an index lookup. This counterintuitive result occurs because:

Sequential I/O (table scan) is faster than random I/O (index + table access)
Each index lookup requires following a pointer, potentially to a different disk page
When most pages must be read anyway, the index adds overhead without reducing I/O

The 10-15% Rule of Thumb

A commonly cited guideline: if a query will access more than 10-15% of the table's rows, a full table scan is often more efficient than using an index. This is why the query optimizer exists—it must make this determination for every query, not blindly use available indexes.

An Index is NOT Self-Maintaining:

Indexes do not automatically adapt to changing query patterns. They must be explicitly created, and their effectiveness depends on:

The specific columns indexed
The order of columns (for composite indexes)
The type of index chosen
Whether the index covers the query's needs

An Index is NOT Free of Overhead:

Every index incurs costs beyond storage space:

Insert operations must update all relevant indexes
Update operations on indexed columns require removing the old entry and adding a new one
Delete operations must remove entries from all indexes
Index pages must be loaded into memory for traversal
Fragmentation accumulates over time, degrading performance

Index is NOT

•A cache or copy of data
•Always faster than scanning
•Free to create and maintain
•A guarantee of query performance
•Automatically created by the DBMS

Index IS

•A sorted key-pointer structure
•Efficient for selective lookups
•A managed overhead commitment
•A tool requiring expert application
•Explicitly created and maintained

Historical Context: The Evolution of Indexes

The concept of indexing did not emerge fully formed. Understanding its evolution provides valuable perspective on why indexes work the way they do and where the field is heading.

Pre-Database Era (1950s-1960s):

Before databases, data was stored in sequential files. Finding a record meant reading from the beginning until it was found—O(n) performance. The first indexes were simple sorted files that could be binary-searched, reducing lookup to O(log n). These "index sequential access method" (ISAM) structures became foundational.

The B-Tree Revolution (1970s):

Rudolf Bayer and Edward McCreight invented the B-tree in 1970 while working at Boeing Research Labs. This self-balancing tree structure solved critical problems:

Guaranteed logarithmic height regardless of insertion order
Efficient insertion and deletion without restructuring the entire tree
Optimized for block-based storage (disk pages)

The B+-tree variant, where all data lives in leaf nodes, became the dominant index structure in relational databases and remains so today.

The Relational Era (1980s-1990s):

As relational databases matured, index technology expanded:

Hash indexes emerged for equality-only lookups with O(1) average case
Bitmap indexes were developed for data warehousing workloads
Clustered indexes were introduced, organizing the table itself in index order
Composite indexes enabled multi-column indexing with prefix matching

Modern Era (2000s-Present):

Recent decades brought specialized indexes for new data types and workloads:

Full-text indexes with inverted index structures for text search
Spatial indexes (R-trees, quad-trees) for geographic data
Partial indexes that index only a subset of rows
Expression indexes on computed values
BRIN (Block Range Indexes) for very large tables with natural ordering
LSM Trees optimized for write-heavy workloads (used in modern NoSQL systems)

The Enduring B+-Tree

Despite 50+ years of database research and countless alternative proposals, the B+-tree remains the default index structure in virtually every major relational database. Its elegant balance of read efficiency, write performance, concurrency support, and disk optimization has proven remarkably difficult to improve upon for general-purpose workloads.

Summary: Understanding Index Definition

We have now established a rigorous understanding of what a database index is. This foundational knowledge will inform everything that follows in this module. Let's consolidate the key concepts:

Key Takeaways

•An index is an auxiliary data structure — It exists alongside the data, not in place of it, and must be synchronized with data changes.
•An index implements a key-to-location mapping — It transforms search keys into record locations, enabling efficient data retrieval.
•Indexes have logical and physical dimensions — Logically ordered key-pointer pairs; physically organized as trees, hash tables, or other structures.
•Record identifiers (RIDs) serve as pointers — A (PageNumber, SlotNumber) pair typically identifies record locations.
•Indexes represent a contract with costs and benefits — Faster reads in exchange for space overhead, write overhead, and maintenance.
•Indexes integrate with all database components — Optimizer, buffer manager, transaction system, and recovery all interact with indexes.
•An index is not always beneficial — Low-selectivity queries, write-heavy workloads, and over-indexing can make indexes harmful.

What's Next:

Now that we understand what an index is, we must understand what drives its organization: the search key. The next page examines search keys in depth—what they are, how they are chosen, and how their characteristics determine index effectiveness.

Page Complete

You now possess a rigorous understanding of what a database index is—not just a vague notion of 'something that makes queries faster,' but a precise understanding of its structure, purpose, and role in database architecture. This foundational knowledge is essential for all that follows.

1 / 5

Loading learning content...

Database Management SystemsIndex Concept

Understanding Database Indexes

LevelIntermediate

Duration60 mins

TopicIndex Concept

1 / 5

Index Definition

The Performance Gateway

This is precisely what a database index does.

What You Will Learn

Formal Definition of a Database Index

The Auxiliary Nature:

The Data-Index Relationship

The Mapping Function:

At its core, an index implements a mapping—a function that takes a search key value as input and returns the location(s) of records containing that value. This mapping can be formalized as:

Index: SearchKeyValue → {RecordLocation₁, RecordLocation₂, ..., RecordLocationₙ}

The nature of this mapping has profound implications:

One-to-One Mapping: When the search key is a unique attribute (like a primary key), each search key value maps to exactly one record location.
One-to-Many Mapping: When the search key is a non-unique attribute, each search key value may map to multiple record locations.

Index Mapping Characteristics
Characteristic	Description	Example
Domain	Set of all possible search key values	All possible employee IDs (integers 1 to 2³¹)
Range	Set of all record locations (pointers/RIDs)	Page numbers and slot offsets within pages
Cardinality	Number of distinct values in the index	100,000 unique department names
Selectivity	Average records per distinct key value	On average, 50 employees per department

Physical and Logical Index Structure

Logical Structure:

("Adams", ptr_to_record_1)
("Baker", ptr_to_record_2)
("Chen", ptr_to_record_3)
("Chen", ptr_to_record_7)    // Multiple records with same key
("Davis", ptr_to_record_4)
...

Physical Structure:

B-tree / B+-tree Indexes: Organize entries in a balanced tree structure across multiple pages. Each page is a tree node containing ordered entries and pointers to child pages.
Hash Indexes: Organize entries into buckets based on a hash function applied to the search key. Each bucket occupies one or more pages.
Bitmap Indexes: Use bit vectors to represent which records contain each key value. Each distinct value has an associated bit vector stored across pages.

The physical structure determines the index's performance characteristics: which operations are fast, which are slow, and how the index scales as data grows.

Logical View

•Abstract key-pointer pairs
•Conceptually ordered by search key
•Independent of storage medium
•Defines semantics of lookup
•Used in query optimization analysis

Physical View

•Concrete disk page layouts
•Tree nodes, hash buckets, or bitmaps
•Optimized for disk I/O patterns
•Determines actual performance
•Used in storage engine implementation

Record Identification: Pointers and RIDs

Record Identifier (RID):

Most database systems use a Record Identifier (RID), also called a Row ID or Tuple ID, as the pointer format. A RID typically consists of:

RID = (PageNumber, SlotNumber)

PageNumber: Identifies which page (block) of the data file contains the record
SlotNumber: Identifies the position of the record within that page

Why Not Direct File Offsets?

Primary Key as Pointer (Clustered Index Secondary Reference):

Advantages:

Secondary indexes remain valid even when the underlying page structure changes
No need to update secondary indexes when records move to different pages
Natural integration with the clustered index structure

Disadvantages:

Larger index entries (primary key values are often bigger than RIDs)
Secondary index lookups require an additional clustered index traversal (double lookup)
Performance depends on primary key size

Pointer Format Comparison
Format	Size	Lookup Cost	Update Resilience	Common Use
RID (Page, Slot)	6-8 bytes	One I/O (direct)	Low (broken by page splits)	Heap-organized tables
Primary Key Value	Variable (key-dependent)	Two I/Os (via clustered)	High (survives reorganization)	InnoDB secondary indexes
Physical Address	4-8 bytes	Zero I/O (memory)	Very Low (any movement breaks)	In-memory databases only

The Index Contract: Guarantees and Obligations

The Guarantees (What the Index Promises):

Equivalence: The index will return the same results as a full scan would return. If a record matching the search criteria exists, the index will find it. The index never lies about data presence or absence.
Efficiency: Lookups via the index will be more efficient than full table scans, at least for selective queries. The index provides a sublinear path to the data.
Consistency: At any moment, the index accurately reflects the current state of the underlying data. There is no stale view or delayed synchronization (in most implementations).

The Obligations (What the Index Requires):

Storage Space: The index consumes additional disk space. For large tables with many indexes, the combined index storage can exceed the table storage.
Write Overhead: Every insert, update, or delete on indexed columns must also update the index. This overhead can be substantial for write-heavy workloads.
Maintenance: Indexes can become fragmented over time, requiring periodic rebuilding or reorganization to maintain optimal performance.
Memory Pressure: Effective index usage requires keeping frequently-accessed index pages in the buffer pool, competing with data pages for limited memory.

The Hidden Cost of Indexes

Index Contract Summary

•Completeness: If a qualifying record exists, the index will locate it
•Soundness: The index will never return false positives (non-matching records)
•Synchronization: Index state always matches table state (transactionally)
•Performance Bound: Lookup complexity is O(log n) or O(1), not O(n)
•Space Bound: Index size is proportional to data size with known constants

Indexes in Database Architecture

The Storage Engine Layer:

Converting Mermaid diagram...

The Query Optimizer Layer:

The query optimizer is perhaps the most critical consumer of index information. When generating an execution plan for a query, the optimizer must:

Identify which indexes are available for each table
Estimate the cost of using each index versus a full table scan
Consider index combinations for complex predicates
Choose between different index types when multiple options exist

The Transaction Manager:

Index Interactions with Database Components
Component	Index Relationship	Key Consideration
Query Optimizer	Selects whether/how to use indexes	Accurate statistics essential
Buffer Manager	Caches index pages in memory	Index pages compete for buffer space
Transaction Manager	Coordinates index updates atomically	Index locks can cause contention
Recovery Manager	Logs index modifications for crash recovery	Write-ahead logging overhead
Concurrency Control	Manages concurrent index access	Index structure locking protocols

What an Index is NOT: Common Misconceptions

Understanding what an index is becomes clearer when we explicitly address what an index is not. Many performance problems stem from misconceptions about index capabilities and behaviors.

An Index is NOT a Copy of the Data:

An Index is NOT Always Faster:

For queries that return a large percentage of the table (low selectivity), a full table scan can be faster than an index lookup. This counterintuitive result occurs because:

Sequential I/O (table scan) is faster than random I/O (index + table access)
Each index lookup requires following a pointer, potentially to a different disk page
When most pages must be read anyway, the index adds overhead without reducing I/O

The 10-15% Rule of Thumb

An Index is NOT Self-Maintaining:

Indexes do not automatically adapt to changing query patterns. They must be explicitly created, and their effectiveness depends on:

The specific columns indexed
The order of columns (for composite indexes)
The type of index chosen
Whether the index covers the query's needs

An Index is NOT Free of Overhead:

Every index incurs costs beyond storage space:

Insert operations must update all relevant indexes
Update operations on indexed columns require removing the old entry and adding a new one
Delete operations must remove entries from all indexes
Index pages must be loaded into memory for traversal
Fragmentation accumulates over time, degrading performance

Index is NOT

•A cache or copy of data
•Always faster than scanning
•Free to create and maintain
•A guarantee of query performance
•Automatically created by the DBMS

Index IS

•A sorted key-pointer structure
•Efficient for selective lookups
•A managed overhead commitment
•A tool requiring expert application
•Explicitly created and maintained

Historical Context: The Evolution of Indexes

The concept of indexing did not emerge fully formed. Understanding its evolution provides valuable perspective on why indexes work the way they do and where the field is heading.

Pre-Database Era (1950s-1960s):

The B-Tree Revolution (1970s):

Rudolf Bayer and Edward McCreight invented the B-tree in 1970 while working at Boeing Research Labs. This self-balancing tree structure solved critical problems:

Guaranteed logarithmic height regardless of insertion order
Efficient insertion and deletion without restructuring the entire tree
Optimized for block-based storage (disk pages)

The B+-tree variant, where all data lives in leaf nodes, became the dominant index structure in relational databases and remains so today.

The Relational Era (1980s-1990s):

As relational databases matured, index technology expanded:

Hash indexes emerged for equality-only lookups with O(1) average case
Bitmap indexes were developed for data warehousing workloads
Clustered indexes were introduced, organizing the table itself in index order
Composite indexes enabled multi-column indexing with prefix matching

Modern Era (2000s-Present):

Recent decades brought specialized indexes for new data types and workloads:

Full-text indexes with inverted index structures for text search
Spatial indexes (R-trees, quad-trees) for geographic data
Partial indexes that index only a subset of rows
Expression indexes on computed values
BRIN (Block Range Indexes) for very large tables with natural ordering
LSM Trees optimized for write-heavy workloads (used in modern NoSQL systems)

The Enduring B+-Tree

Summary: Understanding Index Definition

We have now established a rigorous understanding of what a database index is. This foundational knowledge will inform everything that follows in this module. Let's consolidate the key concepts:

Key Takeaways

•An index is an auxiliary data structure — It exists alongside the data, not in place of it, and must be synchronized with data changes.
•An index implements a key-to-location mapping — It transforms search keys into record locations, enabling efficient data retrieval.
•Indexes have logical and physical dimensions — Logically ordered key-pointer pairs; physically organized as trees, hash tables, or other structures.
•Record identifiers (RIDs) serve as pointers — A (PageNumber, SlotNumber) pair typically identifies record locations.
•Indexes represent a contract with costs and benefits — Faster reads in exchange for space overhead, write overhead, and maintenance.
•Indexes integrate with all database components — Optimizer, buffer manager, transaction system, and recovery all interact with indexes.
•An index is not always beneficial — Low-selectivity queries, write-heavy workloads, and over-indexing can make indexes harmful.

What's Next:

Page Complete

1 / 5