Loading learning content...
Imagine a library containing 10 million books. A researcher needs to find all books written by a specific author. Without any organizational system, the only option is to examine every single book—a process that could take years. Now imagine the library has a card catalog, organized alphabetically by author name. Suddenly, finding all books by a particular author takes minutes, not years.
This is precisely what a database index does.
An index is one of the most powerful tools in a database administrator's arsenal. It is the difference between a query that completes in milliseconds and one that takes hours. It is the structure that enables databases to scale from thousands of records to billions while maintaining responsive performance. Understanding indexes thoroughly—not just how to create them, but what they fundamentally are—is essential for anyone who works with data at scale.
By the end of this page, you will understand the formal definition of a database index, comprehend its structural characteristics, recognize its relationship to the underlying data, and appreciate why indexes are fundamental to database system design—not merely performance enhancements, but architectural necessities.
A database index is an auxiliary data structure that maintains a mapping from search key values to the locations of records containing those values in the underlying data file. This deceptively simple definition contains several critical components that warrant deep examination.
The Auxiliary Nature:
An index is fundamentally auxiliary—it is not the data itself but a secondary structure derived from the data. This distinction is crucial. The index exists to accelerate access to data that resides elsewhere (in the base table or data file). If the index were destroyed, the data would still exist; only the fast path to that data would be lost.
This auxiliary nature creates a fundamental tension in database design: indexes must be kept synchronized with the underlying data. When data changes, indexes must change too. This synchronization overhead is the price we pay for faster read access.
Think of an index like the index at the back of a textbook. The textbook's content is the authoritative data; the index is a derived structure that helps you find content faster. If you update a chapter, you must update the corresponding index entries. The index never replaces the content—it merely points to it.
The Mapping Function:
At its core, an index implements a mapping—a function that takes a search key value as input and returns the location(s) of records containing that value. This mapping can be formalized as:
Index: SearchKeyValue → {RecordLocation₁, RecordLocation₂, ..., RecordLocationₙ}
The nature of this mapping has profound implications:
The efficiency of the index depends entirely on how quickly this mapping function can be evaluated. A naive implementation might store key-location pairs in an unsorted array, requiring O(n) time for each lookup. Sophisticated implementations use data structures that achieve O(log n) or even O(1) lookup times.
| Characteristic | Description | Example |
|---|---|---|
| Domain | Set of all possible search key values | All possible employee IDs (integers 1 to 2³¹) |
| Range | Set of all record locations (pointers/RIDs) | Page numbers and slot offsets within pages |
| Cardinality | Number of distinct values in the index | 100,000 unique department names |
| Selectivity | Average records per distinct key value | On average, 50 employees per department |
To truly understand indexes, we must distinguish between their logical and physical manifestations. This distinction is fundamental to database architecture and explains many nuances of index behavior.
Logical Structure:
Logically, an index is an ordered collection of (search key, pointer) pairs. The search key is the value (or combination of values) on which the index is built, and the pointer is a reference to the location of the corresponding data record.
This logical view is clean and abstract. It tells us what an index contains without specifying how that content is organized in storage. From a logical perspective, an index on the last_name column might look like:
("Adams", ptr_to_record_1)
("Baker", ptr_to_record_2)
("Chen", ptr_to_record_3)
("Chen", ptr_to_record_7) // Multiple records with same key
("Davis", ptr_to_record_4)
...
Physical Structure:
Physically, an index occupies disk pages, just like the data itself. The index entries must be organized within these pages in a way that enables efficient lookup. Different physical organizations lead to different index types:
B-tree / B+-tree Indexes: Organize entries in a balanced tree structure across multiple pages. Each page is a tree node containing ordered entries and pointers to child pages.
Hash Indexes: Organize entries into buckets based on a hash function applied to the search key. Each bucket occupies one or more pages.
Bitmap Indexes: Use bit vectors to represent which records contain each key value. Each distinct value has an associated bit vector stored across pages.
The physical structure determines the index's performance characteristics: which operations are fast, which are slow, and how the index scales as data grows.
The "pointer" in an index entry deserves careful examination. What exactly does it mean to "point to" a record in a database? Unlike in-memory pointers in programming languages, database pointers must work with data stored on disk, potentially across restarts, and must remain valid even as data is reorganized.
Record Identifier (RID):
Most database systems use a Record Identifier (RID), also called a Row ID or Tuple ID, as the pointer format. A RID typically consists of:
RID = (PageNumber, SlotNumber)
This two-level addressing scheme provides a good balance between directness and flexibility. The page number gives a direct path to the relevant disk block, while the slot number allows records within a page to be reorganized without invalidating external references.
Using raw byte offsets into data files would provide the most direct path to records but creates brittleness. Any insertion, deletion, or compaction that shifts bytes would invalidate all pointers. The page-slot indirection allows records within a page to move (during compaction) by updating only the slot directory—a single page operation—rather than updating every index entry pointing to that page.
Primary Key as Pointer (Clustered Index Secondary Reference):
Some database systems, particularly those using clustered index organization (like MySQL's InnoDB), use the primary key value as the pointer in secondary indexes rather than a RID. This approach has interesting trade-offs:
Advantages:
Disadvantages:
This design choice illustrates a fundamental principle: there is no universally "best" pointer format. The optimal choice depends on workload characteristics, update patterns, and system architecture.
| Format | Size | Lookup Cost | Update Resilience | Common Use |
|---|---|---|---|---|
| RID (Page, Slot) | 6-8 bytes | One I/O (direct) | Low (broken by page splits) | Heap-organized tables |
| Primary Key Value | Variable (key-dependent) | Two I/Os (via clustered) | High (survives reorganization) | InnoDB secondary indexes |
| Physical Address | 4-8 bytes | Zero I/O (memory) | Very Low (any movement breaks) | In-memory databases only |
An index can be understood as a contract between the database system and its users. This contract specifies what guarantees the index provides and what obligations it incurs. Understanding this contract is essential for making informed decisions about when and how to use indexes.
The Guarantees (What the Index Promises):
Equivalence: The index will return the same results as a full scan would return. If a record matching the search criteria exists, the index will find it. The index never lies about data presence or absence.
Efficiency: Lookups via the index will be more efficient than full table scans, at least for selective queries. The index provides a sublinear path to the data.
Consistency: At any moment, the index accurately reflects the current state of the underlying data. There is no stale view or delayed synchronization (in most implementations).
The Obligations (What the Index Requires):
Storage Space: The index consumes additional disk space. For large tables with many indexes, the combined index storage can exceed the table storage.
Write Overhead: Every insert, update, or delete on indexed columns must also update the index. This overhead can be substantial for write-heavy workloads.
Maintenance: Indexes can become fragmented over time, requiring periodic rebuilding or reorganization to maintain optimal performance.
Memory Pressure: Effective index usage requires keeping frequently-accessed index pages in the buffer pool, competing with data pages for limited memory.
Many developers think of indexes as "free" optimizations—create them anywhere queries are slow. This is dangerously wrong. Each index is a commitment. A table with 10 indexes means every INSERT touches 10 separate data structures. On write-heavy workloads, over-indexing can cause more harm than under-indexing. The contract cuts both ways.
To fully appreciate what an index is, we must understand where it fits within the broader database architecture. Indexes do not exist in isolation—they interact with nearly every major component of a database management system.
The Storage Engine Layer:
At the lowest level, the storage engine is responsible for organizing data on disk and managing index structures. The storage engine implements the physical index organization (B+-tree, hash, etc.) and handles all I/O operations. Different storage engines may implement indexes differently—this is why MySQL's InnoDB and MyISAM have different indexing behaviors despite using the same SQL syntax.
The Query Optimizer Layer:
The query optimizer is perhaps the most critical consumer of index information. When generating an execution plan for a query, the optimizer must:
The optimizer maintains statistics about each index—number of distinct values, data distribution, clustering factor—to make informed decisions. Poor statistics lead to poor index usage, which is why regular statistics updates are essential.
The Transaction Manager:
Indexes must participate in transactions. When a transaction modifies data, the corresponding index updates must be atomic with the data updates. If a transaction rolls back, index changes must roll back too. This integration adds complexity but is non-negotiable for data integrity.
| Component | Index Relationship | Key Consideration |
|---|---|---|
| Query Optimizer | Selects whether/how to use indexes | Accurate statistics essential |
| Buffer Manager | Caches index pages in memory | Index pages compete for buffer space |
| Transaction Manager | Coordinates index updates atomically | Index locks can cause contention |
| Recovery Manager | Logs index modifications for crash recovery | Write-ahead logging overhead |
| Concurrency Control | Manages concurrent index access | Index structure locking protocols |
Understanding what an index is becomes clearer when we explicitly address what an index is not. Many performance problems stem from misconceptions about index capabilities and behaviors.
An Index is NOT a Copy of the Data:
An index contains keys and pointers, not complete records. When you create an index on a column, you are not duplicating the column's data—you are creating a sorted structure of key values paired with record locators. The actual column data remains in the base table only.
Exception: Covering indexes (discussed later in this chapter) may include additional columns beyond the key to avoid table lookups, but even these are optimizations, not fundamental to what an index is.
An Index is NOT Always Faster:
For queries that return a large percentage of the table (low selectivity), a full table scan can be faster than an index lookup. This counterintuitive result occurs because:
A commonly cited guideline: if a query will access more than 10-15% of the table's rows, a full table scan is often more efficient than using an index. This is why the query optimizer exists—it must make this determination for every query, not blindly use available indexes.
An Index is NOT Self-Maintaining:
Indexes do not automatically adapt to changing query patterns. They must be explicitly created, and their effectiveness depends on:
An Index is NOT Free of Overhead:
Every index incurs costs beyond storage space:
The concept of indexing did not emerge fully formed. Understanding its evolution provides valuable perspective on why indexes work the way they do and where the field is heading.
Pre-Database Era (1950s-1960s):
Before databases, data was stored in sequential files. Finding a record meant reading from the beginning until it was found—O(n) performance. The first indexes were simple sorted files that could be binary-searched, reducing lookup to O(log n). These "index sequential access method" (ISAM) structures became foundational.
The B-Tree Revolution (1970s):
Rudolf Bayer and Edward McCreight invented the B-tree in 1970 while working at Boeing Research Labs. This self-balancing tree structure solved critical problems:
The B+-tree variant, where all data lives in leaf nodes, became the dominant index structure in relational databases and remains so today.
The Relational Era (1980s-1990s):
As relational databases matured, index technology expanded:
Modern Era (2000s-Present):
Recent decades brought specialized indexes for new data types and workloads:
Despite 50+ years of database research and countless alternative proposals, the B+-tree remains the default index structure in virtually every major relational database. Its elegant balance of read efficiency, write performance, concurrency support, and disk optimization has proven remarkably difficult to improve upon for general-purpose workloads.
We have now established a rigorous understanding of what a database index is. This foundational knowledge will inform everything that follows in this module. Let's consolidate the key concepts:
What's Next:
Now that we understand what an index is, we must understand what drives its organization: the search key. The next page examines search keys in depth—what they are, how they are chosen, and how their characteristics determine index effectiveness.
You now possess a rigorous understanding of what a database index is—not just a vague notion of 'something that makes queries faster,' but a precise understanding of its structure, purpose, and role in database architecture. This foundational knowledge is essential for all that follows.