Index Types - Learning Module

Loading content...

0/241

Primary Index: The Foundational Access Structure

The Index That Shapes Physical Storage

In the world of database indexing, not all indexes are created equal. While every index provides a fast path to data, primary indexes occupy a special position—they are intimately tied to how data is physically organized on disk. Understanding primary indexes is fundamental to understanding database performance because they represent the closest relationship an index can have with the underlying data.

When you create a primary index, you're not just creating a lookup structure—you're making a commitment about how your data will be stored and accessed. This commitment has profound implications for query performance, storage efficiency, and system design. A primary index doesn't just point to data; it determines where data lives.

What You Will Learn

By the end of this page, you will understand the precise definition and characteristics of primary indexes, how they relate to physical data ordering, why only one primary index can exist per table, their advantages and limitations, and how major database systems implement them. You'll gain the deep understanding needed to make informed decisions about primary index creation and usage.

What Is a Primary Index?

A primary index is an index built on the ordering field of a file—the field on which the file records are physically sorted on disk. This creates a one-to-one correspondence between the index ordering and the data file ordering, establishing the most efficient possible relationship between an index and its underlying data.

Formal definition:

A primary index is defined on a file that is ordered on the indexing field (called the ordering key field), where the indexing field is typically the primary key of the file. The index entries are sorted in the same order as the data records, and each block of data records has exactly one corresponding entry in the index.

To understand why this matters, consider what happens during a search:

Without any index: Full table scan—every record must be examined
With a secondary index: Index lookup → pointer → random disk access
With a primary index: Index lookup → pointer → sequential disk access (potentially)

The Ordering Key Concept

The ordering key field is the attribute on which the data file is physically ordered. When a primary index is built on this field, the index values and the data values are in the same sorted order. This alignment is what makes primary indexes exceptionally efficient for range queries and sequential access patterns.

Primary Index Characteristics
Characteristic	Description	Implication
Ordering Field	Index field matches file's physical ordering	Sequential reads are optimized
Sparse by Nature	Typically one entry per data block, not per record	Index size is much smaller than data
Anchor Records	Each index entry points to first record in a block	Efficient block-level navigation
Unique Keys	Usually built on primary key (unique values)	No duplicate handling complexity
Physical Coupling	Index order matches data order	Enables sequential I/O optimization

Anatomy of a Primary Index

A primary index consists of index entries, each containing two components:

Search Key Value: The value of the ordering field for the anchor record
Block Pointer: A pointer to the disk block containing that record

The anchor record (also called block anchor) is the first record in each disk block. Since the file is sorted on the primary key, and blocks contain consecutive records, every block's first record has the smallest key value in that block.

Index structure visualization:

primary_index_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Primary Index (Sparse)              Data File (Ordered by Employee_ID)
┌─────────────────────────┐         ┌─────────────────────────────────────────┐
│ Index Entry Format      │         │ Block 1                                  │
│ ┌────────┬───────────┐  │         │ ┌─────────────────────────────────────┐ │
│ │ Key    │ Block Ptr │  │         │ │ Employee_ID: 101 │ Name: Alice     │ │ ← Anchor
│ └────────┴───────────┘  │         │ │ Employee_ID: 102 │ Name: Bob       │ │
└─────────────────────────┘         │ │ Employee_ID: 103 │ Name: Carol     │ │
                                     │ └─────────────────────────────────────┘ │
Index Entries:                       └─────────────────────────────────────────┘
┌────────┬───────────┐              ┌─────────────────────────────────────────┐
│  101   │   →       │──────────────│ Block 2                                  │
├────────┼───────────┤              │ ┌─────────────────────────────────────┐ │
│  104   │   →       │──────────────│ │ Employee_ID: 104 │ Name: David     │ │ ← Anchor
├────────┼───────────┤              │ │ Employee_ID: 105 │ Name: Eve       │ │
│  107   │   →       │──────────────│ │ Employee_ID: 106 │ Name: Frank     │ │
├────────┼───────────┤              │ └─────────────────────────────────────┘ │
│  110   │   →       │───────────┐  └─────────────────────────────────────────┘
├────────┼───────────┤           │  ┌─────────────────────────────────────────┐
│  113   │   →       │────────┐  │  │ Block 3                                  │
└────────┴───────────┘        │  │  │ ┌─────────────────────────────────────┐ │
                              │  └──│ │ Employee_ID: 107 │ Name: Grace     │ │ ← Anchor
Key observations:             │     │ │ Employee_ID: 108 │ Name: Henry     │ │
• Index has 5 entries         │     │ │ Employee_ID: 109 │ Name: Iris      │ │
• Data has 15+ records        │     │ └─────────────────────────────────────┘ │
• Index is SPARSE            │      └─────────────────────────────────────────┘
• One entry per data block   │      
                             └────► (and so on...)

Why Sparsity Matters

A primary index is inherently sparse—it has one entry per block, not one per record. If a block holds 100 records, the index is 100x smaller than a dense index. This means the primary index often fits entirely in main memory, avoiding I/O for index traversal entirely.

Size Analysis:

Let's calculate the size advantage of a sparse primary index:

Suppose we have 1,000,000 employee records
Each disk block holds 50 records
Total data blocks: 1,000,000 / 50 = 20,000 blocks
Primary index entries: 20,000 (one per block)

If each index entry is 12 bytes (4 bytes for key + 8 bytes for pointer):

Index size = 20,000 × 12 = 240,000 bytes = 234 KB

Compare this to a dense index (one entry per record):

Dense index size = 1,000,000 × 12 = 12,000,000 bytes = 11.4 MB

The sparse primary index is approximately 50x smaller than a dense index on the same data!

Search Operations on Primary Indexes

Primary indexes support highly efficient search operations because the index ordering matches the data ordering. This alignment enables binary search on the index followed by direct block access.

Point Query (Equality Search):

Searching for a specific key value (e.g., Employee_ID = 107):

Perform binary search on the index to find the largest key ≤ 107
Follow the block pointer to the data block
Scan within the block for the exact record

Algorithmic complexity:

Index search: O(log₂ i) where i is the number of index entries
Block access: 1 disk I/O
Block scan: O(bfr) where bfr is the blocking factor (records per block)

Since bfr is a small constant (typically 10-100) and the index is often cached, the total cost is essentially O(log i) index lookups + 1 disk I/O.

primary_index_search.sql
Search Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Searching for Employee_ID = 107 using Primary Index
 
-- Step 1: Binary search on index entries
-- Index entries: [101, 104, 107, 110, 113, ...]
--                                ↑
-- Find largest key ≤ 107, which is 107 itself
 
-- Step 2: Follow block pointer to Block 3
-- Block 3 contains: [107, 108, 109]
 
-- Step 3: Sequential scan within block
-- Found: Employee_ID = 107, Name = "Grace"
 
-- Performance Analysis:
-- Total index entries: 20,000 (for 1M records)
-- Binary search comparisons: log₂(20,000) ≈ 15
-- Block accesses: 1 (the data block)
-- Total: ~15 comparisons + 1 disk I/O
 
-- Compare to full table scan:
-- Without index: 20,000 block accesses (average 10,000)
-- Speedup: 10,000x for random access!

Range Query:

Searching for a range of values (e.g., Employee_ID BETWEEN 107 AND 120):

Binary search for the lower bound (107) in the index
Follow pointer to the starting block
Sequentially read blocks until upper bound exceeded

Because data is physically ordered, range queries become sequential I/O—the most efficient form of disk access. The disk head doesn't need to seek; it simply reads consecutive blocks.

Primary Index Query Performance
Query Type	Index Operations	Disk I/O Pattern	Complexity
Point Query (=)	Binary search for key	1 random access	O(log i) + 1 I/O
Range Query (BETWEEN)	Binary search for lower bound	Sequential reads	O(log i) + k I/Os (k = result blocks)
Less Than (<)	Start from first index entry	Sequential reads	O(k) I/Os (k = qualifying blocks)
Greater Than (>)	Binary search for bound	Sequential reads	O(log i) + k I/Os
ORDER BY (asc)	Start from first entry	Full sequential read	O(n) I/Os (all blocks)

Sequential I/O Advantage

On traditional HDDs, sequential reads are 50-100x faster than random reads. Even on SSDs, sequential reads are 2-5x faster due to read-ahead and reduced overhead. Primary indexes convert many random access patterns into sequential patterns, providing significant performance benefits.

Why Only One Primary Index Per Table?

A fundamental constraint of primary indexes is that only one can exist per table. This isn't an arbitrary limitation—it's a logical necessity arising from the definition of a primary index.

Recall that a primary index requires the data file to be physically ordered on the indexing field. A file can only have one physical ordering at a time. You cannot simultaneously sort employees by ID and by name—the records must be in one sequence or another.

The physical ordering problem:

Imagine trying to create two primary indexes on an Employee table:

Primary Index on Employee_ID → records sorted by ID: [101, 102, 103, ...]
Primary Index on Hire_Date → records would need to be sorted by date: [2020-01-15, 2020-02-22, ...]

ordering_conflict.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Physical Reality: A file can only have ONE physical order
 
IMPOSSIBLE: Having data ordered by BOTH Employee_ID AND Hire_Date
 
Employee_ID ordering:              Hire_Date ordering:
┌───────────────────────────┐     ┌───────────────────────────┐
│ ID: 101 │ Date: 2020-05-12│     │ ID: 105 │ Date: 2019-01-10│
│ ID: 102 │ Date: 2021-03-15│     │ ID: 103 │ Date: 2019-06-20│
│ ID: 103 │ Date: 2019-06-20│     │ ID: 101 │ Date: 2020-05-12│
│ ID: 104 │ Date: 2022-08-01│     │ ID: 102 │ Date: 2021-03-15│
│ ID: 105 │ Date: 2019-01-10│     │ ID: 104 │ Date: 2022-08-01│
└───────────────────────────┘     └───────────────────────────┘
       ↑                                 ↑
These are DIFFERENT physical orderings of the SAME records.
A file can only exist in ONE ordering at a time.
 
→ Therefore: Only ONE primary index is possible.
→ Additional indexes must be SECONDARY indexes.

Design Decision

Choosing which field gets the primary index is a critical design decision. Once chosen, range queries on other fields lose the sequential I/O advantage. You must analyze your workload to determine which field benefits most from primary index ordering.

Criteria for Choosing Primary Index Field

•Frequent range queries — If most queries involve ranges on a particular field, that field is a strong candidate
•Natural ordering requirements — Time-series data naturally benefits from timestamp as primary index
•Join operations — If the field is frequently used in joins, primary index ordering accelerates merge joins
•Uniqueness — Primary indexes work best with unique fields to avoid complications
•Query frequency — The field accessed most frequently should receive this optimization

Primary Index vs. Clustered Index

The terms primary index and clustered index are often conflated, but they have subtle distinctions that are important to understand.

Primary Index (Classic Definition):

Built on the primary key (unique identifier) of a table
Data file is ordered on this primary key
Sparse index with one entry per block
Implies uniqueness enforced

Clustered Index (Modern DBMS Term):

An index where the data rows are stored in the same order as the index
Need not be the primary key—can be any field
Enforces physical ordering of table data
Still only one per table (same physical ordering constraint)

Primary Index vs. Clustered Index
Aspect	Primary Index (Classic)	Clustered Index (Modern)
Index Field	Must be primary key	Can be any field or combination
Uniqueness	Implicitly unique	May or may not be unique
Number Allowed	One per table	One per table
Physical Ordering	Data ordered by primary key	Data ordered by clustered key
Sparse/Dense	Typically sparse	Often implemented as B+-tree (dense leaf level)
Terminology Usage	Database theory	SQL Server, MySQL, PostgreSQL documentation

clustered_index_creation.sql
1
2
3
4
5
6
7
8
-- SQL Server: Explicit clustered index creation
CREATE CLUSTERED INDEX IX_Employee_HireDate
ON Employee (HireDate);
 
-- The primary key can be non-clustered if desired
ALTER TABLE Employee
ADD CONSTRAINT PK_Employee
PRIMARY KEY NONCLUSTERED (EmployeeID);

Practical Terminology

In practice, most modern database systems use 'clustered index' rather than 'primary index.' When reading documentation or designing schemas, remember: clustered index = physical ordering of data. Whether it's on the primary key or another field, the clustered index determines how rows are physically stored.

Implementation in Major Database Systems

Different database systems implement primary/clustered indexes with varying approaches. Understanding these differences is crucial for database architects and administrators.

MySQL (InnoDB):

InnoDB uses a structure called the clustered index (or primary key index). When you define a PRIMARY KEY, InnoDB physically orders the table data according to this key. The data rows are stored directly in the B+-tree leaf nodes of the clustered index—this is called an index-organized table.

innodb_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
InnoDB Clustered Index Structure (Index-Organized Table)
 
                    ┌─────────────────────────────────┐
                    │          Root Node              │
                    │   [50] ─────────────[100]       │
                    └─────────┬───────────────┬───────┘
                              │               │
              ┌───────────────┘               └───────────────┐
              ▼                                               ▼
┌─────────────────────────┐                 ┌─────────────────────────┐
│     Internal Node       │                 │     Internal Node       │
│ [10][25][40] ─► [50]    │                 │ [60][75][90] ─► [100]   │
└──┬──┬──┬──┬─────────────┘                 └──┬──┬──┬──┬─────────────┘
   │  │  │  │                                  │  │  │  │
   ▼  ▼  ▼  ▼                                  ▼  ▼  ▼  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                          LEAF LEVEL                                   │
│  ┌─────────────────────┐  ┌─────────────────────┐                    │
│  │ Key:10 │ FULL ROW   │  │ Key:25 │ FULL ROW   │   ...              │
│  │ Name: Alice         │  │ Name: Bob           │                    │
│  │ Dept: Engineering   │  │ Dept: Sales         │                    │
│  │ Salary: 75000       │  │ Salary: 65000       │                    │
│  └─────────────────────┘  └─────────────────────┘                    │
└─────────────────────────────────────────────────────────────────────┘
              ↑
    DATA IS STORED IN THE INDEX LEAF NODES
    (No separate "heap" data file needed)

PostgreSQL:

PostgreSQL stores table data in a heap structure (unordered) by default. The primary key creates a unique B-tree index, but it's a secondary index pointing to heap locations. PostgreSQL uses CLUSTER command to physically reorder data, but this is not maintained automatically.

SQL Server:

SQL Server allows explicit separation of the primary key constraint from the clustered index. You can have a non-clustered primary key and a clustered index on a different column.

Oracle:

Oracle uses the term index-organized table (IOT) for structures similar to InnoDB's clustered index. Regular tables are heap-organized with separate index structures.

Primary/Clustered Index Behavior by DBMS
Database	Primary Key Behavior	Clustered Index Control	Data Storage
MySQL/InnoDB	Always clustered	PRIMARY KEY = clustered	Index-organized (IOT)
PostgreSQL	Unique B-tree index	CLUSTER command (manual)	Heap-organized
SQL Server	Clustered by default	CLUSTERED/NONCLUSTERED choice	Either IOT or heap
Oracle	Non-clustered by default	INDEX ORGANIZED TABLE option	Heap by default
SQLite	ROWID or INTEGER PRIMARY KEY	Always ordered by rowid	B+-tree organized

InnoDB Implicit Primary Key

If you don't define a PRIMARY KEY on an InnoDB table, MySQL will use the first UNIQUE NOT NULL index as the clustered index. If none exists, InnoDB creates a hidden 6-byte row ID (GEN_CLUST_INDEX) as the clustered index. This hidden key can cause performance issues since you can't control or query it directly.

Advantages and Limitations of Primary Indexes

Primary indexes offer significant benefits but come with tradeoffs that must be understood for proper schema design.

Advantages

•Excellent range query performance — Physical ordering enables sequential I/O
•Compact index size — Sparse nature means fewer index entries
•Efficient ORDER BY — Sorted output without additional sorting step
•Faster merge joins — Pre-sorted data accelerates join operations
•Better cache utilization — Related records stored together
•Reduced I/O — Sequential reads minimize disk seeks

Limitations

•Only one per table — Must choose ordering field carefully
•Insert overhead — New records may require block reorganization
•Update complexity — Changing the key value causes record relocation
•Fragmentation over time — Insertions can create overflow chains or split pages
•Limited flexibility — Cannot optimize for multiple range query patterns
•Maintenance cost — Periodic reorganization may be needed

Insert Performance Impact

When inserting a new record, the database must place it in the correct physical position to maintain order. If the target block is full, the database must split the block or create an overflow page. This is why auto-increment primary keys are often recommended—new records always go at the end, minimizing reorganization.

The auto-increment pattern:

Using an auto-incrementing primary key (like SERIAL or AUTO_INCREMENT) with a clustered primary index provides the best of both worlds:

New records always have the highest key value
Insertions always append to the end of the data file
No existing blocks need reorganization
Sequential key generation = sequential physical storage

This is why the pattern of id INT AUTO_INCREMENT PRIMARY KEY is so prevalent—it optimizes for the most common insert-heavy workloads while maintaining efficient point lookups.

Summary: Primary Index Mastery

We've conducted a comprehensive exploration of primary indexes—the most fundamental index structure that establishes a direct relationship between index ordering and physical data storage.

Key Takeaways

•Definition: A primary index is built on the ordering field of a file, with index entries sorted in the same order as data records
•Sparsity: Primary indexes are inherently sparse—one entry per block, not per record—making them extremely compact
•Search efficiency: Binary search on index + single block access = O(log i) + 1 I/O for point queries
•Range query optimization: Physical ordering enables sequential I/O, providing 50-100x performance improvement on HDDs
•One-per-table constraint: Only one primary index possible because a file can have only one physical ordering
•Modern terminology: 'Clustered index' is the contemporary term for this concept in most DBMS documentation
•Implementation varies: InnoDB always clusters on primary key; PostgreSQL heaps data with separate indexes; SQL Server allows choice
•Design consideration: Choose primary index field based on range query patterns and workload analysis

What's next:

Now that we understand primary indexes and their physical ordering relationship with data, we'll explore secondary indexes—the indexes that provide alternative access paths without controlling physical storage. Secondary indexes are essential for supporting multiple query patterns on a single table.

Page Complete

You now have a deep understanding of primary indexes: their structure, operations, constraints, and implementation across major database systems. This knowledge forms the foundation for understanding all other index types and making informed indexing decisions.