Loading content...
The choice between dense and sparse indexing fundamentally trades storage space for search simplicity. Dense indexes provide direct record access at the cost of larger size; sparse indexes achieve dramatic space savings by requiring an additional block-scan step.
This page develops a quantitative framework for analyzing this trade-off. By understanding the mathematical relationships between record count, block size, entry size, and index dimensions, you'll be able to make informed decisions about index design and accurately predict storage requirements.
By the end of this page, you will be able to calculate index sizes precisely, compare dense vs sparse storage requirements for any scenario, analyze the impact of block size and entry size on index dimensions, and understand how index size affects memory utilization and I/O performance.
Let's establish the mathematical foundation for index size analysis. We'll define precise notation and derive the core formulas.
Variable Definitions:
| Symbol | Definition | Typical Values |
|---|---|---|
| $n$ | Number of records in data file | 10³ to 10¹⁰ |
| $R$ | Record size (bytes) | 100 - 2000 |
| $B$ | Block/page size (bytes) | 4,096 - 65,536 |
| $b_f$ | Blocking factor (records per block) = $⌊B/R⌋$ | 20 - 500 |
| $b$ | Number of data blocks = $⌈n/b_f⌉$ | Derived |
| $K$ | Key size (bytes) | 4 - 128 |
| $P$ | Pointer size (bytes) | 4 - 8 |
| $E$ | Index entry size = $K + P$ | 8 - 136 |
| $i_f$ | Index blocking factor = $⌊B/E⌋$ | 50 - 1000 |
Dense Index Size:
A dense index contains exactly $n$ entries (one per record):
$$S_{dense} = n \times E \text{ bytes}$$
Number of index blocks:
$$b_{dense} = \lceil n / i_f \rceil$$
Sparse Index Size:
A sparse index contains exactly $b$ entries (one per data block):
$$S_{sparse} = b \times E = \lceil n / b_f \rceil \times E \text{ bytes}$$
Number of index blocks:
$$b_{sparse} = \lceil b / i_f \rceil = \left\lceil \frac{n}{b_f \times i_f} \right\rceil$$
The ratio of sparse to dense index size equals the inverse of the blocking factor:
$$\frac{S_{sparse}}{S_{dense}} = \frac{b}{n} = \frac{1}{b_f}$$
If each block holds 100 records, the sparse index is exactly 100× smaller. This ratio is independent of total record count—it depends only on record size and block size.
Let's apply these formulas to a realistic e-commerce scenario to see the concrete implications.
Scenario: Order History Table
Table: orders
├── order_id (BIGINT): 8 bytes ← Primary Key
├── customer_id (BIGINT): 8 bytes
├── order_date (TIMESTAMP): 8 bytes
├── status (VARCHAR(20)): 21 bytes
├── total_amount (DECIMAL): 16 bytes
├── shipping_address: 100 bytes (avg)
├── billing_address: 100 bytes (avg)
├── metadata (JSON): 200 bytes (avg)
├── Row overhead: 20 bytes
└── TOTAL RECORD SIZE: 481 bytes ≈ 500 bytes
Configuration:
├── Record count (n): 50,000,000 (50 million orders)
├── Block size (B): 8,192 bytes (8 KB)
├── Key size (K): 8 bytes (BIGINT order_id)
├── Pointer size (P): 8 bytes
└── Entry size (E): 16 bytes
Calculations:
Step 1: Data File Metrics
Blocking factor: b_f = ⌊8192 / 500⌋ = 16 records/block
Data blocks: b = ⌈50,000,000 / 16⌉ = 3,125,000 blocks
Data file size: 3,125,000 × 8 KB = 25 GB
Step 2: Dense Index Size
Dense entries: n = 50,000,000
Dense index size: 50,000,000 × 16 bytes = 800 MB
Index blocking factor: i_f = ⌊8192 / 16⌋ = 512 entries/block
Dense index blocks: ⌈50,000,000 / 512⌉ = 97,657 blocks
Step 3: Sparse Index Size
Sparse entries: b = 3,125,000
Sparse index size: 3,125,000 × 16 bytes = 50 MB
Sparse index blocks: ⌈3,125,000 / 512⌉ = 6,104 blocks
| Metric | Dense Index | Sparse Index | Ratio |
|---|---|---|---|
| Entries | 50,000,000 | 3,125,000 | 16:1 |
| Raw Size | 800 MB | 50 MB | 16:1 |
| Index Blocks | 97,657 | 6,104 | 16:1 |
| % of Data Size | 3.2% | 0.2% | 16:1 |
An 800 MB dense index likely exceeds typical buffer pool allocations and will require disk I/O for many lookups. A 50 MB sparse index fits comfortably in memory on almost any server, enabling pure in-memory index traversal. This 16× difference directly impacts query latency.
The blocking factor ($b_f$ = records per block) is the key determinant of sparse index efficiency. Let's analyze how record size and block size interact to affect this crucial ratio.
Blocking Factor Formula:
$$b_f = \left\lfloor \frac{B}{R} \right\rfloor$$
where:
Space Compression Ratio:
$$\text{Compression} = \frac{\text{Dense Size}}{\text{Sparse Size}} = b_f = \left\lfloor \frac{B}{R} \right\rfloor$$
Higher blocking factors yield greater sparse index advantages.
| Record Size | Block Size 4KB | Block Size 8KB | Block Size 16KB |
|---|---|---|---|
| 50 bytes | 81 records/block | 163 records/block | 327 records/block |
| 100 bytes | 40 records/block | 81 records/block | 163 records/block |
| 200 bytes | 20 records/block | 40 records/block | 81 records/block |
| 500 bytes | 8 records/block | 16 records/block | 32 records/block |
| 1000 bytes | 4 records/block | 8 records/block | 16 records/block |
| 2000 bytes | 2 records/block | 4 records/block | 8 records/block |
Analysis Insights:
Small Records Favor Sparse: With 50-byte records and 8KB blocks, sparse indexes are 163× smaller than dense. The space savings are dramatic.
Large Records Diminish Advantage: With 2000-byte records and 4KB blocks, only 2 records fit per block, making sparse indexes only 2× smaller. The advantage becomes marginal.
Block Size Scaling: Larger blocks increase blocking factor linearly, amplifying sparse index advantages proportionally.
Threshold Consideration: When $b_f < 10$, the sparse index advantage may not justify the two-phase search overhead in some workloads.
With variable-length records, the blocking factor varies across blocks. Use the average record size for estimation, but expect actual sparse index sizes to deviate by ±10-20% from predictions. Some blocks will be densely packed (small records), others sparse (large records).
While the dense-to-sparse ratio depends only on blocking factor, the absolute size of both indexes scales with key size. This has significant practical implications.
| Key Type | Key Size | Entry Size | Dense Size | Sparse Size |
|---|---|---|---|---|
| INT | 4 bytes | 12 bytes | 120 MB | 1.5 MB |
| BIGINT | 8 bytes | 16 bytes | 160 MB | 2.0 MB |
| UUID (binary) | 16 bytes | 24 bytes | 240 MB | 3.0 MB |
| UUID (string) | 36 bytes | 44 bytes | 440 MB | 5.5 MB |
| VARCHAR(50) avg | 50 bytes | 58 bytes | 580 MB | 7.25 MB |
| Composite (3 INTs) | 12 bytes | 20 bytes | 200 MB | 2.5 MB |
Key Observations:
String UUIDs Cost 4× Integer Keys: A table using string UUIDs (36 bytes) instead of integers (4 bytes) sees its dense index grow from 120 MB to 440 MB—a 3.7× increase.
Index Fanout Degrades: Larger keys mean fewer entries per index block. With 44-byte entries in 8KB blocks, fanout drops to 186 (vs 682 for 12-byte entries). This increases B-tree height.
Composite Keys Compound: Multi-column keys add sizes together. A composite key on (customer_id, order_id, product_id) with three BIGINTs creates 24-byte keys.
Sparse Still Proportional: Despite larger absolute sizes, the sparse-to-dense ratio remains constant at 1:80 (for 80 records/block). Sparse indexing benefits persist across key sizes.
When storage or memory is constrained, prefer compact key types. Using BIGINT surrogate keys instead of natural VARCHAR keys can reduce index size by 50% or more—often the difference between fitting in memory and spilling to disk.
Real index structures like B-trees are multi-level. Understanding the total storage across all levels provides a complete picture of index overhead.
Multi-Level Size Formula:
For a multi-level sparse index with fanout $f$ (entries per index block):
$$\text{Total Index Blocks} = \sum_{i=1}^{h} \left\lceil \frac{b}{f^i} \right\rceil$$
This is a geometric series. For large $f$, approximation:
$$\text{Total Blocks} \approx \frac{b}{f-1}$$
Example Calculation:
Scenario:
├── Data blocks (b): 1,000,000
├── Index fanout (f): 500 entries/block
└── Block size: 8 KB
Level 1: ⌈1,000,000 / 500⌉ = 2,000 blocks
Level 2: ⌈2,000 / 500⌉ = 4 blocks
Level 3: ⌈4 / 500⌉ = 1 block (root)
Total index blocks: 2,005 blocks
Total index size: 2,005 × 8 KB ≈ 16 MB
Overhead ratio: 16 MB / (1M × 8 KB) = 16 MB / 8 GB = 0.2%
Dense vs Sparse Multi-Level Comparison:
Same 1M block scenario:
SPARSE (indexing blocks):
├── Base entries: 1,000,000
├── Level 1 entries: 2,000
├── Level 2 entries: 4
├── Total entries: 1,002,004
└── Total size: ~16 MB
DENSE (indexing records with b_f = 100):
├── Base entries: 100,000,000 (n = b × b_f)
├── Level 1 entries: 200,000
├── Level 2 entries: 400
├── Level 3 entries: 1
├── Total entries: 100,200,401
└── Total size: ~1.6 GB
Ratio: 1.6 GB / 16 MB = 100:1
The 100:1 ratio matches the blocking factor ($b_f = 100$), confirming multi-level overhead is proportionally distributed.
In production B+trees, add 10-30% overhead for: (1) partial block utilization after splits, (2) per-entry metadata, (3) free space for future inserts, (4) node headers. Multiply calculated sizes by 1.2-1.3 for realistic estimates.
Index size directly impacts memory utilization. When indexes exceed available buffer pool memory, performance degrades dramatically. This section analyzes the memory implications of the dense/sparse choice.
The Memory Threshold Effect:
Performance Curve:
Response │
Time (ms) │
│
50 ────────│─────────────────╮
│ │
40 ────────│ │
│ │
30 ────────│ │
│ │
20 ────────│ ╰─────────────────
│ ┌───────────────┐
10 ────────│───│ Index fits in │
│ │ memory │
0 ────────┼───┴───────────────┴───────────────
│ │ │
Index Size Buffer Memory
Pool Threshold
When the index fits in memory: sub-millisecond lookups. When it doesn't: disk I/O adds 5-20ms per access.
| Scenario | Dense Size | Sparse Size | Dense Fits? | Sparse Fits? |
|---|---|---|---|---|
| 1M records | 16 MB | 0.2 MB | ✓ Yes | ✓ Yes |
| 10M records | 160 MB | 2 MB | ✓ Yes | ✓ Yes |
| 100M records | 1.6 GB | 20 MB | ✓ Yes | ✓ Yes |
| 500M records | 8 GB | 100 MB | ✗ No | ✓ Yes |
| 1B records | 16 GB | 200 MB | ✗ No | ✓ Yes |
| 10B records | 160 GB | 2 GB | ✗ No | ✓ Yes |
Analysis:
At 500M records, the dense index (8 GB) exceeds the 4 GB buffer pool. Index operations now require disk I/O, increasing latency 100-1000×.
The sparse index (100 MB) fits comfortably, maintaining in-memory performance even at 10 billion records.
Working Set Consideration:
Not all index entries are accessed equally. An index has a 'hot' working set of frequently accessed entries. If this working set fits in memory, performance remains acceptable even with a larger total index:
Working Set Size ≈ (Hot Key Fraction) × (Index Size)
Example:
├── 500M records, dense index = 8 GB
├── 1% of keys are frequently accessed
├── Working set = 80 MB (fits!)
└── Acceptable performance for hot queries
When access patterns are uniformly distributed (no hot set), the entire index becomes the working set. This is common in analytics workloads, batch processing, and testing scenarios. Sparse indexes provide crucial resilience against this worst case.
Beyond storage size, we must analyze the I/O cost of searches. Sparse indexes trade smaller size for additional search steps. Let's quantify this trade-off.
I/O Cost Models:
Dense Index Point Query:
I/O = (Index Height) + 1 (data block)
= ⌈log_f(n)⌉ + 1
Where f = index fanout (entries per block)
Sparse Index Point Query:
I/O = (Index Height) + 1 (data block) + 0 (no extra scan if in-memory)
= ⌈log_f(b)⌉ + 1
Where b = number of data blocks = n / b_f
Since $b = n / b_f$:
Dense: ⌈log_f(n)⌉ + 1
Sparse: ⌈log_f(n/b_f)⌉ + 1 = ⌈log_f(n) - log_f(b_f)⌉ + 1
The difference is approximately $log_f(b_f)$, which is typically less than 1 for reasonable fanouts and blocking factors.
| Records (n) | Dense Height | Sparse Height | Extra I/O |
|---|---|---|---|
| 1M | 3 | 2-3 | 0-1 |
| 100M | 4 | 3 | 1 |
| 10B | 5 | 4 | 1 |
| 1T | 6 | 5 | 1 |
Key Insight:
The I/O difference between dense and sparse indexes is typically at most 1 block read for the index traversal, plus the in-block scan time (microseconds in memory).
Given that:
The 'extra cost' of sparse indexes is negligible—often invisible in latency measurements.
The Real Trade-off:
The true benefit of sparse indexes isn't saving one index I/O—it's keeping the entire index in memory when a dense index would spill to disk. This can mean the difference between:
Sparse indexes provide maximum benefit when: (1) the sparse index fits in memory while the dense equivalent wouldn't, AND (2) data is sorted on the index key. In this scenario, you get both the memory benefit AND the clustering benefit for range queries.
Based on our analysis, here are practical guidelines for estimating and planning index sizes in real systems.
12345678910111213141516171819202122232425262728293031323334353637
-- Estimate index sizes before creation -- Step 1: Get table statisticsSELECT relname AS table_name, n_live_tup AS row_count, pg_relation_size(relid) AS table_bytesFROM pg_stat_user_tablesWHERE relname = 'orders'; -- Step 2: Estimate dense index size-- Formula: entries × entry_size × overhead_factorSELECT n_live_tup AS rows, n_live_tup * 16 AS raw_dense_bytes, -- 8-byte key + 8-byte pointer n_live_tup * 16 * 1.3 AS estimated_dense, -- With 30% overhead pg_size_pretty((n_live_tup * 16 * 1.3)::bigint) AS dense_readableFROM pg_stat_user_tables WHERE relname = 'orders'; -- Step 3: Actual index sizes (after creation)SELECT indexrelname AS index_name, pg_relation_size(indexrelid) AS index_bytes, pg_size_pretty(pg_relation_size(indexrelid)) AS index_sizeFROM pg_stat_user_indexesWHERE relname = 'orders'; -- Step 4: Index-to-table ratioSELECT t.relname AS table_name, pg_size_pretty(pg_relation_size(t.relid)) AS table_size, pg_size_pretty(pg_indexes_size(t.relid)) AS total_index_size, round(100.0 * pg_indexes_size(t.relid) / NULLIF(pg_relation_size(t.relid), 0), 1) AS index_pctFROM pg_stat_user_tables tWHERE t.relname = 'orders';Set up alerts for index size growth. If indexes are growing faster than expected (e.g., >20% per month), investigate potential causes: missing archival policies, unexpected data growth, or index bloat from fragmentation.
We've developed a quantitative framework for analyzing the storage trade-offs between dense and sparse indexes. Let's consolidate the key formulas and insights:
Looking Ahead:
With the space trade-off quantified, we now turn to search efficiency—analyzing the time complexity and practical performance characteristics of each index type. Understanding both storage AND performance is essential for optimal index design.
You now have the quantitative tools to calculate and compare index sizes, predict memory requirements, and understand the storage implications of index design decisions. Next, we'll complete the analysis with search efficiency considerations.