Loading learning content...
After completing logical design, you possess a normalized relational schema—a mathematically clean representation of your data. But logical schemas don't run on disk. They exist as abstractions. The critical question becomes: How do we physically organize data so queries execute efficiently?
Physical design is where theory meets reality. It's where decisions about storage structures, file organizations, and access paths determine whether your database responds in milliseconds or minutes. A brilliantly normalized schema implemented with poor physical design will perform terribly; conversely, thoughtful physical choices can rescue even suboptimal logical designs.
The fundamental constraint: all data ultimately resides on persistent storage—disk blocks that must be read into memory for processing. Every query translates into I/O operations. Physical design aims to minimize those operations by organizing data to match anticipated access patterns.
This page covers the foundational storage structures available for organizing database files. You'll understand heap files, sorted files, hash files, and clustered organizations—learning when each excels and when each fails. By the end, you'll possess the vocabulary and analytical framework to reason about storage decisions in any database system.
Before examining specific storage structures, we must understand why physical organization matters. The answer lies in the stark performance gap between memory and disk access.
The memory hierarchy reality:
Even with modern SSDs, the gap between memory and disk is enormous. A query that requires 1,000 random disk reads will spend virtually all its time waiting for I/O—CPU processing becomes negligible.
The block/page model:
Database systems don't read individual records; they read blocks (also called pages)—fixed-size chunks typically ranging from 4KB to 64KB. When you request a single 100-byte record, the system reads the entire block containing it. This has profound implications:
| Operation | HDD | SSD | Implication |
|---|---|---|---|
| Sequential Read (1 block) | 1x | 1x | Baseline operation cost |
| Random Read (1 block) | ~100x | ~10x | Seek/latency dominates |
| Sequential Scan (1000 blocks) | ~1000x | ~1000x | Scales linearly |
| Random Reads (1000 blocks) | ~100,000x | ~10,000x | Catastrophically slow |
Physical design optimizes for minimal I/O operations—specifically, minimal random I/O. Every storage structure decision should be evaluated against this metric: How many disk blocks must be read to answer typical queries?
Cost model basics:
Database query optimizers estimate costs primarily in terms of:
These values determine the efficiency of different storage structures for different operations. We'll use this notation throughout our analysis.
The simplest file organization is the heap file (also called an unordered file or pile file). Records are stored in the order they were inserted—no sorting, no clustering, no special organization.
Structure:
Implementation details:
Most heap file implementations maintain:
| Operation | Cost | Explanation |
|---|---|---|
| Insert | O(1) — 2 I/O | Read last block, write updated block |
| Delete (with RID) | O(1) — 2 I/O | Read block, mark record deleted, write |
| Search (equality) | O(B) — B/2 avg | Must scan until found (or entire file) |
| Search (range) | O(B) | Must scan entire file |
| Full scan | O(B) | Sequential read of all blocks |
Analysis:
Heap files excel at inserts and full scans but suffer terribly for searches. Consider a table with 1 million records:
This is clearly unacceptable for most query patterns.
Use heap files when: (1) the table is small enough that full scans are acceptable, (2) inserts vastly outnumber searches, (3) you'll always access data through an index anyway, or (4) you're building a staging/temporary table. Many OLTP systems use heap storage as the underlying table organization, relying entirely on indexes for efficient access.
12345678910111213141516171819202122
-- PostgreSQL default table storage is heap-based-- This creates a heap-organized tableCREATE TABLE sensor_readings ( id SERIAL PRIMARY KEY, sensor_id INTEGER NOT NULL, reading_value DECIMAL(10,4), recorded_at TIMESTAMP DEFAULT NOW()); -- Without indexes, any search requires full table scan-- This query must scan ALL blocks in the tableEXPLAIN ANALYZESELECT * FROM sensor_readings WHERE sensor_id = 42;-- Output: Seq Scan on sensor_readings (cost=0.00..10000.00 rows=...) -- With a B-tree index, searches become O(log n)CREATE INDEX idx_sensor_id ON sensor_readings(sensor_id); -- Now the same query uses the indexEXPLAIN ANALYZESELECT * FROM sensor_readings WHERE sensor_id = 42;-- Output: Index Scan using idx_sensor_id (cost=0.42..8.44 rows=...)A sorted file (or ordered file) maintains records in sorted order based on one or more ordering attributes. This organization trades insert performance for dramatically improved search and range query performance.
Structure:
| Operation | Cost | Explanation |
|---|---|---|
| Search (equality on sort key) | O(log₂ B) | Binary search on blocks |
| Search (range on sort key) | O(log₂ B + m) | Find start, scan m matching blocks |
| Search (non-sort-key) | O(B) | Must scan entire file (no ordering help) |
| Insert | O(B) | Find position, shift all subsequent records |
| Delete | O(B) | Find record, shift to close gap (or mark) |
| Full scan (sorted order) | O(B) | Sequential read, already ordered |
The binary search advantage:
For our 1 million record example (10,000 blocks):
This is transformative—from 50 seconds to milliseconds.
The insert penalty:
However, maintaining sorted order is expensive. Inserting a record in the middle requires:
For high-insert workloads, this is prohibitive.
To mitigate insert costs, systems employ overflow handling: (1) Overflow blocks — linked chains of unsorted blocks for new records, periodically merged; (2) Reserved space — leave gaps in blocks for future insertions; (3) Periodic reorganization — accept temporary disorder, rebuild file during maintenance windows. These techniques trade perfect ordering for practical insert performance.
Hash files use a hash function to map attribute values directly to storage locations (buckets). This enables O(1) access for equality queries—a dramatic improvement over both heap and sorted files.
Structure:
Static hashing:
In the simplest form, the hash function maps to a fixed number of buckets:
bucket_number = h(key) mod M
where M is the total number of buckets allocated initially.
| Operation | Cost | Explanation |
|---|---|---|
| Search (equality on hash key) | O(1) — 1-2 I/O | Hash to bucket, read bucket (may have overflow) |
| Search (range on hash key) | O(B) | Hash provides no ordering—full scan required |
| Search (non-hash key) | O(B) | Must scan entire file |
| Insert | O(1) | Hash to bucket, append to bucket |
| Delete | O(1) | Hash to bucket, remove from bucket |
The bucket overflow problem:
Static hashing faces a critical challenge: bucket overflow. When a bucket fills beyond its allocated space, the system creates overflow chains—linked lists of additional blocks. As these chains grow:
Causes of overflow:
Static hashing requires knowing the data size in advance. If you allocate too few buckets, overflow chains destroy performance. If you allocate too many, you waste space and may have empty buckets scattered across disk. This rigidity makes static hashing unsuitable for dynamic datasets.
Dynamic hashing schemes:
To address static hashing limitations, dynamic hashing techniques grow and shrink gracefully:
Extendible Hashing:
Linear Hashing:
Both maintain O(1) average-case performance while handling growth dynamically.
Beyond basic file organization, clustering addresses how related records are physically grouped. This concept is orthogonal to sorting—you can have clustering with various underlying organizations.
Clustered organization:
Records are physically ordered on disk according to some clustering attribute. Only one clustering order is possible per table (a table has only one physical arrangement).
Types of clustering:
Example of multi-table clustering:
Consider Department and Employee tables. In a clustered organization:
[Dept1 Header] [Emp1_Dept1] [Emp2_Dept1] [Emp3_Dept1]
[Dept2 Header] [Emp1_Dept2] [Emp2_Dept2]
[Dept3 Header] [Emp1_Dept3] ...
A query joining Department and Employee on department_id reads contiguous blocks—no random I/O to fetch related employees.
| Query Type | Clustered (on query attribute) | Non-Clustered |
|---|---|---|
| Point query (unique key) | Same — 1 block read | Same — 1 block read |
| Range query (many matching) | Sequential I/O — blocks | Random I/O — blocks scattered |
| Join on clustering key | Merge join highly efficient | May require sorting or hash |
| Query on non-cluster attribute | Full scan required | Full scan required |
Choose the clustering key based on your most critical access pattern. Since you can only have ONE clustered order per table, this decision has significant impact. Common choices: primary key (for point lookups with index), foreign key (for joins), timestamp (for time-series queries), or composite key matching common query filters.
Clustered indexes:
In many database systems, the distinction manifests through clustered indexes:
InnoDB (MySQL's default engine) always maintains a clustered index:
PostgreSQL tables are heap-organized by default. The CLUSTER command can physically reorder a table by an index, but this ordering degrades as new rows are inserted.
123456789101112131415
-- InnoDB automatically clusters by PRIMARY KEYCREATE TABLE orders ( order_id INT PRIMARY KEY, -- Clustered index key customer_id INT, order_date DATE, total_amount DECIMAL(10,2)) ENGINE=InnoDB; -- Data is physically ordered by order_id-- Range scans on order_id are sequential I/O -- Secondary (non-clustered) index on customer_idCREATE INDEX idx_customer ON orders(customer_id);-- This index stores (customer_id, order_id) pairs-- Lookups require: index traverse → retrieve order_id → fetch from clustered indexUnderstanding the distinction between primary and secondary file organizations is crucial for physical design decisions.
Primary organization:
The primary organization determines how the actual data records are stored on disk. Every table has exactly one primary organization. Options include:
Secondary organization (indexes):
Secondary access paths provide efficient access via attributes other than the primary organization key. A table can have many secondary indexes. Each index is an additional data structure that:
The trade-off matrix:
Consider a table with:
| Operation | Primary Impact | Secondary Impact |
|---|---|---|
| INSERT | Fast (append) | Slower (update 3 indexes) |
| DELETE | Fast (mark) | Slower (update 3 indexes) |
| UPDATE col A | Fast | Medium (update 1 index) |
| Search on A | Slow (scan) | Fast (use index) |
| Search on D | Slow (scan) | Slow (no index) |
Every index accelerates specific read patterns at the cost of write overhead.
Each index typically adds 10-30% overhead to write operations. A table with 5 indexes may see writes take 2-3x longer than an unindexed table. This is why physical design requires understanding workload characteristics—not just query patterns, but the read/write ratio and update frequency for each column.
Modern database systems have evolved sophisticated storage structures beyond the classical models. Understanding these advances helps you leverage contemporary database features effectively.
Log-Structured Merge Trees (LSM Trees):
Used by RocksDB, LevelDB, Cassandra, and many NoSQL systems:
Trade-off: Exceptional write throughput (sequential I/O), but read amplification (may read multiple files). Bloom filters mitigate read costs.
| Characteristic | B-Tree | LSM-Tree |
|---|---|---|
| Write pattern | Random I/O | Sequential I/O |
| Write throughput | Moderate | High |
| Read latency | Predictable | Variable (level-dependent) |
| Space amplification | Low (~1.5x) | Moderate (~2-10x) |
| Write amplification | Low | High (compaction) |
| Typical use case | OLTP, mixed workloads | Write-heavy, time-series, logs |
Columnar storage:
Traditional row-oriented storage (NSM — N-ary Storage Model) stores all attributes of a record together. Columnar storage (DSM — Decomposition Storage Model) stores each column separately:
Row-oriented (NSM):
Block 1: [Row1: id, name, salary] [Row2: id, name, salary] ...
Column-oriented (DSM):
Block 1 (ids): [1, 2, 3, 4, 5, ...]
Block 2 (names): ["Alice", "Bob", "Carol", ...]
Block 3 (salaries): [50000, 60000, 55000, ...]
Columnar advantages:
Columnar disadvantages:
Modern systems often use hybrid storage: columnar for analytical workloads, row-oriented for transactional. Some systems (SAP HANA, Oracle Database In-Memory) maintain both representations. Others (DuckDB, Snowflake) use columnar universally but optimize for different query patterns.
Storage structure selection is the foundation of physical database design. Every subsequent decision—indexing, partitioning, denormalization—builds upon this foundation.
What's next:
With storage structures understood, we turn to index selection—the most impactful physical design decision for most workloads. The next page explores B-tree indexes, hash indexes, bitmap indexes, and the analytical framework for choosing which indexes to create.
You now understand the fundamental storage structures underlying database systems. This knowledge enables you to reason about why certain queries are fast or slow, and to make informed decisions about primary file organization. Next, we'll explore how indexes—secondary access structures—dramatically accelerate queries on any attribute.