Database Management SystemPhysical Design

Physical Design

LevelIntermediate

Duration90 mins

TopicPhysical Design

1 / 5

Storage Structures

From Logical to Physical: Where Performance Lives

After completing logical design, you possess a normalized relational schema—a mathematically clean representation of your data. But logical schemas don't run on disk. They exist as abstractions. The critical question becomes: How do we physically organize data so queries execute efficiently?

Physical design is where theory meets reality. It's where decisions about storage structures, file organizations, and access paths determine whether your database responds in milliseconds or minutes. A brilliantly normalized schema implemented with poor physical design will perform terribly; conversely, thoughtful physical choices can rescue even suboptimal logical designs.

The fundamental constraint: all data ultimately resides on persistent storage—disk blocks that must be read into memory for processing. Every query translates into I/O operations. Physical design aims to minimize those operations by organizing data to match anticipated access patterns.

What You Will Learn

This page covers the foundational storage structures available for organizing database files. You'll understand heap files, sorted files, hash files, and clustered organizations—learning when each excels and when each fails. By the end, you'll possess the vocabulary and analytical framework to reason about storage decisions in any database system.

The Disk I/O Imperative

Before examining specific storage structures, we must understand why physical organization matters. The answer lies in the stark performance gap between memory and disk access.

The memory hierarchy reality:

CPU Cache (L1): ~1 nanosecond access
RAM: ~100 nanoseconds access
SSD: ~100 microseconds access (~1,000x slower than RAM)
HDD: ~10 milliseconds access (~100,000x slower than RAM)

Even with modern SSDs, the gap between memory and disk is enormous. A query that requires 1,000 random disk reads will spend virtually all its time waiting for I/O—CPU processing becomes negligible.

The block/page model:

Database systems don't read individual records; they read blocks (also called pages)—fixed-size chunks typically ranging from 4KB to 64KB. When you request a single 100-byte record, the system reads the entire block containing it. This has profound implications:

Reading one record from a block costs the same as reading the entire block
Sequential access (reading adjacent blocks) is vastly faster than random access
Locality matters: records accessed together should be stored together

Approximate I/O Costs (Relative)
Operation	HDD	SSD	Implication
Sequential Read (1 block)	1x	1x	Baseline operation cost
Random Read (1 block)	~100x	~10x	Seek/latency dominates
Sequential Scan (1000 blocks)	~1000x	~1000x	Scales linearly
Random Reads (1000 blocks)	~100,000x	~10,000x	Catastrophically slow

The Fundamental Optimization Goal

Physical design optimizes for minimal I/O operations—specifically, minimal random I/O. Every storage structure decision should be evaluated against this metric: How many disk blocks must be read to answer typical queries?

Cost model basics:

Database query optimizers estimate costs primarily in terms of:

B = Number of disk blocks in the file
R = Number of records in the file
bfr = Blocking factor (records per block)

These values determine the efficiency of different storage structures for different operations. We'll use this notation throughout our analysis.

Heap Files (Unordered Files)

The simplest file organization is the heap file (also called an unordered file or pile file). Records are stored in the order they were inserted—no sorting, no clustering, no special organization.

Structure:

New records append to the end of the file (or fill deleted record slots)
No relationship between record content and physical position
Conceptually like an unsorted array that grows as records are added

Implementation details:

Most heap file implementations maintain:

A header page tracking which pages have free space
A free list linking pages with deleted record slots
Growth by allocating new pages as needed

Heap File Operation Costs
Operation	Cost	Explanation
Insert	O(1) — 2 I/O	Read last block, write updated block
Delete (with RID)	O(1) — 2 I/O	Read block, mark record deleted, write
Search (equality)	O(B) — B/2 avg	Must scan until found (or entire file)
Search (range)	O(B)	Must scan entire file
Full scan	O(B)	Sequential read of all blocks

Analysis:

Heap files excel at inserts and full scans but suffer terribly for searches. Consider a table with 1 million records:

At 100 records per block: 10,000 blocks
Equality search on a non-key attribute: scan ~5,000 blocks on average
At 10ms per random read: 50 seconds per search

This is clearly unacceptable for most query patterns.

When Heap Files Make Sense

Use heap files when: (1) the table is small enough that full scans are acceptable, (2) inserts vastly outnumber searches, (3) you'll always access data through an index anyway, or (4) you're building a staging/temporary table. Many OLTP systems use heap storage as the underlying table organization, relying entirely on indexes for efficient access.

heap_file_example.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- PostgreSQL default table storage is heap-based
-- This creates a heap-organized table
CREATE TABLE sensor_readings (
    id SERIAL PRIMARY KEY,
    sensor_id INTEGER NOT NULL,
    reading_value DECIMAL(10,4),
    recorded_at TIMESTAMP DEFAULT NOW()
);
 
-- Without indexes, any search requires full table scan
-- This query must scan ALL blocks in the table
EXPLAIN ANALYZE
SELECT * FROM sensor_readings WHERE sensor_id = 42;
-- Output: Seq Scan on sensor_readings (cost=0.00..10000.00 rows=...)
 
-- With a B-tree index, searches become O(log n)
CREATE INDEX idx_sensor_id ON sensor_readings(sensor_id);
 
-- Now the same query uses the index
EXPLAIN ANALYZE
SELECT * FROM sensor_readings WHERE sensor_id = 42;
-- Output: Index Scan using idx_sensor_id (cost=0.42..8.44 rows=...)

Sorted Files (Ordered Files)

A sorted file (or ordered file) maintains records in sorted order based on one or more ordering attributes. This organization trades insert performance for dramatically improved search and range query performance.

Structure:

Records physically ordered by the ordering attribute(s)
Enables binary search for equality queries
Enables efficient range scans
The ordering attribute is called the sort key or ordering key

Sorted File Operation Costs
Operation	Cost	Explanation
Search (equality on sort key)	O(log₂ B)	Binary search on blocks
Search (range on sort key)	O(log₂ B + m)	Find start, scan m matching blocks
Search (non-sort-key)	O(B)	Must scan entire file (no ordering help)
Insert	O(B)	Find position, shift all subsequent records
Delete	O(B)	Find record, shift to close gap (or mark)
Full scan (sorted order)	O(B)	Sequential read, already ordered

The binary search advantage:

For our 1 million record example (10,000 blocks):

Equality search: log₂(10,000) ≈ 14 block reads vs. 5,000 for heap
Range query for 1,000 matching records: 14 + ~10 block reads

This is transformative—from 50 seconds to milliseconds.

The insert penalty:

However, maintaining sorted order is expensive. Inserting a record in the middle requires:

Finding the correct position: O(log B)
Reading all subsequent blocks: O(B/2) average
Shifting records: O(B/2) block writes

For high-insert workloads, this is prohibitive.

Overflow Handling Strategies

To mitigate insert costs, systems employ overflow handling: (1) Overflow blocks — linked chains of unsorted blocks for new records, periodically merged; (2) Reserved space — leave gaps in blocks for future insertions; (3) Periodic reorganization — accept temporary disorder, rebuild file during maintenance windows. These techniques trade perfect ordering for practical insert performance.

Sorted Files Excel When

•Range queries on the sort key are frequent
•ORDER BY on sort key is common
•Data is loaded in bulk (batch inserts)
•Read-heavy workload with few updates
•Merge joins are performed on the sort key

Sorted Files Struggle When

•High-frequency random inserts occur
•Multiple attributes need efficient search
•Updates frequently change the sort key value
•Real-time insert response is required
•Concurrent modifications are heavy

Hash Files

Hash files use a hash function to map attribute values directly to storage locations (buckets). This enables O(1) access for equality queries—a dramatic improvement over both heap and sorted files.

Structure:

A hash function h(key) → bucket_number
Each bucket is one or more disk blocks
Records with the same hash value stored in the same bucket
The search key for hashing is called the hash field

Static hashing:

In the simplest form, the hash function maps to a fixed number of buckets:

bucket_number = h(key) mod M

where M is the total number of buckets allocated initially.

Static Hash File Operation Costs
Operation	Cost	Explanation
Search (equality on hash key)	O(1) — 1-2 I/O	Hash to bucket, read bucket (may have overflow)
Search (range on hash key)	O(B)	Hash provides no ordering—full scan required
Search (non-hash key)	O(B)	Must scan entire file
Insert	O(1)	Hash to bucket, append to bucket
Delete	O(1)	Hash to bucket, remove from bucket

The bucket overflow problem:

Static hashing faces a critical challenge: bucket overflow. When a bucket fills beyond its allocated space, the system creates overflow chains—linked lists of additional blocks. As these chains grow:

Search degrades from O(1) toward O(chain_length)
The efficiency advantage over sorted files erodes
Worst case: all records hash to one bucket → O(n) search

Causes of overflow:

Insufficient initial buckets allocated
Uneven hash function distribution
Skewed data (many duplicates of certain key values)
Data growth beyond initial estimates

Static Hashing Limitation

Static hashing requires knowing the data size in advance. If you allocate too few buckets, overflow chains destroy performance. If you allocate too many, you waste space and may have empty buckets scattered across disk. This rigidity makes static hashing unsuitable for dynamic datasets.

Dynamic hashing schemes:

To address static hashing limitations, dynamic hashing techniques grow and shrink gracefully:

Extendible Hashing:

Uses a directory of pointers to buckets
Bucket splitting doubles only the affected bucket, not all buckets
Directory doubles when needed (doubling is cheap—only pointers)
Guarantees at most one overflow page per bucket

Linear Hashing:

Buckets split in a predetermined order (round-robin)
No directory required—uses split pointer
Splits triggered by load factor threshold
Allows temporary overflow chains until next split

Both maintain O(1) average-case performance while handling growth dynamically.

Hash File Use Cases

•Primary key lookups — When most queries are point lookups on a unique identifier
•Join operations — Hash joins partition both tables by join key for efficient matching
•Aggregation — GROUP BY operations can hash records by grouping key
•Duplicate detection — Hash to quickly find potential duplicates
•Index structures — Hash indexes in many DBMS for equality predicates

Clustered vs Non-Clustered Organization

Beyond basic file organization, clustering addresses how related records are physically grouped. This concept is orthogonal to sorting—you can have clustering with various underlying organizations.

Clustered organization:

Records are physically ordered on disk according to some clustering attribute. Only one clustering order is possible per table (a table has only one physical arrangement).

Types of clustering:

Single-table clustering: Records of one table ordered by an attribute
Multi-table clustering (interleaving): Records from related tables stored together

Example of multi-table clustering:

Consider Department and Employee tables. In a clustered organization:

[Dept1 Header] [Emp1_Dept1] [Emp2_Dept1] [Emp3_Dept1]
[Dept2 Header] [Emp1_Dept2] [Emp2_Dept2]
[Dept3 Header] [Emp1_Dept3] ...

A query joining Department and Employee on department_id reads contiguous blocks—no random I/O to fetch related employees.

Clustered vs Non-Clustered Access Patterns
Query Type	Clustered (on query attribute)	Non-Clustered
Point query (unique key)	Same — 1 block read	Same — 1 block read
Range query (many matching)	Sequential I/O — blocks	Random I/O — blocks scattered
Join on clustering key	Merge join highly efficient	May require sorting or hash
Query on non-cluster attribute	Full scan required	Full scan required

Clustering Key Selection

Choose the clustering key based on your most critical access pattern. Since you can only have ONE clustered order per table, this decision has significant impact. Common choices: primary key (for point lookups with index), foreign key (for joins), timestamp (for time-series queries), or composite key matching common query filters.

Clustered indexes:

In many database systems, the distinction manifests through clustered indexes:

Clustered index: The leaf level of the index contains the actual table data (or the table is physically sorted by the index key). Only one per table.
Non-clustered index: The leaf level contains pointers (row IDs) to heap-stored data. Many per table.

InnoDB (MySQL's default engine) always maintains a clustered index:

If you define a PRIMARY KEY, it becomes the clustered index key
Otherwise, InnoDB uses the first UNIQUE NOT NULL index
Otherwise, InnoDB generates a hidden 6-byte row ID

PostgreSQL tables are heap-organized by default. The CLUSTER command can physically reorder a table by an index, but this ordering degrades as new rows are inserted.

clustering_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- InnoDB automatically clusters by PRIMARY KEY
CREATE TABLE orders (
    order_id INT PRIMARY KEY,  -- Clustered index key
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL(10,2)
) ENGINE=InnoDB;
 
-- Data is physically ordered by order_id
-- Range scans on order_id are sequential I/O
 
-- Secondary (non-clustered) index on customer_id
CREATE INDEX idx_customer ON orders(customer_id);
-- This index stores (customer_id, order_id) pairs
-- Lookups require: index traverse → retrieve order_id → fetch from clustered index

Primary vs Secondary Organization

Understanding the distinction between primary and secondary file organizations is crucial for physical design decisions.

Primary organization:

The primary organization determines how the actual data records are stored on disk. Every table has exactly one primary organization. Options include:

Heap (unordered)
Sorted (ordered by some field)
Hashed (by some hash key)
Clustered (by some clustering key)

Secondary organization (indexes):

Secondary access paths provide efficient access via attributes other than the primary organization key. A table can have many secondary indexes. Each index is an additional data structure that:

Maps attribute values → record locations
Must be updated on every data modification
Consumes additional storage space
Enables efficient access patterns the primary organization doesn't support

The trade-off matrix:

Consider a table with:

Primary: Heap organization
Secondary: B-tree indexes on columns A, B, and C

Operation	Primary Impact	Secondary Impact
INSERT	Fast (append)	Slower (update 3 indexes)
DELETE	Fast (mark)	Slower (update 3 indexes)
UPDATE col A	Fast	Medium (update 1 index)
Search on A	Slow (scan)	Fast (use index)
Search on D	Slow (scan)	Slow (no index)

Every index accelerates specific read patterns at the cost of write overhead.

The Index Overhead Reality

Each index typically adds 10-30% overhead to write operations. A table with 5 indexes may see writes take 2-3x longer than an unindexed table. This is why physical design requires understanding workload characteristics—not just query patterns, but the read/write ratio and update frequency for each column.

Primary Organization Selection Guide

•Heap — Default choice for OLTP; pairs with indexes for efficient access
•Sorted — Excellent for data warehouses with batch loads and range scans
•Hashed — Optimal for primary key lookups in key-value patterns
•Clustered — Choose when one access pattern dominates all others

Storage Structures in Modern Systems

Modern database systems have evolved sophisticated storage structures beyond the classical models. Understanding these advances helps you leverage contemporary database features effectively.

Log-Structured Merge Trees (LSM Trees):

Used by RocksDB, LevelDB, Cassandra, and many NoSQL systems:

Writes go to an in-memory buffer (MemTable)
When full, buffer is written as a sorted file (SSTable) to disk
Background processes merge SSTables into larger, sorted files
Reads may check multiple levels (memtable + multiple SSTables)

Trade-off: Exceptional write throughput (sequential I/O), but read amplification (may read multiple files). Bloom filters mitigate read costs.

B-Tree vs LSM-Tree Trade-offs
Characteristic	B-Tree	LSM-Tree
Write pattern	Random I/O	Sequential I/O
Write throughput	Moderate	High
Read latency	Predictable	Variable (level-dependent)
Space amplification	Low (~1.5x)	Moderate (~2-10x)
Write amplification	Low	High (compaction)
Typical use case	OLTP, mixed workloads	Write-heavy, time-series, logs

Columnar storage:

Traditional row-oriented storage (NSM — N-ary Storage Model) stores all attributes of a record together. Columnar storage (DSM — Decomposition Storage Model) stores each column separately:

Row-oriented (NSM):

Block 1: [Row1: id, name, salary] [Row2: id, name, salary] ...

Column-oriented (DSM):

Block 1 (ids):     [1, 2, 3, 4, 5, ...]
Block 2 (names):   ["Alice", "Bob", "Carol", ...]
Block 3 (salaries): [50000, 60000, 55000, ...]

Columnar advantages:

Analytical queries touch fewer columns → read fewer blocks
Better compression (similar values adjacent)
SIMD-friendly processing
Projection pushdown eliminates I/O for untouched columns

Columnar disadvantages:

Tuple reconstruction expensive (JOIN columns from multiple files)
Point queries (fetch one row) require multiple I/O operations
Updates are challenging (often append-only with periodic rebuild)

Hybrid Approaches

Modern systems often use hybrid storage: columnar for analytical workloads, row-oriented for transactional. Some systems (SAP HANA, Oracle Database In-Memory) maintain both representations. Others (DuckDB, Snowflake) use columnar universally but optimize for different query patterns.

Summary: Storage Structures

Storage structure selection is the foundation of physical database design. Every subsequent decision—indexing, partitioning, denormalization—builds upon this foundation.

Key Takeaways

•Disk I/O is the bottleneck — Physical design optimizes for minimal block reads, especially random I/O
•Heap files — Fast inserts, slow searches; use when paired with indexes or for bulk loading
•Sorted files — Fast binary search and range queries on sort key; expensive to maintain under updates
•Hash files — O(1) equality lookups; useless for range queries; dynamic hashing handles growth
•Clustering — One physical order per table; choose based on dominant access pattern
•Primary organization — How data is stored; one per table; determines baseline performance
•Secondary indexes — Enable alternative access paths; each carries write overhead
•Modern structures — LSM trees for write-heavy, columnar for analytics, hybrid for mixed workloads

What's next:

With storage structures understood, we turn to index selection—the most impactful physical design decision for most workloads. The next page explores B-tree indexes, hash indexes, bitmap indexes, and the analytical framework for choosing which indexes to create.

Page Complete

You now understand the fundamental storage structures underlying database systems. This knowledge enables you to reason about why certain queries are fast or slow, and to make informed decisions about primary file organization. Next, we'll explore how indexes—secondary access structures—dramatically accelerate queries on any attribute.

1 / 5

Loading learning content...

Database Management SystemPhysical Design

Physical Design

LevelIntermediate

Duration90 mins

TopicPhysical Design

1 / 5

Storage Structures

From Logical to Physical: Where Performance Lives

What You Will Learn

The Disk I/O Imperative

Before examining specific storage structures, we must understand why physical organization matters. The answer lies in the stark performance gap between memory and disk access.

The memory hierarchy reality:

CPU Cache (L1): ~1 nanosecond access
RAM: ~100 nanoseconds access
SSD: ~100 microseconds access (~1,000x slower than RAM)
HDD: ~10 milliseconds access (~100,000x slower than RAM)

The block/page model:

Reading one record from a block costs the same as reading the entire block
Sequential access (reading adjacent blocks) is vastly faster than random access
Locality matters: records accessed together should be stored together

Approximate I/O Costs (Relative)
Operation	HDD	SSD	Implication
Sequential Read (1 block)	1x	1x	Baseline operation cost
Random Read (1 block)	~100x	~10x	Seek/latency dominates
Sequential Scan (1000 blocks)	~1000x	~1000x	Scales linearly
Random Reads (1000 blocks)	~100,000x	~10,000x	Catastrophically slow

The Fundamental Optimization Goal

Cost model basics:

Database query optimizers estimate costs primarily in terms of:

B = Number of disk blocks in the file
R = Number of records in the file
bfr = Blocking factor (records per block)

These values determine the efficiency of different storage structures for different operations. We'll use this notation throughout our analysis.

Heap Files (Unordered Files)

Structure:

New records append to the end of the file (or fill deleted record slots)
No relationship between record content and physical position
Conceptually like an unsorted array that grows as records are added

Implementation details:

Most heap file implementations maintain:

A header page tracking which pages have free space
A free list linking pages with deleted record slots
Growth by allocating new pages as needed

Heap File Operation Costs
Operation	Cost	Explanation
Insert	O(1) — 2 I/O	Read last block, write updated block
Delete (with RID)	O(1) — 2 I/O	Read block, mark record deleted, write
Search (equality)	O(B) — B/2 avg	Must scan until found (or entire file)
Search (range)	O(B)	Must scan entire file
Full scan	O(B)	Sequential read of all blocks

Analysis:

Heap files excel at inserts and full scans but suffer terribly for searches. Consider a table with 1 million records:

At 100 records per block: 10,000 blocks
Equality search on a non-key attribute: scan ~5,000 blocks on average
At 10ms per random read: 50 seconds per search

This is clearly unacceptable for most query patterns.

When Heap Files Make Sense

heap_file_example.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- PostgreSQL default table storage is heap-based
-- This creates a heap-organized table
CREATE TABLE sensor_readings (
    id SERIAL PRIMARY KEY,
    sensor_id INTEGER NOT NULL,
    reading_value DECIMAL(10,4),
    recorded_at TIMESTAMP DEFAULT NOW()
);
 
-- Without indexes, any search requires full table scan
-- This query must scan ALL blocks in the table
EXPLAIN ANALYZE
SELECT * FROM sensor_readings WHERE sensor_id = 42;
-- Output: Seq Scan on sensor_readings (cost=0.00..10000.00 rows=...)
 
-- With a B-tree index, searches become O(log n)
CREATE INDEX idx_sensor_id ON sensor_readings(sensor_id);
 
-- Now the same query uses the index
EXPLAIN ANALYZE
SELECT * FROM sensor_readings WHERE sensor_id = 42;
-- Output: Index Scan using idx_sensor_id (cost=0.42..8.44 rows=...)

Sorted Files (Ordered Files)

Structure:

Records physically ordered by the ordering attribute(s)
Enables binary search for equality queries
Enables efficient range scans
The ordering attribute is called the sort key or ordering key

Sorted File Operation Costs
Operation	Cost	Explanation
Search (equality on sort key)	O(log₂ B)	Binary search on blocks
Search (range on sort key)	O(log₂ B + m)	Find start, scan m matching blocks
Search (non-sort-key)	O(B)	Must scan entire file (no ordering help)
Insert	O(B)	Find position, shift all subsequent records
Delete	O(B)	Find record, shift to close gap (or mark)
Full scan (sorted order)	O(B)	Sequential read, already ordered

The binary search advantage:

For our 1 million record example (10,000 blocks):

Equality search: log₂(10,000) ≈ 14 block reads vs. 5,000 for heap
Range query for 1,000 matching records: 14 + ~10 block reads

This is transformative—from 50 seconds to milliseconds.

The insert penalty:

However, maintaining sorted order is expensive. Inserting a record in the middle requires:

Finding the correct position: O(log B)
Reading all subsequent blocks: O(B/2) average
Shifting records: O(B/2) block writes

For high-insert workloads, this is prohibitive.

Overflow Handling Strategies

Sorted Files Excel When

•Range queries on the sort key are frequent
•ORDER BY on sort key is common
•Data is loaded in bulk (batch inserts)
•Read-heavy workload with few updates
•Merge joins are performed on the sort key

Sorted Files Struggle When

•High-frequency random inserts occur
•Multiple attributes need efficient search
•Updates frequently change the sort key value
•Real-time insert response is required
•Concurrent modifications are heavy

Hash Files

Structure:

A hash function h(key) → bucket_number
Each bucket is one or more disk blocks
Records with the same hash value stored in the same bucket
The search key for hashing is called the hash field

Static hashing:

In the simplest form, the hash function maps to a fixed number of buckets:

bucket_number = h(key) mod M

where M is the total number of buckets allocated initially.

Static Hash File Operation Costs
Operation	Cost	Explanation
Search (equality on hash key)	O(1) — 1-2 I/O	Hash to bucket, read bucket (may have overflow)
Search (range on hash key)	O(B)	Hash provides no ordering—full scan required
Search (non-hash key)	O(B)	Must scan entire file
Insert	O(1)	Hash to bucket, append to bucket
Delete	O(1)	Hash to bucket, remove from bucket

The bucket overflow problem:

Search degrades from O(1) toward O(chain_length)
The efficiency advantage over sorted files erodes
Worst case: all records hash to one bucket → O(n) search

Causes of overflow:

Insufficient initial buckets allocated
Uneven hash function distribution
Skewed data (many duplicates of certain key values)
Data growth beyond initial estimates

Static Hashing Limitation

Dynamic hashing schemes:

To address static hashing limitations, dynamic hashing techniques grow and shrink gracefully:

Extendible Hashing:

Uses a directory of pointers to buckets
Bucket splitting doubles only the affected bucket, not all buckets
Directory doubles when needed (doubling is cheap—only pointers)
Guarantees at most one overflow page per bucket

Linear Hashing:

Buckets split in a predetermined order (round-robin)
No directory required—uses split pointer
Splits triggered by load factor threshold
Allows temporary overflow chains until next split

Both maintain O(1) average-case performance while handling growth dynamically.

Hash File Use Cases

•Primary key lookups — When most queries are point lookups on a unique identifier
•Join operations — Hash joins partition both tables by join key for efficient matching
•Aggregation — GROUP BY operations can hash records by grouping key
•Duplicate detection — Hash to quickly find potential duplicates
•Index structures — Hash indexes in many DBMS for equality predicates

Clustered vs Non-Clustered Organization

Clustered organization:

Records are physically ordered on disk according to some clustering attribute. Only one clustering order is possible per table (a table has only one physical arrangement).

Types of clustering:

Single-table clustering: Records of one table ordered by an attribute
Multi-table clustering (interleaving): Records from related tables stored together

Example of multi-table clustering:

Consider Department and Employee tables. In a clustered organization:

[Dept1 Header] [Emp1_Dept1] [Emp2_Dept1] [Emp3_Dept1]
[Dept2 Header] [Emp1_Dept2] [Emp2_Dept2]
[Dept3 Header] [Emp1_Dept3] ...

A query joining Department and Employee on department_id reads contiguous blocks—no random I/O to fetch related employees.

Clustered vs Non-Clustered Access Patterns
Query Type	Clustered (on query attribute)	Non-Clustered
Point query (unique key)	Same — 1 block read	Same — 1 block read
Range query (many matching)	Sequential I/O — blocks	Random I/O — blocks scattered
Join on clustering key	Merge join highly efficient	May require sorting or hash
Query on non-cluster attribute	Full scan required	Full scan required

Clustering Key Selection

Clustered indexes:

In many database systems, the distinction manifests through clustered indexes:

Clustered index: The leaf level of the index contains the actual table data (or the table is physically sorted by the index key). Only one per table.
Non-clustered index: The leaf level contains pointers (row IDs) to heap-stored data. Many per table.

InnoDB (MySQL's default engine) always maintains a clustered index:

If you define a PRIMARY KEY, it becomes the clustered index key
Otherwise, InnoDB uses the first UNIQUE NOT NULL index
Otherwise, InnoDB generates a hidden 6-byte row ID

PostgreSQL tables are heap-organized by default. The CLUSTER command can physically reorder a table by an index, but this ordering degrades as new rows are inserted.

clustering_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- InnoDB automatically clusters by PRIMARY KEY
CREATE TABLE orders (
    order_id INT PRIMARY KEY,  -- Clustered index key
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL(10,2)
) ENGINE=InnoDB;
 
-- Data is physically ordered by order_id
-- Range scans on order_id are sequential I/O
 
-- Secondary (non-clustered) index on customer_id
CREATE INDEX idx_customer ON orders(customer_id);
-- This index stores (customer_id, order_id) pairs
-- Lookups require: index traverse → retrieve order_id → fetch from clustered index

Primary vs Secondary Organization

Understanding the distinction between primary and secondary file organizations is crucial for physical design decisions.

Primary organization:

The primary organization determines how the actual data records are stored on disk. Every table has exactly one primary organization. Options include:

Heap (unordered)
Sorted (ordered by some field)
Hashed (by some hash key)
Clustered (by some clustering key)

Secondary organization (indexes):

Secondary access paths provide efficient access via attributes other than the primary organization key. A table can have many secondary indexes. Each index is an additional data structure that:

Maps attribute values → record locations
Must be updated on every data modification
Consumes additional storage space
Enables efficient access patterns the primary organization doesn't support

The trade-off matrix:

Consider a table with:

Primary: Heap organization
Secondary: B-tree indexes on columns A, B, and C

Operation	Primary Impact	Secondary Impact
INSERT	Fast (append)	Slower (update 3 indexes)
DELETE	Fast (mark)	Slower (update 3 indexes)
UPDATE col A	Fast	Medium (update 1 index)
Search on A	Slow (scan)	Fast (use index)
Search on D	Slow (scan)	Slow (no index)

Every index accelerates specific read patterns at the cost of write overhead.

The Index Overhead Reality

Primary Organization Selection Guide

•Heap — Default choice for OLTP; pairs with indexes for efficient access
•Sorted — Excellent for data warehouses with batch loads and range scans
•Hashed — Optimal for primary key lookups in key-value patterns
•Clustered — Choose when one access pattern dominates all others

Storage Structures in Modern Systems

Modern database systems have evolved sophisticated storage structures beyond the classical models. Understanding these advances helps you leverage contemporary database features effectively.

Log-Structured Merge Trees (LSM Trees):

Used by RocksDB, LevelDB, Cassandra, and many NoSQL systems:

Writes go to an in-memory buffer (MemTable)
When full, buffer is written as a sorted file (SSTable) to disk
Background processes merge SSTables into larger, sorted files
Reads may check multiple levels (memtable + multiple SSTables)

Trade-off: Exceptional write throughput (sequential I/O), but read amplification (may read multiple files). Bloom filters mitigate read costs.

B-Tree vs LSM-Tree Trade-offs
Characteristic	B-Tree	LSM-Tree
Write pattern	Random I/O	Sequential I/O
Write throughput	Moderate	High
Read latency	Predictable	Variable (level-dependent)
Space amplification	Low (~1.5x)	Moderate (~2-10x)
Write amplification	Low	High (compaction)
Typical use case	OLTP, mixed workloads	Write-heavy, time-series, logs

Columnar storage:

Traditional row-oriented storage (NSM — N-ary Storage Model) stores all attributes of a record together. Columnar storage (DSM — Decomposition Storage Model) stores each column separately:

Row-oriented (NSM):

Block 1: [Row1: id, name, salary] [Row2: id, name, salary] ...

Column-oriented (DSM):

Block 1 (ids):     [1, 2, 3, 4, 5, ...]
Block 2 (names):   ["Alice", "Bob", "Carol", ...]
Block 3 (salaries): [50000, 60000, 55000, ...]

Columnar advantages:

Analytical queries touch fewer columns → read fewer blocks
Better compression (similar values adjacent)
SIMD-friendly processing
Projection pushdown eliminates I/O for untouched columns

Columnar disadvantages:

Tuple reconstruction expensive (JOIN columns from multiple files)
Point queries (fetch one row) require multiple I/O operations
Updates are challenging (often append-only with periodic rebuild)

Hybrid Approaches

Summary: Storage Structures

Storage structure selection is the foundation of physical database design. Every subsequent decision—indexing, partitioning, denormalization—builds upon this foundation.

Key Takeaways

•Disk I/O is the bottleneck — Physical design optimizes for minimal block reads, especially random I/O
•Heap files — Fast inserts, slow searches; use when paired with indexes or for bulk loading
•Sorted files — Fast binary search and range queries on sort key; expensive to maintain under updates
•Hash files — O(1) equality lookups; useless for range queries; dynamic hashing handles growth
•Clustering — One physical order per table; choose based on dominant access pattern
•Primary organization — How data is stored; one per table; determines baseline performance
•Secondary indexes — Enable alternative access paths; each carries write overhead
•Modern structures — LSM trees for write-heavy, columnar for analytics, hybrid for mixed workloads

What's next:

Page Complete

1 / 5