Loading learning content...
When you commit a transaction, you expect the data to survive power failures, system crashes, and hardware replacements. This permanence—the fundamental promise of a database—is delivered by the Storage Manager. It's the component that bridges the ephemeral world of memory with the permanent world of disk storage.
But storage management is far more than simply writing bytes to disk. The Storage Manager must organize data efficiently for both reads and writes, maintain complex index structures, coordinate with the buffer manager for caching, and ensure data integrity even when the system fails mid-operation.
This page takes you deep into the Storage Manager, revealing how databases organize, locate, and protect persistent data.
By the end of this page, you will understand the storage hierarchy that databases navigate, file organization strategies, record and page formats, index structures (B+ trees, hash indexes), and how the Storage Manager coordinates with other components to deliver fast, reliable data access.
The Storage Manager operates within a fundamental constraint: there is no storage medium that is simultaneously fast, large, and cheap. This is the storage hierarchy problem, and understanding it is essential for database design.
Every storage technology involves trade-offs between speed, capacity, cost, and volatility. The Storage Manager must navigate this hierarchy intelligently, keeping hot data in fast (but expensive) tiers while relegating cold data to slower (but cheaper) storage.
| Storage Type | Access Latency | Capacity | Cost ($/GB) | Volatility |
|---|---|---|---|---|
| CPU Registers | ~0.3 ns | ~1 KB | N/A | Volatile |
| L1 Cache | ~1 ns | 32-64 KB | N/A | Volatile |
| L2 Cache | ~4 ns | 256-512 KB | N/A | Volatile |
| L3 Cache | ~12 ns | 8-64 MB | N/A | Volatile |
| Main Memory (DRAM) | ~100 ns | 64-512 GB | $3-5 | Volatile |
| NVMe SSD | ~10-100 μs | 1-16 TB | $0.10-0.30 | Non-volatile |
| SATA SSD | ~100-500 μs | 1-8 TB | $0.08-0.20 | Non-volatile |
| HDD (7200 RPM) | ~5-10 ms | 4-20 TB | $0.02-0.04 | Non-volatile |
| Tape Storage | ~seconds-minutes | Petabytes | $0.004 | Non-volatile |
The key insight: Memory is approximately 100,000x faster than HDD and 100x faster than SSD for random access. This single fact shapes almost every decision in database storage design.
Why this matters for Storage Managers:
Minimize I/O Operations — Every disk access is expensive. The Storage Manager organizes data to reduce the number of I/O operations needed.
Sequential vs. Random Access — Sequential reads are 100-1000x faster than random reads on spinning disks, and still 10-100x faster on SSDs. The Storage Manager exploits this by organizing related data contiguously.
Block-Based Access — Disks read and write in blocks (typically 4KB-64KB). Reading 1 byte costs the same as reading an entire block. The Storage Manager aligns data structures to block boundaries.
Caching is Critical — Given the speed gap, caching frequently accessed data in memory is essential. The Storage Manager works closely with the Buffer Manager to maximize cache hit rates.
A useful mental model: 1 second of elapsed time in a DBMS corresponds to roughly 10 million memory operations OR 100 SSD random reads OR 100-200 HDD random reads. Optimizing disk access patterns often provides 10,000x more improvement potential than micro-optimizing CPU code.
At the lowest level, a database is stored as a collection of files on the operating system's file system. Each file contains a sequence of pages (also called blocks), and each page contains records (rows). The Storage Manager must decide how to organize records within pages and pages within files.
The fundamental question: Given a query that needs certain records, how do we find them quickly?
Different file organizations optimize for different access patterns. The choice profoundly affects query performance.
Heap File Organization stores records in no particular order. New records go wherever there's space—typically at the end of the file or in the first page with free space.
Characteristics:
When to use heap files:
The practice reality:
Most modern DBMS implementations use heap files with indexes as the standard organization. Tables are heap-organized (fast inserts, updates), and B+ tree indexes provide efficient access paths for various query patterns. This hybrid approach combines the insert efficiency of heap files with the search efficiency of sorted structures.
Within each file, data is organized into pages (typically 4KB-16KB). The page is the unit of I/O—when the Storage Manager reads data from disk, it reads entire pages. Understanding page structure is essential for understanding DBMS performance.
Page components:
Record Identification: The TID (Tuple ID)
Every record has a unique identifier called a TID (or RID - Record ID). The TID is typically a pair: (PageID, SlotNumber). This enables:
123456789101112131415161718192021222324
// Conceptual variable-length record formatinterface RecordHeader { recordLength: number; // Total bytes including header nullBitmap: Uint8Array; // Which fields are NULL fieldCount: number; // Number of fields} interface VariableLengthRecord { header: RecordHeader; fixedFields: Buffer; // INT, DATE, etc. at known offsets variableFields: Buffer; // VARCHAR data fieldOffsets: number[]; // Where each var field starts} // Example: Employee record// Schema: (id INT, name VARCHAR(100), dept VARCHAR(50), salary DECIMAL)// // Record bytes:// [RecordLen: 4 bytes][NullBitmap: 1 byte][FieldCount: 2 bytes]// [id: 4 bytes][salary: 8 bytes] <- fixed fields// [name_offset: 2 bytes][dept_offset: 2 bytes] <- var field directory// [name_data: variable][dept_data: variable] <- var field data//// Total: ~30+ bytes depending on name/dept lengthThe page formats described here are for row-oriented (N-ary) storage, where entire records are stored together. Columnar databases (like ClickHouse, Snowflake) store each column separately, enabling much better compression and cache efficiency for analytical queries that touch few columns of many rows.
Without indexes, finding a specific record requires scanning the entire table—O(n) for every lookup. Indexes provide alternative access paths that dramatically reduce lookup costs, often to O(log n) or even O(1).
The Storage Manager maintains index structures and coordinates with the Query Processor to use them effectively. Understanding index internals reveals why some queries fly while others crawl.
B+ Trees are the dominant index structure in relational databases, used for PRIMARY KEY, UNIQUE, and most secondary indexes. They're optimized for disk-based access patterns.
Key Properties:
Operations:
123456789101112131415161718192021
-- B+ tree depth analysis-- Assumptions: -- 8KB pages, 8-byte keys, 8-byte pointers-- Fanout ≈ 8192 / 16 ≈ 500 children per node -- Depth 2: 500 * 500 = 250,000 records-- Depth 3: 500^3 = 125,000,000 records (125 million)-- Depth 4: 500^4 = 62,500,000,000 records (62 billion) -- Even for 1 billion records, only 3-4 disk reads needed!-- And root + upper levels are always cached in memory. -- PostgreSQL: Check index size and tree depthSELECT indexrelname, pg_size_pretty(pg_relation_size(indexrelid)) as index_size, idx_scan, idx_tup_read, idx_tup_fetchFROM pg_stat_user_indexesWHERE schemaname = 'public';Clustered vs. Non-Clustered Indexes
A critical distinction:
Clustered Index: The table data itself is stored in index order. Only one clustered index per table (the data can only be sorted one way). Leaf nodes ARE the data pages. Range scans are blazing fast—sequential disk access.
Non-Clustered Index: Index is separate from data. Leaf nodes contain pointers (TIDs) to data pages. Range scans may require many random I/Os if data isn't naturally clustered.
In SQL Server and MySQL InnoDB, the PRIMARY KEY defines the clustered index by default. PostgreSQL heap tables are non-clustered; you can CLUSTER a table by an index, but it's a one-time physical reorder, not maintained automatically.
The Storage Manager doesn't interact with disk directly for every operation—that would be prohibitively slow. Instead, it interfaces with the Buffer Manager, which maintains a pool of database pages in memory.
The relationship:
Key Storage Manager responsibilities in this interface:
Most database workloads exhibit strong locality: 20% of pages contain 80% of the hot data. With sufficient buffer pool memory, these hot pages stay cached, and the Storage Manager rarely waits for disk I/O. This is why DBMS memory tuning often focuses on buffer pool size—it directly determines cache hit rates.
The Authorization Manager (sometimes part of the Security subsystem) ensures that every data access is permitted according to the database's security policy. It intercepts operations from the Query Processor, checking privileges before the Storage Manager retrieves data.
Authorization model components:
12345678910111213141516171819202122232425
-- Grant table-level privilegesGRANT SELECT ON employees TO analyst_role;GRANT INSERT, UPDATE ON employees TO hr_role;GRANT ALL PRIVILEGES ON employees TO admin_role; -- Column-level privileges (restrict salary visibility)GRANT SELECT (employee_id, name, department) ON employees TO intern_role;-- Interns cannot see salary column -- Row-level security (PostgreSQL)ALTER TABLE employees ENABLE ROW LEVEL SECURITY; CREATE POLICY manager_sees_department ON employees FOR SELECT TO manager_role USING (department_id = current_setting('app.department_id')::int);-- Managers only see employees in their department -- Revoke with cascadeREVOKE SELECT ON employees FROM analyst_role CASCADE;-- Also revokes from any roles that inherited from analyst_role -- Check current privilegesSELECT * FROM information_schema.role_table_grants WHERE table_name = 'employees';The Query Processor checks authorization during semantic analysis (can this user reference this table?), but the Authorization Manager may also check during execution for row-level policies, dynamic privilege changes, or fine-grained access control. View-based security adds another layer—users may access underlying tables only through authorized views.
The Storage Manager is the guardian of persistent data, responsible for organizing information on disk and providing efficient access paths through carefully designed structures.
Key takeaways:
What's next:
The Storage Manager handles where data lives; the Transaction Manager handles how concurrent access works correctly. The next page explores transaction management—ensuring ACID properties even when multiple users modify data simultaneously.
You now understand how the Storage Manager organizes data on disk, from high-level file organizations down to page and record formats. You can appreciate why B+ trees dominate database indexing and how the Storage Manager coordinates with the Buffer Manager for efficient data access.