Database Management SystemsFile Organization

File Organization Methods

LevelIntermediate

Duration75 mins

TopicFile Organization

3 / 5

Hash File Organization

Direct Access Through Hashing

We've seen that heap files provide O(1) inserts but O(n) searches, while sorted files offer O(log n) searches at the cost of expensive maintenance. But what if we could achieve O(1) search for equality queries—finding a record with a specific key value in constant time, regardless of file size?

Hash file organization makes this possible by using a hash function to compute the exact location of each record based on its key. Instead of searching through pages sequentially or using binary search, we simply calculate where a record belongs and go directly there.

This approach is magnificently efficient for point queries but comes with important limitations that we must thoroughly understand.

What You Will Learn

By the end of this page, you will understand how hash functions distribute records across buckets, master static hashing and its implementation details, analyze overflow handling strategies and their impact on performance, and recognize the fundamental limitation of hash files for range queries.

What is a Hash File?

A hash file (also called a hashed file or direct file) organizes records using a hash function that maps key values to bucket addresses. Instead of storing records in arrival order (heap) or key order (sorted), records are placed in buckets determined by applying a mathematical function to their key.

Core Concept:

Given a key K and a hash function h(), the bucket address is:

bucket_address = h(K) mod N

where N is the number of buckets. All records with keys that hash to the same bucket address are stored together in that bucket.

Formal Definition:

A hash file H consists of:

A set of N buckets B₀, B₁, ..., B_{N-1}
A hash function h: KeyDomain → {0, 1, ..., N-1}
For each record r with key K: r is stored in bucket B_{h(K) mod N}
Each bucket may span one or more disk pages

Hash Function Properties

A good hash function for file organization should be: (1) Deterministic — same key always produces same hash, (2) Uniform — distributes keys evenly across buckets, (3) Fast — computable in O(k) where k is key length, (4) Avalanche — small changes in key produce very different hash values.

hash_function_examples.pseudo

Pseudocode

/*
 * Common Hash Functions for File Organization
 */
 
// Simple modulo hash (for integer keys)
function hash_mod(key: int, num_buckets: int) -> int {
    return key % num_buckets;
}
 
// Division method with prime number
// Using a prime number for num_buckets improves distribution
function hash_prime_mod(key: int, num_buckets: int) -> int {
    // Choose num_buckets as a prime not close to powers of 2
    // Good primes: 97, 997, 9973, 99991, 999983
    return key % num_buckets;
}
 
// Multiplication method (Knuth's suggestion)
function hash_multiply(key: int, num_buckets: int) -> int {
    const A = 0.6180339887;  // (sqrt(5) - 1) / 2, golden ratio
    double fractional_part = (key * A) - floor(key * A);
    return floor(num_buckets * fractional_part);
}
 
// String hashing (polynomial rolling hash)
function hash_string(key: string, num_buckets: int) -> int {
    const int BASE = 31;  // Small prime
    const int MOD = 1000000007;  // Large prime
    
    int64 hash_value = 0;
    int64 power = 1;
    
    for (char c in key) {
        hash_value = (hash_value + (c - 'a' + 1) * power) % MOD;
        power = (power * BASE) % MOD;
    }
    
    return hash_value % num_buckets;
}
 
// FNV-1a hash (commonly used in practice)
function fnv1a_hash(key: bytes, num_buckets: int) -> int {
    const int64 FNV_OFFSET = 14695981039346656037;
    const int64 FNV_PRIME = 1099511628211;
    
    int64 hash = FNV_OFFSET;
    for (byte b in key) {
        hash = hash XOR b;
        hash = hash * FNV_PRIME;
    }
    
    return hash % num_buckets;
}

Key Characteristics of Hash Files:

Direct Access: Given a key, compute bucket address immediately—no searching
No Ordering: Records with consecutive keys may be in completely different buckets; there's no physical ordering
Uniform Distribution Goal: A good hash function distributes records evenly, but perfect uniformity is not guaranteed
Collision Inevitable: Different keys may hash to the same bucket (collisions); handling collisions is crucial
Fixed Structure (Static Hashing): The number of buckets is fixed at file creation; this limits flexibility

Static Hashing Structure

Static hashing uses a fixed number of buckets determined at file creation. The structure consists of three main components:

1. Bucket Directory

A fixed-size array that maps bucket numbers to page addresses:

hash_file_structure.pseudo

Pseudocode

/*
 * Static Hash File Structure
 * 
 * ┌─────────────────────────────────────────────────────────────────┐
 * │                        FILE HEADER                               │
 * │  - Number of buckets (N)                                         │
 * │  - Hash function identifier                                      │
 * │  - Records per bucket                                            │
 * │  - Total record count                                            │
 * └─────────────────────────────────────────────────────────────────┘
 * 
 * ┌─────────────────────────────────────────────────────────────────┐
 * │                    BUCKET DIRECTORY                              │
 * │  Bucket 0  →  Page 10                                           │
 * │  Bucket 1  →  Page 23                                           │
 * │  Bucket 2  →  Page 5                                            │
 * │  Bucket 3  →  Page 17                                           │
 * │  ...                                                            │
 * │  Bucket N-1 → Page 42                                           │
 * └─────────────────────────────────────────────────────────────────┘
 * 
 *         Bucket 0                    Bucket 1                  
 * ┌─────────────────────┐     ┌─────────────────────┐
 * │ Primary Page        │     │ Primary Page        │
 * │ Record (key=100)    │     │ Record (key=7)      │
 * │ Record (key=300)    │     │ Record (key=107)    │
 * │ Record (key=700)    │     │ Record (key=207)    │
 * │ ...                 │     │ ...                 │
 * │ Overflow ptr → ────────┐  │ Overflow ptr → NULL │
 * └─────────────────────┘  │  └─────────────────────┘
 *                          ▼
 *                ┌─────────────────────┐
 *                │ Overflow Page       │
 *                │ Record (key=500)    │
 *                │ Record (key=900)    │
 *                │ Overflow ptr → NULL │
 *                └─────────────────────┘
 */
 
struct HashFileHeader {
    int32   magic_number;
    int32   version;
    int64   num_buckets;           // N - fixed at creation
    int64   records_per_bucket;    // Capacity before overflow
    int64   total_records;
    int64   overflow_records;      // Records in overflow chains
    int32   hash_function_id;      // Which hash function to use
    int32   key_offset;            // Offset of key within record
    int32   key_length;            // Length of key field
    byte    reserved[200];
}
 
struct BucketDirectory {
    PageID  entries[NUM_BUCKETS];  // Direct mapping: bucket # → page ID
}
 
struct BucketPage {
    int32   bucket_id;             // Which bucket this page belongs to
    int32   record_count;          // Records currently in this page
    int32   free_space;            // Available space
    PageID  overflow_ptr;          // Link to next page in chain (or NULL)
    Record  records[];             // Actual record data
}

2. Primary Buckets

Each bucket has a primary page that stores records hashing to that bucket:

Typically one disk page per bucket (but may be multiple for large buckets)
Records within a bucket are unordered (like a mini heap file)
Each bucket can hold a fixed number of records before overflow

3. Overflow Handling

When a bucket's primary page is full and more records hash to it:

Allocate an overflow page
Link overflow page to primary via pointer (chaining)
Subsequent overflows form a chain of pages

Choosing the Number of Buckets:

The choice of N (number of buckets) is critical:

Factor	Fewer Buckets	More Buckets
Records per bucket	Many	Few
Collision probability	High	Low
Directory size	Small	Large
Space utilization	High	May have empty buckets
Overflow likelihood	High	Low

Rule of Thumb: Choose N such that each bucket is 50-80% full on average:

N ≈ (Expected Records) / (Records per Page × Target Fill Factor)

Prime Number of Buckets

Using a prime number for N improves hash distribution, especially when key values have patterns. For example, if keys are multiples of 10 and N=100, only 10% of buckets would be used! With N=97 (prime), keys distribute more evenly.

Hash File Operations

Let's examine the fundamental operations on hash files with detailed algorithms and complexity analysis.

SEARCH Operation (Equality Only)

Searching for a record with key K:

hash_search.pseudo

Pseudocode

function hash_search(HashFile file, Key target) -> Record? {
    // Step 1: Compute bucket address
    int bucket_num = hash(target) % file.header.num_buckets;
    
    // Step 2: Get primary page for this bucket
    PageID page_id = file.directory.entries[bucket_num];
    
    // Step 3: Search within bucket (including overflow chain)
    while (page_id != NULL) {
        Page page = buffer_pool.get_page(file.file_id, page_id);
        
        // Linear search within page (records are unordered in bucket)
        for (int i = 0; i < page.record_count; i++) {
            Record record = page.get_record(i);
            if (extract_key(record) == target) {
                return record;  // Found!
            }
        }
        
        // Move to overflow page if exists
        page_id = page.overflow_ptr;
    }
    
    return null;  // Not found in this bucket
}
 
/*
 * Complexity Analysis:
 * 
 * Best case (no overflow, found in first check):
 *   - Hash computation: O(1)
 *   - Directory lookup: O(1) 
 *   - Page read: 1 I/O
 *   - Search within page: O(records per page)
 *   - Total: O(1) I/O, O(r) comparisons
 * 
 * Average case (short overflow chain):
 *   - Pages in chain: 1 + (overflow_records / bucket_capacity)
 *   - Total: O(1 + m) I/Os where m = overflow chain length
 * 
 * Worst case (severe overflow, all records in one bucket):
 *   - Degenerates to O(n) if all records hash to same bucket
 *   - This is why good hash functions matter!
 */

INSERT Operation

Inserting a record with key K:

hash_insert.pseudo

Pseudocode

function hash_insert(HashFile file, Record new_record) -> RID {
    Key key = extract_key(new_record);
    
    // Step 1: Compute bucket
    int bucket_num = hash(key) % file.header.num_buckets;
    
    // Step 2: Get primary page for bucket
    PageID page_id = file.directory.entries[bucket_num];
    Page page = buffer_pool.get_page(file.file_id, page_id);
    
    // Step 3: Find page with space (follow chain if needed)
    PageID last_page_id = page_id;
    while (page.is_full() && page.overflow_ptr != NULL) {
        last_page_id = page.overflow_ptr;
        page = buffer_pool.get_page(file.file_id, page.overflow_ptr);
    }
    
    // Step 4: Insert into page or create overflow
    if (page.has_space_for(new_record)) {
        int slot = page.add_record(new_record);
        buffer_pool.mark_dirty(page);
        file.header.total_records++;
        return RID(page.page_id, slot);
    }
    
    // Step 5: Need overflow page
    PageID new_page_id = allocate_overflow_page(file, bucket_num);
    Page new_page = buffer_pool.get_page(file.file_id, new_page_id);
    
    // Link from last page in chain
    page.overflow_ptr = new_page_id;
    buffer_pool.mark_dirty(page);
    
    // Insert into new overflow page
    int slot = new_page.add_record(new_record);
    buffer_pool.mark_dirty(new_page);
    
    file.header.total_records++;
    file.header.overflow_records++;
    
    return RID(new_page_id, slot);
}
 
/*
 * Complexity:
 * - Best case: Hash + 1 I/O to read, 1 I/O to write = O(1) I/O
 * - Average case: O(1 + chain_length) I/Os
 * - Worst case: O(chain_length) to traverse + 1 to write
 */

DELETE Operation

Deleting a record follows similar logic to search:

hash_delete.pseudo

Pseudocode

function hash_delete(HashFile file, Key target) -> bool {
    // Step 1: Find the bucket
    int bucket_num = hash(target) % file.header.num_buckets;
    PageID page_id = file.directory.entries[bucket_num];
    
    // Step 2: Search and delete
    PageID prev_page_id = NULL;
    
    while (page_id != NULL) {
        Page page = buffer_pool.get_page(file.file_id, page_id);
        
        for (int i = 0; i < page.record_count; i++) {
            Record record = page.get_record(i);
            if (extract_key(record) == target) {
                // Found! Delete the record
                page.delete_record(i);
                buffer_pool.mark_dirty(page);
                file.header.total_records--;
                
                // Optional: Coalesce empty overflow pages
                if (page.is_empty() && prev_page_id != NULL) {
                    coalesce_overflow_chain(file, bucket_num, prev_page_id, page_id);
                }
                
                return true;
            }
        }
        
        prev_page_id = page_id;
        page_id = page.overflow_ptr;
    }
    
    return false;  // Key not found
}
 
function coalesce_overflow_chain(HashFile file, int bucket, 
                                  PageID prev_id, PageID empty_id) {
    Page prev_page = buffer_pool.get_page(file.file_id, prev_id);
    Page empty_page = buffer_pool.get_page(file.file_id, empty_id);
    
    // Skip over the empty page
    prev_page.overflow_ptr = empty_page.overflow_ptr;
    buffer_pool.mark_dirty(prev_page);
    
    // Return empty page to free list
    file.free_page(empty_id);
}

Overflow Handling Strategies

The Overflow Problem

Overflow is the primary challenge in static hashing. When more records hash to a bucket than it can hold, we must handle the excess. The degree of overflow significantly impacts performance.

Overflow Causes:

Poor Hash Function: Uneven distribution concentrates records in few buckets
Data Skew: Real data may have non-uniform key distributions
Growth Beyond Expectations: File grew larger than anticipated at creation
Bad Luck: Even with perfect distribution, variance causes some buckets to overflow

Overflow Handling Strategies
Strategy	Mechanism	Pros	Cons
Open Chaining	Overflow pages linked to bucket	Simple, unlimited overflow	Chain traversal adds I/O
Linear Probing	Check next bucket if full	Good cache locality	Clustering, primary bucket pollution
Quadratic Probing	Check bucket+1², +2², etc.	Reduces clustering	May not find empty slot
Double Hashing	Use second hash for probe sequence	Best distribution	More complex, two hash computations

Open Chaining (Most Common for Files)

Open chaining links overflow pages into a chain from each bucket. This is the standard approach for file-level hashing because:

Buckets can grow without limit
No interference between buckets
Simple implementation

Performance Impact of Overflow:

overflow_analysis.pseudo

Analysis

/*
 * Overflow Chain Analysis
 * 
 * Given:
 *   - N = number of buckets
 *   - n = number of records
 *   - b = records per page (bucket capacity)
 *   - α = load factor = n / (N × b)
 * 
 * Expected chain length (with good hash function):
 *   E[chain length] ≈ max(1, α)
 * 
 * Example:
 *   N = 1000 buckets
 *   b = 40 records per page
 *   Capacity = 40,000 records
 *   
 *   If n = 40,000: α = 1.0, average 1 page per bucket (ideal)
 *   If n = 80,000: α = 2.0, average 2 pages per bucket
 *   If n = 400,000: α = 10.0, average 10 pages per bucket (severe overflow!)
 * 
 * Search cost with overflow chains:
 *   Average pages read = 1 + (α - 1)/2 = (α + 1)/2  if α > 1
 *   For α = 10: average 5.5 page reads per search
 *   
 * This is why static hashing struggles with growth!
 */
 
// Overflow statistics collector
function analyze_overflow(HashFile file) -> OverflowStats {
    int total_buckets = file.header.num_buckets;
    int total_overflow_pages = 0;
    int max_chain_length = 0;
    int empty_buckets = 0;
    
    for (int i = 0; i < total_buckets; i++) {
        int chain_length = 0;
        PageID page_id = file.directory.entries[i];
        
        while (page_id != NULL) {
            chain_length++;
            Page page = buffer_pool.get_page(file.file_id, page_id);
            
            if (chain_length == 1 && page.record_count == 0) {
                empty_buckets++;
            }
            
            page_id = page.overflow_ptr;
        }
        
        total_overflow_pages += max(0, chain_length - 1);
        max_chain_length = max(max_chain_length, chain_length);
    }
    
    return OverflowStats {
        avg_chain_length: (total_buckets + total_overflow_pages) / total_buckets,
        max_chain_length: max_chain_length,
        empty_bucket_ratio: empty_buckets / total_buckets,
        overflow_page_ratio: total_overflow_pages / total_buckets
    };
}

The Load Factor Crisis

When the load factor α exceeds 1, performance degrades rapidly. At α = 10, hash file performance approaches that of a heap file—you're essentially doing linear search within large buckets. This is why static hashing is limited: you must know your data size upfront and cannot handle significant growth without rebuilding.

The Range Query Limitation

The Fundamental Trade-off

Hash files provide O(1) equality search at the cost of completely sacrificing range queries. This is not an implementation limitation—it's fundamental to how hashing works.

Why Range Queries Fail:

Consider a query: WHERE price BETWEEN 100 AND 200

With a sorted file:

Jump to first record ≥ 100
Scan until record > 200
All matching records are contiguous

With a hash file:

price=100 hashes to bucket 47
price=101 might hash to bucket 12
price=150 might hash to bucket 89
Records are scattered across ALL buckets!

To answer a range query on a hash file, you must scan EVERY bucket—no better than a heap file.

range_query_comparison.pseudo

Pseudocode

/*
 * Range Query: WHERE key BETWEEN low AND high
 */
 
// Sorted File: Efficient
function range_query_sorted(SortedFile file, Key low, Key high) -> List<Record> {
    PageID start = binary_search_pages(file, low);
    List<Record> results;
    
    for (page in pages from start until max_key > high) {
        for (record in page where low <= key <= high) {
            results.add(record);
        }
    }
    // Cost: log(B) + number of pages containing matching records
    return results;
}
 
// Hash File: Must scan everything
function range_query_hash(HashFile file, Key low, Key high) -> List<Record> {
    List<Record> results;
    
    // Must check EVERY bucket - no way to know which contain matching keys
    for (bucket in 0 to file.header.num_buckets - 1) {
        PageID page_id = file.directory.entries[bucket];
        
        while (page_id != NULL) {
            Page page = buffer_pool.get_page(file.file_id, page_id);
            
            for (int i = 0; i < page.record_count; i++) {
                Record record = page.get_record(i);
                Key key = extract_key(record);
                
                if (key >= low && key <= high) {
                    results.add(record);
                }
            }
            
            page_id = page.overflow_ptr;
        }
    }
    // Cost: ALL pages in the file (worse than sorted, same as heap)
    return results;
}
 
/*
 * Performance Comparison for Range Query (10% selectivity):
 * 
 * File: 250,000 pages, 10 million records
 * Range: 1 million matching records (10%)
 * 
 * Sorted File: ~25,000 page reads (matching pages only)
 * Hash File: ~250,000 page reads (all pages)
 * Heap File: ~250,000 page reads (all pages)
 * 
 * Hash file is NO BETTER than heap for range queries!
 */

Hash Files Excel At

•Exact match queries: WHERE id = 12345
•Primary key lookups: Direct record access
•Join on equality: Hash joins for equi-joins
•Deduplication: Checking if record exists
•Grouping by key: GROUP BY on hash key

Hash Files Cannot Help With

•Range queries: WHERE x BETWEEN a AND b
•Prefix searches: WHERE name LIKE 'John%'
•Ordering: ORDER BY key
•MIN/MAX: Finding extreme values
•Partial key lookups: Composite key prefix

Design Implication

Choose hash file organization only when your workload is dominated by equality queries. If you need both equality and range queries on the same attribute, use a B+-tree index instead—it provides O(log n) for both query types, which is nearly as good as hashing for equality and far better for ranges.

Comprehensive Performance Analysis

Let's consolidate the performance characteristics of hash files with precise cost formulas.

Notation:

N = number of buckets
n = number of records
b = records per page (bucket capacity)
α = load factor = n / (N × b)
m = average overflow chain length (≈ max(0, α - 1) for uniform distribution)

Hash File Operation Costs
Operation	Best Case	Average Case	Worst Case	Condition
Equality Search	1 I/O	(1 + m)/2 + 1 I/Os	n/b I/Os	All in one bucket
Insert	2 I/Os	2 + m I/Os	n/b I/Os	All in one bucket
Delete	2 I/Os	2 + m I/Os	n/b I/Os	All in one bucket
Range Query	B I/Os	B I/Os	B I/Os	Always full scan
Full Scan	B I/Os	B I/Os	B I/Os	Must read all pages

Comparison with Other Organizations:

File Organization Performance Comparison
Operation	Heap	Sorted	Hash (α ≤ 1)	Hash (α > 1)
Equality Search	O(B)	O(log B)	O(1)	O(α)
Range Search	O(B)	O(log B + k)	O(B)	O(B)
Insert	O(1)	O(B)	O(1)	O(1)
Delete (by key)	O(B)	O(B)	O(1)	O(α)
Sorted Output	O(B log B)	O(B)	O(B log B)	O(B log B)

Concrete Example:

User lookup table:

1 million users (n = 1,000,000)
40 records per page
25,000 pages minimum

Scenario A: Properly sized hash file (α = 0.8)

N = 31,250 buckets (1.25 million capacity)
No overflow expected
Equality search: 1 I/O = 10ms

Scenario B: Undersized hash file (α = 4.0)

N = 6,250 buckets (250,000 capacity, but 1M records)
Average chain length: 4 pages
Equality search: ~2.5 I/Os = 25ms

Scenario C: Sorted file

Binary search: log₂(25,000) ≈ 15 comparisons
With sparse directory: 1 I/O = 10ms

Key Insight: Hash files only outperform sorted files when well-provisioned (α < 1). With overflow, the advantage diminishes rapidly.

When Is Hash Truly Better?

Hash files provide only a constant factor improvement over B+-trees for equality searches (1 I/O vs ~3-4 I/Os). Given the range query limitation and growth constraints, hash file organization is rarely used for primary storage in modern databases. However, hashing remains crucial for: hash indexes, hash joins, hash partitioning, and in-memory hash tables.

Hash Organization in Practice

While pure hash file organization is less common today, hashing concepts appear throughout database systems:

1. Hash Indexes

Most databases support hash indexes as an alternative to B+-trees:

PostgreSQL: CREATE INDEX USING hash
MySQL: Memory engine supports hash indexes
Oracle: Cluster indexes can use hashing

Hash indexes provide O(1) lookup but only for equality queries.

hash_index_usage.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- PostgreSQL: Create a hash index
CREATE INDEX idx_users_email_hash ON users USING hash (email);
 
-- This index accelerates:
SELECT * FROM users WHERE email = 'user@example.com';  -- Fast!
 
-- But NOT:
SELECT * FROM users WHERE email LIKE 'user%';          -- Full scan
SELECT * FROM users WHERE email > 'a@';                -- Full scan
SELECT * FROM users ORDER BY email;                    -- No help
 
-- When to use hash index:
-- 1. Column used ONLY for equality comparisons
-- 2. Cardinality too high for bitmap index
-- 3. Value distribution is uniform
-- 4. Saving a few I/Os matters (very high throughput)
 
-- Hash index limitations in PostgreSQL:
-- - Not WAL-logged (before v10), possible data loss
-- - Cannot be used for unique constraints
-- - Cannot include additional columns (covering index)
-- - Generally, B-tree is preferred unless proven otherwise

2. Hash Joins

Hash-based algorithms are fundamental to query processing:

Build a hash table on the smaller relation
Probe with tuples from the larger relation
O(n + m) time complexity for equi-joins

3. Hash Partitioning

Distributed databases use hash partitioning:

Hash function determines which shard holds a record
Consistent hashing enables dynamic cluster resizing
Used by Cassandra, DynamoDB, CockroachDB

Hash-Based Features in Database Systems
System	Hash Feature	Purpose
PostgreSQL	Hash indexes, hash joins	Point lookups, join processing
MySQL/InnoDB	Adaptive hash index (internal)	Hot page acceleration
Oracle	Hash clusters, hash partitioning	Grouped access, distribution
Cassandra	Consistent hashing	Data distribution across nodes
Redis	Hash data type	Field-level access within objects
MongoDB	Hashed shard keys	Even data distribution

The Evolution

Static hash files were more common in earlier systems (1970s-1980s) when workloads were more predictable. Dynamic hashing schemes (extendible hashing, linear hashing) addressed the growth problem but added complexity. Today, B+-trees dominate for primary storage, with hashing used strategically for specific access patterns.

Summary: Hash File Organization

We've thoroughly examined hash file organization—understanding how hashing enables constant-time access while imposing significant constraints. Let's consolidate the key insights:

Key Takeaways

•Hash functions map keys to buckets — The bucket address is computed directly from the key, enabling O(1) lookup for equality queries.
•Static hashing uses fixed bucket count — The number of buckets is set at creation, limiting growth flexibility.
•Overflow chains degrade performance — As load factor exceeds 1, chains grow and search cost approaches heap file levels.
•Range queries require full scan — Hashing destroys key ordering; consecutive keys scatter across random buckets.
•Choose prime bucket counts — Prime numbers improve distribution when key values have patterns.
•Hash files require accurate sizing — You must predict data size; underestimating causes overflow, overestimating wastes space.
•Hashing concepts pervade databases — Hash indexes, hash joins, and hash partitioning apply hashing principles strategically.

What's Next:

We've now examined all three fundamental file organizations: heap, sorted, and hash. In the next page, we'll conduct a direct comparison of these approaches, analyzing when each excels and providing decision frameworks for choosing the right organization for your workload.

Page Complete

You now have a comprehensive understanding of hash file organization—its O(1) lookup capability, the overflow challenge, the range query limitation, and its modern applications. This knowledge is essential for understanding hash indexes, hash joins, and distributed system partitioning strategies.

3 / 5

Loading learning content...

Database Management SystemsFile Organization

File Organization Methods

LevelIntermediate

Duration75 mins

TopicFile Organization

3 / 5

Hash File Organization

Direct Access Through Hashing

This approach is magnificently efficient for point queries but comes with important limitations that we must thoroughly understand.

What You Will Learn

What is a Hash File?

Core Concept:

Given a key K and a hash function h(), the bucket address is:

bucket_address = h(K) mod N

where N is the number of buckets. All records with keys that hash to the same bucket address are stored together in that bucket.

Formal Definition:

A hash file H consists of:

A set of N buckets B₀, B₁, ..., B_{N-1}
A hash function h: KeyDomain → {0, 1, ..., N-1}
For each record r with key K: r is stored in bucket B_{h(K) mod N}
Each bucket may span one or more disk pages

Hash Function Properties

hash_function_examples.pseudo

Pseudocode

/*
 * Common Hash Functions for File Organization
 */
 
// Simple modulo hash (for integer keys)
function hash_mod(key: int, num_buckets: int) -> int {
    return key % num_buckets;
}
 
// Division method with prime number
// Using a prime number for num_buckets improves distribution
function hash_prime_mod(key: int, num_buckets: int) -> int {
    // Choose num_buckets as a prime not close to powers of 2
    // Good primes: 97, 997, 9973, 99991, 999983
    return key % num_buckets;
}
 
// Multiplication method (Knuth's suggestion)
function hash_multiply(key: int, num_buckets: int) -> int {
    const A = 0.6180339887;  // (sqrt(5) - 1) / 2, golden ratio
    double fractional_part = (key * A) - floor(key * A);
    return floor(num_buckets * fractional_part);
}
 
// String hashing (polynomial rolling hash)
function hash_string(key: string, num_buckets: int) -> int {
    const int BASE = 31;  // Small prime
    const int MOD = 1000000007;  // Large prime
    
    int64 hash_value = 0;
    int64 power = 1;
    
    for (char c in key) {
        hash_value = (hash_value + (c - 'a' + 1) * power) % MOD;
        power = (power * BASE) % MOD;
    }
    
    return hash_value % num_buckets;
}
 
// FNV-1a hash (commonly used in practice)
function fnv1a_hash(key: bytes, num_buckets: int) -> int {
    const int64 FNV_OFFSET = 14695981039346656037;
    const int64 FNV_PRIME = 1099511628211;
    
    int64 hash = FNV_OFFSET;
    for (byte b in key) {
        hash = hash XOR b;
        hash = hash * FNV_PRIME;
    }
    
    return hash % num_buckets;
}

Key Characteristics of Hash Files:

Direct Access: Given a key, compute bucket address immediately—no searching
No Ordering: Records with consecutive keys may be in completely different buckets; there's no physical ordering
Uniform Distribution Goal: A good hash function distributes records evenly, but perfect uniformity is not guaranteed
Collision Inevitable: Different keys may hash to the same bucket (collisions); handling collisions is crucial
Fixed Structure (Static Hashing): The number of buckets is fixed at file creation; this limits flexibility

Static Hashing Structure

Static hashing uses a fixed number of buckets determined at file creation. The structure consists of three main components:

1. Bucket Directory

A fixed-size array that maps bucket numbers to page addresses:

hash_file_structure.pseudo

Pseudocode

/*
 * Static Hash File Structure
 * 
 * ┌─────────────────────────────────────────────────────────────────┐
 * │                        FILE HEADER                               │
 * │  - Number of buckets (N)                                         │
 * │  - Hash function identifier                                      │
 * │  - Records per bucket                                            │
 * │  - Total record count                                            │
 * └─────────────────────────────────────────────────────────────────┘
 * 
 * ┌─────────────────────────────────────────────────────────────────┐
 * │                    BUCKET DIRECTORY                              │
 * │  Bucket 0  →  Page 10                                           │
 * │  Bucket 1  →  Page 23                                           │
 * │  Bucket 2  →  Page 5                                            │
 * │  Bucket 3  →  Page 17                                           │
 * │  ...                                                            │
 * │  Bucket N-1 → Page 42                                           │
 * └─────────────────────────────────────────────────────────────────┘
 * 
 *         Bucket 0                    Bucket 1                  
 * ┌─────────────────────┐     ┌─────────────────────┐
 * │ Primary Page        │     │ Primary Page        │
 * │ Record (key=100)    │     │ Record (key=7)      │
 * │ Record (key=300)    │     │ Record (key=107)    │
 * │ Record (key=700)    │     │ Record (key=207)    │
 * │ ...                 │     │ ...                 │
 * │ Overflow ptr → ────────┐  │ Overflow ptr → NULL │
 * └─────────────────────┘  │  └─────────────────────┘
 *                          ▼
 *                ┌─────────────────────┐
 *                │ Overflow Page       │
 *                │ Record (key=500)    │
 *                │ Record (key=900)    │
 *                │ Overflow ptr → NULL │
 *                └─────────────────────┘
 */
 
struct HashFileHeader {
    int32   magic_number;
    int32   version;
    int64   num_buckets;           // N - fixed at creation
    int64   records_per_bucket;    // Capacity before overflow
    int64   total_records;
    int64   overflow_records;      // Records in overflow chains
    int32   hash_function_id;      // Which hash function to use
    int32   key_offset;            // Offset of key within record
    int32   key_length;            // Length of key field
    byte    reserved[200];
}
 
struct BucketDirectory {
    PageID  entries[NUM_BUCKETS];  // Direct mapping: bucket # → page ID
}
 
struct BucketPage {
    int32   bucket_id;             // Which bucket this page belongs to
    int32   record_count;          // Records currently in this page
    int32   free_space;            // Available space
    PageID  overflow_ptr;          // Link to next page in chain (or NULL)
    Record  records[];             // Actual record data
}

2. Primary Buckets

Each bucket has a primary page that stores records hashing to that bucket:

Typically one disk page per bucket (but may be multiple for large buckets)
Records within a bucket are unordered (like a mini heap file)
Each bucket can hold a fixed number of records before overflow

3. Overflow Handling

When a bucket's primary page is full and more records hash to it:

Allocate an overflow page
Link overflow page to primary via pointer (chaining)
Subsequent overflows form a chain of pages

Choosing the Number of Buckets:

The choice of N (number of buckets) is critical:

Factor	Fewer Buckets	More Buckets
Records per bucket	Many	Few
Collision probability	High	Low
Directory size	Small	Large
Space utilization	High	May have empty buckets
Overflow likelihood	High	Low

Rule of Thumb: Choose N such that each bucket is 50-80% full on average:

N ≈ (Expected Records) / (Records per Page × Target Fill Factor)

Prime Number of Buckets

Hash File Operations

Let's examine the fundamental operations on hash files with detailed algorithms and complexity analysis.

SEARCH Operation (Equality Only)

Searching for a record with key K:

hash_search.pseudo

Pseudocode

function hash_search(HashFile file, Key target) -> Record? {
    // Step 1: Compute bucket address
    int bucket_num = hash(target) % file.header.num_buckets;
    
    // Step 2: Get primary page for this bucket
    PageID page_id = file.directory.entries[bucket_num];
    
    // Step 3: Search within bucket (including overflow chain)
    while (page_id != NULL) {
        Page page = buffer_pool.get_page(file.file_id, page_id);
        
        // Linear search within page (records are unordered in bucket)
        for (int i = 0; i < page.record_count; i++) {
            Record record = page.get_record(i);
            if (extract_key(record) == target) {
                return record;  // Found!
            }
        }
        
        // Move to overflow page if exists
        page_id = page.overflow_ptr;
    }
    
    return null;  // Not found in this bucket
}
 
/*
 * Complexity Analysis:
 * 
 * Best case (no overflow, found in first check):
 *   - Hash computation: O(1)
 *   - Directory lookup: O(1) 
 *   - Page read: 1 I/O
 *   - Search within page: O(records per page)
 *   - Total: O(1) I/O, O(r) comparisons
 * 
 * Average case (short overflow chain):
 *   - Pages in chain: 1 + (overflow_records / bucket_capacity)
 *   - Total: O(1 + m) I/Os where m = overflow chain length
 * 
 * Worst case (severe overflow, all records in one bucket):
 *   - Degenerates to O(n) if all records hash to same bucket
 *   - This is why good hash functions matter!
 */

INSERT Operation

Inserting a record with key K:

hash_insert.pseudo

Pseudocode

function hash_insert(HashFile file, Record new_record) -> RID {
    Key key = extract_key(new_record);
    
    // Step 1: Compute bucket
    int bucket_num = hash(key) % file.header.num_buckets;
    
    // Step 2: Get primary page for bucket
    PageID page_id = file.directory.entries[bucket_num];
    Page page = buffer_pool.get_page(file.file_id, page_id);
    
    // Step 3: Find page with space (follow chain if needed)
    PageID last_page_id = page_id;
    while (page.is_full() && page.overflow_ptr != NULL) {
        last_page_id = page.overflow_ptr;
        page = buffer_pool.get_page(file.file_id, page.overflow_ptr);
    }
    
    // Step 4: Insert into page or create overflow
    if (page.has_space_for(new_record)) {
        int slot = page.add_record(new_record);
        buffer_pool.mark_dirty(page);
        file.header.total_records++;
        return RID(page.page_id, slot);
    }
    
    // Step 5: Need overflow page
    PageID new_page_id = allocate_overflow_page(file, bucket_num);
    Page new_page = buffer_pool.get_page(file.file_id, new_page_id);
    
    // Link from last page in chain
    page.overflow_ptr = new_page_id;
    buffer_pool.mark_dirty(page);
    
    // Insert into new overflow page
    int slot = new_page.add_record(new_record);
    buffer_pool.mark_dirty(new_page);
    
    file.header.total_records++;
    file.header.overflow_records++;
    
    return RID(new_page_id, slot);
}
 
/*
 * Complexity:
 * - Best case: Hash + 1 I/O to read, 1 I/O to write = O(1) I/O
 * - Average case: O(1 + chain_length) I/Os
 * - Worst case: O(chain_length) to traverse + 1 to write
 */

DELETE Operation

Deleting a record follows similar logic to search:

hash_delete.pseudo

Pseudocode

function hash_delete(HashFile file, Key target) -> bool {
    // Step 1: Find the bucket
    int bucket_num = hash(target) % file.header.num_buckets;
    PageID page_id = file.directory.entries[bucket_num];
    
    // Step 2: Search and delete
    PageID prev_page_id = NULL;
    
    while (page_id != NULL) {
        Page page = buffer_pool.get_page(file.file_id, page_id);
        
        for (int i = 0; i < page.record_count; i++) {
            Record record = page.get_record(i);
            if (extract_key(record) == target) {
                // Found! Delete the record
                page.delete_record(i);
                buffer_pool.mark_dirty(page);
                file.header.total_records--;
                
                // Optional: Coalesce empty overflow pages
                if (page.is_empty() && prev_page_id != NULL) {
                    coalesce_overflow_chain(file, bucket_num, prev_page_id, page_id);
                }
                
                return true;
            }
        }
        
        prev_page_id = page_id;
        page_id = page.overflow_ptr;
    }
    
    return false;  // Key not found
}
 
function coalesce_overflow_chain(HashFile file, int bucket, 
                                  PageID prev_id, PageID empty_id) {
    Page prev_page = buffer_pool.get_page(file.file_id, prev_id);
    Page empty_page = buffer_pool.get_page(file.file_id, empty_id);
    
    // Skip over the empty page
    prev_page.overflow_ptr = empty_page.overflow_ptr;
    buffer_pool.mark_dirty(prev_page);
    
    // Return empty page to free list
    file.free_page(empty_id);
}

Overflow Handling Strategies

The Overflow Problem

Overflow is the primary challenge in static hashing. When more records hash to a bucket than it can hold, we must handle the excess. The degree of overflow significantly impacts performance.

Overflow Causes:

Poor Hash Function: Uneven distribution concentrates records in few buckets
Data Skew: Real data may have non-uniform key distributions
Growth Beyond Expectations: File grew larger than anticipated at creation
Bad Luck: Even with perfect distribution, variance causes some buckets to overflow

Overflow Handling Strategies
Strategy	Mechanism	Pros	Cons
Open Chaining	Overflow pages linked to bucket	Simple, unlimited overflow	Chain traversal adds I/O
Linear Probing	Check next bucket if full	Good cache locality	Clustering, primary bucket pollution
Quadratic Probing	Check bucket+1², +2², etc.	Reduces clustering	May not find empty slot
Double Hashing	Use second hash for probe sequence	Best distribution	More complex, two hash computations

Open Chaining (Most Common for Files)

Open chaining links overflow pages into a chain from each bucket. This is the standard approach for file-level hashing because:

Buckets can grow without limit
No interference between buckets
Simple implementation

Performance Impact of Overflow:

overflow_analysis.pseudo

Analysis

/*
 * Overflow Chain Analysis
 * 
 * Given:
 *   - N = number of buckets
 *   - n = number of records
 *   - b = records per page (bucket capacity)
 *   - α = load factor = n / (N × b)
 * 
 * Expected chain length (with good hash function):
 *   E[chain length] ≈ max(1, α)
 * 
 * Example:
 *   N = 1000 buckets
 *   b = 40 records per page
 *   Capacity = 40,000 records
 *   
 *   If n = 40,000: α = 1.0, average 1 page per bucket (ideal)
 *   If n = 80,000: α = 2.0, average 2 pages per bucket
 *   If n = 400,000: α = 10.0, average 10 pages per bucket (severe overflow!)
 * 
 * Search cost with overflow chains:
 *   Average pages read = 1 + (α - 1)/2 = (α + 1)/2  if α > 1
 *   For α = 10: average 5.5 page reads per search
 *   
 * This is why static hashing struggles with growth!
 */
 
// Overflow statistics collector
function analyze_overflow(HashFile file) -> OverflowStats {
    int total_buckets = file.header.num_buckets;
    int total_overflow_pages = 0;
    int max_chain_length = 0;
    int empty_buckets = 0;
    
    for (int i = 0; i < total_buckets; i++) {
        int chain_length = 0;
        PageID page_id = file.directory.entries[i];
        
        while (page_id != NULL) {
            chain_length++;
            Page page = buffer_pool.get_page(file.file_id, page_id);
            
            if (chain_length == 1 && page.record_count == 0) {
                empty_buckets++;
            }
            
            page_id = page.overflow_ptr;
        }
        
        total_overflow_pages += max(0, chain_length - 1);
        max_chain_length = max(max_chain_length, chain_length);
    }
    
    return OverflowStats {
        avg_chain_length: (total_buckets + total_overflow_pages) / total_buckets,
        max_chain_length: max_chain_length,
        empty_bucket_ratio: empty_buckets / total_buckets,
        overflow_page_ratio: total_overflow_pages / total_buckets
    };
}

The Load Factor Crisis

The Range Query Limitation

The Fundamental Trade-off

Hash files provide O(1) equality search at the cost of completely sacrificing range queries. This is not an implementation limitation—it's fundamental to how hashing works.

Why Range Queries Fail:

Consider a query: WHERE price BETWEEN 100 AND 200

With a sorted file:

Jump to first record ≥ 100
Scan until record > 200
All matching records are contiguous

With a hash file:

price=100 hashes to bucket 47
price=101 might hash to bucket 12
price=150 might hash to bucket 89
Records are scattered across ALL buckets!

To answer a range query on a hash file, you must scan EVERY bucket—no better than a heap file.

range_query_comparison.pseudo

Pseudocode

/*
 * Range Query: WHERE key BETWEEN low AND high
 */
 
// Sorted File: Efficient
function range_query_sorted(SortedFile file, Key low, Key high) -> List<Record> {
    PageID start = binary_search_pages(file, low);
    List<Record> results;
    
    for (page in pages from start until max_key > high) {
        for (record in page where low <= key <= high) {
            results.add(record);
        }
    }
    // Cost: log(B) + number of pages containing matching records
    return results;
}
 
// Hash File: Must scan everything
function range_query_hash(HashFile file, Key low, Key high) -> List<Record> {
    List<Record> results;
    
    // Must check EVERY bucket - no way to know which contain matching keys
    for (bucket in 0 to file.header.num_buckets - 1) {
        PageID page_id = file.directory.entries[bucket];
        
        while (page_id != NULL) {
            Page page = buffer_pool.get_page(file.file_id, page_id);
            
            for (int i = 0; i < page.record_count; i++) {
                Record record = page.get_record(i);
                Key key = extract_key(record);
                
                if (key >= low && key <= high) {
                    results.add(record);
                }
            }
            
            page_id = page.overflow_ptr;
        }
    }
    // Cost: ALL pages in the file (worse than sorted, same as heap)
    return results;
}
 
/*
 * Performance Comparison for Range Query (10% selectivity):
 * 
 * File: 250,000 pages, 10 million records
 * Range: 1 million matching records (10%)
 * 
 * Sorted File: ~25,000 page reads (matching pages only)
 * Hash File: ~250,000 page reads (all pages)
 * Heap File: ~250,000 page reads (all pages)
 * 
 * Hash file is NO BETTER than heap for range queries!
 */

Hash Files Excel At

•Exact match queries: WHERE id = 12345
•Primary key lookups: Direct record access
•Join on equality: Hash joins for equi-joins
•Deduplication: Checking if record exists
•Grouping by key: GROUP BY on hash key

Hash Files Cannot Help With

•Range queries: WHERE x BETWEEN a AND b
•Prefix searches: WHERE name LIKE 'John%'
•Ordering: ORDER BY key
•MIN/MAX: Finding extreme values
•Partial key lookups: Composite key prefix

Design Implication

Comprehensive Performance Analysis

Let's consolidate the performance characteristics of hash files with precise cost formulas.

Notation:

N = number of buckets
n = number of records
b = records per page (bucket capacity)
α = load factor = n / (N × b)
m = average overflow chain length (≈ max(0, α - 1) for uniform distribution)

Hash File Operation Costs
Operation	Best Case	Average Case	Worst Case	Condition
Equality Search	1 I/O	(1 + m)/2 + 1 I/Os	n/b I/Os	All in one bucket
Insert	2 I/Os	2 + m I/Os	n/b I/Os	All in one bucket
Delete	2 I/Os	2 + m I/Os	n/b I/Os	All in one bucket
Range Query	B I/Os	B I/Os	B I/Os	Always full scan
Full Scan	B I/Os	B I/Os	B I/Os	Must read all pages

Comparison with Other Organizations:

File Organization Performance Comparison
Operation	Heap	Sorted	Hash (α ≤ 1)	Hash (α > 1)
Equality Search	O(B)	O(log B)	O(1)	O(α)
Range Search	O(B)	O(log B + k)	O(B)	O(B)
Insert	O(1)	O(B)	O(1)	O(1)
Delete (by key)	O(B)	O(B)	O(1)	O(α)
Sorted Output	O(B log B)	O(B)	O(B log B)	O(B log B)

Concrete Example:

User lookup table:

1 million users (n = 1,000,000)
40 records per page
25,000 pages minimum

Scenario A: Properly sized hash file (α = 0.8)

N = 31,250 buckets (1.25 million capacity)
No overflow expected
Equality search: 1 I/O = 10ms

Scenario B: Undersized hash file (α = 4.0)

N = 6,250 buckets (250,000 capacity, but 1M records)
Average chain length: 4 pages
Equality search: ~2.5 I/Os = 25ms

Scenario C: Sorted file

Binary search: log₂(25,000) ≈ 15 comparisons
With sparse directory: 1 I/O = 10ms

Key Insight: Hash files only outperform sorted files when well-provisioned (α < 1). With overflow, the advantage diminishes rapidly.

When Is Hash Truly Better?

Hash Organization in Practice

While pure hash file organization is less common today, hashing concepts appear throughout database systems:

1. Hash Indexes

Most databases support hash indexes as an alternative to B+-trees:

PostgreSQL: CREATE INDEX USING hash
MySQL: Memory engine supports hash indexes
Oracle: Cluster indexes can use hashing

Hash indexes provide O(1) lookup but only for equality queries.

hash_index_usage.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- PostgreSQL: Create a hash index
CREATE INDEX idx_users_email_hash ON users USING hash (email);
 
-- This index accelerates:
SELECT * FROM users WHERE email = 'user@example.com';  -- Fast!
 
-- But NOT:
SELECT * FROM users WHERE email LIKE 'user%';          -- Full scan
SELECT * FROM users WHERE email > 'a@';                -- Full scan
SELECT * FROM users ORDER BY email;                    -- No help
 
-- When to use hash index:
-- 1. Column used ONLY for equality comparisons
-- 2. Cardinality too high for bitmap index
-- 3. Value distribution is uniform
-- 4. Saving a few I/Os matters (very high throughput)
 
-- Hash index limitations in PostgreSQL:
-- - Not WAL-logged (before v10), possible data loss
-- - Cannot be used for unique constraints
-- - Cannot include additional columns (covering index)
-- - Generally, B-tree is preferred unless proven otherwise

2. Hash Joins

Hash-based algorithms are fundamental to query processing:

Build a hash table on the smaller relation
Probe with tuples from the larger relation
O(n + m) time complexity for equi-joins

3. Hash Partitioning

Distributed databases use hash partitioning:

Hash function determines which shard holds a record
Consistent hashing enables dynamic cluster resizing
Used by Cassandra, DynamoDB, CockroachDB

Hash-Based Features in Database Systems
System	Hash Feature	Purpose
PostgreSQL	Hash indexes, hash joins	Point lookups, join processing
MySQL/InnoDB	Adaptive hash index (internal)	Hot page acceleration
Oracle	Hash clusters, hash partitioning	Grouped access, distribution
Cassandra	Consistent hashing	Data distribution across nodes
Redis	Hash data type	Field-level access within objects
MongoDB	Hashed shard keys	Even data distribution

The Evolution

Summary: Hash File Organization

We've thoroughly examined hash file organization—understanding how hashing enables constant-time access while imposing significant constraints. Let's consolidate the key insights:

Key Takeaways

•Hash functions map keys to buckets — The bucket address is computed directly from the key, enabling O(1) lookup for equality queries.
•Static hashing uses fixed bucket count — The number of buckets is set at creation, limiting growth flexibility.
•Overflow chains degrade performance — As load factor exceeds 1, chains grow and search cost approaches heap file levels.
•Range queries require full scan — Hashing destroys key ordering; consecutive keys scatter across random buckets.
•Choose prime bucket counts — Prime numbers improve distribution when key values have patterns.
•Hash files require accurate sizing — You must predict data size; underestimating causes overflow, overestimating wastes space.
•Hashing concepts pervade databases — Hash indexes, hash joins, and hash partitioning apply hashing principles strategically.

What's Next:

Page Complete

3 / 5