Loading learning content...
We've seen that heap files provide O(1) inserts but O(n) searches, while sorted files offer O(log n) searches at the cost of expensive maintenance. But what if we could achieve O(1) search for equality queries—finding a record with a specific key value in constant time, regardless of file size?
Hash file organization makes this possible by using a hash function to compute the exact location of each record based on its key. Instead of searching through pages sequentially or using binary search, we simply calculate where a record belongs and go directly there.
This approach is magnificently efficient for point queries but comes with important limitations that we must thoroughly understand.
By the end of this page, you will understand how hash functions distribute records across buckets, master static hashing and its implementation details, analyze overflow handling strategies and their impact on performance, and recognize the fundamental limitation of hash files for range queries.
A hash file (also called a hashed file or direct file) organizes records using a hash function that maps key values to bucket addresses. Instead of storing records in arrival order (heap) or key order (sorted), records are placed in buckets determined by applying a mathematical function to their key.
Core Concept:
Given a key K and a hash function h(), the bucket address is:
bucket_address = h(K) mod N
where N is the number of buckets. All records with keys that hash to the same bucket address are stored together in that bucket.
Formal Definition:
A hash file H consists of:
A good hash function for file organization should be: (1) Deterministic — same key always produces same hash, (2) Uniform — distributes keys evenly across buckets, (3) Fast — computable in O(k) where k is key length, (4) Avalanche — small changes in key produce very different hash values.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
/* * Common Hash Functions for File Organization */ // Simple modulo hash (for integer keys)function hash_mod(key: int, num_buckets: int) -> int { return key % num_buckets;} // Division method with prime number// Using a prime number for num_buckets improves distributionfunction hash_prime_mod(key: int, num_buckets: int) -> int { // Choose num_buckets as a prime not close to powers of 2 // Good primes: 97, 997, 9973, 99991, 999983 return key % num_buckets;} // Multiplication method (Knuth's suggestion)function hash_multiply(key: int, num_buckets: int) -> int { const A = 0.6180339887; // (sqrt(5) - 1) / 2, golden ratio double fractional_part = (key * A) - floor(key * A); return floor(num_buckets * fractional_part);} // String hashing (polynomial rolling hash)function hash_string(key: string, num_buckets: int) -> int { const int BASE = 31; // Small prime const int MOD = 1000000007; // Large prime int64 hash_value = 0; int64 power = 1; for (char c in key) { hash_value = (hash_value + (c - 'a' + 1) * power) % MOD; power = (power * BASE) % MOD; } return hash_value % num_buckets;} // FNV-1a hash (commonly used in practice)function fnv1a_hash(key: bytes, num_buckets: int) -> int { const int64 FNV_OFFSET = 14695981039346656037; const int64 FNV_PRIME = 1099511628211; int64 hash = FNV_OFFSET; for (byte b in key) { hash = hash XOR b; hash = hash * FNV_PRIME; } return hash % num_buckets;}Key Characteristics of Hash Files:
Direct Access: Given a key, compute bucket address immediately—no searching
No Ordering: Records with consecutive keys may be in completely different buckets; there's no physical ordering
Uniform Distribution Goal: A good hash function distributes records evenly, but perfect uniformity is not guaranteed
Collision Inevitable: Different keys may hash to the same bucket (collisions); handling collisions is crucial
Fixed Structure (Static Hashing): The number of buckets is fixed at file creation; this limits flexibility
Static hashing uses a fixed number of buckets determined at file creation. The structure consists of three main components:
1. Bucket Directory
A fixed-size array that maps bucket numbers to page addresses:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
/* * Static Hash File Structure * * ┌─────────────────────────────────────────────────────────────────┐ * │ FILE HEADER │ * │ - Number of buckets (N) │ * │ - Hash function identifier │ * │ - Records per bucket │ * │ - Total record count │ * └─────────────────────────────────────────────────────────────────┘ * * ┌─────────────────────────────────────────────────────────────────┐ * │ BUCKET DIRECTORY │ * │ Bucket 0 → Page 10 │ * │ Bucket 1 → Page 23 │ * │ Bucket 2 → Page 5 │ * │ Bucket 3 → Page 17 │ * │ ... │ * │ Bucket N-1 → Page 42 │ * └─────────────────────────────────────────────────────────────────┘ * * Bucket 0 Bucket 1 * ┌─────────────────────┐ ┌─────────────────────┐ * │ Primary Page │ │ Primary Page │ * │ Record (key=100) │ │ Record (key=7) │ * │ Record (key=300) │ │ Record (key=107) │ * │ Record (key=700) │ │ Record (key=207) │ * │ ... │ │ ... │ * │ Overflow ptr → ────────┐ │ Overflow ptr → NULL │ * └─────────────────────┘ │ └─────────────────────┘ * ▼ * ┌─────────────────────┐ * │ Overflow Page │ * │ Record (key=500) │ * │ Record (key=900) │ * │ Overflow ptr → NULL │ * └─────────────────────┘ */ struct HashFileHeader { int32 magic_number; int32 version; int64 num_buckets; // N - fixed at creation int64 records_per_bucket; // Capacity before overflow int64 total_records; int64 overflow_records; // Records in overflow chains int32 hash_function_id; // Which hash function to use int32 key_offset; // Offset of key within record int32 key_length; // Length of key field byte reserved[200];} struct BucketDirectory { PageID entries[NUM_BUCKETS]; // Direct mapping: bucket # → page ID} struct BucketPage { int32 bucket_id; // Which bucket this page belongs to int32 record_count; // Records currently in this page int32 free_space; // Available space PageID overflow_ptr; // Link to next page in chain (or NULL) Record records[]; // Actual record data}2. Primary Buckets
Each bucket has a primary page that stores records hashing to that bucket:
3. Overflow Handling
When a bucket's primary page is full and more records hash to it:
Choosing the Number of Buckets:
The choice of N (number of buckets) is critical:
| Factor | Fewer Buckets | More Buckets |
|---|---|---|
| Records per bucket | Many | Few |
| Collision probability | High | Low |
| Directory size | Small | Large |
| Space utilization | High | May have empty buckets |
| Overflow likelihood | High | Low |
Rule of Thumb: Choose N such that each bucket is 50-80% full on average:
N ≈ (Expected Records) / (Records per Page × Target Fill Factor)
Using a prime number for N improves hash distribution, especially when key values have patterns. For example, if keys are multiples of 10 and N=100, only 10% of buckets would be used! With N=97 (prime), keys distribute more evenly.
Let's examine the fundamental operations on hash files with detailed algorithms and complexity analysis.
SEARCH Operation (Equality Only)
Searching for a record with key K:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
function hash_search(HashFile file, Key target) -> Record? { // Step 1: Compute bucket address int bucket_num = hash(target) % file.header.num_buckets; // Step 2: Get primary page for this bucket PageID page_id = file.directory.entries[bucket_num]; // Step 3: Search within bucket (including overflow chain) while (page_id != NULL) { Page page = buffer_pool.get_page(file.file_id, page_id); // Linear search within page (records are unordered in bucket) for (int i = 0; i < page.record_count; i++) { Record record = page.get_record(i); if (extract_key(record) == target) { return record; // Found! } } // Move to overflow page if exists page_id = page.overflow_ptr; } return null; // Not found in this bucket} /* * Complexity Analysis: * * Best case (no overflow, found in first check): * - Hash computation: O(1) * - Directory lookup: O(1) * - Page read: 1 I/O * - Search within page: O(records per page) * - Total: O(1) I/O, O(r) comparisons * * Average case (short overflow chain): * - Pages in chain: 1 + (overflow_records / bucket_capacity) * - Total: O(1 + m) I/Os where m = overflow chain length * * Worst case (severe overflow, all records in one bucket): * - Degenerates to O(n) if all records hash to same bucket * - This is why good hash functions matter! */INSERT Operation
Inserting a record with key K:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
function hash_insert(HashFile file, Record new_record) -> RID { Key key = extract_key(new_record); // Step 1: Compute bucket int bucket_num = hash(key) % file.header.num_buckets; // Step 2: Get primary page for bucket PageID page_id = file.directory.entries[bucket_num]; Page page = buffer_pool.get_page(file.file_id, page_id); // Step 3: Find page with space (follow chain if needed) PageID last_page_id = page_id; while (page.is_full() && page.overflow_ptr != NULL) { last_page_id = page.overflow_ptr; page = buffer_pool.get_page(file.file_id, page.overflow_ptr); } // Step 4: Insert into page or create overflow if (page.has_space_for(new_record)) { int slot = page.add_record(new_record); buffer_pool.mark_dirty(page); file.header.total_records++; return RID(page.page_id, slot); } // Step 5: Need overflow page PageID new_page_id = allocate_overflow_page(file, bucket_num); Page new_page = buffer_pool.get_page(file.file_id, new_page_id); // Link from last page in chain page.overflow_ptr = new_page_id; buffer_pool.mark_dirty(page); // Insert into new overflow page int slot = new_page.add_record(new_record); buffer_pool.mark_dirty(new_page); file.header.total_records++; file.header.overflow_records++; return RID(new_page_id, slot);} /* * Complexity: * - Best case: Hash + 1 I/O to read, 1 I/O to write = O(1) I/O * - Average case: O(1 + chain_length) I/Os * - Worst case: O(chain_length) to traverse + 1 to write */DELETE Operation
Deleting a record follows similar logic to search:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
function hash_delete(HashFile file, Key target) -> bool { // Step 1: Find the bucket int bucket_num = hash(target) % file.header.num_buckets; PageID page_id = file.directory.entries[bucket_num]; // Step 2: Search and delete PageID prev_page_id = NULL; while (page_id != NULL) { Page page = buffer_pool.get_page(file.file_id, page_id); for (int i = 0; i < page.record_count; i++) { Record record = page.get_record(i); if (extract_key(record) == target) { // Found! Delete the record page.delete_record(i); buffer_pool.mark_dirty(page); file.header.total_records--; // Optional: Coalesce empty overflow pages if (page.is_empty() && prev_page_id != NULL) { coalesce_overflow_chain(file, bucket_num, prev_page_id, page_id); } return true; } } prev_page_id = page_id; page_id = page.overflow_ptr; } return false; // Key not found} function coalesce_overflow_chain(HashFile file, int bucket, PageID prev_id, PageID empty_id) { Page prev_page = buffer_pool.get_page(file.file_id, prev_id); Page empty_page = buffer_pool.get_page(file.file_id, empty_id); // Skip over the empty page prev_page.overflow_ptr = empty_page.overflow_ptr; buffer_pool.mark_dirty(prev_page); // Return empty page to free list file.free_page(empty_id);}The Overflow Problem
Overflow is the primary challenge in static hashing. When more records hash to a bucket than it can hold, we must handle the excess. The degree of overflow significantly impacts performance.
Overflow Causes:
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Open Chaining | Overflow pages linked to bucket | Simple, unlimited overflow | Chain traversal adds I/O |
| Linear Probing | Check next bucket if full | Good cache locality | Clustering, primary bucket pollution |
| Quadratic Probing | Check bucket+1², +2², etc. | Reduces clustering | May not find empty slot |
| Double Hashing | Use second hash for probe sequence | Best distribution | More complex, two hash computations |
Open Chaining (Most Common for Files)
Open chaining links overflow pages into a chain from each bucket. This is the standard approach for file-level hashing because:
Performance Impact of Overflow:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
/* * Overflow Chain Analysis * * Given: * - N = number of buckets * - n = number of records * - b = records per page (bucket capacity) * - α = load factor = n / (N × b) * * Expected chain length (with good hash function): * E[chain length] ≈ max(1, α) * * Example: * N = 1000 buckets * b = 40 records per page * Capacity = 40,000 records * * If n = 40,000: α = 1.0, average 1 page per bucket (ideal) * If n = 80,000: α = 2.0, average 2 pages per bucket * If n = 400,000: α = 10.0, average 10 pages per bucket (severe overflow!) * * Search cost with overflow chains: * Average pages read = 1 + (α - 1)/2 = (α + 1)/2 if α > 1 * For α = 10: average 5.5 page reads per search * * This is why static hashing struggles with growth! */ // Overflow statistics collectorfunction analyze_overflow(HashFile file) -> OverflowStats { int total_buckets = file.header.num_buckets; int total_overflow_pages = 0; int max_chain_length = 0; int empty_buckets = 0; for (int i = 0; i < total_buckets; i++) { int chain_length = 0; PageID page_id = file.directory.entries[i]; while (page_id != NULL) { chain_length++; Page page = buffer_pool.get_page(file.file_id, page_id); if (chain_length == 1 && page.record_count == 0) { empty_buckets++; } page_id = page.overflow_ptr; } total_overflow_pages += max(0, chain_length - 1); max_chain_length = max(max_chain_length, chain_length); } return OverflowStats { avg_chain_length: (total_buckets + total_overflow_pages) / total_buckets, max_chain_length: max_chain_length, empty_bucket_ratio: empty_buckets / total_buckets, overflow_page_ratio: total_overflow_pages / total_buckets };}When the load factor α exceeds 1, performance degrades rapidly. At α = 10, hash file performance approaches that of a heap file—you're essentially doing linear search within large buckets. This is why static hashing is limited: you must know your data size upfront and cannot handle significant growth without rebuilding.
The Fundamental Trade-off
Hash files provide O(1) equality search at the cost of completely sacrificing range queries. This is not an implementation limitation—it's fundamental to how hashing works.
Why Range Queries Fail:
Consider a query: WHERE price BETWEEN 100 AND 200
With a sorted file:
With a hash file:
To answer a range query on a hash file, you must scan EVERY bucket—no better than a heap file.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
/* * Range Query: WHERE key BETWEEN low AND high */ // Sorted File: Efficientfunction range_query_sorted(SortedFile file, Key low, Key high) -> List<Record> { PageID start = binary_search_pages(file, low); List<Record> results; for (page in pages from start until max_key > high) { for (record in page where low <= key <= high) { results.add(record); } } // Cost: log(B) + number of pages containing matching records return results;} // Hash File: Must scan everythingfunction range_query_hash(HashFile file, Key low, Key high) -> List<Record> { List<Record> results; // Must check EVERY bucket - no way to know which contain matching keys for (bucket in 0 to file.header.num_buckets - 1) { PageID page_id = file.directory.entries[bucket]; while (page_id != NULL) { Page page = buffer_pool.get_page(file.file_id, page_id); for (int i = 0; i < page.record_count; i++) { Record record = page.get_record(i); Key key = extract_key(record); if (key >= low && key <= high) { results.add(record); } } page_id = page.overflow_ptr; } } // Cost: ALL pages in the file (worse than sorted, same as heap) return results;} /* * Performance Comparison for Range Query (10% selectivity): * * File: 250,000 pages, 10 million records * Range: 1 million matching records (10%) * * Sorted File: ~25,000 page reads (matching pages only) * Hash File: ~250,000 page reads (all pages) * Heap File: ~250,000 page reads (all pages) * * Hash file is NO BETTER than heap for range queries! */Choose hash file organization only when your workload is dominated by equality queries. If you need both equality and range queries on the same attribute, use a B+-tree index instead—it provides O(log n) for both query types, which is nearly as good as hashing for equality and far better for ranges.
Let's consolidate the performance characteristics of hash files with precise cost formulas.
Notation:
| Operation | Best Case | Average Case | Worst Case | Condition |
|---|---|---|---|---|
| Equality Search | 1 I/O | (1 + m)/2 + 1 I/Os | n/b I/Os | All in one bucket |
| Insert | 2 I/Os | 2 + m I/Os | n/b I/Os | All in one bucket |
| Delete | 2 I/Os | 2 + m I/Os | n/b I/Os | All in one bucket |
| Range Query | B I/Os | B I/Os | B I/Os | Always full scan |
| Full Scan | B I/Os | B I/Os | B I/Os | Must read all pages |
Comparison with Other Organizations:
| Operation | Heap | Sorted | Hash (α ≤ 1) | Hash (α > 1) |
|---|---|---|---|---|
| Equality Search | O(B) | O(log B) | O(1) | O(α) |
| Range Search | O(B) | O(log B + k) | O(B) | O(B) |
| Insert | O(1) | O(B) | O(1) | O(1) |
| Delete (by key) | O(B) | O(B) | O(1) | O(α) |
| Sorted Output | O(B log B) | O(B) | O(B log B) | O(B log B) |
Concrete Example:
User lookup table:
Scenario A: Properly sized hash file (α = 0.8)
Scenario B: Undersized hash file (α = 4.0)
Scenario C: Sorted file
Key Insight: Hash files only outperform sorted files when well-provisioned (α < 1). With overflow, the advantage diminishes rapidly.
Hash files provide only a constant factor improvement over B+-trees for equality searches (1 I/O vs ~3-4 I/Os). Given the range query limitation and growth constraints, hash file organization is rarely used for primary storage in modern databases. However, hashing remains crucial for: hash indexes, hash joins, hash partitioning, and in-memory hash tables.
While pure hash file organization is less common today, hashing concepts appear throughout database systems:
1. Hash Indexes
Most databases support hash indexes as an alternative to B+-trees:
CREATE INDEX USING hashHash indexes provide O(1) lookup but only for equality queries.
12345678910111213141516171819202122
-- PostgreSQL: Create a hash indexCREATE INDEX idx_users_email_hash ON users USING hash (email); -- This index accelerates:SELECT * FROM users WHERE email = 'user@example.com'; -- Fast! -- But NOT:SELECT * FROM users WHERE email LIKE 'user%'; -- Full scanSELECT * FROM users WHERE email > 'a@'; -- Full scanSELECT * FROM users ORDER BY email; -- No help -- When to use hash index:-- 1. Column used ONLY for equality comparisons-- 2. Cardinality too high for bitmap index-- 3. Value distribution is uniform-- 4. Saving a few I/Os matters (very high throughput) -- Hash index limitations in PostgreSQL:-- - Not WAL-logged (before v10), possible data loss-- - Cannot be used for unique constraints-- - Cannot include additional columns (covering index)-- - Generally, B-tree is preferred unless proven otherwise2. Hash Joins
Hash-based algorithms are fundamental to query processing:
3. Hash Partitioning
Distributed databases use hash partitioning:
| System | Hash Feature | Purpose |
|---|---|---|
| PostgreSQL | Hash indexes, hash joins | Point lookups, join processing |
| MySQL/InnoDB | Adaptive hash index (internal) | Hot page acceleration |
| Oracle | Hash clusters, hash partitioning | Grouped access, distribution |
| Cassandra | Consistent hashing | Data distribution across nodes |
| Redis | Hash data type | Field-level access within objects |
| MongoDB | Hashed shard keys | Even data distribution |
Static hash files were more common in earlier systems (1970s-1980s) when workloads were more predictable. Dynamic hashing schemes (extendible hashing, linear hashing) addressed the growth problem but added complexity. Today, B+-trees dominate for primary storage, with hashing used strategically for specific access patterns.
We've thoroughly examined hash file organization—understanding how hashing enables constant-time access while imposing significant constraints. Let's consolidate the key insights:
What's Next:
We've now examined all three fundamental file organizations: heap, sorted, and hash. In the next page, we'll conduct a direct comparison of these approaches, analyzing when each excels and providing decision frameworks for choosing the right organization for your workload.
You now have a comprehensive understanding of hash file organization—its O(1) lookup capability, the overflow challenge, the range query limitation, and its modern applications. This knowledge is essential for understanding hash indexes, hash joins, and distributed system partitioning strategies.