Loading learning content...
Fixed-length records are elegant but limited. Real-world data is messy: user bios range from empty to thousands of characters, JSON documents vary from {} to megabytes, and images span bytes to gigabytes. Variable-length data is the norm, not the exception.
Handling variable-length data introduces fundamental challenges:
This page explores the techniques databases use to handle these challenges efficiently.
By the end of this page, you will understand length-prefixing and offset arrays for variable fields, TOAST and overflow page mechanisms, inline vs out-of-line storage decisions, compression strategies, and how different databases implement variable-length storage. You'll be able to reason about storage tradeoffs for schema design.
There are two primary strategies for encoding variable-length fields: length-prefixing and offset arrays.
Length-Prefixing
Each variable-length value is preceded by its length:
Field format: [length][data bytes]
Example VARCHAR values:
'Hello' → [5][H][e][l][l][o]
'' → [0]
'A longer string...' → [17][A][ ][l]...
Sequential access:
Read length → read that many bytes → repeat
Advantages:
Disadvantages:
Offset Arrays
A separate array stores the ending offset (or starting offset) of each field:
Record structure:
[offset_array][data region]
Offset array: [12][25][30]
Field 0: bytes 0-11 (length 12)
Field 1: bytes 12-24 (length 13)
Field 2: bytes 25-29 (length 5)
Random access:
offset[n-1] to offset[n] gives field n
Advantages:
Disadvantages:
| Database | Strategy | Implementation Details |
|---|---|---|
| PostgreSQL | Length-prefix (varlena) | 1 or 4 byte length; short varlenas use 1-byte header |
| MySQL InnoDB | Length-prefix + list | Lengths stored before record header in reverse order |
| SQL Server | Offset array | Variable column offset array after NULL bitmap |
| SQLite | Length-prefix (varint) | Variable-length integer encoding for lengths |
| Oracle | Length-prefix | 1-3 byte length depending on value size |
Many systems use hybrid approaches. InnoDB stores variable-field lengths before the record (enabling quick size calculation) but the actual data uses length-prefix encoding. SQL Server's offset array gives O(1) access to any variable column, ideal for SELECT queries accessing specific columns.
The length field itself must balance overhead against maximum representable size.
| Length Field Size | Max Value Size | Overhead for 10-byte value | Overhead for 1KB value |
|---|---|---|---|
| 1 byte | 255 bytes | 10% | 0.1% |
| 2 bytes | 65,535 bytes (~64KB) | 20% | 0.2% |
| 4 bytes | 4,294,967,295 bytes (~4GB) | 40% | 0.4% |
| Variable (1-9 bytes) | Up to 2^63 bytes | 10-40% | 0.1-0.4% |
PostgreSQL's varlena Format
PostgreSQL uses a clever variable-length header (varlena) that adapts to value size:
Varlena header interpretation (first byte):
┌─────────────────────────────────────────────────────────────┐
│ If first byte has high bit = 0: │
│ - 4-byte header: bits 0-29 = length including header │
│ - bits 30-31 = compression/TOAST flags │
│ - Max untoasted size: ~1GB │
├─────────────────────────────────────────────────────────────┤
│ If first byte has high bit = 1: │
│ - 1-byte header (short varlena) │
│ - bits 1-7 = length including header │
│ - Max size: 127 bytes │
│ - Saves 3 bytes per small string! │
├─────────────────────────────────────────────────────────────┤
│ Special encodings for TOAST pointers, compressed data, etc. │
└─────────────────────────────────────────────────────────────┘
Examples:
'Hi' → 0x85 'H' 'i' (short varlena: 0x80 | 5 = header+data length)
100-char string → 0x00000068 ... (4-byte header: 104 total bytes)
This optimization saves 3 bytes per small string—significant for tables with many short VARCHAR columns.
Variable-Length Integer Encoding (Varint)
SQLite and Protocol Buffers use varint encoding where smaller numbers use fewer bytes:
Varint encoding (SQLite style):
Values 0-127: 1 byte (high bit = 0)
Values 128-16383: 2 bytes (first byte high bit = 1, continues)
Values 16384+: 3+ bytes
Example: Length = 200
Binary: 11001000 = 0xC8
Varint: 0x81 0x48 (10000001 01001000)
First byte: 0x81 -> high bit set, continue; value bits = 0000001
Second byte: 0x48 -> high bit clear, done; value bits = 1001000
Combined: 0000001_1001000 = 11001000 = 200
Varint is space-optimal when most lengths are small, but parsing is slightly more complex than fixed-size lengths.
Length fields may include just the data, or the header + data. PostgreSQL's varlena length includes the header itself. This affects offset calculations and must be handled consistently. Always verify whether documented 'length' values include headers.
What happens when a value is too large to fit on a single page? PostgreSQL's TOAST (The Oversized-Attribute Storage Technique) provides a comprehensive solution.
The Problem
PostgreSQL pages are 8KB. A single tuple must fit on one page (minus header overhead). But:
TOAST Solution
When a row is too large, TOAST kicks in:
TOAST Strategies
| Strategy | Column Directive | Behavior | Best For |
|---|---|---|---|
| PLAIN | n/a (for fixed types) | Never compress or out-of-line | Fixed-length types |
| EXTENDED | Default for varlena | Compress then out-of-line if needed | Most variable types |
| EXTERNAL | ALTER COLUMN SET STORAGE | Out-of-line but no compression | Pre-compressed data, images |
| MAIN | ALTER COLUMN SET STORAGE | Compress, but prefer keeping inline | Values rarely TOASTed |
TOAST Pointer Structure
typedef struct varatt_external {
int32 va_rawsize; // Original uncompressed size
int32 va_extsize; // External stored size (after compression)
Oid va_valueid; // OID of TOAST value
Oid va_toastrelid; // OID of TOAST table
} varatt_external;
// Size: 18 bytes on disk
// Replaces potentially multi-megabyte value
Accessing TOASTed values requires additional I/O to fetch chunks from the TOAST table. Queries selecting large TEXT/JSONB columns may have hidden costs. Use SELECT with limited columns to avoid fetching unneeded TOASTed data. Monitor pg_stat_user_tables.n_tup_hot_upd and TOAST table sizes.
Every database system has its own mechanism for handling values that exceed page capacity:
InnoDB Row Overflow Handling
InnoDB row formats handle large values differently:
COMPACT/REDUNDANT Format:
DYNAMIC Format (default since 5.7):
COMPRESSED Format:
DYNAMIC format overflow:
┌─────────────────────────────────────────┐
│ Main page row: │
│ [header][fixed cols][off-page pointer] │
│ │ │
│ ▼ │
│ ┌───────────────────┐ │
│ │ Overflow page 1 │──────┼──► Overflow page 2 ...
│ │ (up to 16KB data) │ │
│ └───────────────────┘ │
└─────────────────────────────────────────┘
Inline Thresholds:
Monitor for chained/migrated rows: Oracle: SELECT chain_cnt FROM dba_tables. SQL Server: sys.dm_db_index_physical_stats shows avg_record_size. PostgreSQL: pg_stat_user_tables.n_live_tup vs table size indicates bloat. High ratios of overflow to inline can indicate schema design issues.
Variable-length data often contains redundancy that compression can exploit. Databases offer multiple compression strategies:
| Database | Compression Type | Algorithm | Scope |
|---|---|---|---|
| PostgreSQL | TOAST compression | pglz (custom LZ) | Per-value, transparent |
| PostgreSQL 14+ | TOAST compression | LZ4 option | Per-value, faster than pglz |
| InnoDB | Page compression | zlib, LZ4, zstd | Per-page, transparent |
| InnoDB | Table compression | zlib | KEY_BLOCK_SIZE pages |
| SQL Server | Row compression | Dictionary + RLE | Fixed values, NULLs compacted |
| SQL Server | Page compression | Column prefix + dictionary | Page-level patterns |
| Oracle | Basic compression | Deduplicate repeated values | Per-block, read-only friendly |
| Oracle | Advanced/HCC | Columnar + algorithms | Exadata, data warehouse |
Compression Tradeoffs
Compression Decision Matrix:
High Compression Ratio ──────────────►
┌────────────────────────────────────────────────┐
Fast │ LZ4, Snappy │ zstd (balanced) │
Decompress│ Good for OLTP, │ Good general purpose, │
│ │ hot data, real-time │ configurable levels │
│ ├────────────────────────────────────────────────┤
▼ │ No compression │ zlib, gzip │
Slow │ Simple, CPU-free │ Best ratio, slow │
Decompress│ Max speed needed │ Cold data, archival │
└────────────────────────────────────────────────┘
When to Use Compression:
PostgreSQL 14 added LZ4 as an alternative TOAST compression algorithm. LZ4 decompresses significantly faster than pglz (often 10x) with slightly lower compression ratio. Set default_toast_compression = 'lz4' or specify per-column: ALTER TABLE t ALTER COLUMN c SET COMPRESSION lz4;
Index pages also must handle variable-length keys, but with different constraints than heap pages:
Why Index Keys Are Special
Key Truncation and Prefix Compression
B-tree indexes can use key abbreviation in internal pages:
Leaf level: Full keys for exact matching
['Alice', 'Bob', 'Charlie', 'David', 'Eve']
Internal level: Truncated separator keys
['C', 'E'] // Minimal keys that separate children
Prefix compression (InnoDB):
Page stores common prefix once
Individual entries store suffix only
Original: ['product_12345', 'product_12346', 'product_12347']
Prefix: 'product_1234'
Entries: ['5', '6', '7']
Maximum Key Length
| Database | Max Key Size | Notes |
|---|---|---|
| PostgreSQL | ~2712 bytes | 1/3 of page, practical limit lower due to overhead |
| MySQL InnoDB | 3072 bytes | With innodb_large_prefix=ON and DYNAMIC/COMPRESSED |
| SQL Server | 900 bytes | Clustered index; 1700 for nonclustered with INCLUDE |
| SQLite | ~1/4 page | Default 1024 bytes for 4KB pages |
Strategies When Keys Are Too Long:
Long keys reduce index fanout, increasing tree height and I/O. A B-tree with 100-entry pages has height log₁₀₀(N). With only 10-entry pages (due to large keys), height is log₁₀(N)—can double required I/O for lookups. Monitor index sizes relative to data sizes.
Updates to variable-length fields present unique challenges because the new value may differ in size from the old value.
Update Scenarios
Scenario 1: New value is SMALLER
Old: 'The quick brown fox jumps over the lazy dog' (45 bytes)
New: 'Hello' (5 bytes)
Options:
a) In-place update, leaving 40 bytes as internal fragmentation
b) Delete old + insert new at different location
c) Compact page after update
Scenario 2: New value is SAME SIZE
Old: 'Hello' (5 bytes)
New: 'World' (5 bytes)
Simple in-place overwrite
Scenario 3: New value is LARGER
Old: 'Hi' (2 bytes)
New: 'Hello World' (11 bytes)
Options:
a) Extend in-place if contiguous free space available
b) Relocate within same page if total space available
c) Move to different page entirely
d) Original slot becomes forward pointer (Oracle migration)
e) MVCC: Keep old version, insert new version (PostgreSQL)
Scenario 4: Value exceeds TOAST threshold
Old: 100-byte string (inline)
New: 50KB document (must TOAST)
Old inline space freed (or kept for MVCC)
New value goes to TOAST table
Pointer replaces inline value
Setting a fill factor (e.g., 70%) leaves free space on each page for in-place growth. UPDATE-heavy tables with growing variable columns benefit from lower fill factors. OLAP/append-only tables can use 100% fill factor. PostgreSQL: CREATE TABLE ... WITH (fillfactor=70)
Variable-length data is the norm in real-world databases, requiring sophisticated encoding and overflow mechanisms. Let's consolidate the key takeaways:
What's Next:
We've explored page structure, slot directories, record layouts, and variable-length handling. The final page brings everything together with page management—the policies and mechanisms that coordinate page allocation, free space tracking, and lifecycle management across the database.
You now understand how databases handle the fundamental challenge of variable-length data—from encoding strategies through TOAST/overflow to compression. This knowledge is crucial for schema design, storage optimization, and understanding query performance characteristics. Next, we'll complete our exploration with page management.