Loading learning content...
To the operating system, every file is just a sequence of bytes—an undifferentiated stream of 0s and 1s with no inherent meaning. Yet when you open a PDF, your reader shows formatted text, images, and interactive elements. When you play an MP3, you hear music. When you run an executable, code executes.
The magic happens through file structure—the conventions, formats, and organization that transform raw bytes into meaningful data. This page explores the spectrum of file structures, from the OS-level view of unstructured streams to the rich, complex formats that applications use.
Understanding file structure is essential for:
By the end of this page, you will understand how operating systems view file structure (or the lack thereof), the difference between text and binary files, how file formats impose structure on raw bytes, record-oriented vs stream-oriented files, and how modern structured formats work.
Modern operating systems (Unix, Linux, Windows, macOS) treat files as unstructured sequences of bytes. The OS provides:
This is the stream model of files—the OS sees only a tape of bytes that can be read in any order.
What the OS Stores:
File "document.pdf":
[0x25][0x50][0x44][0x46][0x2D][0x31][0x2E][0x37]... (raw bytes)
% P D F - 1 . 7 ... (ASCII interpretation)
The OS sees: bytes 0-7 contain 0x25, 0x50, 0x44, ...
The OS does NOT see: "This is a PDF version 1.7"
Why Unstructured?
Early operating systems (especially mainframes) had more structured file models—files composed of fixed-length records, with OS-managed fields. This seems helpful, but it created problems:
The Unix designers made a radical choice: files are just bytes. Structure is the application's responsibility. This simplicity enabled the explosion of diverse file formats we have today.
Internally, file systems store data in fixed-size blocks (typically 4 KB). But this is hidden from applications—they see a smooth byte stream. A request to read bytes 1000-2000 might span two blocks, but the OS handles this transparently.
Text files contain human-readable characters organized into lines. They're the simplest structured files, with only one structural element: the line break.
Character Encoding:
Text files store characters as numeric codes. The encoding determines which number represents which character:
| Encoding | Bytes/Char | Range | Notes |
|---|---|---|---|
| ASCII | 1 | 0-127 | English letters, digits, symbols |
| Latin-1 (ISO-8859-1) | 1 | 0-255 | Western European languages |
| UTF-8 | 1-4 | All Unicode | Variable-width, ASCII-compatible |
| UTF-16 | 2 or 4 | All Unicode | Windows internal, Java strings |
| UTF-32 | 4 | All Unicode | Fixed-width, rarely used in files |
Line Endings—The Eternal Problem:
Different systems use different bytes to mark line ends:
| System | Line Ending | Hex | Name |
|---|---|---|---|
| Unix/Linux/macOS | ` | ||
| ` | 0x0A | LF (Line Feed) | |
| Windows | `\r | ||
| ` | 0x0D 0x0A | CR+LF | |
| Classic Mac (pre-OS X) | \r | 0x0D | CR (Carriage Return) |
This causes real problems:
# A Windows text file opened on Linux:
$ cat windows.txt
Line 1^M
Line 2^M
# The ^M is the CR (0x0D) that Linux displays
# Fix with: sed -i 's/\r$//' windows.txt
Text File Structure:
Line 1: H e l l o
0x48 0x65 0x6C 0x6C 0x6F 0x0A
Line 2: W o r l d
0x57 0x6F 0x72 0x6C 0x64 0x0A
(No special header, no length fields, just characters and line breaks)
Common Text Formats:
UTF-8 has essentially won the encoding wars. It handles all world languages, is backward-compatible with ASCII, and is the default for web pages, JSON, and most modern systems. When creating new text files, always use UTF-8 unless you have a specific reason not to.
Binary files store data in formats that are not human-readable text. They contain raw bytes that must be interpreted according to a specific format—integers, floating-point numbers, structured records, compressed data, or any combination.
Why Binary Instead of Text?
| Consideration | Text Encoding | Binary Encoding |
|---|---|---|
| Size of integer 1000000 | 7 bytes ("1000000") | 4 bytes (binary int) |
| Precision of π | Variable ("3.14159...") | 8 bytes (IEEE 754 double) |
| Parsing speed | Slow (string parsing) | Fast (direct memory copy) |
| Human readability | ✓ Yes | ✗ No |
| Text editor safe | ✓ Yes | ✗ No (will corrupt) |
| Version control | ✓ Diff-friendly | ✗ Binary diff only |
Byte Order (Endianness):
Multi-byte values can be stored in two orders:
The integer 0x12345678 (4 bytes):
Big-endian ("network byte order"):
Address: 0 1 2 3
Bytes: [12] [34] [56] [78]
MSB LSB
Little-endian (x86, ARM):
Address: 0 1 2 3
Bytes: [78] [56] [34] [12]
LSB MSB
This affects how programs read binary data:
Binary File Anatomy:
Well-designed binary formats typically include:
┌─────────────────────────────────────────────┐
│ Magic Number (format identifier) │
│ e.g., 0x89PNG for PNG, 0x7fELF for ELF │
├─────────────────────────────────────────────┤
│ Header (metadata, version, sizes) │
│ - File version │
│ - Size fields │
│ - Offsets to sections │
├─────────────────────────────────────────────┤
│ Data Sections │
│ - Actual content in structured format │
│ - May be compressed │
├─────────────────────────────────────────────┤
│ Optional: Index/Directory │
│ - Fast lookup tables │
├─────────────────────────────────────────────┤
│ Optional: Trailer/Footer │
│ - Checksums, end marker │
└─────────────────────────────────────────────┘
Opening a binary file as text (or vice versa) causes corruption. Text mode may convert line endings, drop NUL bytes, or misinterpret encodings. Binary mode reads/writes bytes exactly. In C, use 'rb'/'wb' for binary; in Python, always specify 'b' mode for binary data.
Magic numbers are specific byte sequences at the beginning of files that identify their format. Since the OS doesn't care about file contents, applications need a reliable way to detect file types.
Common Magic Numbers:
| Format | Magic Bytes (Hex) | ASCII | Offset |
|---|---|---|---|
25 50 44 46 | %PDF | 0 | |
| PNG | 89 50 4E 47 0D 0A 1A 0A | `\x89PNG\r |
\x1a
| 0 | | JPEG |FF D8 FF| (not printable) | 0 | | GIF |47 49 46 38|GIF8| 0 | | ZIP/DOCX/JAR |50 4B 03 04|PK\x03\x04| 0 | | ELF (Linux binary) |7F 45 4C 46|\x7fELF| 0 | | PE (Windows .exe) |4D 5A|MZ| 0 | | SQLite |53 51 4C 69 74 65|SQLite| 0 | | Gzip |1F 8B` | (not printable) | 0 |
The file Command:
Unix's file command uses a database of magic numbers (typically /usr/share/misc/magic) to identify file types:
$ file mystery
mystery: ELF 64-bit LSB executable, x86-64
$ file document.pdf
document.pdf: PDF document, version 1.7
$ file image.png
image.png: PNG image data, 1920 x 1080, 8-bit/color RGBA
$ file renamed.xyz
renamed.xyz: JPEG image data, JFIF standard 1.01
# Extension is .xyz but content is JPEG!
The file command examines content, not extensions—far more reliable for security.
Programmatic Magic Number Checking:
#include <stdio.h>
#include <stdint.h>
int is_png(const char *filepath) {
FILE *f = fopen(filepath, "rb");
if (!f) return 0;
uint8_t header[8];
if (fread(header, 1, 8, f) != 8) {
fclose(f);
return 0;
}
fclose(f);
// PNG magic: 89 50 4E 47 0D 0A 1A 0A
uint8_t png_magic[] = {0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A};
return memcmp(header, png_magic, 8) == 0;
}
Some formats have magic numbers at non-zero offsets. ISO images have 'CD001' at offset 32769 (within the first volume descriptor). Some container formats have the magic after a variable-length header. Robust identification may require checking multiple locations.
Some applications organize file data into records—discrete units of related data. While the OS sees only bytes, the application imposes record structure on top.
Fixed-Length Records:
All records are the same size—simple to navigate but potentially wasteful:
Record 0: [Name: 30 bytes ][Age: 4 bytes][Salary: 8 bytes]
Offset 0 Offset 30 Offset 34
Record 1: [Name: 30 bytes ][Age: 4 bytes][Salary: 8 bytes]
Offset 42 Offset 72 Offset 76
To read record N: seek to offset (N × 42)
Advantages:
Disadvantages:
Variable-Length Records:
Records vary in size—more space-efficient but harder to navigate:
[Length][Record 0 data...][Length][Record 1 data......][Length][Record 2]
↓ ↓
Describes next record size Must read sequentially (or build index)
Common Approaches:
| Strategy | Description | Example |
|---|---|---|
| Length prefix | Each record starts with size | Protocol buffers |
| Delimiter | Special byte marks record end | CSV (newline) |
| Index/directory | Separate table of offsets | ZIP central directory |
| Chunked | Fixed chunks, records span chunks | Database pages |
Example: Length-Prefixed Records
[04 00] [H e l l o ] [08 00] [W o r l d ! ! ! ]
└─ 4 bytes of data─┘ └─ 8 bytes of data──────────┘
Database files (like those used by PostgreSQL, SQLite, or MySQL) use complex page-based structures with slotted pages, free space maps, overflow pages for large data, and B-tree indexes. These are far more sophisticated than simple record files but follow the same principles.
Structured text formats are text files with defined syntax for representing complex data. They combine human readability with machine parseability.
JSON (JavaScript Object Notation):
{
"name": "Alice",
"age": 30,
"active": true,
"roles": ["admin", "developer"],
"address": {
"city": "Seattle",
"zip": "98101"
}
}
Characteristics:
XML (eXtensible Markup Language):
<?xml version="1.0" encoding="UTF-8"?>
<person>
<name>Alice</name>
<age>30</age>
<roles>
<role>admin</role>
<role>developer</role>
</roles>
</person>
CSV (Comma-Separated Values):
name,age,city
Alice,30,Seattle
Bob,25,Portland
YAML (YAML Ain't Markup Language):
name: Alice
age: 30
roles:
- admin
- developer
| Format | Best For | Weaknesses |
|---|---|---|
| JSON | APIs, config, data exchange | No comments, limited types |
| XML | Documents, complex schemas | Verbose, complex parsing |
| CSV | Tabular data, spreadsheets | No nesting, escaping issues |
| YAML | Configuration files | Whitespace sensitivity, security |
| TOML | Configuration files | Less expressive for complex data |
YAML parsers in many languages support arbitrary object instantiation, which has led to security vulnerabilities. Always use 'safe' loading modes (e.g., yaml.safe_load() in Python) for untrusted input. JSON is safer because it only supports primitive types.
For efficiency-critical applications, binary serialization formats provide compact encoding with fast parsing.
Protocol Buffers (Protobuf) - Google:
Schema definition:
message Person {
string name = 1;
int32 age = 2;
repeated string roles = 3;
}
The binary encoding uses field numbers (1, 2, 3) rather than names, saving space. A Person with name="Alice", age=30 encodes to approximately 15 bytes instead of ~50+ bytes for JSON.
MessagePack:
Binary JSON-like format—no schema needed:
JSON (23 bytes): {"name":"Alice","age":30}
MsgPack (~16 bytes): [0x82, 0xA4, n,a,m,e, 0xA5, A,l,i,c,e, 0xA3, a,g,e, 0x1E]
| Format | Schema | Size | Speed | Use Case |
|---|---|---|---|---|
| Protobuf | Required | Smallest | Fastest | gRPC, internal services |
| FlatBuffers | Required | Small | Zero-copy | Games, mobile apps |
| Cap'n Proto | Required | Small | Zero-copy | High-performance IPC |
| MessagePack | None | Medium | Fast | JSON replacement |
| CBOR | None | Medium | Fast | IoT, web tokens |
| BSON | None | Larger | Fast | MongoDB documents |
Zero-Copy Formats:
Formats like FlatBuffers and Cap'n Proto are designed for zero-copy access—you can read data directly from the file buffer without parsing into intermediate structures:
// Traditional: Parse entire message into memory
Person p = Person::parseFromBytes(buffer);
printf("Name: %s
", p.name().c_str());
// Zero-copy: Access data in place
auto p = flatbuffers::GetRoot<Person>(buffer);
printf("Name: %s
", p->name()->c_str()); // Points into buffer!
This is crucial for performance in games, embedded systems, and high-frequency trading.
Use JSON for APIs and config (human-readable). Use Protobuf for services (compact, fast, versioned schemas). Use FlatBuffers when parsing overhead matters (games, realtime). Use MessagePack for simple JSON-to-binary compression without schemas.
Container formats encapsulate multiple data streams or files within a single file. They provide organization while the actual data uses format-specific encodings.
ZIP - The Universal Container:
ZIP File Structure:
┌─────────────────────────────────────────────┐
│ [Local File Header 1] [File 1 Data] │
│ [Local File Header 2] [File 2 Data] │
│ ... │
│ [Central Directory] │
│ - List of all files with offsets │
│ - File attributes, sizes, CRCs │
│ [End of Central Directory Record] │
└─────────────────────────────────────────────┘
Fun fact: DOCX, XLSX, JAR, APK, and EPUB are all ZIP files with specific internal structures!
Media Containers:
Video files are containers holding multiple streams:
| Container | Extension | Typical Contents |
|---|---|---|
| MP4/M4V | .mp4, .m4v | H.264/H.265 video + AAC audio |
| MKV (Matroska) | .mkv | Any codec + subtitles + chapters |
| AVI | .avi | Legacy video + audio streams |
| WebM | .webm | VP8/VP9 video + Vorbis/Opus audio |
MP4 Container Structure:
┌──────────────────────────────────────────────┐
│ ftyp box (file type) │
│ moov box (movie metadata) │
│ └── trak box (track 1 - video) │
│ └── trak box (track 2 - audio) │
│ mdat box (actual media data) │
└──────────────────────────────────────────────┘
Document Containers:
| Format | Actually Is | Contents |
|---|---|---|
| DOCX | ZIP | XML documents + media |
| Custom | Objects, streams, dictionaries | |
| ODF (.odt) | ZIP | XML + resources |
| EPUB | ZIP | XHTML chapters + metadata |
Exploring Containers:
# DOCX is just ZIP
$ unzip -l document.docx
Length Date Time Name
--------- ---------- ----- ----
1234 2024-01-15 10:00 [Content_Types].xml
5678 2024-01-15 10:00 word/document.xml
2345 2024-01-15 10:00 word/styles.xml
...
Container formats separate the 'what's inside' from 'how it's packaged.' An MKV file can hold video encoded with H.264, VP9, or AV1—the container is the same. This separation allows codecs and containers to evolve independently.
Executable files have highly structured formats that allow the OS loader to set up a process for execution. They contain code, data, and metadata about memory layout and dynamic linking.
ELF (Executable and Linkable Format) - Unix/Linux:
ELF File Structure:
┌─────────────────────────────────────────────┐
│ ELF Header (magic: 0x7F 'E' 'L' 'F') │
│ - Machine type (x86-64, ARM, etc.) │
│ - Entry point address │
│ - Program header offset │
│ - Section header offset │
├─────────────────────────────────────────────┤
│ Program Headers (for loading) │
│ - Segment addresses and sizes │
│ - Memory permissions (RX, RW, R) │
├─────────────────────────────────────────────┤
│ Sections │
│ .text - Executable code │
│ .data - Initialized data │
│ .bss - Uninitialized data │
│ .rodata - Read-only data (strings) │
│ .dynamic- Dynamic linking info │
│ .symtab - Symbol table │
├─────────────────────────────────────────────┤
│ Section Headers (for linking/debugging) │
└─────────────────────────────────────────────┘
PE (Portable Executable) - Windows:
PE File Structure:
┌─────────────────────────────────────────────┐
│ DOS Header (magic: 'MZ') │
│ - Offset to PE header │
├─────────────────────────────────────────────┤
│ DOS Stub (legacy: "This program cannot...) │
├─────────────────────────────────────────────┤
│ PE Header (magic: 'PE\0\0') │
│ - Machine type │
│ - Number of sections │
│ - Optional header (image size, entry) │
├─────────────────────────────────────────────┤
│ Section Headers │
├─────────────────────────────────────────────┤
│ Sections │
│ .text - Code │
│ .data - Data │
│ .rdata - Read-only data, imports │
│ .rsrc - Resources (icons, dialogs) │
└─────────────────────────────────────────────┘
| Format | Platforms | Tools |
|---|---|---|
| ELF | Linux, BSD, Solaris, Android | readelf, objdump, nm |
| PE/COFF | Windows | dumpbin, PE Explorer |
| Mach-O | macOS, iOS | otool, nm, MachOView |
| a.out | Legacy Unix (obsolete) | Historical interest only |
The readelf and objdump commands reveal executable structure: 'readelf -h binary' shows the header, 'readelf -l binary' shows program headers (for loading), 'readelf -S binary' shows sections. Understanding these is essential for reverse engineering and security research.
Images, audio, and video use specialized formats optimized for their data characteristics—particularly compression.
Image Format Structures:
PNG (Portable Network Graphics):
PNG Structure:
┌─────────────────────────────────────────────┐
│ Signature: 89 50 4E 47 0D 0A 1A 0A │
├─────────────────────────────────────────────┤
│ IHDR Chunk (header) │
│ - Width, height, bit depth, color type │
├─────────────────────────────────────────────┤
│ [Optional chunks: tEXt, gAMA, cHRM...] │
├─────────────────────────────────────────────┤
│ IDAT Chunks (compressed image data) │
│ - zlib-compressed filtered pixel data │
├─────────────────────────────────────────────┤
│ IEND Chunk (end marker) │
└─────────────────────────────────────────────┘
PNG is lossless—every pixel is preserved exactly.
JPEG Structure (Lossy):
JPEG Structure:
┌─────────────────────────────────────────────┐
│ SOI Marker (FF D8) │
├─────────────────────────────────────────────┤
│ APP0 (JFIF metadata) │
├─────────────────────────────────────────────┤
│ DQT (Quantization tables) │
│ - These determine quality level │
├─────────────────────────────────────────────┤
│ SOF (Start of Frame) │
│ - Dimensions, components │
├─────────────────────────────────────────────┤
│ DHT (Huffman tables) │
├─────────────────────────────────────────────┤
│ SOS (Start of Scan) + Compressed Data │
│ - DCT-compressed image blocks │
├─────────────────────────────────────────────┤
│ EOI Marker (FF D9) │
└─────────────────────────────────────────────┘
JPEG uses lossy compression—some information is discarded for smaller size.
| Format | Compression | Best For | Limitations |
|---|---|---|---|
| PNG | Lossless | Screenshots, graphics, transparency | Large for photos |
| JPEG | Lossy | Photographs | Artifacts, no transparency |
| GIF | Lossless (limited palette) | Simple animations | 256 colors max |
| WebP | Both modes | Web images | Less universal support |
| AVIF | Lossy | Modern web images | Newer, less support |
| TIFF | Various | Archival, professional | Large, complex |
JPEG, TIFF, and other formats can contain EXIF metadata: camera settings, GPS coordinates, timestamps, and more. This is valuable for photographers but a privacy concern—always strip EXIF from images before sharing online if location privacy matters.
Database files contain highly structured data optimized for random access, concurrent modification, and crash recovery.
SQLite File Format:
SQLite stores an entire database in a single file:
SQLite File Structure:
┌─────────────────────────────────────────────┐
│ Header (100 bytes) │
│ - Magic: "SQLite format 3\0" │
│ - Page size (typically 4096) │
│ - File format versions │
│ - Database size in pages │
├─────────────────────────────────────────────┤
│ Page 1: Database Schema │
│ - sqlite_master table │
│ - CREATE TABLE statements │
├─────────────────────────────────────────────┤
│ Pages 2-N: B-tree nodes, overflow, freelist │
│ - Table B-trees (data storage) │
│ - Index B-trees (for fast lookup) │
│ - Overflow pages (for large rows) │
│ - Free pages (deleted, reusable) │
└─────────────────────────────────────────────┘
Page-Based Storage:
Nearly all databases use pages (fixed-size blocks, typically 4-16 KB):
Typical Database Page:
┌─────────────────────────────────────────────┐
│ Page Header │
│ - Page type (leaf, interior, overflow) │
│ - Number of cells │
│ - Free space offset │
├─────────────────────────────────────────────┤
│ Cell Pointer Array │
│ [offset1][offset2][offset3]... │
├─────────────────────────────────────────────┤
│ Free Space │
│ │
├─────────────────────────────────────────────┤
│ Cell Content Area (grows upward) │
│ [Cell 3 data] │
│ [Cell 2 data] │
│ [Cell 1 data] │
└─────────────────────────────────────────────┘
This "slotted page" design allows efficient insertion without moving all data.
Modern databases don't write directly to data files. Changes go to a Write-Ahead Log first, then are periodically checkpointed to data files. This provides crash recovery—incomplete transactions can be rolled back by replaying the log. SQLite's WAL mode stores this in a separate .wal file.
We've explored the vast landscape of file structures—from the OS view of undifferentiated bytes to the complex formats that encode our data. Let's consolidate:
Module Complete:
You have now completed Module 1: File Concepts. You understand what files are, their attributes, the operations performed on them, the different types of files, and how file data is structured. This foundation prepares you for the next module on Access Methods—how applications read and write files sequentially, randomly, and through indexing.
Congratulations! You now have comprehensive knowledge of file concepts—the fundamental abstraction for persistent storage in operating systems. From this foundation, you can understand file system implementation, debug file format issues, design efficient data storage, and appreciate the elegance of the simple 'file as bytes' model.