Operating SystemsFile System Concepts

File Concepts

LevelBeginner

Duration60 mins

TopicFile System Concepts

5 / 5

File Structure

Raw Bytes vs. Meaningful Data

To the operating system, every file is just a sequence of bytes—an undifferentiated stream of 0s and 1s with no inherent meaning. Yet when you open a PDF, your reader shows formatted text, images, and interactive elements. When you play an MP3, you hear music. When you run an executable, code executes.

The magic happens through file structure—the conventions, formats, and organization that transform raw bytes into meaningful data. This page explores the spectrum of file structures, from the OS-level view of unstructured streams to the rich, complex formats that applications use.

Understanding file structure is essential for:

Parsing and generating file formats
Reverse engineering and forensics
Data recovery and corruption handling
Designing efficient data storage

What You Will Learn

By the end of this page, you will understand how operating systems view file structure (or the lack thereof), the difference between text and binary files, how file formats impose structure on raw bytes, record-oriented vs stream-oriented files, and how modern structured formats work.

The OS Perspective: Unstructured Byte Streams

Modern operating systems (Unix, Linux, Windows, macOS) treat files as unstructured sequences of bytes. The OS provides:

A linear address space: bytes numbered 0, 1, 2, ... N-1
Operations to read and write arbitrary byte ranges
No interpretation of byte contents
No record boundaries, field definitions, or data types

This is the stream model of files—the OS sees only a tape of bytes that can be read in any order.

What the OS Stores:

File "document.pdf":

[0x25][0x50][0x44][0x46][0x2D][0x31][0x2E][0x37]...  (raw bytes)
 %     P     D     F     -     1     .     7   ...   (ASCII interpretation)

The OS sees: bytes 0-7 contain 0x25, 0x50, 0x44, ...
The OS does NOT see: "This is a PDF version 1.7"

Why Unstructured?

Early operating systems (especially mainframes) had more structured file models—files composed of fixed-length records, with OS-managed fields. This seems helpful, but it created problems:

Rigidity: Applications couldn't define custom structures
Overhead: OS enforced record sizes even when not needed
Complexity: File systems needed to understand many formats
Portability: Different systems had incompatible record formats

The Unix designers made a radical choice: files are just bytes. Structure is the application's responsibility. This simplicity enabled the explosion of diverse file formats we have today.

Block Storage vs. Byte Interface

Internally, file systems store data in fixed-size blocks (typically 4 KB). But this is hidden from applications—they see a smooth byte stream. A request to read bytes 1000-2000 might span two blocks, but the OS handles this transparently.

Text Files

Text files contain human-readable characters organized into lines. They're the simplest structured files, with only one structural element: the line break.

Character Encoding:

Text files store characters as numeric codes. The encoding determines which number represents which character:

Encoding	Bytes/Char	Range	Notes
ASCII	1	0-127	English letters, digits, symbols
Latin-1 (ISO-8859-1)	1	0-255	Western European languages
UTF-8	1-4	All Unicode	Variable-width, ASCII-compatible
UTF-16	2 or 4	All Unicode	Windows internal, Java strings
UTF-32	4	All Unicode	Fixed-width, rarely used in files

Line Endings—The Eternal Problem:

Different systems use different bytes to mark line ends:

System	Line Ending	Hex	Name
Unix/Linux/macOS	`
`	`0x0A`	LF (Line Feed)
Windows	`\r
`	`0x0D 0x0A`	CR+LF
Classic Mac (pre-OS X)	`\r`	`0x0D`	CR (Carriage Return)

This causes real problems:

# A Windows text file opened on Linux:
$ cat windows.txt
Line 1^M
Line 2^M

# The ^M is the CR (0x0D) that Linux displays
# Fix with: sed -i 's/\r$//' windows.txt

Text File Structure:

Line 1: H  e  l  l  o  

        0x48 0x65 0x6C 0x6C 0x6F 0x0A

Line 2: W  o  r  l  d  

        0x57 0x6F 0x72 0x6C 0x64 0x0A

(No special header, no length fields, just characters and line breaks)

Common Text Formats:

Source code: C, Python, JavaScript, etc.
Configuration: INI, YAML, TOML, .env files
Data exchange: CSV, JSON (text-based)
Markup: HTML, XML, Markdown
Logs: System and application logs

UTF-8 Is Now Standard

UTF-8 has essentially won the encoding wars. It handles all world languages, is backward-compatible with ASCII, and is the default for web pages, JSON, and most modern systems. When creating new text files, always use UTF-8 unless you have a specific reason not to.

Binary Files

Binary files store data in formats that are not human-readable text. They contain raw bytes that must be interpreted according to a specific format—integers, floating-point numbers, structured records, compressed data, or any combination.

Why Binary Instead of Text?

Consideration	Text Encoding	Binary Encoding
Size of integer 1000000	7 bytes ("1000000")	4 bytes (binary int)
Precision of π	Variable ("3.14159...")	8 bytes (IEEE 754 double)
Parsing speed	Slow (string parsing)	Fast (direct memory copy)
Human readability	✓ Yes	✗ No
Text editor safe	✓ Yes	✗ No (will corrupt)
Version control	✓ Diff-friendly	✗ Binary diff only

Byte Order (Endianness):

Multi-byte values can be stored in two orders:

The integer 0x12345678 (4 bytes):

Big-endian ("network byte order"):
  Address:  0    1    2    3
  Bytes:   [12] [34] [56] [78]
           MSB              LSB

Little-endian (x86, ARM):
  Address:  0    1    2    3
  Bytes:   [78] [56] [34] [12]
           LSB              MSB

This affects how programs read binary data:

x86/x64/most ARM: Little-endian
Network protocols: Big-endian (by convention)
Java .class files: Big-endian
Most file formats: Specify one or include marker

Binary File Anatomy:

Well-designed binary formats typically include:

┌─────────────────────────────────────────────┐
│ Magic Number (format identifier)             │
│ e.g., 0x89PNG for PNG, 0x7fELF for ELF      │
├─────────────────────────────────────────────┤
│ Header (metadata, version, sizes)            │
│ - File version                               │
│ - Size fields                                │
│ - Offsets to sections                        │
├─────────────────────────────────────────────┤
│ Data Sections                                │
│ - Actual content in structured format        │
│ - May be compressed                          │
├─────────────────────────────────────────────┤
│ Optional: Index/Directory                    │
│ - Fast lookup tables                         │
├─────────────────────────────────────────────┤
│ Optional: Trailer/Footer                     │
│ - Checksums, end marker                      │
└─────────────────────────────────────────────┘

Never Open Binary Files in Text Mode

Opening a binary file as text (or vice versa) causes corruption. Text mode may convert line endings, drop NUL bytes, or misinterpret encodings. Binary mode reads/writes bytes exactly. In C, use 'rb'/'wb' for binary; in Python, always specify 'b' mode for binary data.

Magic Numbers and File Identification

Magic numbers are specific byte sequences at the beginning of files that identify their format. Since the OS doesn't care about file contents, applications need a reliable way to detect file types.

Common Magic Numbers:

Format	Magic Bytes (Hex)	ASCII	Offset
PDF	`25 50 44 46`	`%PDF`	0
PNG	`89 50 4E 47 0D 0A 1A 0A`	`\x89PNG\r

\x1a | 0 | | JPEG |FF D8 FF| (not printable) | 0 | | GIF |47 49 46 38|GIF8| 0 | | ZIP/DOCX/JAR |50 4B 03 04|PK\x03\x04| 0 | | ELF (Linux binary) |7F 45 4C 46|\x7fELF| 0 | | PE (Windows .exe) |4D 5A|MZ| 0 | | SQLite |53 51 4C 69 74 65|SQLite| 0 | | Gzip |1F 8B` | (not printable) | 0 |

The file Command:

Unix's file command uses a database of magic numbers (typically /usr/share/misc/magic) to identify file types:

$ file mystery
mystery: ELF 64-bit LSB executable, x86-64

$ file document.pdf
document.pdf: PDF document, version 1.7

$ file image.png
image.png: PNG image data, 1920 x 1080, 8-bit/color RGBA

$ file renamed.xyz
renamed.xyz: JPEG image data, JFIF standard 1.01
# Extension is .xyz but content is JPEG!

The file command examines content, not extensions—far more reliable for security.

Programmatic Magic Number Checking:

#include <stdio.h>
#include <stdint.h>

int is_png(const char *filepath) {
    FILE *f = fopen(filepath, "rb");
    if (!f) return 0;
    
    uint8_t header[8];
    if (fread(header, 1, 8, f) != 8) {
        fclose(f);
        return 0;
    }
    fclose(f);
    
    // PNG magic: 89 50 4E 47 0D 0A 1A 0A
    uint8_t png_magic[] = {0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A};
    return memcmp(header, png_magic, 8) == 0;
}

Magic at Non-Zero Offsets

Some formats have magic numbers at non-zero offsets. ISO images have 'CD001' at offset 32769 (within the first volume descriptor). Some container formats have the magic after a variable-length header. Robust identification may require checking multiple locations.

Record-Oriented Files

Some applications organize file data into records—discrete units of related data. While the OS sees only bytes, the application imposes record structure on top.

Fixed-Length Records:

All records are the same size—simple to navigate but potentially wasteful:

Record 0:  [Name: 30 bytes    ][Age: 4 bytes][Salary: 8 bytes]
           Offset 0           Offset 30      Offset 34

Record 1:  [Name: 30 bytes    ][Age: 4 bytes][Salary: 8 bytes]
           Offset 42          Offset 72      Offset 76

To read record N: seek to offset (N × 42)

Advantages:

Direct access to any record: O(1) seek
Simple implementation
Easy update in place

Disadvantages:

Wasted space for short data
Cannot store data exceeding field size
Schema changes require restructuring entire file

Variable-Length Records:

Records vary in size—more space-efficient but harder to navigate:

[Length][Record 0 data...][Length][Record 1 data......][Length][Record 2]
   ↓                          ↓
   Describes next record size  Must read sequentially (or build index)

Common Approaches:

Strategy	Description	Example
Length prefix	Each record starts with size	Protocol buffers
Delimiter	Special byte marks record end	CSV (newline)
Index/directory	Separate table of offsets	ZIP central directory
Chunked	Fixed chunks, records span chunks	Database pages

Example: Length-Prefixed Records

[04 00] [H  e  l  l  o  ] [08 00] [W  o  r  l  d  !  !  !  ]
  └─ 4 bytes of data─┘      └─ 8 bytes of data──────────┘

Databases Use Sophisticated Record Structures

Database files (like those used by PostgreSQL, SQLite, or MySQL) use complex page-based structures with slotted pages, free space maps, overflow pages for large data, and B-tree indexes. These are far more sophisticated than simple record files but follow the same principles.

Structured Text Formats

Structured text formats are text files with defined syntax for representing complex data. They combine human readability with machine parseability.

JSON (JavaScript Object Notation):

{
  "name": "Alice",
  "age": 30,
  "active": true,
  "roles": ["admin", "developer"],
  "address": {
    "city": "Seattle",
    "zip": "98101"
  }
}

Characteristics:

Human-readable
Maps directly to programming language data structures
No schema required (self-describing)
Universal support across languages
Not binary-efficient (integers stored as text)

XML (eXtensible Markup Language):

<?xml version="1.0" encoding="UTF-8"?>
<person>
    <name>Alice</name>
    <age>30</age>
    <roles>
        <role>admin</role>
        <role>developer</role>
    </roles>
</person>

CSV (Comma-Separated Values):

name,age,city
Alice,30,Seattle
Bob,25,Portland

YAML (YAML Ain't Markup Language):

name: Alice
age: 30
roles:
  - admin
  - developer

Structured Text Format Comparison
Format	Best For	Weaknesses
JSON	APIs, config, data exchange	No comments, limited types
XML	Documents, complex schemas	Verbose, complex parsing
CSV	Tabular data, spreadsheets	No nesting, escaping issues
YAML	Configuration files	Whitespace sensitivity, security
TOML	Configuration files	Less expressive for complex data

YAML Security Concerns

YAML parsers in many languages support arbitrary object instantiation, which has led to security vulnerabilities. Always use 'safe' loading modes (e.g., yaml.safe_load() in Python) for untrusted input. JSON is safer because it only supports primitive types.

Binary Serialization Formats

For efficiency-critical applications, binary serialization formats provide compact encoding with fast parsing.

Protocol Buffers (Protobuf) - Google:

Schema definition:

message Person {
  string name = 1;
  int32 age = 2;
  repeated string roles = 3;
}

The binary encoding uses field numbers (1, 2, 3) rather than names, saving space. A Person with name="Alice", age=30 encodes to approximately 15 bytes instead of ~50+ bytes for JSON.

MessagePack:

Binary JSON-like format—no schema needed:

JSON (23 bytes):  {"name":"Alice","age":30}
MsgPack (~16 bytes): [0x82, 0xA4, n,a,m,e, 0xA5, A,l,i,c,e, 0xA3, a,g,e, 0x1E]

Binary Format Comparison
Format	Schema	Size	Speed	Use Case
Protobuf	Required	Smallest	Fastest	gRPC, internal services
FlatBuffers	Required	Small	Zero-copy	Games, mobile apps
Cap'n Proto	Required	Small	Zero-copy	High-performance IPC
MessagePack	None	Medium	Fast	JSON replacement
CBOR	None	Medium	Fast	IoT, web tokens
BSON	None	Larger	Fast	MongoDB documents

Zero-Copy Formats:

Formats like FlatBuffers and Cap'n Proto are designed for zero-copy access—you can read data directly from the file buffer without parsing into intermediate structures:

// Traditional: Parse entire message into memory
Person p = Person::parseFromBytes(buffer);
printf("Name: %s
", p.name().c_str());

// Zero-copy: Access data in place
auto p = flatbuffers::GetRoot<Person>(buffer);
printf("Name: %s
", p->name()->c_str());  // Points into buffer!

This is crucial for performance in games, embedded systems, and high-frequency trading.

Choose Format Based on Requirements

Use JSON for APIs and config (human-readable). Use Protobuf for services (compact, fast, versioned schemas). Use FlatBuffers when parsing overhead matters (games, realtime). Use MessagePack for simple JSON-to-binary compression without schemas.

Container Formats

Container formats encapsulate multiple data streams or files within a single file. They provide organization while the actual data uses format-specific encodings.

ZIP - The Universal Container:

ZIP File Structure:
┌─────────────────────────────────────────────┐
│ [Local File Header 1] [File 1 Data]          │
│ [Local File Header 2] [File 2 Data]          │
│ ...                                          │
│ [Central Directory]                          │
│   - List of all files with offsets          │
│   - File attributes, sizes, CRCs            │
│ [End of Central Directory Record]            │
└─────────────────────────────────────────────┘

Fun fact: DOCX, XLSX, JAR, APK, and EPUB are all ZIP files with specific internal structures!

Media Containers:

Video files are containers holding multiple streams:

Container	Extension	Typical Contents
MP4/M4V	`.mp4`, `.m4v`	H.264/H.265 video + AAC audio
MKV (Matroska)	`.mkv`	Any codec + subtitles + chapters
AVI	`.avi`	Legacy video + audio streams
WebM	`.webm`	VP8/VP9 video + Vorbis/Opus audio

MP4 Container Structure:
┌──────────────────────────────────────────────┐
│ ftyp box (file type)                          │
│ moov box (movie metadata)                     │
│   └── trak box (track 1 - video)             │
│   └── trak box (track 2 - audio)             │
│ mdat box (actual media data)                  │
└──────────────────────────────────────────────┘

Document Containers:

Format	Actually Is	Contents
DOCX	ZIP	XML documents + media
PDF	Custom	Objects, streams, dictionaries
ODF (.odt)	ZIP	XML + resources
EPUB	ZIP	XHTML chapters + metadata

Exploring Containers:

# DOCX is just ZIP
$ unzip -l document.docx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1234  2024-01-15 10:00   [Content_Types].xml
     5678  2024-01-15 10:00   word/document.xml
     2345  2024-01-15 10:00   word/styles.xml
...

Containers Enable Flexibility

Container formats separate the 'what's inside' from 'how it's packaged.' An MKV file can hold video encoded with H.264, VP9, or AV1—the container is the same. This separation allows codecs and containers to evolve independently.

Executable Formats

Executable files have highly structured formats that allow the OS loader to set up a process for execution. They contain code, data, and metadata about memory layout and dynamic linking.

ELF (Executable and Linkable Format) - Unix/Linux:

ELF File Structure:
┌─────────────────────────────────────────────┐
│ ELF Header (magic: 0x7F 'E' 'L' 'F')         │
│   - Machine type (x86-64, ARM, etc.)        │
│   - Entry point address                      │
│   - Program header offset                    │
│   - Section header offset                    │
├─────────────────────────────────────────────┤
│ Program Headers (for loading)                │
│   - Segment addresses and sizes              │
│   - Memory permissions (RX, RW, R)          │
├─────────────────────────────────────────────┤
│ Sections                                     │
│   .text   - Executable code                  │
│   .data   - Initialized data                 │
│   .bss    - Uninitialized data               │
│   .rodata - Read-only data (strings)         │
│   .dynamic- Dynamic linking info             │
│   .symtab - Symbol table                     │
├─────────────────────────────────────────────┤
│ Section Headers (for linking/debugging)      │
└─────────────────────────────────────────────┘

PE (Portable Executable) - Windows:

PE File Structure:
┌─────────────────────────────────────────────┐
│ DOS Header (magic: 'MZ')                     │
│   - Offset to PE header                      │
├─────────────────────────────────────────────┤
│ DOS Stub (legacy: "This program cannot...)   │
├─────────────────────────────────────────────┤
│ PE Header (magic: 'PE\0\0')                  │
│   - Machine type                             │
│   - Number of sections                       │
│   - Optional header (image size, entry)      │
├─────────────────────────────────────────────┤
│ Section Headers                              │
├─────────────────────────────────────────────┤
│ Sections                                     │
│   .text   - Code                             │
│   .data   - Data                             │
│   .rdata  - Read-only data, imports          │
│   .rsrc   - Resources (icons, dialogs)       │
└─────────────────────────────────────────────┘

Executable Format Comparison
Format	Platforms	Tools
ELF	Linux, BSD, Solaris, Android	`readelf`, `objdump`, `nm`
PE/COFF	Windows	`dumpbin`, PE Explorer
Mach-O	macOS, iOS	`otool`, `nm`, `MachOView`
a.out	Legacy Unix (obsolete)	Historical interest only

Analyzing Executables

The readelf and objdump commands reveal executable structure: 'readelf -h binary' shows the header, 'readelf -l binary' shows program headers (for loading), 'readelf -S binary' shows sections. Understanding these is essential for reverse engineering and security research.

Image and Media Formats

Images, audio, and video use specialized formats optimized for their data characteristics—particularly compression.

Image Format Structures:

PNG (Portable Network Graphics):

PNG Structure:
┌─────────────────────────────────────────────┐
│ Signature: 89 50 4E 47 0D 0A 1A 0A          │
├─────────────────────────────────────────────┤
│ IHDR Chunk (header)                          │
│   - Width, height, bit depth, color type    │
├─────────────────────────────────────────────┤
│ [Optional chunks: tEXt, gAMA, cHRM...]       │
├─────────────────────────────────────────────┤
│ IDAT Chunks (compressed image data)          │
│   - zlib-compressed filtered pixel data      │
├─────────────────────────────────────────────┤
│ IEND Chunk (end marker)                      │
└─────────────────────────────────────────────┘

PNG is lossless—every pixel is preserved exactly.

JPEG Structure (Lossy):

JPEG Structure:
┌─────────────────────────────────────────────┐
│ SOI Marker (FF D8)                           │
├─────────────────────────────────────────────┤
│ APP0 (JFIF metadata)                         │
├─────────────────────────────────────────────┤
│ DQT (Quantization tables)                    │
│   - These determine quality level            │
├─────────────────────────────────────────────┤
│ SOF (Start of Frame)                         │
│   - Dimensions, components                   │
├─────────────────────────────────────────────┤
│ DHT (Huffman tables)                         │
├─────────────────────────────────────────────┤
│ SOS (Start of Scan) + Compressed Data        │
│   - DCT-compressed image blocks              │
├─────────────────────────────────────────────┤
│ EOI Marker (FF D9)                           │
└─────────────────────────────────────────────┘

JPEG uses lossy compression—some information is discarded for smaller size.

Image Format Comparison
Format	Compression	Best For	Limitations
PNG	Lossless	Screenshots, graphics, transparency	Large for photos
JPEG	Lossy	Photographs	Artifacts, no transparency
GIF	Lossless (limited palette)	Simple animations	256 colors max
WebP	Both modes	Web images	Less universal support
AVIF	Lossy	Modern web images	Newer, less support
TIFF	Various	Archival, professional	Large, complex

EXIF Metadata in Images

JPEG, TIFF, and other formats can contain EXIF metadata: camera settings, GPS coordinates, timestamps, and more. This is valuable for photographers but a privacy concern—always strip EXIF from images before sharing online if location privacy matters.

Database File Formats

Database files contain highly structured data optimized for random access, concurrent modification, and crash recovery.

SQLite File Format:

SQLite stores an entire database in a single file:

SQLite File Structure:
┌─────────────────────────────────────────────┐
│ Header (100 bytes)                           │
│   - Magic: "SQLite format 3\0"               │
│   - Page size (typically 4096)               │
│   - File format versions                     │
│   - Database size in pages                   │
├─────────────────────────────────────────────┤
│ Page 1: Database Schema                      │
│   - sqlite_master table                      │
│   - CREATE TABLE statements                  │
├─────────────────────────────────────────────┤
│ Pages 2-N: B-tree nodes, overflow, freelist  │
│   - Table B-trees (data storage)             │
│   - Index B-trees (for fast lookup)          │
│   - Overflow pages (for large rows)          │
│   - Free pages (deleted, reusable)           │
└─────────────────────────────────────────────┘

Page-Based Storage:

Nearly all databases use pages (fixed-size blocks, typically 4-16 KB):

Typical Database Page:
┌─────────────────────────────────────────────┐
│ Page Header                                  │
│   - Page type (leaf, interior, overflow)    │
│   - Number of cells                          │
│   - Free space offset                        │
├─────────────────────────────────────────────┤
│ Cell Pointer Array                           │
│   [offset1][offset2][offset3]...             │
├─────────────────────────────────────────────┤
│ Free Space                                   │
│                                              │
├─────────────────────────────────────────────┤
│ Cell Content Area (grows upward)             │
│   [Cell 3 data]                              │
│   [Cell 2 data]                              │
│   [Cell 1 data]                              │
└─────────────────────────────────────────────┘

This "slotted page" design allows efficient insertion without moving all data.

Write-Ahead Logging (WAL)

Modern databases don't write directly to data files. Changes go to a Write-Ahead Log first, then are periodically checkpointed to data files. This provides crash recovery—incomplete transactions can be rolled back by replaying the log. SQLite's WAL mode stores this in a separate .wal file.

Summary: Understanding File Structure

We've explored the vast landscape of file structures—from the OS view of undifferentiated bytes to the complex formats that encode our data. Let's consolidate:

Key Takeaways

•The OS sees files as unstructured byte streams — All structure is imposed by applications, not the operating system.
•Text files store human-readable characters — Encoding (UTF-8) and line endings (LF vs CRLF) are critical details.
•Binary files encode data efficiently — Endianness, alignment, and format specifications matter.
•Magic numbers identify file types reliably — Content inspection beats filename extensions for security.
•Record-oriented files organize data systematically — Fixed-length for random access, variable-length for efficiency.
•Structured text formats (JSON, XML) balance readability and structure — Use for configuration and data exchange.
•Binary serialization (Protobuf, FlatBuffers) maximizes efficiency — Use when performance matters.
•Container formats hold multiple streams — ZIP, MP4, DOCX are all containers with different contents.
•Executable formats enable program loading — ELF, PE, Mach-O define how code becomes a running process.
•Database files use page-based structures — B-trees, slotted pages, and WAL enable efficient data management.

Module Complete:

You have now completed Module 1: File Concepts. You understand what files are, their attributes, the operations performed on them, the different types of files, and how file data is structured. This foundation prepares you for the next module on Access Methods—how applications read and write files sequentially, randomly, and through indexing.

Module Complete

Congratulations! You now have comprehensive knowledge of file concepts—the fundamental abstraction for persistent storage in operating systems. From this foundation, you can understand file system implementation, debug file format issues, design efficient data storage, and appreciate the elegance of the simple 'file as bytes' model.

5 / 5

Loading learning content...

Operating SystemsFile System Concepts

File Concepts

LevelBeginner

Duration60 mins

TopicFile System Concepts

5 / 5

File Structure

Raw Bytes vs. Meaningful Data

Understanding file structure is essential for:

Parsing and generating file formats
Reverse engineering and forensics
Data recovery and corruption handling
Designing efficient data storage

What You Will Learn

The OS Perspective: Unstructured Byte Streams

Modern operating systems (Unix, Linux, Windows, macOS) treat files as unstructured sequences of bytes. The OS provides:

A linear address space: bytes numbered 0, 1, 2, ... N-1
Operations to read and write arbitrary byte ranges
No interpretation of byte contents
No record boundaries, field definitions, or data types

This is the stream model of files—the OS sees only a tape of bytes that can be read in any order.

What the OS Stores:

File "document.pdf":

[0x25][0x50][0x44][0x46][0x2D][0x31][0x2E][0x37]...  (raw bytes)
 %     P     D     F     -     1     .     7   ...   (ASCII interpretation)

The OS sees: bytes 0-7 contain 0x25, 0x50, 0x44, ...
The OS does NOT see: "This is a PDF version 1.7"

Why Unstructured?

Early operating systems (especially mainframes) had more structured file models—files composed of fixed-length records, with OS-managed fields. This seems helpful, but it created problems:

Rigidity: Applications couldn't define custom structures
Overhead: OS enforced record sizes even when not needed
Complexity: File systems needed to understand many formats
Portability: Different systems had incompatible record formats

The Unix designers made a radical choice: files are just bytes. Structure is the application's responsibility. This simplicity enabled the explosion of diverse file formats we have today.

Block Storage vs. Byte Interface

Text Files

Text files contain human-readable characters organized into lines. They're the simplest structured files, with only one structural element: the line break.

Character Encoding:

Text files store characters as numeric codes. The encoding determines which number represents which character:

Encoding	Bytes/Char	Range	Notes
ASCII	1	0-127	English letters, digits, symbols
Latin-1 (ISO-8859-1)	1	0-255	Western European languages
UTF-8	1-4	All Unicode	Variable-width, ASCII-compatible
UTF-16	2 or 4	All Unicode	Windows internal, Java strings
UTF-32	4	All Unicode	Fixed-width, rarely used in files

Line Endings—The Eternal Problem:

Different systems use different bytes to mark line ends:

System	Line Ending	Hex	Name
Unix/Linux/macOS	`
`	`0x0A`	LF (Line Feed)
Windows	`\r
`	`0x0D 0x0A`	CR+LF
Classic Mac (pre-OS X)	`\r`	`0x0D`	CR (Carriage Return)

This causes real problems:

# A Windows text file opened on Linux:
$ cat windows.txt
Line 1^M
Line 2^M

# The ^M is the CR (0x0D) that Linux displays
# Fix with: sed -i 's/\r$//' windows.txt

Text File Structure:

Line 1: H  e  l  l  o  

        0x48 0x65 0x6C 0x6C 0x6F 0x0A

Line 2: W  o  r  l  d  

        0x57 0x6F 0x72 0x6C 0x64 0x0A

(No special header, no length fields, just characters and line breaks)

Common Text Formats:

Source code: C, Python, JavaScript, etc.
Configuration: INI, YAML, TOML, .env files
Data exchange: CSV, JSON (text-based)
Markup: HTML, XML, Markdown
Logs: System and application logs

UTF-8 Is Now Standard

Binary Files

Why Binary Instead of Text?

Consideration	Text Encoding	Binary Encoding
Size of integer 1000000	7 bytes ("1000000")	4 bytes (binary int)
Precision of π	Variable ("3.14159...")	8 bytes (IEEE 754 double)
Parsing speed	Slow (string parsing)	Fast (direct memory copy)
Human readability	✓ Yes	✗ No
Text editor safe	✓ Yes	✗ No (will corrupt)
Version control	✓ Diff-friendly	✗ Binary diff only

Byte Order (Endianness):

Multi-byte values can be stored in two orders:

The integer 0x12345678 (4 bytes):

Big-endian ("network byte order"):
  Address:  0    1    2    3
  Bytes:   [12] [34] [56] [78]
           MSB              LSB

Little-endian (x86, ARM):
  Address:  0    1    2    3
  Bytes:   [78] [56] [34] [12]
           LSB              MSB

This affects how programs read binary data:

x86/x64/most ARM: Little-endian
Network protocols: Big-endian (by convention)
Java .class files: Big-endian
Most file formats: Specify one or include marker

Binary File Anatomy:

Well-designed binary formats typically include:

┌─────────────────────────────────────────────┐
│ Magic Number (format identifier)             │
│ e.g., 0x89PNG for PNG, 0x7fELF for ELF      │
├─────────────────────────────────────────────┤
│ Header (metadata, version, sizes)            │
│ - File version                               │
│ - Size fields                                │
│ - Offsets to sections                        │
├─────────────────────────────────────────────┤
│ Data Sections                                │
│ - Actual content in structured format        │
│ - May be compressed                          │
├─────────────────────────────────────────────┤
│ Optional: Index/Directory                    │
│ - Fast lookup tables                         │
├─────────────────────────────────────────────┤
│ Optional: Trailer/Footer                     │
│ - Checksums, end marker                      │
└─────────────────────────────────────────────┘

Never Open Binary Files in Text Mode

Magic Numbers and File Identification

Common Magic Numbers:

Format	Magic Bytes (Hex)	ASCII	Offset
PDF	`25 50 44 46`	`%PDF`	0
PNG	`89 50 4E 47 0D 0A 1A 0A`	`\x89PNG\r

The file Command:

Unix's file command uses a database of magic numbers (typically /usr/share/misc/magic) to identify file types:

$ file mystery
mystery: ELF 64-bit LSB executable, x86-64

$ file document.pdf
document.pdf: PDF document, version 1.7

$ file image.png
image.png: PNG image data, 1920 x 1080, 8-bit/color RGBA

$ file renamed.xyz
renamed.xyz: JPEG image data, JFIF standard 1.01
# Extension is .xyz but content is JPEG!

The file command examines content, not extensions—far more reliable for security.

Programmatic Magic Number Checking:

#include <stdio.h>
#include <stdint.h>

int is_png(const char *filepath) {
    FILE *f = fopen(filepath, "rb");
    if (!f) return 0;
    
    uint8_t header[8];
    if (fread(header, 1, 8, f) != 8) {
        fclose(f);
        return 0;
    }
    fclose(f);
    
    // PNG magic: 89 50 4E 47 0D 0A 1A 0A
    uint8_t png_magic[] = {0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A};
    return memcmp(header, png_magic, 8) == 0;
}

Magic at Non-Zero Offsets

Record-Oriented Files

Some applications organize file data into records—discrete units of related data. While the OS sees only bytes, the application imposes record structure on top.

Fixed-Length Records:

All records are the same size—simple to navigate but potentially wasteful:

Record 0:  [Name: 30 bytes    ][Age: 4 bytes][Salary: 8 bytes]
           Offset 0           Offset 30      Offset 34

Record 1:  [Name: 30 bytes    ][Age: 4 bytes][Salary: 8 bytes]
           Offset 42          Offset 72      Offset 76

To read record N: seek to offset (N × 42)

Advantages:

Direct access to any record: O(1) seek
Simple implementation
Easy update in place

Disadvantages:

Wasted space for short data
Cannot store data exceeding field size
Schema changes require restructuring entire file

Variable-Length Records:

Records vary in size—more space-efficient but harder to navigate:

[Length][Record 0 data...][Length][Record 1 data......][Length][Record 2]
   ↓                          ↓
   Describes next record size  Must read sequentially (or build index)

Common Approaches:

Strategy	Description	Example
Length prefix	Each record starts with size	Protocol buffers
Delimiter	Special byte marks record end	CSV (newline)
Index/directory	Separate table of offsets	ZIP central directory
Chunked	Fixed chunks, records span chunks	Database pages

Example: Length-Prefixed Records

[04 00] [H  e  l  l  o  ] [08 00] [W  o  r  l  d  !  !  !  ]
  └─ 4 bytes of data─┘      └─ 8 bytes of data──────────┘

Databases Use Sophisticated Record Structures

Structured Text Formats

Structured text formats are text files with defined syntax for representing complex data. They combine human readability with machine parseability.

JSON (JavaScript Object Notation):

{
  "name": "Alice",
  "age": 30,
  "active": true,
  "roles": ["admin", "developer"],
  "address": {
    "city": "Seattle",
    "zip": "98101"
  }
}

Characteristics:

Human-readable
Maps directly to programming language data structures
No schema required (self-describing)
Universal support across languages
Not binary-efficient (integers stored as text)

XML (eXtensible Markup Language):

<?xml version="1.0" encoding="UTF-8"?>
<person>
    <name>Alice</name>
    <age>30</age>
    <roles>
        <role>admin</role>
        <role>developer</role>
    </roles>
</person>

CSV (Comma-Separated Values):

name,age,city
Alice,30,Seattle
Bob,25,Portland

YAML (YAML Ain't Markup Language):

name: Alice
age: 30
roles:
  - admin
  - developer

Structured Text Format Comparison
Format	Best For	Weaknesses
JSON	APIs, config, data exchange	No comments, limited types
XML	Documents, complex schemas	Verbose, complex parsing
CSV	Tabular data, spreadsheets	No nesting, escaping issues
YAML	Configuration files	Whitespace sensitivity, security
TOML	Configuration files	Less expressive for complex data

YAML Security Concerns

Binary Serialization Formats

For efficiency-critical applications, binary serialization formats provide compact encoding with fast parsing.

Protocol Buffers (Protobuf) - Google:

Schema definition:

message Person {
  string name = 1;
  int32 age = 2;
  repeated string roles = 3;
}

The binary encoding uses field numbers (1, 2, 3) rather than names, saving space. A Person with name="Alice", age=30 encodes to approximately 15 bytes instead of ~50+ bytes for JSON.

MessagePack:

Binary JSON-like format—no schema needed:

JSON (23 bytes):  {"name":"Alice","age":30}
MsgPack (~16 bytes): [0x82, 0xA4, n,a,m,e, 0xA5, A,l,i,c,e, 0xA3, a,g,e, 0x1E]

Binary Format Comparison
Format	Schema	Size	Speed	Use Case
Protobuf	Required	Smallest	Fastest	gRPC, internal services
FlatBuffers	Required	Small	Zero-copy	Games, mobile apps
Cap'n Proto	Required	Small	Zero-copy	High-performance IPC
MessagePack	None	Medium	Fast	JSON replacement
CBOR	None	Medium	Fast	IoT, web tokens
BSON	None	Larger	Fast	MongoDB documents

Zero-Copy Formats:

Formats like FlatBuffers and Cap'n Proto are designed for zero-copy access—you can read data directly from the file buffer without parsing into intermediate structures:

// Traditional: Parse entire message into memory
Person p = Person::parseFromBytes(buffer);
printf("Name: %s
", p.name().c_str());

// Zero-copy: Access data in place
auto p = flatbuffers::GetRoot<Person>(buffer);
printf("Name: %s
", p->name()->c_str());  // Points into buffer!

This is crucial for performance in games, embedded systems, and high-frequency trading.

Choose Format Based on Requirements

Container Formats

Container formats encapsulate multiple data streams or files within a single file. They provide organization while the actual data uses format-specific encodings.

ZIP - The Universal Container:

ZIP File Structure:
┌─────────────────────────────────────────────┐
│ [Local File Header 1] [File 1 Data]          │
│ [Local File Header 2] [File 2 Data]          │
│ ...                                          │
│ [Central Directory]                          │
│   - List of all files with offsets          │
│   - File attributes, sizes, CRCs            │
│ [End of Central Directory Record]            │
└─────────────────────────────────────────────┘

Fun fact: DOCX, XLSX, JAR, APK, and EPUB are all ZIP files with specific internal structures!

Media Containers:

Video files are containers holding multiple streams:

Container	Extension	Typical Contents
MP4/M4V	`.mp4`, `.m4v`	H.264/H.265 video + AAC audio
MKV (Matroska)	`.mkv`	Any codec + subtitles + chapters
AVI	`.avi`	Legacy video + audio streams
WebM	`.webm`	VP8/VP9 video + Vorbis/Opus audio

MP4 Container Structure:
┌──────────────────────────────────────────────┐
│ ftyp box (file type)                          │
│ moov box (movie metadata)                     │
│   └── trak box (track 1 - video)             │
│   └── trak box (track 2 - audio)             │
│ mdat box (actual media data)                  │
└──────────────────────────────────────────────┘

Document Containers:

Format	Actually Is	Contents
DOCX	ZIP	XML documents + media
PDF	Custom	Objects, streams, dictionaries
ODF (.odt)	ZIP	XML + resources
EPUB	ZIP	XHTML chapters + metadata

Exploring Containers:

# DOCX is just ZIP
$ unzip -l document.docx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1234  2024-01-15 10:00   [Content_Types].xml
     5678  2024-01-15 10:00   word/document.xml
     2345  2024-01-15 10:00   word/styles.xml
...

Containers Enable Flexibility

Executable Formats

Executable files have highly structured formats that allow the OS loader to set up a process for execution. They contain code, data, and metadata about memory layout and dynamic linking.

ELF (Executable and Linkable Format) - Unix/Linux:

ELF File Structure:
┌─────────────────────────────────────────────┐
│ ELF Header (magic: 0x7F 'E' 'L' 'F')         │
│   - Machine type (x86-64, ARM, etc.)        │
│   - Entry point address                      │
│   - Program header offset                    │
│   - Section header offset                    │
├─────────────────────────────────────────────┤
│ Program Headers (for loading)                │
│   - Segment addresses and sizes              │
│   - Memory permissions (RX, RW, R)          │
├─────────────────────────────────────────────┤
│ Sections                                     │
│   .text   - Executable code                  │
│   .data   - Initialized data                 │
│   .bss    - Uninitialized data               │
│   .rodata - Read-only data (strings)         │
│   .dynamic- Dynamic linking info             │
│   .symtab - Symbol table                     │
├─────────────────────────────────────────────┤
│ Section Headers (for linking/debugging)      │
└─────────────────────────────────────────────┘

PE (Portable Executable) - Windows:

PE File Structure:
┌─────────────────────────────────────────────┐
│ DOS Header (magic: 'MZ')                     │
│   - Offset to PE header                      │
├─────────────────────────────────────────────┤
│ DOS Stub (legacy: "This program cannot...)   │
├─────────────────────────────────────────────┤
│ PE Header (magic: 'PE\0\0')                  │
│   - Machine type                             │
│   - Number of sections                       │
│   - Optional header (image size, entry)      │
├─────────────────────────────────────────────┤
│ Section Headers                              │
├─────────────────────────────────────────────┤
│ Sections                                     │
│   .text   - Code                             │
│   .data   - Data                             │
│   .rdata  - Read-only data, imports          │
│   .rsrc   - Resources (icons, dialogs)       │
└─────────────────────────────────────────────┘

Executable Format Comparison
Format	Platforms	Tools
ELF	Linux, BSD, Solaris, Android	`readelf`, `objdump`, `nm`
PE/COFF	Windows	`dumpbin`, PE Explorer
Mach-O	macOS, iOS	`otool`, `nm`, `MachOView`
a.out	Legacy Unix (obsolete)	Historical interest only

Analyzing Executables

Image and Media Formats

Images, audio, and video use specialized formats optimized for their data characteristics—particularly compression.

Image Format Structures:

PNG (Portable Network Graphics):

PNG Structure:
┌─────────────────────────────────────────────┐
│ Signature: 89 50 4E 47 0D 0A 1A 0A          │
├─────────────────────────────────────────────┤
│ IHDR Chunk (header)                          │
│   - Width, height, bit depth, color type    │
├─────────────────────────────────────────────┤
│ [Optional chunks: tEXt, gAMA, cHRM...]       │
├─────────────────────────────────────────────┤
│ IDAT Chunks (compressed image data)          │
│   - zlib-compressed filtered pixel data      │
├─────────────────────────────────────────────┤
│ IEND Chunk (end marker)                      │
└─────────────────────────────────────────────┘

PNG is lossless—every pixel is preserved exactly.

JPEG Structure (Lossy):

JPEG Structure:
┌─────────────────────────────────────────────┐
│ SOI Marker (FF D8)                           │
├─────────────────────────────────────────────┤
│ APP0 (JFIF metadata)                         │
├─────────────────────────────────────────────┤
│ DQT (Quantization tables)                    │
│   - These determine quality level            │
├─────────────────────────────────────────────┤
│ SOF (Start of Frame)                         │
│   - Dimensions, components                   │
├─────────────────────────────────────────────┤
│ DHT (Huffman tables)                         │
├─────────────────────────────────────────────┤
│ SOS (Start of Scan) + Compressed Data        │
│   - DCT-compressed image blocks              │
├─────────────────────────────────────────────┤
│ EOI Marker (FF D9)                           │
└─────────────────────────────────────────────┘

JPEG uses lossy compression—some information is discarded for smaller size.

Image Format Comparison
Format	Compression	Best For	Limitations
PNG	Lossless	Screenshots, graphics, transparency	Large for photos
JPEG	Lossy	Photographs	Artifacts, no transparency
GIF	Lossless (limited palette)	Simple animations	256 colors max
WebP	Both modes	Web images	Less universal support
AVIF	Lossy	Modern web images	Newer, less support
TIFF	Various	Archival, professional	Large, complex

EXIF Metadata in Images

Database File Formats

Database files contain highly structured data optimized for random access, concurrent modification, and crash recovery.

SQLite File Format:

SQLite stores an entire database in a single file:

SQLite File Structure:
┌─────────────────────────────────────────────┐
│ Header (100 bytes)                           │
│   - Magic: "SQLite format 3\0"               │
│   - Page size (typically 4096)               │
│   - File format versions                     │
│   - Database size in pages                   │
├─────────────────────────────────────────────┤
│ Page 1: Database Schema                      │
│   - sqlite_master table                      │
│   - CREATE TABLE statements                  │
├─────────────────────────────────────────────┤
│ Pages 2-N: B-tree nodes, overflow, freelist  │
│   - Table B-trees (data storage)             │
│   - Index B-trees (for fast lookup)          │
│   - Overflow pages (for large rows)          │
│   - Free pages (deleted, reusable)           │
└─────────────────────────────────────────────┘

Page-Based Storage:

Nearly all databases use pages (fixed-size blocks, typically 4-16 KB):

Typical Database Page:
┌─────────────────────────────────────────────┐
│ Page Header                                  │
│   - Page type (leaf, interior, overflow)    │
│   - Number of cells                          │
│   - Free space offset                        │
├─────────────────────────────────────────────┤
│ Cell Pointer Array                           │
│   [offset1][offset2][offset3]...             │
├─────────────────────────────────────────────┤
│ Free Space                                   │
│                                              │
├─────────────────────────────────────────────┤
│ Cell Content Area (grows upward)             │
│   [Cell 3 data]                              │
│   [Cell 2 data]                              │
│   [Cell 1 data]                              │
└─────────────────────────────────────────────┘

This "slotted page" design allows efficient insertion without moving all data.

Write-Ahead Logging (WAL)

Summary: Understanding File Structure

We've explored the vast landscape of file structures—from the OS view of undifferentiated bytes to the complex formats that encode our data. Let's consolidate:

Key Takeaways

•The OS sees files as unstructured byte streams — All structure is imposed by applications, not the operating system.
•Text files store human-readable characters — Encoding (UTF-8) and line endings (LF vs CRLF) are critical details.
•Binary files encode data efficiently — Endianness, alignment, and format specifications matter.
•Magic numbers identify file types reliably — Content inspection beats filename extensions for security.
•Record-oriented files organize data systematically — Fixed-length for random access, variable-length for efficiency.
•Structured text formats (JSON, XML) balance readability and structure — Use for configuration and data exchange.
•Binary serialization (Protobuf, FlatBuffers) maximizes efficiency — Use when performance matters.
•Container formats hold multiple streams — ZIP, MP4, DOCX are all containers with different contents.
•Executable formats enable program loading — ELF, PE, Mach-O define how code becomes a running process.
•Database files use page-based structures — B-trees, slotted pages, and WAL enable efficient data management.

Module Complete:

Module Complete

5 / 5