Database Management SystemsDocument Databases

Document Databases: The Schema-Flexible NoSQL Paradigm

LevelAdvanced

Duration90 mins

TopicDocument Databases

2 / 5

JSON/BSON Storage: The Physical Foundation

From Human-Readable to Machine-Efficient

When you store a document in a database, what actually happens on disk? The document you write—perhaps a JSON object in your application code—undergoes a transformation before it's persisted. This transformation balances human readability against machine efficiency, query performance against storage density.

Understanding this physical layer is essential for document database practitioners. It explains why some operations are fast and others slow, why certain data types exist, and how to optimize your schema for storage efficiency.

In this page, we'll trace the journey from JSON (the ubiquitous text format) to BSON (Binary JSON, MongoDB's internal format), examining the engineering decisions that make document databases performant.

What You Will Master

By the end of this page, you will understand: JSON's role as the universal document interchange format; BSON's binary structure and type system; how documents are serialized, stored, and deserialized; performance implications of different data types and document structures; storage optimization techniques; and how other document databases approach the same problems.

JSON: The Universal Document Language

JavaScript Object Notation (JSON) has become the lingua franca of data interchange. Its success stems from a perfect balance of simplicity, expressiveness, and human readability.

JSON Syntax

JSON supports six data types organized into two structures:

Primitive Types:

String: Unicode text in double quotes: "Hello, World!"
Number: Integer or floating-point: 42, -3.14, 1.2e10
Boolean: Literal true or false
Null: Literal null

Composite Types:

Object: Unordered collection of key-value pairs: {"name": "John", "age": 30}
Array: Ordered list of values: [1, 2, 3], ["a", "b", "c"]

JSON Strengths

{
  "user": {
    "id": "usr_12345",
    "profile": {
      "name": "Alice Chen",
      "verified": true,
      "followers": 1523,
      "bio": null
    },
    "tags": ["developer", "speaker", "author"]
  }
}

Why JSON dominates:

Human-readable: Developers can inspect and debug without tools
Language-agnostic: Native parsing in virtually all programming languages
Self-describing: Structure is embedded in the data
Lightweight: Minimal syntax overhead compared to XML
Hierarchical: Natural representation of nested data

JSON Limitations for Databases

While excellent for interchange, JSON has significant limitations for database storage:

• No Date type — Dates must be encoded as strings or numbers • No Binary type — Binary data requires Base64 encoding (33% overhead) • No Integer vs Float distinction — All numbers are IEEE 754 doubles • No ObjectId/UUID — Unique identifiers stored as strings • Parsing overhead — Text must be parsed on every read • Format overhead — Field names repeated for every document

BSON: Binary JSON for Performance

BSON (Binary JSON) is MongoDB's binary-encoded serialization format. It was designed to overcome JSON's limitations for database storage while maintaining JSON's document model semantics.

BSON Design Goals

Traversability: BSON includes length prefixes, enabling fast skipping to specific fields without parsing entire documents
Efficient Encoding: Binary representations are more compact and faster to parse than text
Extended Types: Additional data types not present in JSON (Date, Binary, ObjectId, etc.)
Fast Encoding/Decoding: Minimal transformation between in-memory and on-disk formats

BSON Document Structure

Every BSON document follows this structure:

┌─────────────────────────────────────────────────────────────┐
│ Document                                                     │
├─────────┬───────────────────────────────────────────────────┤
│ 4 bytes │ Total document size (including this header)        │
├─────────┼───────────────────────────────────────────────────┤
│ Element │ Type (1 byte) + Name (cstring) + Value            │
├─────────┼───────────────────────────────────────────────────┤
│ Element │ Type (1 byte) + Name (cstring) + Value            │
├─────────┼───────────────────────────────────────────────────┤
│   ...   │ Additional elements                                │
├─────────┼───────────────────────────────────────────────────┤
│ 1 byte  │ Null terminator (0x00)                             │
└─────────┴───────────────────────────────────────────────────┘

Key insight: The 4-byte size prefix at the document start enables O(1) document skipping. When scanning a collection, the database can jump from document to document without parsing field contents.

BSON Element Structure
Component	Size	Description
Type byte	1 byte	Identifies the value type (0x01=double, 0x02=string, etc.)
Field name	Variable	Null-terminated C-string (UTF-8)
Value	Variable	Type-specific binary encoding

BSON Example

Consider this JSON document:

{"hello": "world"}

In BSON, this becomes 22 bytes:

\x16\x00\x00\x00               // Document size: 22 bytes
\x02                            // Type: string (0x02)
hello\x00                       // Field name + null terminator
\x06\x00\x00\x00                // String length: 6 (including null)
world\x00                       // String value + null terminator
\x00                            // Document terminator

Compare to the JSON text representation (18 bytes). BSON is slightly larger here, but the length prefixes enable direct field access without parsing.

The BSON Type System

BSON extends JSON with additional types critical for database operations. Understanding these types is essential for efficient schema design.

Core BSON Types

BSON Type Reference
Type	Code	Description	Size
Double	0x01	64-bit IEEE 754 floating point	8 bytes
String	0x02	UTF-8 string with length prefix	4 + len + 1
Document	0x03	Embedded BSON document	Variable
Array	0x04	BSON document with numeric keys	Variable
Binary	0x05	Binary data with subtype byte	5 + len
ObjectId	0x07	12-byte unique identifier	12 bytes
Boolean	0x08	Single byte boolean (0/1)	1 byte
UTC DateTime	0x09	64-bit milliseconds since epoch	8 bytes
Null	0x0A	Null value (no data bytes)	0 bytes
Regular Expression	0x0B	Pattern + options strings	Variable
32-bit Integer	0x10	Signed 32-bit integer	4 bytes
64-bit Integer	0x12	Signed 64-bit integer	8 bytes
Decimal128	0x13	128-bit decimal floating point	16 bytes

Key Type Deep Dives

ObjectId (12 bytes)

The ObjectId is MongoDB's default _id type, designed for distributed uniqueness:

┌────────────┬────────────┬─────────────┐
│ Timestamp  │  Random    │  Counter    │
│  4 bytes   │  5 bytes   │  3 bytes    │
└────────────┴────────────┴─────────────┘

Timestamp: Unix timestamp when ObjectId was created
Random: Random value unique to machine + process
Counter: Incrementing counter for same-second uniqueness

This structure enables:

Generating unique IDs without coordination
Extracting creation timestamp from any ObjectId
Rough time-ordering within collections

DateTime

Stored as 64-bit signed integer representing milliseconds since Unix epoch (January 1, 1970 UTC). This provides:

Range: ~290 million years in either direction
Precision: Millisecond resolution
Time zone: Always UTC internally
Comparisons: Direct integer comparison for sorting

Decimal128

For financial and scientific applications requiring exact decimal representation:

34 decimal digits of precision
No floating-point rounding errors
Fully implements IEEE 754-2008 decimal128 format
Essential for monetary calculations where 0.1 + 0.2 = 0.3 (not 0.30000000000000004)

Type Selection Best Practices

• Use Int32 for small integers (counter, age) — 4 bytes vs 8 for Double • Use Int64 for IDs, timestamps — avoids precision loss • Use Decimal128 for money — never use Double for currency • Use Date for timestamps — native sorting and querying • Use Binary for files/images — avoid Base64's 33% overhead • Use ObjectId for _id unless you have specific requirements

Serialization and Parsing Performance

The choice between JSON and BSON has profound performance implications. Let's analyze the trade-offs quantitatively.

Encoding/Decoding Speed

JSON Parsing:

Read character by character
Tokenize (identify strings, numbers, delimiters)
Build parse tree or object graph
Convert string representations to native types

BSON Decoding:

Read type byte
Read field name until null terminator
Based on type, read fixed or length-prefixed value
Value is already in native binary format

Performance comparison (approximate, varies by implementation):

Operation	JSON	BSON	BSON Advantage
Full document parse	100%	40-60%	1.7-2.5× faster
Skip to specific field	O(n)	O(1)*	Dramatic
Number parsing	String→Binary	Native read	5-10× faster
Date parsing	Parse ISO string	Read 8 bytes	10× faster

*BSON field access is O(1) for known offsets, O(n) for name lookup on first access

BSON Advantages

•Faster parsing — Binary data requires minimal transformation
•Native numerical types — No string-to-number conversion
•Length prefixes — Enable skipping without parsing
•Richer types — Dates, binaries, ObjectIds are native
•Query efficiency — Type codes enable type-specific operations

BSON Trade-offs

•Often larger — Type bytes, length prefixes add overhead
•Not human-readable — Debugging requires tools
•Field names stored — Every document includes field name strings
•MongoDB-specific — Less portable than JSON
•Encoding cost — JSON-to-BSON conversion required

Size Comparison Examples

Small document with numbers:

{"x": 1, "y": 2, "z": 3}

JSON: 21 bytes
BSON (with int32): 32 bytes — BSON is 52% larger

Document with date and binary:

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "data": "SGVsbG8gV29ybGQh"  // Base64-encoded "Hello World!"
}

JSON: 80 bytes
BSON: 54 bytes — BSON is 32% smaller

Key insight: BSON wins for documents with rich types (dates, binaries, large numbers) and efficient field access. JSON wins for small documents with simple string content.

Physical Storage Layout

How documents are organized on disk impacts I/O efficiency, compression, and query performance. Different document databases employ various storage strategies.

Document Storage Engines

WiredTiger (MongoDB default since 3.2)

WiredTiger is a high-performance storage engine that:

B-tree based: Documents stored in B-tree leaf pages
Compressed: Snappy (default) or zlib/zstd compression
Document-level Concurrency: Multiple documents can be modified concurrently
Journaling: Write-ahead log for crash recovery
Cache: Hot data kept in compressed form in RAM

Storage Hierarchy:

┌─────────────────────────────────────────┐
│              WiredTiger Cache           │
│    (Frequently accessed documents)      │
├─────────────────────────────────────────┤
│                                         │
│    ┌─────────────────────────────┐      │
│    │     B-tree Internal Nodes    │      │
│    ├─────────────────────────────┤      │
│    │     B-tree Leaf Pages        │      │
│    │  ┌───────┬───────┬───────┐  │      │
│    │  │ Doc 1 │ Doc 2 │ Doc 3 │  │      │
│    │  └───────┴───────┴───────┘  │      │
│    └─────────────────────────────┘      │
│                                         │
└─────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│           Data Files on Disk            │
│         (Compressed, journaled)         │
└─────────────────────────────────────────┘

Compression Strategies

Compression is critical for document databases since:

Documents contain repetitive field names
Related documents share similar structure
Text content compresses well

Compression Options:

Algorithm	Speed	Ratio	Use Case
None	Fastest	1.0×	Latency-critical, incompressible data
Snappy	Fast	2-4×	Default balance of speed and ratio
zlib	Medium	4-8×	Better ratio, moderate CPU overhead
zstd	Fast	5-10×	Best balance (newer engines)

Block-level compression:

WiredTiger compresses at the page level, not document level. This means:

Adjacent documents are compressed together
Similar documents compress well as a block
Random access requires decompressing the page
Compression ratio improves with page size

Compression vs. Query Performance

Higher compression ratios can paradoxically improve query performance. Reason: Less data to read from disk means faster I/O. The CPU cost of decompression is often less than the I/O savings. This is especially true for disk-bound workloads where I/O is the bottleneck.

Field Name Storage and Optimization

A unique characteristic of document databases is that field names are stored with every document. This enables schema flexibility but has storage implications.

The Field Name Problem

Consider a collection with 1 million documents:

{
  "customer_email_address": "user@example.com",
  "shipping_street_address": "123 Main Street",
  "preferred_contact_method": "email",
  "account_creation_timestamp": "2024-01-15T10:30:00Z"
}

Field names alone consume:

customer_email_address: 22 bytes
shipping_street_address: 23 bytes
preferred_contact_method: 24 bytes
account_creation_timestamp: 26 bytes
Total: 95 bytes per document × 1M documents = 95MB just for field names

Optimization Strategies

1. Short Field Names

Use abbreviated field names in storage:

{
  "cEmail": "user@example.com",
  "sAddr": "123 Main Street",
  "cMethod": "email",
  "cTime": "2024-01-15T10:30:00Z"
}

Reduced to 21 bytes — 78% savings.

Trade-off: Reduced readability. Mitigate with:

Application-layer mapping
Clear documentation
Consistent abbreviation conventions

2. Rely on Compression

Compression algorithms excel at repetitive patterns. Field names appearing in millions of documents compress extremely well:

Snappy: ~90% reduction in field name overhead
zstd: ~95% reduction

3. Schema Validation Documentation

Even without code-level mapping, maintain clear documentation:

{
  "$jsonSchema": {
    "properties": {
      "cEmail": { "description": "Customer email address" },
      "sAddr": { "description": "Shipping street address" }
    }
  }
}

Field Name Best Practices

For high-volume collections (millions of docs), consider abbreviated names
Always enable compression — it largely solves the problem
Avoid deep nesting which multiplies field name overhead
Balance readability and storage based on collection size
Document abbreviations clearly in schema or code

Alternative Binary Document Formats

While BSON is MongoDB-specific, other document databases and systems use different binary formats. Understanding the landscape helps evaluate technology choices.

Format Comparison

Binary Document Format Comparison
Format	Used By	Key Features
BSON	MongoDB	Length-prefixed, rich types, traversable without parsing
MessagePack	Many systems	Compact, cross-platform, simpler type system
CBOR	IoT, WebAuthn	IETF standard, streaming support, schema-optional
Protocol Buffers	gRPC, Google	Schema-required, compact, fast, explicit field IDs
Avro	Hadoop ecosystem	Schema-with-data, row/columnar modes, schema evolution
Ion	Amazon	Self-describing, text and binary modes, typed nulls

MessagePack

MessagePack is often described as "binary JSON":

JSON:  {"compact":true,"schema":0}    (27 bytes)
MsgPk: 82 a7 compact c3 a6 schema 00   (18 bytes)

More compact than BSON for simple documents
No length prefixes (requires sequential parsing)
Widely adopted across languages
Used by: Redis (internally), Fluentd, various caches

CBOR (Concise Binary Object Representation)

IETF RFC 8949 standardized format:

Designed for constrained environments (IoT)
Self-describing with compact encoding
Supports streaming and indefinite-length arrays
Extensible type system via tags

Amazon Ion

Developed by Amazon for internal use:

Dual format: Human-readable text and compact binary
Rich type system including typed nulls
Annotation support (metadata on values)
Used in: QLDB, many AWS services

Choosing a Format

Requirement	Recommended Format
MongoDB compatibility	BSON
Maximum compactness	MessagePack
IoT/Constrained devices	CBOR
Request/response APIs	Protocol Buffers
Big data/Analytics	Avro
Multi-mode (text+binary)	Ion

Storage Optimization Techniques

Optimizing document storage requires understanding both the format and access patterns. Here are proven techniques for production systems.

Document Size Management

Problem: Documents that grow unboundedly (e.g., arrays that receive endless pushes) cause:

Document movements when exceeding allocated space
Memory pressure from large documents
Slower updates as more data is rewritten

Solutions:

Bucketing Pattern: Split large arrays into multiple documents

// Instead of one document with 100,000 events:
{
  "sensor_id": "s1",
  "events": [/* 100,000 events */]  // ❌ Will exceed limits
}

// Use bucketed documents:
{
  "sensor_id": "s1",
  "bucket": 1,
  "events": [/* events 1-1000 */],
  "count": 1000
}
{
  "sensor_id": "s1",
  "bucket": 2,
  "events": [/* events 1001-2000 */],
  "count": 1000
}

Computed Fields: Precompute aggregates to avoid full-array scans

{
  "product_id": "p1",
  "ratings": [5, 4, 5, 3, 5, 4, ...],
  "_computed": {
    "avg_rating": 4.3,
    "rating_count": 1523,
    "last_updated": "2024-01-15T10:30:00Z"
  }
}

Production Optimization Checklist

•Enable compression — zstd or snappy depending on CPU/IO balance
•Use appropriate types — Int32 vs Int64 vs Double; Date vs string
•Avoid unbounded arrays — Use bucketing or reference patterns
•Index covering queries — Avoid fetching documents when possible
•Project only needed fields — Reduce network and memory overhead
•Monitor document size distribution — Identify outliers
•Consider field name abbreviation — For very high-volume collections
•Profile storage vs. working set — Ensure hot data fits in RAM

Storage Anti-Patterns

Avoid these common mistakes:

• Storing Base64-encoded binaries — Use Binary type • ISO date strings — Use native DateTime type • Floating point for currency — Use Decimal128 • Deeply nested structures (>5 levels) — Flattening often improves performance • Arrays as poor man's indexes — Use proper indexes instead

Summary: JSON/BSON Storage Mastery

We've explored the physical foundation of document storage. Let's consolidate the key insights:

Key Takeaways

•JSON is logical, BSON is physical — JSON for interchange, BSON for storage and performance
•BSON extends JSON's type system — Dates, binaries, ObjectIds, integers with native representations
•Length prefixes enable efficiency — O(1) document skipping, fast field access
•Storage engines matter — WiredTiger's B-tree structure optimizes for document workloads
•Compression is essential — Reduces field name overhead and improves I/O efficiency
•Type selection impacts storage — Choose appropriate types for size and precision
•Document size requires management — Bucketing, computed fields, projections
•Alternatives exist — MessagePack, CBOR, Ion for different requirements

What's Next:

With an understanding of how documents are modeled and stored, we'll explore MongoDB as the canonical document database example. You'll learn MongoDB's architecture, replication model, sharding capabilities, and how it implements the document model principles we've established.

Page Complete

You now understand the physical storage layer of document databases—from JSON's ubiquitous text format to BSON's optimized binary representation. This knowledge enables you to make informed decisions about data types, schema design, and storage optimization in production systems.

2 / 5

Loading learning content...

Database Management SystemsDocument Databases

Document Databases: The Schema-Flexible NoSQL Paradigm

LevelAdvanced

Duration90 mins

TopicDocument Databases

2 / 5

JSON/BSON Storage: The Physical Foundation

From Human-Readable to Machine-Efficient

What You Will Master

JSON: The Universal Document Language

JavaScript Object Notation (JSON) has become the lingua franca of data interchange. Its success stems from a perfect balance of simplicity, expressiveness, and human readability.

JSON Syntax

JSON supports six data types organized into two structures:

Primitive Types:

String: Unicode text in double quotes: "Hello, World!"
Number: Integer or floating-point: 42, -3.14, 1.2e10
Boolean: Literal true or false
Null: Literal null

Composite Types:

Object: Unordered collection of key-value pairs: {"name": "John", "age": 30}
Array: Ordered list of values: [1, 2, 3], ["a", "b", "c"]

JSON Strengths

{
  "user": {
    "id": "usr_12345",
    "profile": {
      "name": "Alice Chen",
      "verified": true,
      "followers": 1523,
      "bio": null
    },
    "tags": ["developer", "speaker", "author"]
  }
}

Why JSON dominates:

Human-readable: Developers can inspect and debug without tools
Language-agnostic: Native parsing in virtually all programming languages
Self-describing: Structure is embedded in the data
Lightweight: Minimal syntax overhead compared to XML
Hierarchical: Natural representation of nested data

JSON Limitations for Databases

While excellent for interchange, JSON has significant limitations for database storage:

BSON: Binary JSON for Performance

BSON (Binary JSON) is MongoDB's binary-encoded serialization format. It was designed to overcome JSON's limitations for database storage while maintaining JSON's document model semantics.

BSON Design Goals

Traversability: BSON includes length prefixes, enabling fast skipping to specific fields without parsing entire documents
Efficient Encoding: Binary representations are more compact and faster to parse than text
Extended Types: Additional data types not present in JSON (Date, Binary, ObjectId, etc.)
Fast Encoding/Decoding: Minimal transformation between in-memory and on-disk formats

BSON Document Structure

Every BSON document follows this structure:

┌─────────────────────────────────────────────────────────────┐
│ Document                                                     │
├─────────┬───────────────────────────────────────────────────┤
│ 4 bytes │ Total document size (including this header)        │
├─────────┼───────────────────────────────────────────────────┤
│ Element │ Type (1 byte) + Name (cstring) + Value            │
├─────────┼───────────────────────────────────────────────────┤
│ Element │ Type (1 byte) + Name (cstring) + Value            │
├─────────┼───────────────────────────────────────────────────┤
│   ...   │ Additional elements                                │
├─────────┼───────────────────────────────────────────────────┤
│ 1 byte  │ Null terminator (0x00)                             │
└─────────┴───────────────────────────────────────────────────┘

BSON Element Structure
Component	Size	Description
Type byte	1 byte	Identifies the value type (0x01=double, 0x02=string, etc.)
Field name	Variable	Null-terminated C-string (UTF-8)
Value	Variable	Type-specific binary encoding

BSON Example

Consider this JSON document:

{"hello": "world"}

In BSON, this becomes 22 bytes:

\x16\x00\x00\x00               // Document size: 22 bytes
\x02                            // Type: string (0x02)
hello\x00                       // Field name + null terminator
\x06\x00\x00\x00                // String length: 6 (including null)
world\x00                       // String value + null terminator
\x00                            // Document terminator

Compare to the JSON text representation (18 bytes). BSON is slightly larger here, but the length prefixes enable direct field access without parsing.

The BSON Type System

BSON extends JSON with additional types critical for database operations. Understanding these types is essential for efficient schema design.

Core BSON Types

BSON Type Reference
Type	Code	Description	Size
Double	0x01	64-bit IEEE 754 floating point	8 bytes
String	0x02	UTF-8 string with length prefix	4 + len + 1
Document	0x03	Embedded BSON document	Variable
Array	0x04	BSON document with numeric keys	Variable
Binary	0x05	Binary data with subtype byte	5 + len
ObjectId	0x07	12-byte unique identifier	12 bytes
Boolean	0x08	Single byte boolean (0/1)	1 byte
UTC DateTime	0x09	64-bit milliseconds since epoch	8 bytes
Null	0x0A	Null value (no data bytes)	0 bytes
Regular Expression	0x0B	Pattern + options strings	Variable
32-bit Integer	0x10	Signed 32-bit integer	4 bytes
64-bit Integer	0x12	Signed 64-bit integer	8 bytes
Decimal128	0x13	128-bit decimal floating point	16 bytes

Key Type Deep Dives

ObjectId (12 bytes)

The ObjectId is MongoDB's default _id type, designed for distributed uniqueness:

┌────────────┬────────────┬─────────────┐
│ Timestamp  │  Random    │  Counter    │
│  4 bytes   │  5 bytes   │  3 bytes    │
└────────────┴────────────┴─────────────┘

Timestamp: Unix timestamp when ObjectId was created
Random: Random value unique to machine + process
Counter: Incrementing counter for same-second uniqueness

This structure enables:

Generating unique IDs without coordination
Extracting creation timestamp from any ObjectId
Rough time-ordering within collections

DateTime

Stored as 64-bit signed integer representing milliseconds since Unix epoch (January 1, 1970 UTC). This provides:

Range: ~290 million years in either direction
Precision: Millisecond resolution
Time zone: Always UTC internally
Comparisons: Direct integer comparison for sorting

Decimal128

For financial and scientific applications requiring exact decimal representation:

34 decimal digits of precision
No floating-point rounding errors
Fully implements IEEE 754-2008 decimal128 format
Essential for monetary calculations where 0.1 + 0.2 = 0.3 (not 0.30000000000000004)

Type Selection Best Practices

Serialization and Parsing Performance

The choice between JSON and BSON has profound performance implications. Let's analyze the trade-offs quantitatively.

Encoding/Decoding Speed

JSON Parsing:

Read character by character
Tokenize (identify strings, numbers, delimiters)
Build parse tree or object graph
Convert string representations to native types

BSON Decoding:

Read type byte
Read field name until null terminator
Based on type, read fixed or length-prefixed value
Value is already in native binary format

Performance comparison (approximate, varies by implementation):

Operation	JSON	BSON	BSON Advantage
Full document parse	100%	40-60%	1.7-2.5× faster
Skip to specific field	O(n)	O(1)*	Dramatic
Number parsing	String→Binary	Native read	5-10× faster
Date parsing	Parse ISO string	Read 8 bytes	10× faster

*BSON field access is O(1) for known offsets, O(n) for name lookup on first access

BSON Advantages

•Faster parsing — Binary data requires minimal transformation
•Native numerical types — No string-to-number conversion
•Length prefixes — Enable skipping without parsing
•Richer types — Dates, binaries, ObjectIds are native
•Query efficiency — Type codes enable type-specific operations

BSON Trade-offs

•Often larger — Type bytes, length prefixes add overhead
•Not human-readable — Debugging requires tools
•Field names stored — Every document includes field name strings
•MongoDB-specific — Less portable than JSON
•Encoding cost — JSON-to-BSON conversion required

Size Comparison Examples

Small document with numbers:

{"x": 1, "y": 2, "z": 3}

JSON: 21 bytes
BSON (with int32): 32 bytes — BSON is 52% larger

Document with date and binary:

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "data": "SGVsbG8gV29ybGQh"  // Base64-encoded "Hello World!"
}

JSON: 80 bytes
BSON: 54 bytes — BSON is 32% smaller

Key insight: BSON wins for documents with rich types (dates, binaries, large numbers) and efficient field access. JSON wins for small documents with simple string content.

Physical Storage Layout

How documents are organized on disk impacts I/O efficiency, compression, and query performance. Different document databases employ various storage strategies.

Document Storage Engines

WiredTiger (MongoDB default since 3.2)

WiredTiger is a high-performance storage engine that:

B-tree based: Documents stored in B-tree leaf pages
Compressed: Snappy (default) or zlib/zstd compression
Document-level Concurrency: Multiple documents can be modified concurrently
Journaling: Write-ahead log for crash recovery
Cache: Hot data kept in compressed form in RAM

Storage Hierarchy:

┌─────────────────────────────────────────┐
│              WiredTiger Cache           │
│    (Frequently accessed documents)      │
├─────────────────────────────────────────┤
│                                         │
│    ┌─────────────────────────────┐      │
│    │     B-tree Internal Nodes    │      │
│    ├─────────────────────────────┤      │
│    │     B-tree Leaf Pages        │      │
│    │  ┌───────┬───────┬───────┐  │      │
│    │  │ Doc 1 │ Doc 2 │ Doc 3 │  │      │
│    │  └───────┴───────┴───────┘  │      │
│    └─────────────────────────────┘      │
│                                         │
└─────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│           Data Files on Disk            │
│         (Compressed, journaled)         │
└─────────────────────────────────────────┘

Compression Strategies

Compression is critical for document databases since:

Documents contain repetitive field names
Related documents share similar structure
Text content compresses well

Compression Options:

Algorithm	Speed	Ratio	Use Case
None	Fastest	1.0×	Latency-critical, incompressible data
Snappy	Fast	2-4×	Default balance of speed and ratio
zlib	Medium	4-8×	Better ratio, moderate CPU overhead
zstd	Fast	5-10×	Best balance (newer engines)

Block-level compression:

WiredTiger compresses at the page level, not document level. This means:

Adjacent documents are compressed together
Similar documents compress well as a block
Random access requires decompressing the page
Compression ratio improves with page size

Compression vs. Query Performance

Field Name Storage and Optimization

A unique characteristic of document databases is that field names are stored with every document. This enables schema flexibility but has storage implications.

The Field Name Problem

Consider a collection with 1 million documents:

{
  "customer_email_address": "user@example.com",
  "shipping_street_address": "123 Main Street",
  "preferred_contact_method": "email",
  "account_creation_timestamp": "2024-01-15T10:30:00Z"
}

Field names alone consume:

customer_email_address: 22 bytes
shipping_street_address: 23 bytes
preferred_contact_method: 24 bytes
account_creation_timestamp: 26 bytes
Total: 95 bytes per document × 1M documents = 95MB just for field names

Optimization Strategies

1. Short Field Names

Use abbreviated field names in storage:

{
  "cEmail": "user@example.com",
  "sAddr": "123 Main Street",
  "cMethod": "email",
  "cTime": "2024-01-15T10:30:00Z"
}

Reduced to 21 bytes — 78% savings.

Trade-off: Reduced readability. Mitigate with:

Application-layer mapping
Clear documentation
Consistent abbreviation conventions

2. Rely on Compression

Compression algorithms excel at repetitive patterns. Field names appearing in millions of documents compress extremely well:

Snappy: ~90% reduction in field name overhead
zstd: ~95% reduction

3. Schema Validation Documentation

Even without code-level mapping, maintain clear documentation:

{
  "$jsonSchema": {
    "properties": {
      "cEmail": { "description": "Customer email address" },
      "sAddr": { "description": "Shipping street address" }
    }
  }
}

Field Name Best Practices

For high-volume collections (millions of docs), consider abbreviated names
Always enable compression — it largely solves the problem
Avoid deep nesting which multiplies field name overhead
Balance readability and storage based on collection size
Document abbreviations clearly in schema or code

Alternative Binary Document Formats

While BSON is MongoDB-specific, other document databases and systems use different binary formats. Understanding the landscape helps evaluate technology choices.

Format Comparison

Binary Document Format Comparison
Format	Used By	Key Features
BSON	MongoDB	Length-prefixed, rich types, traversable without parsing
MessagePack	Many systems	Compact, cross-platform, simpler type system
CBOR	IoT, WebAuthn	IETF standard, streaming support, schema-optional
Protocol Buffers	gRPC, Google	Schema-required, compact, fast, explicit field IDs
Avro	Hadoop ecosystem	Schema-with-data, row/columnar modes, schema evolution
Ion	Amazon	Self-describing, text and binary modes, typed nulls

MessagePack

MessagePack is often described as "binary JSON":

JSON:  {"compact":true,"schema":0}    (27 bytes)
MsgPk: 82 a7 compact c3 a6 schema 00   (18 bytes)

More compact than BSON for simple documents
No length prefixes (requires sequential parsing)
Widely adopted across languages
Used by: Redis (internally), Fluentd, various caches

CBOR (Concise Binary Object Representation)

IETF RFC 8949 standardized format:

Designed for constrained environments (IoT)
Self-describing with compact encoding
Supports streaming and indefinite-length arrays
Extensible type system via tags

Amazon Ion

Developed by Amazon for internal use:

Dual format: Human-readable text and compact binary
Rich type system including typed nulls
Annotation support (metadata on values)
Used in: QLDB, many AWS services

Choosing a Format

Requirement	Recommended Format
MongoDB compatibility	BSON
Maximum compactness	MessagePack
IoT/Constrained devices	CBOR
Request/response APIs	Protocol Buffers
Big data/Analytics	Avro
Multi-mode (text+binary)	Ion

Storage Optimization Techniques

Optimizing document storage requires understanding both the format and access patterns. Here are proven techniques for production systems.

Document Size Management

Problem: Documents that grow unboundedly (e.g., arrays that receive endless pushes) cause:

Document movements when exceeding allocated space
Memory pressure from large documents
Slower updates as more data is rewritten

Solutions:

Bucketing Pattern: Split large arrays into multiple documents

// Instead of one document with 100,000 events:
{
  "sensor_id": "s1",
  "events": [/* 100,000 events */]  // ❌ Will exceed limits
}

// Use bucketed documents:
{
  "sensor_id": "s1",
  "bucket": 1,
  "events": [/* events 1-1000 */],
  "count": 1000
}
{
  "sensor_id": "s1",
  "bucket": 2,
  "events": [/* events 1001-2000 */],
  "count": 1000
}

Computed Fields: Precompute aggregates to avoid full-array scans

{
  "product_id": "p1",
  "ratings": [5, 4, 5, 3, 5, 4, ...],
  "_computed": {
    "avg_rating": 4.3,
    "rating_count": 1523,
    "last_updated": "2024-01-15T10:30:00Z"
  }
}

Production Optimization Checklist

•Enable compression — zstd or snappy depending on CPU/IO balance
•Use appropriate types — Int32 vs Int64 vs Double; Date vs string
•Avoid unbounded arrays — Use bucketing or reference patterns
•Index covering queries — Avoid fetching documents when possible
•Project only needed fields — Reduce network and memory overhead
•Monitor document size distribution — Identify outliers
•Consider field name abbreviation — For very high-volume collections
•Profile storage vs. working set — Ensure hot data fits in RAM

Storage Anti-Patterns

Avoid these common mistakes:

Summary: JSON/BSON Storage Mastery

We've explored the physical foundation of document storage. Let's consolidate the key insights:

Key Takeaways

•JSON is logical, BSON is physical — JSON for interchange, BSON for storage and performance
•BSON extends JSON's type system — Dates, binaries, ObjectIds, integers with native representations
•Length prefixes enable efficiency — O(1) document skipping, fast field access
•Storage engines matter — WiredTiger's B-tree structure optimizes for document workloads
•Compression is essential — Reduces field name overhead and improves I/O efficiency
•Type selection impacts storage — Choose appropriate types for size and precision
•Document size requires management — Bucketing, computed fields, projections
•Alternatives exist — MessagePack, CBOR, Ion for different requirements

What's Next:

Page Complete

2 / 5