Loading learning content...
When you store a document in a database, what actually happens on disk? The document you write—perhaps a JSON object in your application code—undergoes a transformation before it's persisted. This transformation balances human readability against machine efficiency, query performance against storage density.
Understanding this physical layer is essential for document database practitioners. It explains why some operations are fast and others slow, why certain data types exist, and how to optimize your schema for storage efficiency.
In this page, we'll trace the journey from JSON (the ubiquitous text format) to BSON (Binary JSON, MongoDB's internal format), examining the engineering decisions that make document databases performant.
By the end of this page, you will understand: JSON's role as the universal document interchange format; BSON's binary structure and type system; how documents are serialized, stored, and deserialized; performance implications of different data types and document structures; storage optimization techniques; and how other document databases approach the same problems.
JavaScript Object Notation (JSON) has become the lingua franca of data interchange. Its success stems from a perfect balance of simplicity, expressiveness, and human readability.
JSON supports six data types organized into two structures:
Primitive Types:
"Hello, World!"42, -3.14, 1.2e10true or falsenullComposite Types:
{"name": "John", "age": 30}[1, 2, 3], ["a", "b", "c"]{
"user": {
"id": "usr_12345",
"profile": {
"name": "Alice Chen",
"verified": true,
"followers": 1523,
"bio": null
},
"tags": ["developer", "speaker", "author"]
}
}
Why JSON dominates:
While excellent for interchange, JSON has significant limitations for database storage:
• No Date type — Dates must be encoded as strings or numbers • No Binary type — Binary data requires Base64 encoding (33% overhead) • No Integer vs Float distinction — All numbers are IEEE 754 doubles • No ObjectId/UUID — Unique identifiers stored as strings • Parsing overhead — Text must be parsed on every read • Format overhead — Field names repeated for every document
BSON (Binary JSON) is MongoDB's binary-encoded serialization format. It was designed to overcome JSON's limitations for database storage while maintaining JSON's document model semantics.
Every BSON document follows this structure:
┌─────────────────────────────────────────────────────────────┐
│ Document │
├─────────┬───────────────────────────────────────────────────┤
│ 4 bytes │ Total document size (including this header) │
├─────────┼───────────────────────────────────────────────────┤
│ Element │ Type (1 byte) + Name (cstring) + Value │
├─────────┼───────────────────────────────────────────────────┤
│ Element │ Type (1 byte) + Name (cstring) + Value │
├─────────┼───────────────────────────────────────────────────┤
│ ... │ Additional elements │
├─────────┼───────────────────────────────────────────────────┤
│ 1 byte │ Null terminator (0x00) │
└─────────┴───────────────────────────────────────────────────┘
Key insight: The 4-byte size prefix at the document start enables O(1) document skipping. When scanning a collection, the database can jump from document to document without parsing field contents.
| Component | Size | Description |
|---|---|---|
| Type byte | 1 byte | Identifies the value type (0x01=double, 0x02=string, etc.) |
| Field name | Variable | Null-terminated C-string (UTF-8) |
| Value | Variable | Type-specific binary encoding |
Consider this JSON document:
{"hello": "world"}
In BSON, this becomes 22 bytes:
\x16\x00\x00\x00 // Document size: 22 bytes
\x02 // Type: string (0x02)
hello\x00 // Field name + null terminator
\x06\x00\x00\x00 // String length: 6 (including null)
world\x00 // String value + null terminator
\x00 // Document terminator
Compare to the JSON text representation (18 bytes). BSON is slightly larger here, but the length prefixes enable direct field access without parsing.
BSON extends JSON with additional types critical for database operations. Understanding these types is essential for efficient schema design.
| Type | Code | Description | Size |
|---|---|---|---|
| Double | 0x01 | 64-bit IEEE 754 floating point | 8 bytes |
| String | 0x02 | UTF-8 string with length prefix | 4 + len + 1 |
| Document | 0x03 | Embedded BSON document | Variable |
| Array | 0x04 | BSON document with numeric keys | Variable |
| Binary | 0x05 | Binary data with subtype byte | 5 + len |
| ObjectId | 0x07 | 12-byte unique identifier | 12 bytes |
| Boolean | 0x08 | Single byte boolean (0/1) | 1 byte |
| UTC DateTime | 0x09 | 64-bit milliseconds since epoch | 8 bytes |
| Null | 0x0A | Null value (no data bytes) | 0 bytes |
| Regular Expression | 0x0B | Pattern + options strings | Variable |
| 32-bit Integer | 0x10 | Signed 32-bit integer | 4 bytes |
| 64-bit Integer | 0x12 | Signed 64-bit integer | 8 bytes |
| Decimal128 | 0x13 | 128-bit decimal floating point | 16 bytes |
ObjectId (12 bytes)
The ObjectId is MongoDB's default _id type, designed for distributed uniqueness:
┌────────────┬────────────┬─────────────┐
│ Timestamp │ Random │ Counter │
│ 4 bytes │ 5 bytes │ 3 bytes │
└────────────┴────────────┴─────────────┘
This structure enables:
DateTime
Stored as 64-bit signed integer representing milliseconds since Unix epoch (January 1, 1970 UTC). This provides:
Decimal128
For financial and scientific applications requiring exact decimal representation:
0.1 + 0.2 = 0.3 (not 0.30000000000000004)• Use Int32 for small integers (counter, age) — 4 bytes vs 8 for Double
• Use Int64 for IDs, timestamps — avoids precision loss
• Use Decimal128 for money — never use Double for currency
• Use Date for timestamps — native sorting and querying
• Use Binary for files/images — avoid Base64's 33% overhead
• Use ObjectId for _id unless you have specific requirements
The choice between JSON and BSON has profound performance implications. Let's analyze the trade-offs quantitatively.
JSON Parsing:
BSON Decoding:
Performance comparison (approximate, varies by implementation):
| Operation | JSON | BSON | BSON Advantage |
|---|---|---|---|
| Full document parse | 100% | 40-60% | 1.7-2.5× faster |
| Skip to specific field | O(n) | O(1)* | Dramatic |
| Number parsing | String→Binary | Native read | 5-10× faster |
| Date parsing | Parse ISO string | Read 8 bytes | 10× faster |
*BSON field access is O(1) for known offsets, O(n) for name lookup on first access
Small document with numbers:
{"x": 1, "y": 2, "z": 3}
Document with date and binary:
{
"timestamp": "2024-01-15T10:30:00.000Z",
"data": "SGVsbG8gV29ybGQh" // Base64-encoded "Hello World!"
}
Key insight: BSON wins for documents with rich types (dates, binaries, large numbers) and efficient field access. JSON wins for small documents with simple string content.
How documents are organized on disk impacts I/O efficiency, compression, and query performance. Different document databases employ various storage strategies.
WiredTiger (MongoDB default since 3.2)
WiredTiger is a high-performance storage engine that:
Storage Hierarchy:
┌─────────────────────────────────────────┐
│ WiredTiger Cache │
│ (Frequently accessed documents) │
├─────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────┐ │
│ │ B-tree Internal Nodes │ │
│ ├─────────────────────────────┤ │
│ │ B-tree Leaf Pages │ │
│ │ ┌───────┬───────┬───────┐ │ │
│ │ │ Doc 1 │ Doc 2 │ Doc 3 │ │ │
│ │ └───────┴───────┴───────┘ │ │
│ └─────────────────────────────┘ │
│ │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Data Files on Disk │
│ (Compressed, journaled) │
└─────────────────────────────────────────┘
Compression is critical for document databases since:
Compression Options:
| Algorithm | Speed | Ratio | Use Case |
|---|---|---|---|
| None | Fastest | 1.0× | Latency-critical, incompressible data |
| Snappy | Fast | 2-4× | Default balance of speed and ratio |
| zlib | Medium | 4-8× | Better ratio, moderate CPU overhead |
| zstd | Fast | 5-10× | Best balance (newer engines) |
Block-level compression:
WiredTiger compresses at the page level, not document level. This means:
Higher compression ratios can paradoxically improve query performance. Reason: Less data to read from disk means faster I/O. The CPU cost of decompression is often less than the I/O savings. This is especially true for disk-bound workloads where I/O is the bottleneck.
A unique characteristic of document databases is that field names are stored with every document. This enables schema flexibility but has storage implications.
Consider a collection with 1 million documents:
{
"customer_email_address": "user@example.com",
"shipping_street_address": "123 Main Street",
"preferred_contact_method": "email",
"account_creation_timestamp": "2024-01-15T10:30:00Z"
}
Field names alone consume:
customer_email_address: 22 bytesshipping_street_address: 23 bytespreferred_contact_method: 24 bytesaccount_creation_timestamp: 26 bytes1. Short Field Names
Use abbreviated field names in storage:
{
"cEmail": "user@example.com",
"sAddr": "123 Main Street",
"cMethod": "email",
"cTime": "2024-01-15T10:30:00Z"
}
Reduced to 21 bytes — 78% savings.
Trade-off: Reduced readability. Mitigate with:
2. Rely on Compression
Compression algorithms excel at repetitive patterns. Field names appearing in millions of documents compress extremely well:
3. Schema Validation Documentation
Even without code-level mapping, maintain clear documentation:
{
"$jsonSchema": {
"properties": {
"cEmail": { "description": "Customer email address" },
"sAddr": { "description": "Shipping street address" }
}
}
}
While BSON is MongoDB-specific, other document databases and systems use different binary formats. Understanding the landscape helps evaluate technology choices.
| Format | Used By | Key Features |
|---|---|---|
| BSON | MongoDB | Length-prefixed, rich types, traversable without parsing |
| MessagePack | Many systems | Compact, cross-platform, simpler type system |
| CBOR | IoT, WebAuthn | IETF standard, streaming support, schema-optional |
| Protocol Buffers | gRPC, Google | Schema-required, compact, fast, explicit field IDs |
| Avro | Hadoop ecosystem | Schema-with-data, row/columnar modes, schema evolution |
| Ion | Amazon | Self-describing, text and binary modes, typed nulls |
MessagePack is often described as "binary JSON":
JSON: {"compact":true,"schema":0} (27 bytes)
MsgPk: 82 a7 compact c3 a6 schema 00 (18 bytes)
IETF RFC 8949 standardized format:
Developed by Amazon for internal use:
| Requirement | Recommended Format |
|---|---|
| MongoDB compatibility | BSON |
| Maximum compactness | MessagePack |
| IoT/Constrained devices | CBOR |
| Request/response APIs | Protocol Buffers |
| Big data/Analytics | Avro |
| Multi-mode (text+binary) | Ion |
Optimizing document storage requires understanding both the format and access patterns. Here are proven techniques for production systems.
Problem: Documents that grow unboundedly (e.g., arrays that receive endless pushes) cause:
Solutions:
// Instead of one document with 100,000 events:
{
"sensor_id": "s1",
"events": [/* 100,000 events */] // ❌ Will exceed limits
}
// Use bucketed documents:
{
"sensor_id": "s1",
"bucket": 1,
"events": [/* events 1-1000 */],
"count": 1000
}
{
"sensor_id": "s1",
"bucket": 2,
"events": [/* events 1001-2000 */],
"count": 1000
}
{
"product_id": "p1",
"ratings": [5, 4, 5, 3, 5, 4, ...],
"_computed": {
"avg_rating": 4.3,
"rating_count": 1523,
"last_updated": "2024-01-15T10:30:00Z"
}
}
Avoid these common mistakes:
• Storing Base64-encoded binaries — Use Binary type • ISO date strings — Use native DateTime type • Floating point for currency — Use Decimal128 • Deeply nested structures (>5 levels) — Flattening often improves performance • Arrays as poor man's indexes — Use proper indexes instead
We've explored the physical foundation of document storage. Let's consolidate the key insights:
What's Next:
With an understanding of how documents are modeled and stored, we'll explore MongoDB as the canonical document database example. You'll learn MongoDB's architecture, replication model, sharding capabilities, and how it implements the document model principles we've established.
You now understand the physical storage layer of document databases—from JSON's ubiquitous text format to BSON's optimized binary representation. This knowledge enables you to make informed decisions about data types, schema design, and storage optimization in production systems.