Loading learning content...
Every second, the digital world generates an incomprehensible volume of time-stamped data. A single Tesla vehicle produces 25 gigabytes of sensor data per hour. Uber's real-time pricing engine ingests billions of location updates daily. Netflix's streaming infrastructure emits trillions of metrics to monitor the health of systems serving 230 million subscribers. This data shares one fundamental characteristic: it is indexed by time.
Traditional databases—both relational and NoSQL—were not designed for this workload. When you try to store 100,000 data points per second in PostgreSQL or MongoDB, performance degrades catastrophically. Queries that scan time ranges become agonizingly slow. Storage consumption balloons uncontrollably. The fundamental assumptions of general-purpose databases—optimized for random access patterns and CRUD operations—fail spectacularly when confronted with the relentless, append-heavy, time-ordered nature of sensor readings, metrics, logs, and events.
By the end of this page, you will understand the unique characteristics of time-series data, why general-purpose databases fail at this workload, and the specialized optimizations—from columnar storage to time-based partitioning to compression algorithms—that enable time-series databases to achieve 10-100x better performance than their general-purpose counterparts.
Before we can optimize for time-series data, we must precisely understand what makes it fundamentally different from other data types. Time-series data is not simply "data with a timestamp"—it exhibits a constellation of characteristics that, taken together, demand specialized handling.
Definition: Time-series data is a sequence of data points collected or recorded at discrete time intervals, where the temporal ordering is the primary dimension of access and analysis.
This definition seems simple, but its implications are profound. Let's examine the core characteristics:
The Data Shape:
Time-series data typically takes one of two structural forms, each with distinct implications for storage and query patterns:
| Structure | Description | Example | Storage Implication |
|---|---|---|---|
| Regular (Fixed Interval) | Data points at consistent time intervals | CPU usage sampled every 10 seconds | Highly compressible; timestamp can be computed from index |
| Irregular (Event-Driven) | Data points at variable intervals | User login events, error logs | Must store explicit timestamps; less compressible |
| Multivariate | Multiple metrics per timestamp | Temperature, humidity, pressure from same sensor | Can share timestamp storage across metrics |
| Tagged/Labeled | Metrics with dimensional metadata | cpu_usage{host=web01, region=us-east} | Tag indexing critical for query performance |
To truly appreciate time-series database optimizations, we must understand why PostgreSQL, MySQL, MongoDB, and other general-purpose databases struggle with time-series workloads. The failures aren't bugs—they're fundamental design mismatches.
The Write Amplification Problem:
Relational databases use B-tree indexes as their primary data structure. B-trees are excellent for random reads and balanced read/write workloads, but they're catastrophic for append-heavy time-series ingestion.
When you insert a new row into a B-tree indexed table:
For time-series data, every single write goes to the "end" of the time index—but B-trees don't have a fast path for this. Each insert is treated as a random write, causing 10-40x write amplification compared to a simple append.
12345678910111213141516171819
-- Simulating time-series insert into PostgreSQL-- Table: sensor_readings (timestamp, device_id, value)-- Index: btree on (timestamp) -- For each of 100,000 inserts per second:-- 1. WAL write: ~200 bytes per insert = 20 MB/s journal I/O-- 2. B-tree traversal: log2(N) page reads-- 3. Page split probability increases with table size-- 4. Autovacuum overhead: ~10-20% additional I/O -- Result: A table with 1 billion rows requires:-- - ~30 levels of B-tree depth-- - ~100ms per write at sustained load-- - Eventually: writes queue faster than they complete -- PostgreSQL write throughput degrades:-- 1M rows: ~50,000 writes/sec-- 100M rows: ~10,000 writes/sec-- 1B rows: ~1,000 writes/sec (or less)The Query Pattern Mismatch:
General-purpose databases optimize for point queries: "Get user with ID 12345." Time-series workloads require range queries: "Get all readings from device X in the last hour." This difference has profound implications:
Index Selectivity: B-tree indexes assume high selectivity (each query hits a small percentage of data). Time-series range queries can easily hit millions of rows.
Sequential vs. Random I/O: Disk drives—even SSDs—perform dramatically better on sequential reads than random reads. B-trees scatter logically sequential data across random physical locations.
Aggregation Costs: Time-series queries frequently require aggregations (averages, sums, percentiles) over millions of points. General databases compute these row-by-row; specialized databases precompute them.
In benchmarks, time-series databases typically achieve 10x better write throughput, 10x faster range queries, and 10x better compression compared to general-purpose databases on identical workloads. This isn't marginal improvement—it's the difference between a system that works and one that doesn't.
The foundation of time-series database performance is the storage engine—the subsystem responsible for writing data to disk and reading it back. Time-series databases use specialized storage engines designed from the ground up for append-heavy, time-ordered data. Let's examine the three dominant approaches:
1. Log-Structured Merge Trees (LSM-Trees):
LSM-Trees, originally developed for write-optimized databases like LevelDB and RocksDB, have become the foundation for many time-series databases. The key insight: by buffering writes in memory and periodically flushing sorted files to disk, LSM-Trees convert random writes into sequential writes.
12345678910111213141516171819202122232425262728293031323334
LSM-Tree Write Path:┌─────────────────────────────────────────────────────────────┐│ Incoming Writes (100K/sec) │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ MemTable (In-Memory, Sorted) ││ ├── Red-Black Tree / Skip List ││ ├── Size: 64MB - 256MB ││ └── Writes are O(log n) - extremely fast │└─────────────────────────────────────────────────────────────┘ │ (Flush when full) ▼┌─────────────────────────────────────────────────────────────┐│ Level 0 SSTable Files (Disk) ││ ├── Immutable, sorted files ││ ├── Each file: 64-256MB ││ └── Multiple overlapping files allowed │└─────────────────────────────────────────────────────────────┘ │ (Compaction) ▼┌─────────────────────────────────────────────────────────────┐│ Level 1-N SSTables (Disk) ││ ├── Non-overlapping key ranges per level ││ ├── Level N+1 is 10x larger than Level N ││ └── Compaction merges and removes duplicates │└─────────────────────────────────────────────────────────────┘ Why This Works for Time-Series:- All writes go to memory first (microseconds, not milliseconds)- Disk writes are large, sequential batches- Time-ordered data naturally creates non-overlapping ranges- Read amplification is acceptable for range-heavy workloads2. Time-Structured Merge Trees (TSM):
InfluxDB developed the Time-Structured Merge (TSM) engine specifically for time-series data. TSM extends LSM concepts with time-series-specific optimizations:
1234567891011121314151617181920212223242526272829303132
TSM Shard Structure (InfluxDB):┌─────────────────────────────────────────────────────────────┐│ Shard: 2024-01-01 to 2024-01-07 ││ (Time-based partition) │└─────────────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ TSM File 1 │ │ TSM File 2 │ │ TSM File N ││ (Immutable) │ │ (Immutable) │ │ (Immutable) │└──────────────┘ └──────────────┘ └──────────────┘ Inside a TSM File:┌─────────────────────────────────────────────────────────────┐│ Header │├─────────────────────────────────────────────────────────────┤│ Block 1: Series=cpu,host=web01 ││ ├── Timestamps: [compressed delta-encoded] ││ └── Values: [Gorilla-compressed floats] │├─────────────────────────────────────────────────────────────┤│ Block 2: Series=cpu,host=web02 ││ ├── Timestamps: [compressed] ││ └── Values: [compressed] │├─────────────────────────────────────────────────────────────┤│ Index: Series → Block Offset mapping │└─────────────────────────────────────────────────────────────┘ Compression Ratios:- Timestamps: 95% reduction (delta + varint encoding)- Float values: 80% reduction (Gorilla XOR compression)- Overall: 10-20x compression vs raw storage3. Chunk-Based Storage (TimescaleDB/Hypertables):
TimescaleDB takes a different approach: rather than replacing PostgreSQL's storage engine, it partitions data into chunks—smaller PostgreSQL tables organized by time. Each chunk is a standard PostgreSQL heap table, but the chunking layer provides time-series optimizations:
This approach sacrifices some raw performance for compatibility with the PostgreSQL ecosystem—critical for teams with existing Postgres expertise and tooling.
Time-series data is extraordinarily compressible—but only if you exploit its unique characteristics. General-purpose compression algorithms like gzip or lz4 achieve 2-4x compression on time-series data. Purpose-built algorithms achieve 10-50x. Understanding these compression techniques is essential for system design.
Delta Encoding for Timestamps:
When data arrives at regular intervals (e.g., every 10 seconds), storing the full timestamp for each point is wasteful. Delta encoding stores only the difference between consecutive timestamps:
1234567891011121314151617181920212223242526272829
# Raw timestamps (8 bytes each):timestamps = [ 1704067200, # 2024-01-01 00:00:00 1704067210, # 2024-01-01 00:00:10 1704067220, # 2024-01-01 00:00:20 1704067230, # 2024-01-01 00:00:30 # ... 1 million more points]# Storage: 1,000,000 × 8 bytes = 8 MB # Delta encoded:deltas = [ 1704067200, # First timestamp (full) 10, # Δ from previous 10, # Δ from previous 10, # Δ from previous # ... 1 million more deltas (all = 10)]# Storage: 8 bytes + 999,999 × 1 byte ≈ 1 MB # Delta-of-Delta encoded (for regular intervals):delta_of_deltas = [ 1704067200, # First timestamp (full) 10, # First delta 0, 0, 0, 0, # All subsequent delta-of-deltas are 0!]# Storage: 8 bytes + 4 bytes + 999,998 × 1 bit ≈ 125 KB # Compression ratio: 8 MB → 125 KB = 64xGorilla Compression for Floating-Point Values:
Facebook's Gorilla paper introduced XOR-based compression for time-series float values. The insight: consecutive readings from the same sensor tend to be similar, and XOR of similar floats produces values with many leading zeros:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# CPU temperature readings: 67.2°C, 67.3°C, 67.1°C, 67.4°C # IEEE 754 float representation:# 67.2 = 0x4286666666666666 (64 bits)# 67.3 = 0x4286B33333333333 (64 bits) # XOR of consecutive values:# 67.2 XOR 67.3 = 0x00005555555555555# Result has 12 leading zeros + meaningful bits # Gorilla encoding:# 1. First value: stored in full (64 bits)# 2. Subsequent values encoded as:# - If XOR == 0: Store '0' (1 bit) - values identical# - If XOR has same leading/trailing zeros as previous:# Store '10' + meaningful bits# - Otherwise: Store '11' + leading zeros count +# meaningful bits length + meaningful bits # Typical compression: 1.37 bits per value (vs 64 bits raw)# Compression ratio: ~47x for typical metrics class GorillaEncoder: def __init__(self): self.prev_value = 0.0 self.prev_leading_zeros = 0 self.prev_trailing_zeros = 0 def encode(self, value: float) -> BitStream: xor = float_to_bits(value) ^ float_to_bits(self.prev_value) if xor == 0: return BitStream('0') # Identical value leading = count_leading_zeros(xor) trailing = count_trailing_zeros(xor) if (leading >= self.prev_leading_zeros and trailing >= self.prev_trailing_zeros): # Use previous block position meaningful_bits = extract_bits(xor, self.prev_leading_zeros, 64 - self.prev_leading_zeros - self.prev_trailing_zeros) return BitStream('10') + meaningful_bits else: # Store new block position block_size = 64 - leading - trailing return (BitStream('11') + encode_5bit(leading) + encode_6bit(block_size) + extract_bits(xor, leading, block_size))Dictionary Encoding for Tags:
Time-series data often includes high-cardinality string tags (hostnames, regions, metric names). Dictionary encoding maps repeated strings to integer identifiers:
| Data Type | Technique | Typical Ratio | Usage |
|---|---|---|---|
| Timestamps (regular) | Delta-of-Delta + Varint | 50-100x | Metrics at fixed intervals |
| Timestamps (irregular) | Delta + Varint | 5-10x | Event logs, user actions |
| Float values | Gorilla XOR | 20-50x | Sensor readings, metrics |
| Integer values | Delta + ZigZag + Varint | 10-30x | Counters, gauges |
| String tags | Dictionary encoding | 10-100x | Host names, regions |
| Boolean flags | Bit-packing | 8x (minimum) | Status indicators |
For time-series workloads, compression isn't a storage optimization—it's a performance optimization. Compressed data means less I/O, which means faster queries. A 20x compression ratio means you can scan 20x more data in the same time. Disk I/O is almost always the bottleneck; reducing data size directly improves query performance.
Time-series databases universally employ time-based partitioning—dividing data into discrete chunks by time range. This seemingly simple optimization has profound effects on both write and read performance.
Why Time Partitioning Works:
1234567891011121314151617181920212223242526
Time-Partitioned Data Layout: Year of Metrics Data (10TB uncompressed, 500GB compressed)└── shard_group_2024_01 ← 1 week partition│ ├── shard_1 (active) ← Current writes│ └── shard_2 (read-only) ← Flushed, compressed├── shard_group_2023_12 (read-only, highly compressed)├── shard_group_2023_11 (read-only, on cold storage)├── ...└── shard_group_2023_01 (read-only, archived to S3) Query: SELECT mean(cpu) WHERE time > now() - 1h Execution Plan:1. Identify relevant partitions: shard_group_2024_012. Scan only in-memory buffer + recent TSM files3. Skip 51 weeks of partitions entirely4. Result: <50ms query on 10TB dataset Query: SELECT mean(cpu) WHERE time > now() - 1y GROUP BY month Execution Plan:1. Identify relevant partitions: all 52 shard_groups2. Execute parallel aggregation across 52 shards3. Merge monthly aggregates4. Result: <5s query with full parallelizationChoosing Partition Granularity:
Partition size is a critical design decision. Partitions that are too small create management overhead and limit query parallelism. Partitions that are too large reduce the benefit of partition pruning.
| Write Volume | Recommended Partition | Rationale |
|---|---|---|
| < 1,000 pts/sec | Daily or Weekly | Write volume is low; fewer files simplify management |
| 1,000 - 100,000 pts/sec | Daily | Balance between pruning benefit and file count |
| 100,000 - 1M pts/sec | 6-hour to Daily | Keeps active partition in memory; rapid compaction |
1M pts/sec | 1-hour to 6-hour | Prevents memory pressure; enables per-hour queries |
Time-series databases require specialized indexing strategies to handle their unique access patterns: time-range queries and high-cardinality tag filtering. Traditional B-tree indexes are suboptimal; instead, time-series databases employ a combination of techniques:
1. Inverted Indexes for Tags:
When you query SELECT * FROM cpu WHERE host='web01' AND region='us-east', the database needs to quickly find all series matching those tags. An inverted index maps tag values to series IDs:
12345678910111213141516171819202122232425
Inverted Index for Tags: Tag Index:┌─────────────────────────────────────┬────────────────────────┐│ Tag │ Series IDs │├─────────────────────────────────────┼────────────────────────┤│ host=web01 │ [1, 5, 892, 1204] ││ host=web02 │ [2, 6, 893, 1205] ││ host=db01 │ [3, 7, 894, 1206] ││ region=us-east │ [1, 2, 3, 892, 893] ││ region=us-west │ [5, 6, 7, 1204, 1205] ││ metric=cpu │ [1, 2, 3, 5, 6, 7] ││ metric=memory │ [892, 893, 894] │└─────────────────────────────────────┴────────────────────────┘ Query: SELECT * FROM cpu WHERE host='web01' AND region='us-east' Execution:1. Lookup host=web01 → [1, 5, 892, 1204]2. Lookup region=us-east → [1, 2, 3, 892, 893]3. Intersection → [1, 892]4. Lookup metric=cpu → [1, 2, 3, 5, 6, 7]5. Final intersection → [1] Result: Only series ID 1 matches all predicates2. Time-Range Indexes:
Within each series, data is organized by time. Time-series databases typically store metadata about the time range covered by each file or block, enabling rapid range-based filtering without scanning data:
3. Bloom Filters for Series Existence:
When cardinality is extremely high (millions of unique series), even inverted index lookups become expensive. Bloom filters provide a probabilistic shortcut: quickly eliminate series that definitely don't match before doing exact lookups.
High cardinality—having millions of unique tag combinations—is the Achilles' heel of time-series databases. Each unique series requires index entries, memory for series metadata, and separate storage blocks. Databases like InfluxDB 1.x famously struggled with cardinality; newer versions and alternatives like ClickHouse and QuestDB have made significant improvements. When designing schemas, minimize cardinality by avoiding high-cardinality tags (UUIDs, user IDs) in metric labels.
Time-series workloads are write-dominated—often 10:1 or 100:1 write-to-read ratios. This inverts the traditional database optimization priority. Time-series databases employ several architectural patterns to maximize write throughput:
1. In-Memory Write Buffers:
All incoming data first enters an in-memory buffer (MemTable, WAL cache, or equivalent). This buffer absorbs the full ingestion rate with microsecond-level latency. Memory is organized as a sorted structure (skip list, red-black tree) to enable efficient range queries even before data reaches disk.
2. Batch Flushing:
Rather than writing each data point individually, time-series databases accumulate writes and flush them in large batches. A 256MB flush of 10 million points achieves far better disk throughput than 10 million individual 256-byte writes.
3. Write-Ahead Logging:
To ensure durability without sacrificing write speed, incoming data is simultaneously written to a sequential log (WAL). If the system crashes before flushing, the WAL can be replayed to recover buffered data. This pattern decouples durability from organized storage.
12345678910111213141516171819202122232425262728293031323334353637383940414243
High-Performance Write Path: Client Writes (100K points/sec per node) │ ▼┌─────────────────────────────────────────────────────────────┐│ STEP 1: Parse & Validate ││ ├── Protocol parsing (Line protocol, Prometheus, JSON) ││ ├── Schema inference (for schema-less DBs) ││ ├── Tag and field validation ││ └── Latency: ~10μs per batch │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ STEP 2: Write-Ahead Log (WAL) ││ ├── Append to sequential log file ││ ├── fsync configurable (per-write, batched, or async) ││ ├── Enables crash recovery ││ └── Latency: ~100μs (with batched fsync) │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ STEP 3: MemTable Insert ││ ├── Sorted in-memory structure (per-shard) ││ ├── Lock-free or sharded for concurrency ││ ├── Immediately queryable ││ └── Latency: ~1μs per point │└─────────────────────────────────────────────────────────────┘ │ │ (Async background flush when MemTable full) ▼┌─────────────────────────────────────────────────────────────┐│ STEP 4: Flush to SSTable/TSM ││ ├── Sort and compress data ││ ├── Build index blocks ││ ├── Sequential write to disk ││ └── Delete WAL segments after flush │└─────────────────────────────────────────────────────────────┘ Total Write Latency: ~100-500μs (with durable acknowledgment)Throughput: 100K-1M points/sec per nodeProduction time-series systems batch aggressively at every level: clients batch writes before sending, servers batch WAL fsyncs, background processes batch flushes. Unbatched writes (one point per network round-trip, one fsync per write) impose fatal overhead. Always configure appropriate batch sizes for your ingestion rate.
Time-series databases achieve extraordinary performance through a comprehensive suite of optimizations, each addressing a specific characteristic of time-stamped data. Let's consolidate the key concepts:
What's Next:
Now that we understand the fundamental optimizations for time-stamped data, we'll explore the two industry-leading time-series databases in depth: InfluxDB and TimescaleDB. You'll learn their architectures, query languages, operational characteristics, and the specific scenarios where each excels.
You now understand why time-series data demands specialized databases and the core optimizations that enable them to achieve 10-100x better performance than general-purpose alternatives. These fundamentals will inform your understanding of specific time-series database implementations in the pages ahead.