System Design (HLD)Time-Series Databases

Time-Series Databases

LevelAdvanced

Duration90 mins

TopicTime-Series Databases

1 / 5

Time-Stamped Data Optimization

The Data That Never Stops

Every second, the digital world generates an incomprehensible volume of time-stamped data. A single Tesla vehicle produces 25 gigabytes of sensor data per hour. Uber's real-time pricing engine ingests billions of location updates daily. Netflix's streaming infrastructure emits trillions of metrics to monitor the health of systems serving 230 million subscribers. This data shares one fundamental characteristic: it is indexed by time.

Traditional databases—both relational and NoSQL—were not designed for this workload. When you try to store 100,000 data points per second in PostgreSQL or MongoDB, performance degrades catastrophically. Queries that scan time ranges become agonizingly slow. Storage consumption balloons uncontrollably. The fundamental assumptions of general-purpose databases—optimized for random access patterns and CRUD operations—fail spectacularly when confronted with the relentless, append-heavy, time-ordered nature of sensor readings, metrics, logs, and events.

What You Will Learn

By the end of this page, you will understand the unique characteristics of time-series data, why general-purpose databases fail at this workload, and the specialized optimizations—from columnar storage to time-based partitioning to compression algorithms—that enable time-series databases to achieve 10-100x better performance than their general-purpose counterparts.

Understanding Time-Series Data

Before we can optimize for time-series data, we must precisely understand what makes it fundamentally different from other data types. Time-series data is not simply "data with a timestamp"—it exhibits a constellation of characteristics that, taken together, demand specialized handling.

Definition: Time-series data is a sequence of data points collected or recorded at discrete time intervals, where the temporal ordering is the primary dimension of access and analysis.

This definition seems simple, but its implications are profound. Let's examine the core characteristics:

Core Characteristics of Time-Series Data

•Temporal Ordering — Every data point is inherently ordered by its timestamp. Unlike user records or product catalogs where ordering is arbitrary, time-series data has a natural, immutable sequence. This means queries almost always involve time ranges rather than random access.
•Append-Only Writes — New data points are almost exclusively appended to the end of the series. Historical data is rarely updated; you don't change what a sensor read yesterday. This append-only pattern enables dramatic write optimizations that are impossible when data can be modified arbitrarily.
•High Write Volume — Time-series workloads are write-intensive. A single IoT gateway might ingest 10,000 data points per second. Traditional databases designed for balanced read/write workloads struggle with sustained, high-velocity ingestion.
•Time-Based Access Patterns — Queries typically request data within specific time ranges: 'last 5 minutes', 'yesterday between 2-4 PM', 'past 30 days'. Range scans over time are the dominant access pattern, making time-based indexing critical.
•Decreasing Value Over Time — Recent data is typically more valuable than historical data. You query last hour's metrics constantly, last week's occasionally, last year's rarely. This temporal value gradient enables intelligent data lifecycle management.
•High Cardinality Potential — A time-series often has multiple dimensions beyond time—device ID, region, metric type. The combination of these dimensions can create millions or billions of unique series, known as high cardinality.

The Data Shape:

Time-series data typically takes one of two structural forms, each with distinct implications for storage and query patterns:

Time-Series Data Structural Patterns
Structure	Description	Example	Storage Implication
Regular (Fixed Interval)	Data points at consistent time intervals	CPU usage sampled every 10 seconds	Highly compressible; timestamp can be computed from index
Irregular (Event-Driven)	Data points at variable intervals	User login events, error logs	Must store explicit timestamps; less compressible
Multivariate	Multiple metrics per timestamp	Temperature, humidity, pressure from same sensor	Can share timestamp storage across metrics
Tagged/Labeled	Metrics with dimensional metadata	cpu_usage{host=web01, region=us-east}	Tag indexing critical for query performance

Why General-Purpose Databases Fail

To truly appreciate time-series database optimizations, we must understand why PostgreSQL, MySQL, MongoDB, and other general-purpose databases struggle with time-series workloads. The failures aren't bugs—they're fundamental design mismatches.

The Write Amplification Problem:

Relational databases use B-tree indexes as their primary data structure. B-trees are excellent for random reads and balanced read/write workloads, but they're catastrophic for append-heavy time-series ingestion.

When you insert a new row into a B-tree indexed table:

The database must find the correct leaf node (requiring I/O)
If the leaf is full, it splits (causing more I/O and potentially cascading splits)
The write-ahead log must be updated (more I/O)
Index statistics must be updated (yet more I/O)

For time-series data, every single write goes to the "end" of the time index—but B-trees don't have a fast path for this. Each insert is treated as a random write, causing 10-40x write amplification compared to a simple append.

PostgreSQL B-tree Insert Overhead
Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Simulating time-series insert into PostgreSQL
-- Table: sensor_readings (timestamp, device_id, value)
-- Index: btree on (timestamp)
 
-- For each of 100,000 inserts per second:
-- 1. WAL write: ~200 bytes per insert = 20 MB/s journal I/O
-- 2. B-tree traversal: log2(N) page reads
-- 3. Page split probability increases with table size
-- 4. Autovacuum overhead: ~10-20% additional I/O
 
-- Result: A table with 1 billion rows requires:
-- - ~30 levels of B-tree depth
-- - ~100ms per write at sustained load
-- - Eventually: writes queue faster than they complete
 
-- PostgreSQL write throughput degrades:
-- 1M rows:   ~50,000 writes/sec
-- 100M rows: ~10,000 writes/sec
-- 1B rows:   ~1,000 writes/sec (or less)

The Query Pattern Mismatch:

General-purpose databases optimize for point queries: "Get user with ID 12345." Time-series workloads require range queries: "Get all readings from device X in the last hour." This difference has profound implications:

Index Selectivity: B-tree indexes assume high selectivity (each query hits a small percentage of data). Time-series range queries can easily hit millions of rows.
Sequential vs. Random I/O: Disk drives—even SSDs—perform dramatically better on sequential reads than random reads. B-trees scatter logically sequential data across random physical locations.
Aggregation Costs: Time-series queries frequently require aggregations (averages, sums, percentiles) over millions of points. General databases compute these row-by-row; specialized databases precompute them.

General-Purpose DB Limitations

•B-tree write amplification (10-40x)
•Random I/O for sequential data
•No timestamp-aware compression
•No automatic data aging/TTL
•Row-by-row aggregation
•High storage overhead
•No downsampling support

Time-Series DB Optimizations

•LSM-Tree/TSM append optimization
•Sequential I/O for time ranges
•Delta/RLE timestamp compression
•Automatic retention policies
•Precomputed roll-ups
•10-50x storage compression
•Built-in downsampling

The 10x Threshold

In benchmarks, time-series databases typically achieve 10x better write throughput, 10x faster range queries, and 10x better compression compared to general-purpose databases on identical workloads. This isn't marginal improvement—it's the difference between a system that works and one that doesn't.

Time-Series Storage Engines

The foundation of time-series database performance is the storage engine—the subsystem responsible for writing data to disk and reading it back. Time-series databases use specialized storage engines designed from the ground up for append-heavy, time-ordered data. Let's examine the three dominant approaches:

1. Log-Structured Merge Trees (LSM-Trees):

LSM-Trees, originally developed for write-optimized databases like LevelDB and RocksDB, have become the foundation for many time-series databases. The key insight: by buffering writes in memory and periodically flushing sorted files to disk, LSM-Trees convert random writes into sequential writes.

LSM-Tree Architecture for Time-Series

Conceptual

LSM-Tree Write Path:
┌─────────────────────────────────────────────────────────────┐
│  Incoming Writes (100K/sec)                                 │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  MemTable (In-Memory, Sorted)                               │
│  ├── Red-Black Tree / Skip List                             │
│  ├── Size: 64MB - 256MB                                     │
│  └── Writes are O(log n) - extremely fast                   │
└─────────────────────────────────────────────────────────────┘
                          │ (Flush when full)
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  Level 0 SSTable Files (Disk)                               │
│  ├── Immutable, sorted files                                │
│  ├── Each file: 64-256MB                                    │
│  └── Multiple overlapping files allowed                     │
└─────────────────────────────────────────────────────────────┘
                          │ (Compaction)
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  Level 1-N SSTables (Disk)                                  │
│  ├── Non-overlapping key ranges per level                   │
│  ├── Level N+1 is 10x larger than Level N                   │
│  └── Compaction merges and removes duplicates               │
└─────────────────────────────────────────────────────────────┘
 
Why This Works for Time-Series:
- All writes go to memory first (microseconds, not milliseconds)
- Disk writes are large, sequential batches
- Time-ordered data naturally creates non-overlapping ranges
- Read amplification is acceptable for range-heavy workloads

2. Time-Structured Merge Trees (TSM):

InfluxDB developed the Time-Structured Merge (TSM) engine specifically for time-series data. TSM extends LSM concepts with time-series-specific optimizations:

Time-based sharding: Data is automatically partitioned by time (daily, weekly) before entering the LSM structure
Series-aware organization: Data from the same measurement/tags is colocated for efficient range scans
Columnar storage within files: Values, timestamps, and tags are stored in separate columns for better compression
Integrated compression: Delta-encoding for timestamps, Gorilla compression for floats

TSM Engine Data Organization

Structure

TSM Shard Structure (InfluxDB):
┌─────────────────────────────────────────────────────────────┐
│  Shard: 2024-01-01 to 2024-01-07                           │
│  (Time-based partition)                                     │
└─────────────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  TSM File 1  │ │  TSM File 2  │ │  TSM File N  │
│  (Immutable) │ │  (Immutable) │ │  (Immutable) │
└──────────────┘ └──────────────┘ └──────────────┘
 
Inside a TSM File:
┌─────────────────────────────────────────────────────────────┐
│  Header                                                     │
├─────────────────────────────────────────────────────────────┤
│  Block 1: Series=cpu,host=web01                            │
│  ├── Timestamps: [compressed delta-encoded]                 │
│  └── Values: [Gorilla-compressed floats]                    │
├─────────────────────────────────────────────────────────────┤
│  Block 2: Series=cpu,host=web02                            │
│  ├── Timestamps: [compressed]                               │
│  └── Values: [compressed]                                   │
├─────────────────────────────────────────────────────────────┤
│  Index: Series → Block Offset mapping                       │
└─────────────────────────────────────────────────────────────┘
 
Compression Ratios:
- Timestamps: 95% reduction (delta + varint encoding)
- Float values: 80% reduction (Gorilla XOR compression)
- Overall: 10-20x compression vs raw storage

3. Chunk-Based Storage (TimescaleDB/Hypertables):

TimescaleDB takes a different approach: rather than replacing PostgreSQL's storage engine, it partitions data into chunks—smaller PostgreSQL tables organized by time. Each chunk is a standard PostgreSQL heap table, but the chunking layer provides time-series optimizations:

Automatic partitioning: New chunks created as time advances
Chunk exclusion: Queries only scan relevant time ranges
Compression: Older chunks can be compressed into columnar format
Data tiering: Chunks can be moved to cheaper storage as they age

This approach sacrifices some raw performance for compatibility with the PostgreSQL ecosystem—critical for teams with existing Postgres expertise and tooling.

Compression Strategies for Time-Series Data

Time-series data is extraordinarily compressible—but only if you exploit its unique characteristics. General-purpose compression algorithms like gzip or lz4 achieve 2-4x compression on time-series data. Purpose-built algorithms achieve 10-50x. Understanding these compression techniques is essential for system design.

Delta Encoding for Timestamps:

When data arrives at regular intervals (e.g., every 10 seconds), storing the full timestamp for each point is wasteful. Delta encoding stores only the difference between consecutive timestamps:

Delta Encoding Example
compression
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Raw timestamps (8 bytes each):
timestamps = [
    1704067200,  # 2024-01-01 00:00:00
    1704067210,  # 2024-01-01 00:00:10
    1704067220,  # 2024-01-01 00:00:20
    1704067230,  # 2024-01-01 00:00:30
    # ... 1 million more points
]
# Storage: 1,000,000 × 8 bytes = 8 MB
 
# Delta encoded:
deltas = [
    1704067200,  # First timestamp (full)
    10,          # Δ from previous
    10,          # Δ from previous
    10,          # Δ from previous
    # ... 1 million more deltas (all = 10)
]
# Storage: 8 bytes + 999,999 × 1 byte ≈ 1 MB
 
# Delta-of-Delta encoded (for regular intervals):
delta_of_deltas = [
    1704067200,  # First timestamp (full)
    10,          # First delta
    0, 0, 0, 0,  # All subsequent delta-of-deltas are 0!
]
# Storage: 8 bytes + 4 bytes + 999,998 × 1 bit ≈ 125 KB
 
# Compression ratio: 8 MB → 125 KB = 64x

Gorilla Compression for Floating-Point Values:

Facebook's Gorilla paper introduced XOR-based compression for time-series float values. The insight: consecutive readings from the same sensor tend to be similar, and XOR of similar floats produces values with many leading zeros:

Gorilla XOR Compression
algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# CPU temperature readings: 67.2°C, 67.3°C, 67.1°C, 67.4°C
 
# IEEE 754 float representation:
# 67.2 = 0x4286666666666666 (64 bits)
# 67.3 = 0x4286B33333333333 (64 bits)
 
# XOR of consecutive values:
# 67.2 XOR 67.3 = 0x00005555555555555
# Result has 12 leading zeros + meaningful bits
 
# Gorilla encoding:
# 1. First value: stored in full (64 bits)
# 2. Subsequent values encoded as:
#    - If XOR == 0: Store '0' (1 bit) - values identical
#    - If XOR has same leading/trailing zeros as previous:
#      Store '10' + meaningful bits
#    - Otherwise: Store '11' + leading zeros count +
#      meaningful bits length + meaningful bits
 
# Typical compression: 1.37 bits per value (vs 64 bits raw)
# Compression ratio: ~47x for typical metrics
 
class GorillaEncoder:
    def __init__(self):
        self.prev_value = 0.0
        self.prev_leading_zeros = 0
        self.prev_trailing_zeros = 0
    
    def encode(self, value: float) -> BitStream:
        xor = float_to_bits(value) ^ float_to_bits(self.prev_value)
        
        if xor == 0:
            return BitStream('0')  # Identical value
        
        leading = count_leading_zeros(xor)
        trailing = count_trailing_zeros(xor)
        
        if (leading >= self.prev_leading_zeros and 
            trailing >= self.prev_trailing_zeros):
            # Use previous block position
            meaningful_bits = extract_bits(xor, 
                self.prev_leading_zeros, 
                64 - self.prev_leading_zeros - self.prev_trailing_zeros)
            return BitStream('10') + meaningful_bits
        else:
            # Store new block position
            block_size = 64 - leading - trailing
            return (BitStream('11') + 
                    encode_5bit(leading) + 
                    encode_6bit(block_size) + 
                    extract_bits(xor, leading, block_size))

Dictionary Encoding for Tags:

Time-series data often includes high-cardinality string tags (hostnames, regions, metric names). Dictionary encoding maps repeated strings to integer identifiers:

Compression Techniques Summary
Data Type	Technique	Typical Ratio	Usage
Timestamps (regular)	Delta-of-Delta + Varint	50-100x	Metrics at fixed intervals
Timestamps (irregular)	Delta + Varint	5-10x	Event logs, user actions
Float values	Gorilla XOR	20-50x	Sensor readings, metrics
Integer values	Delta + ZigZag + Varint	10-30x	Counters, gauges
String tags	Dictionary encoding	10-100x	Host names, regions
Boolean flags	Bit-packing	8x (minimum)	Status indicators

Compression Is Not Optional

For time-series workloads, compression isn't a storage optimization—it's a performance optimization. Compressed data means less I/O, which means faster queries. A 20x compression ratio means you can scan 20x more data in the same time. Disk I/O is almost always the bottleneck; reducing data size directly improves query performance.

Time-Based Partitioning and Sharding

Time-series databases universally employ time-based partitioning—dividing data into discrete chunks by time range. This seemingly simple optimization has profound effects on both write and read performance.

Why Time Partitioning Works:

Benefits of Time-Based Partitioning

•Write Localization — All writes go to the current time partition. There's no need to update old files or shuffle data. The active partition absorbs all ingestion load while historical partitions remain immutable.
•Partition Pruning — Queries with time predicates automatically skip irrelevant partitions. A query for 'last 24 hours' on a year of data might scan 1/365th of the dataset without any index lookup.
•Efficient Deletion — When retention policies expire old data, entire partitions can be dropped atomically. No row-by-row deletion, no index updates, no vacuum operations.
•Tiered Storage — Old partitions can be compressed more aggressively, moved to cheaper storage, or migrated to object storage (S3, GCS) without affecting hot data performance.
•Parallelization — Multiple partitions can be queried in parallel across multiple CPU cores or machines. Aggregation across partitions is embarrassingly parallel.

Time Partitioning Architecture

design

Time-Partitioned Data Layout:
 
Year of Metrics Data (10TB uncompressed, 500GB compressed)
└── shard_group_2024_01     ← 1 week partition
│   ├── shard_1 (active)    ← Current writes
│   └── shard_2 (read-only) ← Flushed, compressed
├── shard_group_2023_12 (read-only, highly compressed)
├── shard_group_2023_11 (read-only, on cold storage)
├── ...
└── shard_group_2023_01 (read-only, archived to S3)
 
Query: SELECT mean(cpu) WHERE time > now() - 1h
 
Execution Plan:
1. Identify relevant partitions: shard_group_2024_01
2. Scan only in-memory buffer + recent TSM files
3. Skip 51 weeks of partitions entirely
4. Result: <50ms query on 10TB dataset
 
Query: SELECT mean(cpu) WHERE time > now() - 1y GROUP BY month
 
Execution Plan:
1. Identify relevant partitions: all 52 shard_groups
2. Execute parallel aggregation across 52 shards
3. Merge monthly aggregates
4. Result: <5s query with full parallelization

Choosing Partition Granularity:

Partition size is a critical design decision. Partitions that are too small create management overhead and limit query parallelism. Partitions that are too large reduce the benefit of partition pruning.

Partition Size Guidelines
Write Volume	Recommended Partition	Rationale
< 1,000 pts/sec	Daily or Weekly	Write volume is low; fewer files simplify management
1,000 - 100,000 pts/sec	Daily	Balance between pruning benefit and file count
100,000 - 1M pts/sec	6-hour to Daily	Keeps active partition in memory; rapid compaction
1M pts/sec	1-hour to 6-hour	Prevents memory pressure; enables per-hour queries

Indexing Strategies for Time-Series Data

Time-series databases require specialized indexing strategies to handle their unique access patterns: time-range queries and high-cardinality tag filtering. Traditional B-tree indexes are suboptimal; instead, time-series databases employ a combination of techniques:

1. Inverted Indexes for Tags:

When you query SELECT * FROM cpu WHERE host='web01' AND region='us-east', the database needs to quickly find all series matching those tags. An inverted index maps tag values to series IDs:

Inverted Index Structure

structure

Inverted Index for Tags:
 
Tag Index:
┌─────────────────────────────────────┬────────────────────────┐
│ Tag                                 │ Series IDs             │
├─────────────────────────────────────┼────────────────────────┤
│ host=web01                          │ [1, 5, 892, 1204]      │
│ host=web02                          │ [2, 6, 893, 1205]      │
│ host=db01                           │ [3, 7, 894, 1206]      │
│ region=us-east                      │ [1, 2, 3, 892, 893]    │
│ region=us-west                      │ [5, 6, 7, 1204, 1205]  │
│ metric=cpu                          │ [1, 2, 3, 5, 6, 7]     │
│ metric=memory                       │ [892, 893, 894]        │
└─────────────────────────────────────┴────────────────────────┘
 
Query: SELECT * FROM cpu WHERE host='web01' AND region='us-east'
 
Execution:
1. Lookup host=web01    → [1, 5, 892, 1204]
2. Lookup region=us-east → [1, 2, 3, 892, 893]
3. Intersection          → [1, 892]
4. Lookup metric=cpu     → [1, 2, 3, 5, 6, 7]
5. Final intersection    → [1]
 
Result: Only series ID 1 matches all predicates

2. Time-Range Indexes:

Within each series, data is organized by time. Time-series databases typically store metadata about the time range covered by each file or block, enabling rapid range-based filtering without scanning data:

3. Bloom Filters for Series Existence:

When cardinality is extremely high (millions of unique series), even inverted index lookups become expensive. Bloom filters provide a probabilistic shortcut: quickly eliminate series that definitely don't match before doing exact lookups.

The High Cardinality Challenge

High cardinality—having millions of unique tag combinations—is the Achilles' heel of time-series databases. Each unique series requires index entries, memory for series metadata, and separate storage blocks. Databases like InfluxDB 1.x famously struggled with cardinality; newer versions and alternatives like ClickHouse and QuestDB have made significant improvements. When designing schemas, minimize cardinality by avoiding high-cardinality tags (UUIDs, user IDs) in metric labels.

Write-Optimized Architectures

Time-series workloads are write-dominated—often 10:1 or 100:1 write-to-read ratios. This inverts the traditional database optimization priority. Time-series databases employ several architectural patterns to maximize write throughput:

1. In-Memory Write Buffers:

All incoming data first enters an in-memory buffer (MemTable, WAL cache, or equivalent). This buffer absorbs the full ingestion rate with microsecond-level latency. Memory is organized as a sorted structure (skip list, red-black tree) to enable efficient range queries even before data reaches disk.

2. Batch Flushing:

Rather than writing each data point individually, time-series databases accumulate writes and flush them in large batches. A 256MB flush of 10 million points achieves far better disk throughput than 10 million individual 256-byte writes.

3. Write-Ahead Logging:

To ensure durability without sacrificing write speed, incoming data is simultaneously written to a sequential log (WAL). If the system crashes before flushing, the WAL can be replayed to recover buffered data. This pattern decouples durability from organized storage.

Write Path Architecture

flow

High-Performance Write Path:
 
Client Writes (100K points/sec per node)
          │
          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 1: Parse & Validate                                   │
│  ├── Protocol parsing (Line protocol, Prometheus, JSON)     │
│  ├── Schema inference (for schema-less DBs)                 │
│  ├── Tag and field validation                               │
│  └── Latency: ~10μs per batch                               │
└─────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 2: Write-Ahead Log (WAL)                              │
│  ├── Append to sequential log file                          │
│  ├── fsync configurable (per-write, batched, or async)      │
│  ├── Enables crash recovery                                 │
│  └── Latency: ~100μs (with batched fsync)                   │
└─────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 3: MemTable Insert                                    │
│  ├── Sorted in-memory structure (per-shard)                 │
│  ├── Lock-free or sharded for concurrency                   │
│  ├── Immediately queryable                                  │
│  └── Latency: ~1μs per point                                │
└─────────────────────────────────────────────────────────────┘
          │
          │ (Async background flush when MemTable full)
          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 4: Flush to SSTable/TSM                               │
│  ├── Sort and compress data                                 │
│  ├── Build index blocks                                     │
│  ├── Sequential write to disk                               │
│  └── Delete WAL segments after flush                        │
└─────────────────────────────────────────────────────────────┘
 
Total Write Latency: ~100-500μs (with durable acknowledgment)
Throughput: 100K-1M points/sec per node

Batching at Every Level

Production time-series systems batch aggressively at every level: clients batch writes before sending, servers batch WAL fsyncs, background processes batch flushes. Unbatched writes (one point per network round-trip, one fsync per write) impose fatal overhead. Always configure appropriate batch sizes for your ingestion rate.

Summary: Time-Stamped Data Optimization

Time-series databases achieve extraordinary performance through a comprehensive suite of optimizations, each addressing a specific characteristic of time-stamped data. Let's consolidate the key concepts:

Key Takeaways

•Time-series data is fundamentally different — Append-only writes, time-ordered access, high volume, and decreasing value over time distinguish it from transactional data.
•General-purpose databases are architecturally mismatched — B-tree write amplification, random I/O patterns, and lack of time-aware optimizations cause 10-100x performance degradation.
•LSM-Trees and TSM engines enable massive write throughput — By converting random writes to sequential batches, these engines support millions of writes per second.
•Compression is multiplicative — Delta encoding, Gorilla compression, and dictionary encoding achieve 10-50x storage reduction, directly improving query performance.
•Time-based partitioning is foundational — Partition pruning, efficient deletion, tiered storage, and parallelization all depend on time-organized data layout.
•Specialized indexes handle cardinality — Inverted indexes for tags, time-range indexes for pruning, and bloom filters for existence checks replace inefficient B-trees.
•Write-optimized architectures prioritize ingestion — In-memory buffers, batch flushing, and write-ahead logs enable sustained high-velocity data intake.

What's Next:

Now that we understand the fundamental optimizations for time-stamped data, we'll explore the two industry-leading time-series databases in depth: InfluxDB and TimescaleDB. You'll learn their architectures, query languages, operational characteristics, and the specific scenarios where each excels.

Page Complete

You now understand why time-series data demands specialized databases and the core optimizations that enable them to achieve 10-100x better performance than general-purpose alternatives. These fundamentals will inform your understanding of specific time-series database implementations in the pages ahead.

1 / 5

Loading learning content...

System Design (HLD)Time-Series Databases

Time-Series Databases

LevelAdvanced

Duration90 mins

TopicTime-Series Databases

1 / 5

Time-Stamped Data Optimization

The Data That Never Stops

What You Will Learn

Understanding Time-Series Data

Definition: Time-series data is a sequence of data points collected or recorded at discrete time intervals, where the temporal ordering is the primary dimension of access and analysis.

This definition seems simple, but its implications are profound. Let's examine the core characteristics:

Core Characteristics of Time-Series Data

•Temporal Ordering — Every data point is inherently ordered by its timestamp. Unlike user records or product catalogs where ordering is arbitrary, time-series data has a natural, immutable sequence. This means queries almost always involve time ranges rather than random access.
•Append-Only Writes — New data points are almost exclusively appended to the end of the series. Historical data is rarely updated; you don't change what a sensor read yesterday. This append-only pattern enables dramatic write optimizations that are impossible when data can be modified arbitrarily.
•High Write Volume — Time-series workloads are write-intensive. A single IoT gateway might ingest 10,000 data points per second. Traditional databases designed for balanced read/write workloads struggle with sustained, high-velocity ingestion.
•Time-Based Access Patterns — Queries typically request data within specific time ranges: 'last 5 minutes', 'yesterday between 2-4 PM', 'past 30 days'. Range scans over time are the dominant access pattern, making time-based indexing critical.
•Decreasing Value Over Time — Recent data is typically more valuable than historical data. You query last hour's metrics constantly, last week's occasionally, last year's rarely. This temporal value gradient enables intelligent data lifecycle management.
•High Cardinality Potential — A time-series often has multiple dimensions beyond time—device ID, region, metric type. The combination of these dimensions can create millions or billions of unique series, known as high cardinality.

The Data Shape:

Time-series data typically takes one of two structural forms, each with distinct implications for storage and query patterns:

Time-Series Data Structural Patterns
Structure	Description	Example	Storage Implication
Regular (Fixed Interval)	Data points at consistent time intervals	CPU usage sampled every 10 seconds	Highly compressible; timestamp can be computed from index
Irregular (Event-Driven)	Data points at variable intervals	User login events, error logs	Must store explicit timestamps; less compressible
Multivariate	Multiple metrics per timestamp	Temperature, humidity, pressure from same sensor	Can share timestamp storage across metrics
Tagged/Labeled	Metrics with dimensional metadata	cpu_usage{host=web01, region=us-east}	Tag indexing critical for query performance

Why General-Purpose Databases Fail

The Write Amplification Problem:

When you insert a new row into a B-tree indexed table:

The database must find the correct leaf node (requiring I/O)
If the leaf is full, it splits (causing more I/O and potentially cascading splits)
The write-ahead log must be updated (more I/O)
Index statistics must be updated (yet more I/O)

PostgreSQL B-tree Insert Overhead
Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Simulating time-series insert into PostgreSQL
-- Table: sensor_readings (timestamp, device_id, value)
-- Index: btree on (timestamp)
 
-- For each of 100,000 inserts per second:
-- 1. WAL write: ~200 bytes per insert = 20 MB/s journal I/O
-- 2. B-tree traversal: log2(N) page reads
-- 3. Page split probability increases with table size
-- 4. Autovacuum overhead: ~10-20% additional I/O
 
-- Result: A table with 1 billion rows requires:
-- - ~30 levels of B-tree depth
-- - ~100ms per write at sustained load
-- - Eventually: writes queue faster than they complete
 
-- PostgreSQL write throughput degrades:
-- 1M rows:   ~50,000 writes/sec
-- 100M rows: ~10,000 writes/sec
-- 1B rows:   ~1,000 writes/sec (or less)

The Query Pattern Mismatch:

Index Selectivity: B-tree indexes assume high selectivity (each query hits a small percentage of data). Time-series range queries can easily hit millions of rows.
Sequential vs. Random I/O: Disk drives—even SSDs—perform dramatically better on sequential reads than random reads. B-trees scatter logically sequential data across random physical locations.
Aggregation Costs: Time-series queries frequently require aggregations (averages, sums, percentiles) over millions of points. General databases compute these row-by-row; specialized databases precompute them.

General-Purpose DB Limitations

•B-tree write amplification (10-40x)
•Random I/O for sequential data
•No timestamp-aware compression
•No automatic data aging/TTL
•Row-by-row aggregation
•High storage overhead
•No downsampling support

Time-Series DB Optimizations

•LSM-Tree/TSM append optimization
•Sequential I/O for time ranges
•Delta/RLE timestamp compression
•Automatic retention policies
•Precomputed roll-ups
•10-50x storage compression
•Built-in downsampling

The 10x Threshold

Time-Series Storage Engines

1. Log-Structured Merge Trees (LSM-Trees):

LSM-Tree Architecture for Time-Series

Conceptual

LSM-Tree Write Path:
┌─────────────────────────────────────────────────────────────┐
│  Incoming Writes (100K/sec)                                 │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  MemTable (In-Memory, Sorted)                               │
│  ├── Red-Black Tree / Skip List                             │
│  ├── Size: 64MB - 256MB                                     │
│  └── Writes are O(log n) - extremely fast                   │
└─────────────────────────────────────────────────────────────┘
                          │ (Flush when full)
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  Level 0 SSTable Files (Disk)                               │
│  ├── Immutable, sorted files                                │
│  ├── Each file: 64-256MB                                    │
│  └── Multiple overlapping files allowed                     │
└─────────────────────────────────────────────────────────────┘
                          │ (Compaction)
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  Level 1-N SSTables (Disk)                                  │
│  ├── Non-overlapping key ranges per level                   │
│  ├── Level N+1 is 10x larger than Level N                   │
│  └── Compaction merges and removes duplicates               │
└─────────────────────────────────────────────────────────────┘
 
Why This Works for Time-Series:
- All writes go to memory first (microseconds, not milliseconds)
- Disk writes are large, sequential batches
- Time-ordered data naturally creates non-overlapping ranges
- Read amplification is acceptable for range-heavy workloads

2. Time-Structured Merge Trees (TSM):

InfluxDB developed the Time-Structured Merge (TSM) engine specifically for time-series data. TSM extends LSM concepts with time-series-specific optimizations:

Time-based sharding: Data is automatically partitioned by time (daily, weekly) before entering the LSM structure
Series-aware organization: Data from the same measurement/tags is colocated for efficient range scans
Columnar storage within files: Values, timestamps, and tags are stored in separate columns for better compression
Integrated compression: Delta-encoding for timestamps, Gorilla compression for floats

TSM Engine Data Organization

Structure

TSM Shard Structure (InfluxDB):
┌─────────────────────────────────────────────────────────────┐
│  Shard: 2024-01-01 to 2024-01-07                           │
│  (Time-based partition)                                     │
└─────────────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│  TSM File 1  │ │  TSM File 2  │ │  TSM File N  │
│  (Immutable) │ │  (Immutable) │ │  (Immutable) │
└──────────────┘ └──────────────┘ └──────────────┘
 
Inside a TSM File:
┌─────────────────────────────────────────────────────────────┐
│  Header                                                     │
├─────────────────────────────────────────────────────────────┤
│  Block 1: Series=cpu,host=web01                            │
│  ├── Timestamps: [compressed delta-encoded]                 │
│  └── Values: [Gorilla-compressed floats]                    │
├─────────────────────────────────────────────────────────────┤
│  Block 2: Series=cpu,host=web02                            │
│  ├── Timestamps: [compressed]                               │
│  └── Values: [compressed]                                   │
├─────────────────────────────────────────────────────────────┤
│  Index: Series → Block Offset mapping                       │
└─────────────────────────────────────────────────────────────┘
 
Compression Ratios:
- Timestamps: 95% reduction (delta + varint encoding)
- Float values: 80% reduction (Gorilla XOR compression)
- Overall: 10-20x compression vs raw storage

3. Chunk-Based Storage (TimescaleDB/Hypertables):

Automatic partitioning: New chunks created as time advances
Chunk exclusion: Queries only scan relevant time ranges
Compression: Older chunks can be compressed into columnar format
Data tiering: Chunks can be moved to cheaper storage as they age

This approach sacrifices some raw performance for compatibility with the PostgreSQL ecosystem—critical for teams with existing Postgres expertise and tooling.

Compression Strategies for Time-Series Data

Delta Encoding for Timestamps:

When data arrives at regular intervals (e.g., every 10 seconds), storing the full timestamp for each point is wasteful. Delta encoding stores only the difference between consecutive timestamps:

Delta Encoding Example
compression
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Raw timestamps (8 bytes each):
timestamps = [
    1704067200,  # 2024-01-01 00:00:00
    1704067210,  # 2024-01-01 00:00:10
    1704067220,  # 2024-01-01 00:00:20
    1704067230,  # 2024-01-01 00:00:30
    # ... 1 million more points
]
# Storage: 1,000,000 × 8 bytes = 8 MB
 
# Delta encoded:
deltas = [
    1704067200,  # First timestamp (full)
    10,          # Δ from previous
    10,          # Δ from previous
    10,          # Δ from previous
    # ... 1 million more deltas (all = 10)
]
# Storage: 8 bytes + 999,999 × 1 byte ≈ 1 MB
 
# Delta-of-Delta encoded (for regular intervals):
delta_of_deltas = [
    1704067200,  # First timestamp (full)
    10,          # First delta
    0, 0, 0, 0,  # All subsequent delta-of-deltas are 0!
]
# Storage: 8 bytes + 4 bytes + 999,998 × 1 bit ≈ 125 KB
 
# Compression ratio: 8 MB → 125 KB = 64x

Gorilla Compression for Floating-Point Values:

Gorilla XOR Compression
algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# CPU temperature readings: 67.2°C, 67.3°C, 67.1°C, 67.4°C
 
# IEEE 754 float representation:
# 67.2 = 0x4286666666666666 (64 bits)
# 67.3 = 0x4286B33333333333 (64 bits)
 
# XOR of consecutive values:
# 67.2 XOR 67.3 = 0x00005555555555555
# Result has 12 leading zeros + meaningful bits
 
# Gorilla encoding:
# 1. First value: stored in full (64 bits)
# 2. Subsequent values encoded as:
#    - If XOR == 0: Store '0' (1 bit) - values identical
#    - If XOR has same leading/trailing zeros as previous:
#      Store '10' + meaningful bits
#    - Otherwise: Store '11' + leading zeros count +
#      meaningful bits length + meaningful bits
 
# Typical compression: 1.37 bits per value (vs 64 bits raw)
# Compression ratio: ~47x for typical metrics
 
class GorillaEncoder:
    def __init__(self):
        self.prev_value = 0.0
        self.prev_leading_zeros = 0
        self.prev_trailing_zeros = 0
    
    def encode(self, value: float) -> BitStream:
        xor = float_to_bits(value) ^ float_to_bits(self.prev_value)
        
        if xor == 0:
            return BitStream('0')  # Identical value
        
        leading = count_leading_zeros(xor)
        trailing = count_trailing_zeros(xor)
        
        if (leading >= self.prev_leading_zeros and 
            trailing >= self.prev_trailing_zeros):
            # Use previous block position
            meaningful_bits = extract_bits(xor, 
                self.prev_leading_zeros, 
                64 - self.prev_leading_zeros - self.prev_trailing_zeros)
            return BitStream('10') + meaningful_bits
        else:
            # Store new block position
            block_size = 64 - leading - trailing
            return (BitStream('11') + 
                    encode_5bit(leading) + 
                    encode_6bit(block_size) + 
                    extract_bits(xor, leading, block_size))

Dictionary Encoding for Tags:

Time-series data often includes high-cardinality string tags (hostnames, regions, metric names). Dictionary encoding maps repeated strings to integer identifiers:

Compression Techniques Summary
Data Type	Technique	Typical Ratio	Usage
Timestamps (regular)	Delta-of-Delta + Varint	50-100x	Metrics at fixed intervals
Timestamps (irregular)	Delta + Varint	5-10x	Event logs, user actions
Float values	Gorilla XOR	20-50x	Sensor readings, metrics
Integer values	Delta + ZigZag + Varint	10-30x	Counters, gauges
String tags	Dictionary encoding	10-100x	Host names, regions
Boolean flags	Bit-packing	8x (minimum)	Status indicators

Compression Is Not Optional

Time-Based Partitioning and Sharding

Why Time Partitioning Works:

Benefits of Time-Based Partitioning

•Write Localization — All writes go to the current time partition. There's no need to update old files or shuffle data. The active partition absorbs all ingestion load while historical partitions remain immutable.
•Partition Pruning — Queries with time predicates automatically skip irrelevant partitions. A query for 'last 24 hours' on a year of data might scan 1/365th of the dataset without any index lookup.
•Efficient Deletion — When retention policies expire old data, entire partitions can be dropped atomically. No row-by-row deletion, no index updates, no vacuum operations.
•Tiered Storage — Old partitions can be compressed more aggressively, moved to cheaper storage, or migrated to object storage (S3, GCS) without affecting hot data performance.
•Parallelization — Multiple partitions can be queried in parallel across multiple CPU cores or machines. Aggregation across partitions is embarrassingly parallel.

Time Partitioning Architecture

design

Time-Partitioned Data Layout:
 
Year of Metrics Data (10TB uncompressed, 500GB compressed)
└── shard_group_2024_01     ← 1 week partition
│   ├── shard_1 (active)    ← Current writes
│   └── shard_2 (read-only) ← Flushed, compressed
├── shard_group_2023_12 (read-only, highly compressed)
├── shard_group_2023_11 (read-only, on cold storage)
├── ...
└── shard_group_2023_01 (read-only, archived to S3)
 
Query: SELECT mean(cpu) WHERE time > now() - 1h
 
Execution Plan:
1. Identify relevant partitions: shard_group_2024_01
2. Scan only in-memory buffer + recent TSM files
3. Skip 51 weeks of partitions entirely
4. Result: <50ms query on 10TB dataset
 
Query: SELECT mean(cpu) WHERE time > now() - 1y GROUP BY month
 
Execution Plan:
1. Identify relevant partitions: all 52 shard_groups
2. Execute parallel aggregation across 52 shards
3. Merge monthly aggregates
4. Result: <5s query with full parallelization

Choosing Partition Granularity:

Partition Size Guidelines
Write Volume	Recommended Partition	Rationale
< 1,000 pts/sec	Daily or Weekly	Write volume is low; fewer files simplify management
1,000 - 100,000 pts/sec	Daily	Balance between pruning benefit and file count
100,000 - 1M pts/sec	6-hour to Daily	Keeps active partition in memory; rapid compaction
1M pts/sec	1-hour to 6-hour	Prevents memory pressure; enables per-hour queries

Indexing Strategies for Time-Series Data

1. Inverted Indexes for Tags:

When you query SELECT * FROM cpu WHERE host='web01' AND region='us-east', the database needs to quickly find all series matching those tags. An inverted index maps tag values to series IDs:

Inverted Index Structure

structure

Inverted Index for Tags:
 
Tag Index:
┌─────────────────────────────────────┬────────────────────────┐
│ Tag                                 │ Series IDs             │
├─────────────────────────────────────┼────────────────────────┤
│ host=web01                          │ [1, 5, 892, 1204]      │
│ host=web02                          │ [2, 6, 893, 1205]      │
│ host=db01                           │ [3, 7, 894, 1206]      │
│ region=us-east                      │ [1, 2, 3, 892, 893]    │
│ region=us-west                      │ [5, 6, 7, 1204, 1205]  │
│ metric=cpu                          │ [1, 2, 3, 5, 6, 7]     │
│ metric=memory                       │ [892, 893, 894]        │
└─────────────────────────────────────┴────────────────────────┘
 
Query: SELECT * FROM cpu WHERE host='web01' AND region='us-east'
 
Execution:
1. Lookup host=web01    → [1, 5, 892, 1204]
2. Lookup region=us-east → [1, 2, 3, 892, 893]
3. Intersection          → [1, 892]
4. Lookup metric=cpu     → [1, 2, 3, 5, 6, 7]
5. Final intersection    → [1]
 
Result: Only series ID 1 matches all predicates

2. Time-Range Indexes:

3. Bloom Filters for Series Existence:

The High Cardinality Challenge

Write-Optimized Architectures

1. In-Memory Write Buffers:

2. Batch Flushing:

3. Write-Ahead Logging:

Write Path Architecture

flow

High-Performance Write Path:
 
Client Writes (100K points/sec per node)
          │
          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 1: Parse & Validate                                   │
│  ├── Protocol parsing (Line protocol, Prometheus, JSON)     │
│  ├── Schema inference (for schema-less DBs)                 │
│  ├── Tag and field validation                               │
│  └── Latency: ~10μs per batch                               │
└─────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 2: Write-Ahead Log (WAL)                              │
│  ├── Append to sequential log file                          │
│  ├── fsync configurable (per-write, batched, or async)      │
│  ├── Enables crash recovery                                 │
│  └── Latency: ~100μs (with batched fsync)                   │
└─────────────────────────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 3: MemTable Insert                                    │
│  ├── Sorted in-memory structure (per-shard)                 │
│  ├── Lock-free or sharded for concurrency                   │
│  ├── Immediately queryable                                  │
│  └── Latency: ~1μs per point                                │
└─────────────────────────────────────────────────────────────┘
          │
          │ (Async background flush when MemTable full)
          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 4: Flush to SSTable/TSM                               │
│  ├── Sort and compress data                                 │
│  ├── Build index blocks                                     │
│  ├── Sequential write to disk                               │
│  └── Delete WAL segments after flush                        │
└─────────────────────────────────────────────────────────────┘
 
Total Write Latency: ~100-500μs (with durable acknowledgment)
Throughput: 100K-1M points/sec per node

Batching at Every Level

Summary: Time-Stamped Data Optimization

Key Takeaways

•Time-series data is fundamentally different — Append-only writes, time-ordered access, high volume, and decreasing value over time distinguish it from transactional data.
•General-purpose databases are architecturally mismatched — B-tree write amplification, random I/O patterns, and lack of time-aware optimizations cause 10-100x performance degradation.
•LSM-Trees and TSM engines enable massive write throughput — By converting random writes to sequential batches, these engines support millions of writes per second.
•Compression is multiplicative — Delta encoding, Gorilla compression, and dictionary encoding achieve 10-50x storage reduction, directly improving query performance.
•Time-based partitioning is foundational — Partition pruning, efficient deletion, tiered storage, and parallelization all depend on time-organized data layout.
•Specialized indexes handle cardinality — Inverted indexes for tags, time-range indexes for pruning, and bloom filters for existence checks replace inefficient B-trees.
•Write-optimized architectures prioritize ingestion — In-memory buffers, batch flushing, and write-ahead logs enable sustained high-velocity data intake.

What's Next:

Page Complete

1 / 5