Database Management SystemsTime-Series Databases

Time-Series Databases

LevelAdvanced

Duration75 mins

TopicTime-Series Databases

1 / 5

Time-Series Data

The Relentless Flow of Time-Stamped Data

Every second, an astronomical volume of data flows through modern digital infrastructure—each data point tagged with a precise timestamp. Server metrics pulse every 10 seconds across thousands of nodes. Stock prices tick hundreds of times per second during market hours. IoT sensors in smart factories emit measurements continuously, 24/7/365. Vehicle telemetry streams from millions of connected cars. Weather stations, fitness trackers, network equipment, power grids—all generating waves of time-series data.

This isn't just regular data happening to have timestamps. Time-series data represents a fundamentally different paradigm—one where time is the primary axis of organization, where data is predominantly append-only, where recency matters more than history, and where the sheer volume can overwhelm systems designed for traditional workloads. Understanding time-series data is prerequisite to understanding why we need specialized database systems to handle it.

What You Will Learn

By the end of this page, you will understand: (1) The formal definition and unique characteristics of time-series data, (2) Why traditional relational databases struggle with time-series workloads, (3) The fundamental requirements that drive TSDB design, and (4) The core data model patterns that time-series databases employ. This foundation prepares you for deep exploration of specific TSDB implementations.

Defining Time-Series Data

Time-series data is a sequence of data points collected, recorded, or observed at successive points in time, typically at uniform intervals. Each data point consists of at least two components: a timestamp indicating when the measurement occurred, and one or more values representing the observed measurements at that time.

Formally, a time-series can be expressed as:

T = {(t₁, v₁), (t₂, v₂), ..., (tₙ, vₙ)}

Where:

tᵢ represents the timestamp of the i-th observation (t₁ < t₂ < ... < tₙ)
vᵢ represents the value(s) observed at time tᵢ
The ordering is strictly chronological

This deceptively simple definition masks significant complexity. In practice, time-series data exhibits characteristics that profoundly impact how we store, query, and analyze it.

Anatomy of Time-Series Data
Component	Description	Example
Timestamp	The temporal coordinate; when the measurement occurred	`2024-01-15T14:30:00.000Z`
Metric Name	Identifier for what is being measured	`cpu_usage`, `temperature`, `stock_price`
Value	The actual measurement (numeric, boolean, string)	`78.5`, `true`, `running`
Tags/Labels	Metadata for grouping and filtering	`host=server-01`, `region=us-east`
Fields	Additional measurements at same timestamp	`user_time=45.2`, `system_time=33.3`

The Multidimensional Nature:

Real-world time-series data is rarely a simple sequence of values. Consider CPU utilization monitoring across a data center:

Timestamp: When the measurement was taken
Metric: cpu_usage_percent
Tags: host=web-server-47, datacenter=us-east-1, cpu=0, environment=production
Fields: user=45.2, system=22.1, iowait=8.3, idle=24.4

This single data point actually represents measurements across multiple dimensions (host, datacenter, CPU core) with multiple field values—and millions of such points arrive every minute. The combinatorial explosion of dimensions creates what's called high cardinality, one of the defining challenges of time-series systems.

Cardinality in Time-Series

Cardinality refers to the number of unique combinations of tag values. With 10,000 hosts × 16 CPU cores × 5 datacenters × 3 environments = 2.4 million unique time series. Each unique combination creates a separate series that must be tracked, indexed, and stored efficiently. High cardinality is the bane of time-series database performance.

Fundamental Characteristics of Time-Series Workloads

Time-series workloads differ fundamentally from traditional OLTP or OLAP workloads. Understanding these characteristics explains why specialized databases emerged and why they make specific architectural choices.

Defining Characteristics of Time-Series Workloads

•Append-Dominant Write Pattern — Data is almost exclusively appended with new timestamps. Historical data is rarely updated or deleted on a per-point basis. This insert-only pattern enables aggressive write optimization.
•High Ingestion Velocity — Systems commonly ingest millions to billions of data points per second. Write performance at scale is non-negotiable; the data firehose stops for nothing.
•Time-Centric Access Patterns — Queries almost always include time-range predicates: 'give me data from the last hour,' 'show me yesterday's values.' Time is the universal filter.
•Recency Bias — Recent data is accessed far more frequently than historical data. The last hour matters more than last year. This skewed access pattern enables tiered storage optimization.
•Aggregation-Heavy Queries — Raw data is often summarized: averages, percentiles, rates of change. Analytical queries dominate over point lookups.
•Regular or Semi-Regular Intervals — Data often arrives at predictable intervals (every 10 seconds, every minute), enabling compression schemes that exploit this regularity.
•Data Has Natural Expiration — Second-level data from 2 years ago is rarely useful. Retention policies automatically purge old data, unlike typical databases where data lives forever.

Time-Series Workload

•Write once, read by time range
•Bulk inserts of sequential data
•Aggregations over windows
•Recent data accessed frequently
•Data automatically expires
•Schema relatively stable

Traditional OLTP Workload

•Random reads and updates by key
•Single-row transactions
•Point lookups and joins
•All data equally important
•Data retained indefinitely
•Schema evolves with business

The 95/5 Rule of Time-Series

In most time-series deployments, 95% of queries access data from the last 5% of the time range. This extreme recency bias is why TSDBs optimize hot/cold data separation—keeping recent data in fast storage while migrating older data to cheaper, slower tiers.

Why Traditional Databases Struggle with Time-Series

Before time-series databases existed, organizations attempted to store time-series data in traditional RDBMS systems like PostgreSQL, MySQL, or Oracle. While technically possible, this approach encounters fundamental limitations that become severe at scale.

traditional_timeseries_schema.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Naive approach: store metrics in a relational table
CREATE TABLE metrics (
    id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL,
    metric_name VARCHAR(255) NOT NULL,
    host VARCHAR(255) NOT NULL,
    datacenter VARCHAR(64),
    value DOUBLE PRECISION NOT NULL
);
 
-- Index for time-range queries
CREATE INDEX idx_metrics_timestamp ON metrics(timestamp);
 
-- Index for filtering by host
CREATE INDEX idx_metrics_host_time ON metrics(host, timestamp);
 
-- Compound index for common query patterns
CREATE INDEX idx_metrics_name_host_time 
    ON metrics(metric_name, host, timestamp);

This seemingly reasonable schema fails catastrophically:

Fundamental Problems with RDBMS for Time-Series

•Write Amplification — Each insert triggers B-tree index updates across multiple indexes. With millions of inserts/second, index maintenance becomes the bottleneck. TSDB write paths bypass this entirely with LSM-trees or custom structures.
•Storage Inefficiency — Repeating host, metric_name, datacenter on every row wastes enormous storage. Time-series databases use tag dictionaries and columnar encoding achieving 10-100x compression.
•Index Bloat — B-tree indexes grow without bound. A year of data at 1M points/second = 31.5 trillion rows. Index size exceeds data size; queries slow as indexes fragment.
•Retention Complexity — Deleting old data in RDBMS creates tombstones, fragments storage, requires VACUUM/OPTIMIZE operations. TSDBs partition by time, making retention as simple as dropping partitions.
•Aggregation Performance — Queries like 'average per hour for last 30 days' scan billions of rows. Without pre-aggregation (which RDBMS doesn't do automatically), such queries take minutes to hours.
•Cardinality Explosion — Each unique tag combination ideally needs its own index entry. With millions of unique label combinations, index cardinality explodes memory usage.

RDBMS vs TSDB Performance at Scale
Metric	PostgreSQL (Optimized)	Purpose-Built TSDB
Ingestion Rate	10K-100K points/sec	1M-10M+ points/sec
Storage per 1B points	~500 GB	~50 GB (10x compression)
Time-range query (1 hour)	10-60 seconds	< 100 milliseconds
Retention cleanup	Hours (VACUUM)	Seconds (drop partition)
Downsampling	Complex ETL required	Built-in continuous queries

The Breaking Point

Organizations typically hit the RDBMS wall at 10-100 million data points. At this scale, ingestion lags behind data arrival, queries timeout, and storage costs balloon. This is precisely when teams discover they need a purpose-built time-series solution.

The Time-Series Data Model

Time-series databases adopt a metric-based data model optimized for the unique characteristics of time-stamped data. While implementations vary, most TSDBs share common conceptual elements.

time_series_data_model.txt

Text

┌─────────────────────────────────────────────────────────────────────┐
│                        TIME-SERIES DATA MODEL                        │
└─────────────────────────────────────────────────────────────────────┘
 
MEASUREMENT (or METRIC):
┌─────────────────────────────────────────────────────────────────────┐
│  measurement: "cpu_usage"                                           │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  TAGS (Indexed, Low Cardinality Preferred):                  │   │
│  │    host = "server-01"                                         │   │
│  │    region = "us-east-1"                                       │   │
│  │    env = "production"                                         │   │
│  └──────────────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  FIELDS (Not Indexed, Multiple Values):                      │   │
│  │    user = 45.2                                                │   │
│  │    system = 22.1                                              │   │
│  │    idle = 32.7                                                │   │
│  └──────────────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  TIMESTAMP:                                                   │   │
│  │    time = 2024-01-15T14:30:00.000000000Z                     │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
 
SERIES = Measurement + Unique Tag Set
  Series 1: cpu_usage{host="server-01",region="us-east-1"} 
  Series 2: cpu_usage{host="server-02",region="us-east-1"}
  Series 3: cpu_usage{host="server-01",region="eu-west-1"}
 
Each series is stored and queried independently, enabling parallel processing.

Key Concepts Explained:

•Measurement/Metric: The name of what you're measuring (cpu_usage, temperature, http_requests). Acts as a namespace grouping related data.
•Tags/Labels: Key-value pairs that identify the source or context. Tags are indexed and used for filtering (WHERE host='server-01'). Keep tag cardinality manageable.
•Fields: The actual numeric (or string) values being recorded. Fields are stored efficiently in columnar format but typically not indexed. Multiple fields share a timestamp.
•Timestamp: Nanosecond-precision time when the measurement occurred. The timestamp is the primary ordering key for all data in the series.
•Series: A unique combination of measurement name + tag values. Each series is essentially an independent time-ordered sequence of data points.

Tags vs Fields: The Critical Distinction

Tags are metadata for identifying and grouping (host, region, service). They're indexed and should have bounded cardinality. Fields are the actual measurements (cpu_percent, latency_ms). Storing high-cardinality data (like user_id or request_id) as tags is a common mistake that devastates TSDB performance.

Time-Series Storage Architecture Principles

Purpose-built time-series databases employ architectural principles specifically designed for time-series workloads. These principles enable the massive performance improvements over traditional databases.

Core Architectural Principles

•Time-Based Partitioning (Sharding) — Data is organized into time-bounded chunks (e.g., 1 hour, 1 day). Each partition contains all data for that time window. This makes time-range queries touch minimal partitions and enables efficient retention (drop entire partitions).
•Columnar Storage — Within partitions, data is stored column-by-column rather than row-by-row. All timestamp values together, all 'user' field values together. This enables SIMD operations and better compression.
•Aggressive Compression — Time-series data is highly compressible. Timestamps can use delta-of-delta encoding (storing differences between successive timestamps). Values use techniques like XOR encoding where similar floats compress dramatically.
•LSM-Tree or Append-Only Structures — Rather than B-trees (which require extensive random I/O on inserts), TSDBs use Log-Structured Merge-trees or simple append logs with periodic compaction. Writes are sequential and fast.
•Inverted Indexes for Tags — Tags are stored in inverted indexes mapping each tag value to the list of series containing it. This enables efficient filtering: 'find all series where region=us-east'.
•Pre-Aggregation Support — Many TSDBs continuously compute aggregations (hourly averages, daily maxes) as data arrives. Queries against these pre-computed results are orders of magnitude faster.

tsdb_storage_layout.txt

Text

TIME-SERIES DATABASE STORAGE LAYOUT
====================================
 
Data organized by time first, then by series:
 
Day 1 Partition (2024-01-15)
├── Series Index (inverted index for tags)
│   ├── host=server-01 → [series_1, series_4, series_7]
│   ├── host=server-02 → [series_2, series_5, series_8]
│   └── region=us-east → [series_1, series_2, series_3]
│
├── Series 1: cpu_usage{host=server-01,region=us-east}
│   ├── Timestamps: [t1, t2, t3, ...] (delta-encoded)
│   ├── Field 'user': [45.2, 46.1, 44.8, ...] (XOR compressed)
│   └── Field 'system': [22.1, 23.0, 21.9, ...] (XOR compressed)
│
├── Series 2: cpu_usage{host=server-02,region=us-east}
│   ├── Timestamps: [t1, t2, t3, ...]
│   ├── Field 'user': [38.1, 39.0, 37.5, ...]
│   └── Field 'system': [18.4, 19.1, 18.0, ...]
│
└── Metadata Block
    ├── Min/Max timestamps
    ├── Series count
    └── Compression statistics
 
Day 2 Partition (2024-01-16)
└── ... (same structure)
 
Retention: Simply delete entire day partitions older than threshold

Compression is Critical

Modern TSDBs achieve 10-20x compression ratios. A naive approach storing 64-bit timestamps + 64-bit floats uses 16 bytes per point. With delta-delta and XOR encoding, this drops to 1-2 bytes per point. At billions of points, this is the difference between terabytes and petabytes.

Categories of Time-Series Data

Time-series data appears across virtually every domain of modern technology. Understanding the categories helps in recognizing when a time-series database is the appropriate solution.

Domains Generating Time-Series Data
Domain	Examples	Characteristics
Infrastructure Monitoring	CPU, memory, disk, network metrics from servers, containers, VMs	High frequency (1-60 sec), moderate cardinality, critical for operations
Application Performance (APM)	Request latency, error rates, throughput, custom business metrics	Tied to services/endpoints, often with distributed tracing context
Internet of Things (IoT)	Sensor readings, telemetry from vehicles, smart devices, industrial equipment	Massive device counts, irregular intervals, edge processing needs
Financial Markets	Stock prices, order book depth, trade volumes, economic indicators	Microsecond precision, regulatory retention, low latency requirements
DevOps/Observability	Logs, metrics, traces—the three pillars	Correlation across signals, distributed systems context
Scientific/Research	Climate data, genomics, physics experiments, astronomical observations	Long retention, precision requirements, often batch loaded
Business Intelligence	User activity, conversion events, feature usage over time	Event-driven, often ties to business outcomes and A/B testing

Scale Examples in Production:

To appreciate why specialized databases emerged, consider real-world scale:

Netflix: ~2.5 billion time-series metrics from their microservices architecture
Uber: Ingests ~1 trillion events per day for trip data and marketplace dynamics
Tesla: Collects ~1.5 million data points per vehicle per day from its fleet
Datadog: Processes trillions of data points daily across customer infrastructure

At these scales, every architectural decision in the database has massive cost and performance implications.

When to Reach for a TSDB

Consider a time-series database when: (1) Timestamp is the primary query dimension, (2) Data is append-mostly with rare updates, (3) You're ingesting more than 100K points/second, (4) Queries aggregate over time windows, (5) Data has natural expiration. If your workload is transactional with random updates, stick with OLTP databases.

Query Patterns in Time-Series Systems

Time-series queries follow predictable patterns that differ markedly from traditional SQL queries. TSDBs optimize heavily for these patterns, often providing specialized query languages or extensions.

common_tsdb_queries.sql
SQL (TimescaleDB Style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- PATTERN 1: Time-Range Selection with Aggregation
-- "What was the average CPU usage per host in the last hour?"
SELECT 
    host,
    time_bucket('5 minutes', time) AS bucket,
    AVG(cpu_usage) AS avg_cpu
FROM metrics
WHERE time > NOW() - INTERVAL '1 hour'
GROUP BY host, bucket
ORDER BY bucket DESC;
 
-- PATTERN 2: Downsampling
-- "Give me hourly averages instead of raw second-level data"
SELECT 
    time_bucket('1 hour', time) AS hour,
    AVG(value) AS avg_val,
    MAX(value) AS max_val,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) AS p95
FROM measurements
WHERE time > NOW() - INTERVAL '7 days'
GROUP BY hour
ORDER BY hour;
 
-- PATTERN 3: Rate Calculation
-- "What's the request rate per second over time?"
SELECT 
    time_bucket('1 minute', time) AS minute,
    (MAX(counter_value) - MIN(counter_value)) / 60.0 AS requests_per_sec
FROM http_request_total
WHERE time > NOW() - INTERVAL '30 minutes'
GROUP BY minute;
 
-- PATTERN 4: Top-N Analysis
-- "Which hosts had the highest error rates?"
SELECT 
    host,
    SUM(errors) / SUM(requests) AS error_rate
FROM web_metrics
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY host
ORDER BY error_rate DESC
LIMIT 10;
 
-- PATTERN 5: Gap Filling
-- "Show me values for every minute, filling gaps with interpolation"
SELECT 
    time_bucket_gapfill('1 minute', time) AS minute,
    locf(AVG(temperature)) AS temperature  -- Last Observation Carried Forward
FROM sensor_data
WHERE time BETWEEN '2024-01-15' AND '2024-01-16'
GROUP BY minute;

Key Query Characteristics

•Time-range predicates are universal — Nearly every query includes WHERE time > X or WHERE time BETWEEN X AND Y. Efficient time-range filtering is the most critical optimization.
•Aggregation over time windows — Raw data is rarely useful; aggregations (avg, sum, percentiles) over configurable windows are the norm.
•Group-by tag dimensions — Splitting results by host, service, region is standard. The TSDB must efficiently intersect time-filtered data with tag-filtered data.
•Fill strategies for gaps — Time-series often has missing data. Queries need strategies: null, zero, last-value, linear interpolation.
•Rate and counter calculations — Metrics like request counts are monotonic counters. Queries need derivative/rate functions to show change over time.

Summary: Understanding Time-Series Data

We've established the foundational understanding of time-series data—what it is, how it differs from traditional workloads, and why specialized databases emerged to handle it.

Key Takeaways

•Time-series data is timestamped, append-mostly, and often massive — The combination of high velocity, high volume, and time-centric access patterns defines the workload.
•Traditional databases fail at scale — B-tree index maintenance, row-based storage, and lack of time-aware partitioning create insurmountable bottlenecks.
•The metric + tags + fields model is standard — This structure enables efficient indexing, compression, and querying across billions of data points.
•Specialized architectures make TSDBs possible — Time-partitioning, columnar storage, LSM-trees, and aggressive compression enable 10-100x improvements in both storage and performance.
•Query patterns are predictable and optimizable — Time-range filtering, aggregation, group-by dimensions, and gap filling are the core operations TSDBs accelerate.
•Cardinality management is critical — High cardinality (too many unique tag combinations) can bring even specialized TSDBs to their knees.

What's Next:

Now that we understand the nature of time-series data and why specialized systems are necessary, we'll explore InfluxDB—one of the pioneering purpose-built time-series databases. We'll examine its architecture, data model, query language (Flux), and the design decisions that made it a leader in the observability space.

Foundation Established

You now understand the fundamental nature of time-series data and the architectural requirements it imposes on database systems. This foundation prepares you to critically evaluate specific TSDB implementations and make informed decisions about when and how to deploy them.

1 / 5

Loading learning content...

Database Management SystemsTime-Series Databases

Time-Series Databases

LevelAdvanced

Duration75 mins

TopicTime-Series Databases

1 / 5

Time-Series Data

The Relentless Flow of Time-Stamped Data

What You Will Learn

Defining Time-Series Data

Formally, a time-series can be expressed as:

T = {(t₁, v₁), (t₂, v₂), ..., (tₙ, vₙ)}

Where:

tᵢ represents the timestamp of the i-th observation (t₁ < t₂ < ... < tₙ)
vᵢ represents the value(s) observed at time tᵢ
The ordering is strictly chronological

This deceptively simple definition masks significant complexity. In practice, time-series data exhibits characteristics that profoundly impact how we store, query, and analyze it.

Anatomy of Time-Series Data
Component	Description	Example
Timestamp	The temporal coordinate; when the measurement occurred	`2024-01-15T14:30:00.000Z`
Metric Name	Identifier for what is being measured	`cpu_usage`, `temperature`, `stock_price`
Value	The actual measurement (numeric, boolean, string)	`78.5`, `true`, `running`
Tags/Labels	Metadata for grouping and filtering	`host=server-01`, `region=us-east`
Fields	Additional measurements at same timestamp	`user_time=45.2`, `system_time=33.3`

The Multidimensional Nature:

Real-world time-series data is rarely a simple sequence of values. Consider CPU utilization monitoring across a data center:

Timestamp: When the measurement was taken
Metric: cpu_usage_percent
Tags: host=web-server-47, datacenter=us-east-1, cpu=0, environment=production
Fields: user=45.2, system=22.1, iowait=8.3, idle=24.4

Cardinality in Time-Series

Fundamental Characteristics of Time-Series Workloads

Defining Characteristics of Time-Series Workloads

•Append-Dominant Write Pattern — Data is almost exclusively appended with new timestamps. Historical data is rarely updated or deleted on a per-point basis. This insert-only pattern enables aggressive write optimization.
•High Ingestion Velocity — Systems commonly ingest millions to billions of data points per second. Write performance at scale is non-negotiable; the data firehose stops for nothing.
•Time-Centric Access Patterns — Queries almost always include time-range predicates: 'give me data from the last hour,' 'show me yesterday's values.' Time is the universal filter.
•Recency Bias — Recent data is accessed far more frequently than historical data. The last hour matters more than last year. This skewed access pattern enables tiered storage optimization.
•Aggregation-Heavy Queries — Raw data is often summarized: averages, percentiles, rates of change. Analytical queries dominate over point lookups.
•Regular or Semi-Regular Intervals — Data often arrives at predictable intervals (every 10 seconds, every minute), enabling compression schemes that exploit this regularity.
•Data Has Natural Expiration — Second-level data from 2 years ago is rarely useful. Retention policies automatically purge old data, unlike typical databases where data lives forever.

Time-Series Workload

•Write once, read by time range
•Bulk inserts of sequential data
•Aggregations over windows
•Recent data accessed frequently
•Data automatically expires
•Schema relatively stable

Traditional OLTP Workload

•Random reads and updates by key
•Single-row transactions
•Point lookups and joins
•All data equally important
•Data retained indefinitely
•Schema evolves with business

The 95/5 Rule of Time-Series

Why Traditional Databases Struggle with Time-Series

traditional_timeseries_schema.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Naive approach: store metrics in a relational table
CREATE TABLE metrics (
    id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL,
    metric_name VARCHAR(255) NOT NULL,
    host VARCHAR(255) NOT NULL,
    datacenter VARCHAR(64),
    value DOUBLE PRECISION NOT NULL
);
 
-- Index for time-range queries
CREATE INDEX idx_metrics_timestamp ON metrics(timestamp);
 
-- Index for filtering by host
CREATE INDEX idx_metrics_host_time ON metrics(host, timestamp);
 
-- Compound index for common query patterns
CREATE INDEX idx_metrics_name_host_time 
    ON metrics(metric_name, host, timestamp);

This seemingly reasonable schema fails catastrophically:

Fundamental Problems with RDBMS for Time-Series

•Write Amplification — Each insert triggers B-tree index updates across multiple indexes. With millions of inserts/second, index maintenance becomes the bottleneck. TSDB write paths bypass this entirely with LSM-trees or custom structures.
•Storage Inefficiency — Repeating host, metric_name, datacenter on every row wastes enormous storage. Time-series databases use tag dictionaries and columnar encoding achieving 10-100x compression.
•Index Bloat — B-tree indexes grow without bound. A year of data at 1M points/second = 31.5 trillion rows. Index size exceeds data size; queries slow as indexes fragment.
•Retention Complexity — Deleting old data in RDBMS creates tombstones, fragments storage, requires VACUUM/OPTIMIZE operations. TSDBs partition by time, making retention as simple as dropping partitions.
•Aggregation Performance — Queries like 'average per hour for last 30 days' scan billions of rows. Without pre-aggregation (which RDBMS doesn't do automatically), such queries take minutes to hours.
•Cardinality Explosion — Each unique tag combination ideally needs its own index entry. With millions of unique label combinations, index cardinality explodes memory usage.

RDBMS vs TSDB Performance at Scale
Metric	PostgreSQL (Optimized)	Purpose-Built TSDB
Ingestion Rate	10K-100K points/sec	1M-10M+ points/sec
Storage per 1B points	~500 GB	~50 GB (10x compression)
Time-range query (1 hour)	10-60 seconds	< 100 milliseconds
Retention cleanup	Hours (VACUUM)	Seconds (drop partition)
Downsampling	Complex ETL required	Built-in continuous queries

The Breaking Point

The Time-Series Data Model

Time-series databases adopt a metric-based data model optimized for the unique characteristics of time-stamped data. While implementations vary, most TSDBs share common conceptual elements.

time_series_data_model.txt

Text

┌─────────────────────────────────────────────────────────────────────┐
│                        TIME-SERIES DATA MODEL                        │
└─────────────────────────────────────────────────────────────────────┘
 
MEASUREMENT (or METRIC):
┌─────────────────────────────────────────────────────────────────────┐
│  measurement: "cpu_usage"                                           │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  TAGS (Indexed, Low Cardinality Preferred):                  │   │
│  │    host = "server-01"                                         │   │
│  │    region = "us-east-1"                                       │   │
│  │    env = "production"                                         │   │
│  └──────────────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  FIELDS (Not Indexed, Multiple Values):                      │   │
│  │    user = 45.2                                                │   │
│  │    system = 22.1                                              │   │
│  │    idle = 32.7                                                │   │
│  └──────────────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  TIMESTAMP:                                                   │   │
│  │    time = 2024-01-15T14:30:00.000000000Z                     │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
 
SERIES = Measurement + Unique Tag Set
  Series 1: cpu_usage{host="server-01",region="us-east-1"} 
  Series 2: cpu_usage{host="server-02",region="us-east-1"}
  Series 3: cpu_usage{host="server-01",region="eu-west-1"}
 
Each series is stored and queried independently, enabling parallel processing.

Key Concepts Explained:

•Measurement/Metric: The name of what you're measuring (cpu_usage, temperature, http_requests). Acts as a namespace grouping related data.
•Tags/Labels: Key-value pairs that identify the source or context. Tags are indexed and used for filtering (WHERE host='server-01'). Keep tag cardinality manageable.
•Fields: The actual numeric (or string) values being recorded. Fields are stored efficiently in columnar format but typically not indexed. Multiple fields share a timestamp.
•Timestamp: Nanosecond-precision time when the measurement occurred. The timestamp is the primary ordering key for all data in the series.
•Series: A unique combination of measurement name + tag values. Each series is essentially an independent time-ordered sequence of data points.

Tags vs Fields: The Critical Distinction

Time-Series Storage Architecture Principles

Core Architectural Principles

•Time-Based Partitioning (Sharding) — Data is organized into time-bounded chunks (e.g., 1 hour, 1 day). Each partition contains all data for that time window. This makes time-range queries touch minimal partitions and enables efficient retention (drop entire partitions).
•Columnar Storage — Within partitions, data is stored column-by-column rather than row-by-row. All timestamp values together, all 'user' field values together. This enables SIMD operations and better compression.
•Aggressive Compression — Time-series data is highly compressible. Timestamps can use delta-of-delta encoding (storing differences between successive timestamps). Values use techniques like XOR encoding where similar floats compress dramatically.
•LSM-Tree or Append-Only Structures — Rather than B-trees (which require extensive random I/O on inserts), TSDBs use Log-Structured Merge-trees or simple append logs with periodic compaction. Writes are sequential and fast.
•Inverted Indexes for Tags — Tags are stored in inverted indexes mapping each tag value to the list of series containing it. This enables efficient filtering: 'find all series where region=us-east'.
•Pre-Aggregation Support — Many TSDBs continuously compute aggregations (hourly averages, daily maxes) as data arrives. Queries against these pre-computed results are orders of magnitude faster.

tsdb_storage_layout.txt

Text

TIME-SERIES DATABASE STORAGE LAYOUT
====================================
 
Data organized by time first, then by series:
 
Day 1 Partition (2024-01-15)
├── Series Index (inverted index for tags)
│   ├── host=server-01 → [series_1, series_4, series_7]
│   ├── host=server-02 → [series_2, series_5, series_8]
│   └── region=us-east → [series_1, series_2, series_3]
│
├── Series 1: cpu_usage{host=server-01,region=us-east}
│   ├── Timestamps: [t1, t2, t3, ...] (delta-encoded)
│   ├── Field 'user': [45.2, 46.1, 44.8, ...] (XOR compressed)
│   └── Field 'system': [22.1, 23.0, 21.9, ...] (XOR compressed)
│
├── Series 2: cpu_usage{host=server-02,region=us-east}
│   ├── Timestamps: [t1, t2, t3, ...]
│   ├── Field 'user': [38.1, 39.0, 37.5, ...]
│   └── Field 'system': [18.4, 19.1, 18.0, ...]
│
└── Metadata Block
    ├── Min/Max timestamps
    ├── Series count
    └── Compression statistics
 
Day 2 Partition (2024-01-16)
└── ... (same structure)
 
Retention: Simply delete entire day partitions older than threshold

Compression is Critical

Categories of Time-Series Data

Time-series data appears across virtually every domain of modern technology. Understanding the categories helps in recognizing when a time-series database is the appropriate solution.

Domains Generating Time-Series Data
Domain	Examples	Characteristics
Infrastructure Monitoring	CPU, memory, disk, network metrics from servers, containers, VMs	High frequency (1-60 sec), moderate cardinality, critical for operations
Application Performance (APM)	Request latency, error rates, throughput, custom business metrics	Tied to services/endpoints, often with distributed tracing context
Internet of Things (IoT)	Sensor readings, telemetry from vehicles, smart devices, industrial equipment	Massive device counts, irregular intervals, edge processing needs
Financial Markets	Stock prices, order book depth, trade volumes, economic indicators	Microsecond precision, regulatory retention, low latency requirements
DevOps/Observability	Logs, metrics, traces—the three pillars	Correlation across signals, distributed systems context
Scientific/Research	Climate data, genomics, physics experiments, astronomical observations	Long retention, precision requirements, often batch loaded
Business Intelligence	User activity, conversion events, feature usage over time	Event-driven, often ties to business outcomes and A/B testing

Scale Examples in Production:

To appreciate why specialized databases emerged, consider real-world scale:

Netflix: ~2.5 billion time-series metrics from their microservices architecture
Uber: Ingests ~1 trillion events per day for trip data and marketplace dynamics
Tesla: Collects ~1.5 million data points per vehicle per day from its fleet
Datadog: Processes trillions of data points daily across customer infrastructure

At these scales, every architectural decision in the database has massive cost and performance implications.

When to Reach for a TSDB

Query Patterns in Time-Series Systems

common_tsdb_queries.sql
SQL (TimescaleDB Style)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- PATTERN 1: Time-Range Selection with Aggregation
-- "What was the average CPU usage per host in the last hour?"
SELECT 
    host,
    time_bucket('5 minutes', time) AS bucket,
    AVG(cpu_usage) AS avg_cpu
FROM metrics
WHERE time > NOW() - INTERVAL '1 hour'
GROUP BY host, bucket
ORDER BY bucket DESC;
 
-- PATTERN 2: Downsampling
-- "Give me hourly averages instead of raw second-level data"
SELECT 
    time_bucket('1 hour', time) AS hour,
    AVG(value) AS avg_val,
    MAX(value) AS max_val,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) AS p95
FROM measurements
WHERE time > NOW() - INTERVAL '7 days'
GROUP BY hour
ORDER BY hour;
 
-- PATTERN 3: Rate Calculation
-- "What's the request rate per second over time?"
SELECT 
    time_bucket('1 minute', time) AS minute,
    (MAX(counter_value) - MIN(counter_value)) / 60.0 AS requests_per_sec
FROM http_request_total
WHERE time > NOW() - INTERVAL '30 minutes'
GROUP BY minute;
 
-- PATTERN 4: Top-N Analysis
-- "Which hosts had the highest error rates?"
SELECT 
    host,
    SUM(errors) / SUM(requests) AS error_rate
FROM web_metrics
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY host
ORDER BY error_rate DESC
LIMIT 10;
 
-- PATTERN 5: Gap Filling
-- "Show me values for every minute, filling gaps with interpolation"
SELECT 
    time_bucket_gapfill('1 minute', time) AS minute,
    locf(AVG(temperature)) AS temperature  -- Last Observation Carried Forward
FROM sensor_data
WHERE time BETWEEN '2024-01-15' AND '2024-01-16'
GROUP BY minute;

Key Query Characteristics

•Time-range predicates are universal — Nearly every query includes WHERE time > X or WHERE time BETWEEN X AND Y. Efficient time-range filtering is the most critical optimization.
•Aggregation over time windows — Raw data is rarely useful; aggregations (avg, sum, percentiles) over configurable windows are the norm.
•Group-by tag dimensions — Splitting results by host, service, region is standard. The TSDB must efficiently intersect time-filtered data with tag-filtered data.
•Fill strategies for gaps — Time-series often has missing data. Queries need strategies: null, zero, last-value, linear interpolation.
•Rate and counter calculations — Metrics like request counts are monotonic counters. Queries need derivative/rate functions to show change over time.

Summary: Understanding Time-Series Data

We've established the foundational understanding of time-series data—what it is, how it differs from traditional workloads, and why specialized databases emerged to handle it.

Key Takeaways

•Time-series data is timestamped, append-mostly, and often massive — The combination of high velocity, high volume, and time-centric access patterns defines the workload.
•Traditional databases fail at scale — B-tree index maintenance, row-based storage, and lack of time-aware partitioning create insurmountable bottlenecks.
•The metric + tags + fields model is standard — This structure enables efficient indexing, compression, and querying across billions of data points.
•Specialized architectures make TSDBs possible — Time-partitioning, columnar storage, LSM-trees, and aggressive compression enable 10-100x improvements in both storage and performance.
•Query patterns are predictable and optimizable — Time-range filtering, aggregation, group-by dimensions, and gap filling are the core operations TSDBs accelerate.
•Cardinality management is critical — High cardinality (too many unique tag combinations) can bring even specialized TSDBs to their knees.

What's Next:

Foundation Established

1 / 5