Loading learning content...
Every second, an astronomical volume of data flows through modern digital infrastructure—each data point tagged with a precise timestamp. Server metrics pulse every 10 seconds across thousands of nodes. Stock prices tick hundreds of times per second during market hours. IoT sensors in smart factories emit measurements continuously, 24/7/365. Vehicle telemetry streams from millions of connected cars. Weather stations, fitness trackers, network equipment, power grids—all generating waves of time-series data.
This isn't just regular data happening to have timestamps. Time-series data represents a fundamentally different paradigm—one where time is the primary axis of organization, where data is predominantly append-only, where recency matters more than history, and where the sheer volume can overwhelm systems designed for traditional workloads. Understanding time-series data is prerequisite to understanding why we need specialized database systems to handle it.
By the end of this page, you will understand: (1) The formal definition and unique characteristics of time-series data, (2) Why traditional relational databases struggle with time-series workloads, (3) The fundamental requirements that drive TSDB design, and (4) The core data model patterns that time-series databases employ. This foundation prepares you for deep exploration of specific TSDB implementations.
Time-series data is a sequence of data points collected, recorded, or observed at successive points in time, typically at uniform intervals. Each data point consists of at least two components: a timestamp indicating when the measurement occurred, and one or more values representing the observed measurements at that time.
Formally, a time-series can be expressed as:
T = {(t₁, v₁), (t₂, v₂), ..., (tₙ, vₙ)}
Where:
This deceptively simple definition masks significant complexity. In practice, time-series data exhibits characteristics that profoundly impact how we store, query, and analyze it.
| Component | Description | Example |
|---|---|---|
| Timestamp | The temporal coordinate; when the measurement occurred | 2024-01-15T14:30:00.000Z |
| Metric Name | Identifier for what is being measured | cpu_usage, temperature, stock_price |
| Value | The actual measurement (numeric, boolean, string) | 78.5, true, running |
| Tags/Labels | Metadata for grouping and filtering | host=server-01, region=us-east |
| Fields | Additional measurements at same timestamp | user_time=45.2, system_time=33.3 |
The Multidimensional Nature:
Real-world time-series data is rarely a simple sequence of values. Consider CPU utilization monitoring across a data center:
cpu_usage_percenthost=web-server-47, datacenter=us-east-1, cpu=0, environment=productionuser=45.2, system=22.1, iowait=8.3, idle=24.4This single data point actually represents measurements across multiple dimensions (host, datacenter, CPU core) with multiple field values—and millions of such points arrive every minute. The combinatorial explosion of dimensions creates what's called high cardinality, one of the defining challenges of time-series systems.
Cardinality refers to the number of unique combinations of tag values. With 10,000 hosts × 16 CPU cores × 5 datacenters × 3 environments = 2.4 million unique time series. Each unique combination creates a separate series that must be tracked, indexed, and stored efficiently. High cardinality is the bane of time-series database performance.
Time-series workloads differ fundamentally from traditional OLTP or OLAP workloads. Understanding these characteristics explains why specialized databases emerged and why they make specific architectural choices.
In most time-series deployments, 95% of queries access data from the last 5% of the time range. This extreme recency bias is why TSDBs optimize hot/cold data separation—keeping recent data in fast storage while migrating older data to cheaper, slower tiers.
Before time-series databases existed, organizations attempted to store time-series data in traditional RDBMS systems like PostgreSQL, MySQL, or Oracle. While technically possible, this approach encounters fundamental limitations that become severe at scale.
12345678910111213141516171819
-- Naive approach: store metrics in a relational tableCREATE TABLE metrics ( id BIGSERIAL PRIMARY KEY, timestamp TIMESTAMPTZ NOT NULL, metric_name VARCHAR(255) NOT NULL, host VARCHAR(255) NOT NULL, datacenter VARCHAR(64), value DOUBLE PRECISION NOT NULL); -- Index for time-range queriesCREATE INDEX idx_metrics_timestamp ON metrics(timestamp); -- Index for filtering by hostCREATE INDEX idx_metrics_host_time ON metrics(host, timestamp); -- Compound index for common query patternsCREATE INDEX idx_metrics_name_host_time ON metrics(metric_name, host, timestamp);This seemingly reasonable schema fails catastrophically:
host, metric_name, datacenter on every row wastes enormous storage. Time-series databases use tag dictionaries and columnar encoding achieving 10-100x compression.| Metric | PostgreSQL (Optimized) | Purpose-Built TSDB |
|---|---|---|
| Ingestion Rate | 10K-100K points/sec | 1M-10M+ points/sec |
| Storage per 1B points | ~500 GB | ~50 GB (10x compression) |
| Time-range query (1 hour) | 10-60 seconds | < 100 milliseconds |
| Retention cleanup | Hours (VACUUM) | Seconds (drop partition) |
| Downsampling | Complex ETL required | Built-in continuous queries |
Organizations typically hit the RDBMS wall at 10-100 million data points. At this scale, ingestion lags behind data arrival, queries timeout, and storage costs balloon. This is precisely when teams discover they need a purpose-built time-series solution.
Time-series databases adopt a metric-based data model optimized for the unique characteristics of time-stamped data. While implementations vary, most TSDBs share common conceptual elements.
12345678910111213141516171819202122232425262728293031
┌─────────────────────────────────────────────────────────────────────┐│ TIME-SERIES DATA MODEL │└─────────────────────────────────────────────────────────────────────┘ MEASUREMENT (or METRIC):┌─────────────────────────────────────────────────────────────────────┐│ measurement: "cpu_usage" ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ TAGS (Indexed, Low Cardinality Preferred): │ ││ │ host = "server-01" │ ││ │ region = "us-east-1" │ ││ │ env = "production" │ ││ └──────────────────────────────────────────────────────────────┘ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ FIELDS (Not Indexed, Multiple Values): │ ││ │ user = 45.2 │ ││ │ system = 22.1 │ ││ │ idle = 32.7 │ ││ └──────────────────────────────────────────────────────────────┘ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ TIMESTAMP: │ ││ │ time = 2024-01-15T14:30:00.000000000Z │ ││ └──────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘ SERIES = Measurement + Unique Tag Set Series 1: cpu_usage{host="server-01",region="us-east-1"} Series 2: cpu_usage{host="server-02",region="us-east-1"} Series 3: cpu_usage{host="server-01",region="eu-west-1"} Each series is stored and queried independently, enabling parallel processing.Key Concepts Explained:
Tags are metadata for identifying and grouping (host, region, service). They're indexed and should have bounded cardinality. Fields are the actual measurements (cpu_percent, latency_ms). Storing high-cardinality data (like user_id or request_id) as tags is a common mistake that devastates TSDB performance.
Purpose-built time-series databases employ architectural principles specifically designed for time-series workloads. These principles enable the massive performance improvements over traditional databases.
123456789101112131415161718192021222324252627282930
TIME-SERIES DATABASE STORAGE LAYOUT==================================== Data organized by time first, then by series: Day 1 Partition (2024-01-15)├── Series Index (inverted index for tags)│ ├── host=server-01 → [series_1, series_4, series_7]│ ├── host=server-02 → [series_2, series_5, series_8]│ └── region=us-east → [series_1, series_2, series_3]│├── Series 1: cpu_usage{host=server-01,region=us-east}│ ├── Timestamps: [t1, t2, t3, ...] (delta-encoded)│ ├── Field 'user': [45.2, 46.1, 44.8, ...] (XOR compressed)│ └── Field 'system': [22.1, 23.0, 21.9, ...] (XOR compressed)│├── Series 2: cpu_usage{host=server-02,region=us-east}│ ├── Timestamps: [t1, t2, t3, ...]│ ├── Field 'user': [38.1, 39.0, 37.5, ...]│ └── Field 'system': [18.4, 19.1, 18.0, ...]│└── Metadata Block ├── Min/Max timestamps ├── Series count └── Compression statistics Day 2 Partition (2024-01-16)└── ... (same structure) Retention: Simply delete entire day partitions older than thresholdModern TSDBs achieve 10-20x compression ratios. A naive approach storing 64-bit timestamps + 64-bit floats uses 16 bytes per point. With delta-delta and XOR encoding, this drops to 1-2 bytes per point. At billions of points, this is the difference between terabytes and petabytes.
Time-series data appears across virtually every domain of modern technology. Understanding the categories helps in recognizing when a time-series database is the appropriate solution.
| Domain | Examples | Characteristics |
|---|---|---|
| Infrastructure Monitoring | CPU, memory, disk, network metrics from servers, containers, VMs | High frequency (1-60 sec), moderate cardinality, critical for operations |
| Application Performance (APM) | Request latency, error rates, throughput, custom business metrics | Tied to services/endpoints, often with distributed tracing context |
| Internet of Things (IoT) | Sensor readings, telemetry from vehicles, smart devices, industrial equipment | Massive device counts, irregular intervals, edge processing needs |
| Financial Markets | Stock prices, order book depth, trade volumes, economic indicators | Microsecond precision, regulatory retention, low latency requirements |
| DevOps/Observability | Logs, metrics, traces—the three pillars | Correlation across signals, distributed systems context |
| Scientific/Research | Climate data, genomics, physics experiments, astronomical observations | Long retention, precision requirements, often batch loaded |
| Business Intelligence | User activity, conversion events, feature usage over time | Event-driven, often ties to business outcomes and A/B testing |
Scale Examples in Production:
To appreciate why specialized databases emerged, consider real-world scale:
At these scales, every architectural decision in the database has massive cost and performance implications.
Consider a time-series database when: (1) Timestamp is the primary query dimension, (2) Data is append-mostly with rare updates, (3) You're ingesting more than 100K points/second, (4) Queries aggregate over time windows, (5) Data has natural expiration. If your workload is transactional with random updates, stick with OLTP databases.
Time-series queries follow predictable patterns that differ markedly from traditional SQL queries. TSDBs optimize heavily for these patterns, often providing specialized query languages or extensions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
-- PATTERN 1: Time-Range Selection with Aggregation-- "What was the average CPU usage per host in the last hour?"SELECT host, time_bucket('5 minutes', time) AS bucket, AVG(cpu_usage) AS avg_cpuFROM metricsWHERE time > NOW() - INTERVAL '1 hour'GROUP BY host, bucketORDER BY bucket DESC; -- PATTERN 2: Downsampling-- "Give me hourly averages instead of raw second-level data"SELECT time_bucket('1 hour', time) AS hour, AVG(value) AS avg_val, MAX(value) AS max_val, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) AS p95FROM measurementsWHERE time > NOW() - INTERVAL '7 days'GROUP BY hourORDER BY hour; -- PATTERN 3: Rate Calculation-- "What's the request rate per second over time?"SELECT time_bucket('1 minute', time) AS minute, (MAX(counter_value) - MIN(counter_value)) / 60.0 AS requests_per_secFROM http_request_totalWHERE time > NOW() - INTERVAL '30 minutes'GROUP BY minute; -- PATTERN 4: Top-N Analysis-- "Which hosts had the highest error rates?"SELECT host, SUM(errors) / SUM(requests) AS error_rateFROM web_metricsWHERE time > NOW() - INTERVAL '24 hours'GROUP BY hostORDER BY error_rate DESCLIMIT 10; -- PATTERN 5: Gap Filling-- "Show me values for every minute, filling gaps with interpolation"SELECT time_bucket_gapfill('1 minute', time) AS minute, locf(AVG(temperature)) AS temperature -- Last Observation Carried ForwardFROM sensor_dataWHERE time BETWEEN '2024-01-15' AND '2024-01-16'GROUP BY minute;We've established the foundational understanding of time-series data—what it is, how it differs from traditional workloads, and why specialized databases emerged to handle it.
What's Next:
Now that we understand the nature of time-series data and why specialized systems are necessary, we'll explore InfluxDB—one of the pioneering purpose-built time-series databases. We'll examine its architecture, data model, query language (Flux), and the design decisions that made it a leader in the observability space.
You now understand the fundamental nature of time-series data and the architectural requirements it imposes on database systems. This foundation prepares you to critically evaluate specific TSDB implementations and make informed decisions about when and how to deploy them.