Time-Series Databases - Learning Module

Loading content...

0/273

Retention Policies

The Data That Keeps Growing

Time-series data is relentless. A modest IoT deployment with 1,000 sensors reporting every second generates 86.4 million data points per day—over 31 billion per year. At even 20 bytes per point (after compression), that's 620 gigabytes annually from a single deployment. Scale to enterprise—millions of sensors, hundreds of applications, thousands of servers—and you're looking at petabytes.

Without intelligent data lifecycle management, time-series systems face an existential threat: storage costs that grow without bound while the value of that data diminishes exponentially over time. Last hour's metrics are queried constantly; last year's might be accessed once for an annual review.

Retention policies are the mechanism by which time-series databases balance data preservation against practical constraints. They answer critical questions: How long should we keep raw data? When should we downsample? What can we safely delete? How do we tier storage to optimize costs?

What You Will Learn

By the end of this page, you will master retention policy design: understanding the economics of time-series storage, implementing multi-tier retention strategies, configuring downsampling policies, and making informed trade-offs between data granularity, query performance, and cost.

The Economics of Time-Series Storage

Before diving into retention strategies, we must understand the economic forces that make retention policies necessary. Time-series storage costs are not linear—they compound in ways that can bankrupt engineering budgets.

Cost Drivers:

Factors Driving Time-Series Storage Costs

•Data Volume — Raw data ingestion rate multiplied by retention duration. 1M points/sec × 86,400 sec/day × 365 days = 31 trillion points/year.
•Storage Tier — NVMe SSD costs ~$0.10/GB/month; S3 Standard costs ~$0.023/GB/month; S3 Glacier costs ~$0.004/GB/month. A 10x difference between hot and cold storage.
•Query Performance — Fast queries require data on fast storage. Moving data to cheaper tiers trades query latency for cost savings.
•Cardinality — High-cardinality data (millions of unique series) increases index overhead and memory requirements, often more expensive than the data itself.
•Replication — High availability requires data replication (typically 3x). Your storage cost multiplies accordingly.
•Compute — Queries over larger datasets consume more CPU. Retaining more data increases query compute costs.

Storage Cost Calculation
calculation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Time-Series Storage Cost Estimation
 
# Scenario: Medium-scale monitoring deployment
metrics_per_second = 100_000
bytes_per_metric_raw = 100  # timestamp + tags + value
compression_ratio = 10      # Typical for time-series data
replication_factor = 3
 
# Daily data volume
seconds_per_day = 86_400
raw_daily_bytes = metrics_per_second * bytes_per_metric_raw * seconds_per_day
# = 100,000 × 100 × 86,400 = 864 GB/day raw
 
compressed_daily_bytes = raw_daily_bytes / compression_ratio
# = 86.4 GB/day compressed
 
replicated_daily_bytes = compressed_daily_bytes * replication_factor
# = 259.2 GB/day with replication
 
# Annual storage at different retention levels
one_year_storage = replicated_daily_bytes * 365
# = 94.6 TB/year
 
# Cost comparison by storage tier (monthly)
# Scenario: 1 year retention with current month on hot storage
 
hot_storage_bytes = replicated_daily_bytes * 30  # Last 30 days
cold_storage_bytes = replicated_daily_bytes * 335  # Older 11 months
 
# Hot storage (NVMe SSD or high-performance cloud storage)
hot_cost_per_gb_month = 0.10
hot_monthly_cost = (hot_storage_bytes / 1e9) * hot_cost_per_gb_month
# = 7,776 GB × $0.10 = $778/month
 
# Cold storage (S3 Standard)  
cold_cost_per_gb_month = 0.023
cold_monthly_cost = (cold_storage_bytes / 1e9) * cold_cost_per_gb_month
# = 86,832 GB × $0.023 = $1,997/month
 
# TOTAL: $2,775/month = $33,300/year
 
# COMPARISON: All data on hot storage
all_hot_monthly = (one_year_storage / 1e9 / 12) * hot_cost_per_gb_month
# = 7,883 GB/month × $0.10 = $788/month... wait, this is per-month calc
# Actually: 94.6 TB × $0.10 = $9,460/month = $113,520/year
 
# SAVINGS from tiering: $113,520 - $33,300 = $80,220/year (71% reduction)

The 90/10 Rule of Time-Series Access

Industry data consistently shows that 90% of time-series queries access data from the last 24-48 hours. The remaining 10% are historical queries for incident investigation or capacity planning. This access pattern is the foundation for tiered retention: hot storage for recent data, cold storage for historical.

Retention Policy Fundamentals

A retention policy defines how long data is kept before automatic deletion. Modern time-series databases provide sophisticated retention management far beyond simple "delete after N days."

Retention Policy Components:

Elements of a Retention Policy

•Duration — How long data is retained before deletion. Example: 30 days, 1 year, infinite.
•Granularity — At what resolution is data stored? Example: 15-second raw, 1-minute aggregated, hourly rollups.
•Scope — Which data does this policy apply to? Example: all data, specific measurements, specific tags.
•Trigger — How is deletion executed? Example: continuous (delete as data ages out), scheduled (daily cleanup job).
•Verification — How is compliance verified? Example: disk space monitoring, data age auditing.

Retention Policy Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// InfluxDB 2.x: Retention is set at the bucket level
 
// Create bucket with 30-day retention
influx bucket create \
  --name metrics-hot \
  --retention 30d
 
// Create bucket with 1-year retention  
influx bucket create \
  --name metrics-cold \
  --retention 365d
 
// Create bucket with infinite retention (for compliance)
influx bucket create \
  --name metrics-archive \
  --retention 0  // 0 means infinite
 
// Data flow: write to hot bucket, downsample to cold
// Flux task for downsampling (runs hourly)
option task = {name: "downsample-to-cold", every: 1h}
 
from(bucket: "metrics-hot")
  |> range(start: -2h, stop: -1h)
  |> aggregateWindow(every: 5m, fn: mean)
  |> to(bucket: "metrics-cold")

Downsampling: Reducing Resolution Over Time

Downsampling is the process of reducing data resolution while preserving essential information. Instead of deleting old data entirely, downsampling converts high-resolution data (15-second intervals) into lower-resolution aggregates (5-minute, hourly averages). This dramatically reduces storage while maintaining queryability.

Why Downsampling Works:

For most analytical purposes, the exact value of CPU usage at 14:23:15 last month doesn't matter. What matters is:

Was there a spike? (captured by max aggregation)
What was the typical usage? (captured by avg aggregation)
Was there unusual variance? (captured by stddev aggregation)

Downsampling preserves these insights while reducing storage by 10-100x.

Multi-Resolution Retention Architecture

architecture

Multi-Resolution Retention Strategy:
 
Time Since Ingestion │ Resolution │ Retention │ Storage Factor
─────────────────────┼────────────┼───────────┼───────────────
0 - 24 hours         │ 15 seconds │ 24 hours  │ 1x (baseline)
1 - 7 days           │ 1 minute   │ 6 days    │ 0.25x
7 - 30 days          │ 5 minutes  │ 23 days   │ 0.05x
30 - 365 days        │ 1 hour     │ 335 days  │ 0.004x
365+ days            │ 1 day      │ Forever   │ 0.0002x
 
Storage Calculation for 1 year of data:
- Raw (15s): 5,760 points/day × 1 day = 5,760 points
- 1m agg:    1,440 points/day × 6 days = 8,640 points  
- 5m agg:    288 points/day × 23 days = 6,624 points
- 1h agg:   24 points/day × 335 days = 8,040 points
- 1d agg:   1 point/day × ongoing
 
Total: ~29,000 points for 1 year vs ~2.1M points if all at 15s
Compression ratio: 72x reduction while preserving queryability
 
Data Flow:
┌─────────────────────────────────────────────────────────────────────┐
│  Ingest (15-second resolution)                                     │
└───────────────────────────────────────────────────────────────────┬─┘
                                                                    │
    ┌───────────────────────────────────────────────────────────────┼──┐
    │ After 24 hours: Downsample to 1-minute                        │  │
    │ Aggregations: avg, min, max, count                            │  │
    └───────────────────────────────────────────────────────────────┼──┘
                                                                    │
    ┌───────────────────────────────────────────────────────────────┼──┐
    │ After 7 days: Downsample to 5-minute                          │  │
    │ Aggregations: avg, min, max, count                            │  │
    └───────────────────────────────────────────────────────────────┼──┘
                                                                    │
    ┌───────────────────────────────────────────────────────────────┼──┐
    │ After 30 days: Downsample to 1-hour                           │  │
    │ Delete original 5-minute data                                 │  │
    └───────────────────────────────────────────────────────────────┼──┘
                                                                    │
    ┌───────────────────────────────────────────────────────────────┼──┐
    │ After 1 year: Downsample to 1-day                             │  │
    │ Delete original hourly data                                   │  │
    └───────────────────────────────────────────────────────────────┴──┘

Aggregation Functions for Downsampling:

Choosing the right aggregation functions is critical. Different metrics require different aggregations to preserve meaningful information:

Aggregation Strategy by Metric Type
Metric Type	Example	Aggregations to Preserve	Rationale
Counter (rate)	requests_total	sum (for rate calculation)	Rate = sum of deltas over interval
Gauge (level)	cpu_usage	avg, min, max	Captures typical + extremes
Histogram	latency_bucket	sum (per bucket)	Enables percentile reconstruction
Event count	errors	sum, count	Total and frequency
Temperature	sensor_temp	avg, min, max	Detect spikes and drops
Business metric	revenue	sum	Total over period

Downsampling Is Lossy

Downsampling irreversibly discards information. A 1-minute spike in CPU usage that lasted 10 seconds becomes invisible in 5-minute averages. For metrics where brief anomalies matter, preserve max/min aggregations or consider longer high-resolution retention.

Tiered Storage Architecture

Modern time-series databases increasingly support tiered storage—automatically moving data between storage tiers based on age. This enables the best of both worlds: fast queries on recent data with economical long-term retention.

Common Storage Tiers:

Storage Tier Hierarchy

•Hot Tier (Memory/NVMe) — Newest data in RAM or fast SSD. Sub-millisecond query latency. Highest cost (~$0.10/GB/month for NVMe). Typical retention: hours to days.
•Warm Tier (SSD/HDD) — Recent historical data on SSD or fast HDD. Millisecond query latency. Moderate cost (~$0.05/GB/month). Typical retention: days to weeks.
•Cold Tier (Object Storage) — Long-term data in S3, GCS, or Azure Blob. Second-level query latency. Low cost (~$0.02/GB/month). Typical retention: months to years.
•Archive Tier (Glacier/Deep Archive) — Compliance/audit data in cold archive. Minute to hour retrieval. Lowest cost (~$0.004/GB/month). Typical retention: years to forever.

Tiered Storage Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Thanos: Prometheus with unlimited retention via object storage
 
# Thanos Sidecar config (runs alongside Prometheus)
# Uploads blocks to object storage
thanos:
  sidecar:
    objstore_config:
      type: S3
      config:
        bucket: my-thanos-metrics
        endpoint: s3.amazonaws.com
        region: us-east-1
 
# Thanos Compactor: Manages retention and downsampling
compactor:
  retention:
    # Raw data (15s resolution)
    --retention.resolution-raw=30d
    
    # 5-minute downsampled data
    --retention.resolution-5m=180d
    
    # 1-hour downsampled data  
    --retention.resolution-1h=365d
    
  # Automatic downsampling
  --downsample.concurrency=4
 
# Query latency by tier:
# - Hot (Prometheus): <100ms
# - Cold (S3 via Store Gateway): 1-5s
# - Thanos Query federates transparently

Query Transparency Is Essential

The best tiered storage implementations are query-transparent: users write the same query regardless of where data resides. Systems like Thanos and InfluxDB IOx automatically fan out queries to appropriate tiers and merge results. This simplifies application code and enables gradual tiering refinement without query changes.

Retention Policy Design Patterns

Different use cases demand different retention strategies. Here are proven patterns for common scenarios:

Pattern 1: SRE/Ops Monitoring

Goal: Fast incident response for recent data; trend analysis for capacity planning.

Hot: 7 days at 15-second resolution (for active debugging)
Warm: 30 days at 1-minute resolution (for incident review)
Cold: 1 year at 1-hour resolution (for capacity planning)
Archive: 3 years at daily resolution (for compliance/audit)

Pattern 2: IoT Sensor Data

Goal: Real-time monitoring with long-term analytics for predictive maintenance.

Hot: 24 hours at 1-second resolution (for real-time dashboards)
Warm: 7 days at 1-minute resolution (for anomaly detection training)
Cold: 90 days at 5-minute resolution (for trend analysis)
Archive: 2 years at hourly resolution (for predictive models)

Pattern 3: Financial/Compliance

Goal: Regulatory compliance requires long retention with audit capability.

Hot: 30 days at full resolution (for active trading/investigation)
Warm: 1 year at full resolution (for regulatory queries)
Cold: 7 years at full resolution (for SEC/audit) — compressed, S3
No downsampling: Regulatory requirements may mandate original data

Retention Patterns by Use Case
Use Case	Hot Duration	Total Retention	Downsampling	Cost Profile
SRE/Ops Monitoring	7 days	1-3 years	Aggressive	Optimize for cost
IoT/Sensors	24 hours	90 days - 2 years	Moderate	Balance cost/resolution
Financial/Compliance	30 days	7+ years	None (regulatory)	Optimize for compliance
Application Analytics	24 hours	90 days	Aggressive	Optimize for cost
Scientific Research	7 days	Forever	Selective	Preserve raw + aggregates

Complete Retention Policy Example
design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Enterprise Monitoring Retention Policy
 
organization: acme-corp
policy_version: "2.0"
effective_date: "2024-01-01"
 
# Default policy for all metrics
default_policy:
  hot_retention: 7d
  hot_resolution: 15s
  warm_retention: 30d
  warm_resolution: 1m
  cold_retention: 365d
  cold_resolution: 1h
  archive_retention: 3y
  archive_resolution: 1d
 
# Override for critical infrastructure metrics
critical_infrastructure:
  applies_to:
    - measurement: kubernetes_*
    - measurement: database_*
    - tags: { tier: "critical" }
  hot_retention: 14d    # Longer for debugging
  hot_resolution: 10s   # Higher resolution
  warm_retention: 90d
  cold_retention: 2y
 
# Override for high-cardinality application metrics
high_cardinality_apps:
  applies_to:
    - measurement: app_request_*
    - tags: { cardinality: "high" }
  hot_retention: 3d     # Shorter due to volume
  warm_retention: 14d
  cold_retention: 90d
  # Drop high-cardinality labels in cold tier
  cold_label_policy:
    drop: ["request_id", "trace_id", "session_id"]
    keep: ["service", "method", "status", "region"]
 
# Compliance-sensitive data (no downsampling)
compliance_data:
  applies_to:
    - measurement: audit_*
    - measurement: financial_*
  hot_retention: 30d
  hot_resolution: original  # No downsampling
  cold_retention: 7y
  cold_resolution: original
  archive_storage: s3-glacier-deep-archive
  encryption: required
  deletion_approval: legal-team
 
# Aggregation functions for downsampling
downsampling_config:
  default:
    functions: [avg, min, max, count]
  counters:
    functions: [sum]
  histograms:
    functions: [sum]  # per bucket

Implementing Retention in Production

Designing retention policies is one thing; operating them reliably in production is another. Here are critical operational considerations:

1. Gradual Rollout:

Never apply new retention policies to all data at once. Start with a pilot:

Apply to non-critical metrics first
Monitor storage reduction and query performance
Verify dashboards and alerts still function
Gradually expand scope

2. Monitoring Retention Operations:

Retention jobs can fail silently. Monitor:

Disk space trends (is retention actually freeing space?)
Oldest data age per measurement (is deletion happening?)
Retention job duration (is it completing before the next run?)
Query latency across tiers (is tiering affecting users?)

Retention Monitoring Queries
monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- TimescaleDB: Monitor chunk retention
 
-- Check chunk age distribution
SELECT 
    hypertable_name,
    count(*) as chunk_count,
    min(range_start) as oldest_chunk,
    max(range_end) as newest_chunk,
    pg_size_pretty(sum(total_bytes)) as total_size
FROM timescaledb_information.chunks
GROUP BY hypertable_name
ORDER BY sum(total_bytes) DESC;
 
-- Verify compression is working
SELECT 
    hypertable_name,
    chunk_name,
    compression_status,
    before_compression_total_bytes,
    after_compression_total_bytes,
    round(
        (1 - after_compression_total_bytes::float / 
             nullif(before_compression_total_bytes, 0)) * 100, 1
    ) as compression_ratio_pct
FROM timescaledb_information.chunks
WHERE compression_status = 'Compressed';
 
-- Monitor retention job execution
SELECT 
    job_id,
    application_name,
    schedule_interval,
    last_run_duration,
    last_successful_finish,
    total_successes,
    total_failures
FROM timescaledb_information.jobs
WHERE proc_name LIKE '%retention%' OR proc_name LIKE '%compression%';
 
-- Alert if oldest data exceeds expected retention
SELECT 
    hypertable_name,
    min(range_start) as oldest_data,
    NOW() - min(range_start) as data_age
FROM timescaledb_information.chunks
GROUP BY hypertable_name
HAVING NOW() - min(range_start) > INTERVAL '35 days'  -- Expected: 30 days
ORDER BY data_age DESC;

3. Handling Deletion Failures:

Retention operations can fail due to:

Disk full (can't create temp files for compaction)
Locks held by long-running queries
Network issues (for tiered storage)
Permission problems

Build alerting for these failures and have runbooks for recovery.

4. Testing Recovery:

Periodically verify that archived data is actually recoverable:

Query cold-tier data to ensure it's accessible
Restore a sample from archive tier
Validate data integrity after restoration

Deletion Is Permanent

Once retention policies delete data, it's gone. Always ensure you have appropriate backups before enabling aggressive retention. For compliance-sensitive data, implement soft-delete (mark as deleted, purge later) rather than immediate deletion.

Compliance and Legal Considerations

Retention policies don't exist in a vacuum. Regulatory requirements, legal holds, and data governance policies impose constraints that override pure cost optimization.

Key Compliance Frameworks:

Regulatory Retention Requirements
Regulation	Scope	Typical Retention	Key Requirements
SOX (Sarbanes-Oxley)	Financial records (US)	7 years	Audit trail integrity, immutability
GDPR	Personal data (EU)	As short as possible	Right to deletion, purpose limitation
HIPAA	Healthcare data (US)	6 years	Access logging, encryption at rest
PCI-DSS	Payment card data	1 year (audit logs)	Full track of access, encryption
SEC Rule 17a-4	Broker-dealer records	6 years (some 3)	WORM storage (write-once)
MiFID II	Trading records (EU)	5-7 years	Transaction reconstruction

Legal Hold Considerations:

When litigation or investigation is anticipated, organizations must implement legal holds—suspension of normal retention/deletion for relevant data. Time-series databases must support:

Pausing deletion for specific data ranges
Preserving data beyond normal retention periods
Providing audit trails of what was preserved
Eventual release when hold is lifted

GDPR Right to Erasure:

GDPR creates a tension: users can request deletion of their personal data, but your retention policies might keep it for operational reasons. Solutions:

Minimize personal data in metrics (don't use user ID as a tag)
Separate PII into purpose-specific stores with different retention
Implement pseudonymization (hash personal identifiers)
Maintain deletion request logs for audit

When In Doubt, Consult Legal

Retention policy design for regulated industries should involve legal and compliance teams. The cost of over-retention (storage) is far less than the cost of under-retention (regulatory fines, inability to respond to discovery requests).

Summary: Retention Policies

We've explored the complete landscape of time-series data retention. Let's consolidate the key insights:

Key Takeaways

•Storage costs compound exponentially — Without retention policies, time-series storage grows unbounded. Tiering and downsampling can reduce costs by 70-90%.
•The 90/10 access pattern — 90% of queries hit recent data. Hot storage for recent, cold storage for historical optimizes cost and performance.
•Downsampling preserves value — Reducing resolution (15s → 1m → 1h) retains analytical value while dramatically reducing storage. Choose aggregations carefully.
•Tiered storage is automatic — Modern TSDBs transparently move data between tiers. Query interfaces remain unchanged.
•Design patterns exist — SRE monitoring, IoT, financial compliance each have proven retention strategies. Start with a pattern, customize for your needs.
•Operations matter — Monitor retention job execution, test data recovery, plan for failures. Deletion is permanent.
•Compliance constrains optimization — Regulatory requirements may mandate longer retention or prevent downsampling. Involve legal in policy design.

What's Next:

Having covered time-stamped data optimization, database implementations, metrics infrastructure, and retention policies, we'll conclude with Use Cases and Trade-offs—a comprehensive decision framework for when and how to apply time-series databases in your architecture.

Page Complete

You now understand the complete lifecycle of time-series data retention—from economic drivers through implementation patterns to compliance considerations. This knowledge enables you to design retention policies that balance performance, cost, and regulatory requirements.