Loading content...
Time-series data is relentless. A modest IoT deployment with 1,000 sensors reporting every second generates 86.4 million data points per day—over 31 billion per year. At even 20 bytes per point (after compression), that's 620 gigabytes annually from a single deployment. Scale to enterprise—millions of sensors, hundreds of applications, thousands of servers—and you're looking at petabytes.
Without intelligent data lifecycle management, time-series systems face an existential threat: storage costs that grow without bound while the value of that data diminishes exponentially over time. Last hour's metrics are queried constantly; last year's might be accessed once for an annual review.
Retention policies are the mechanism by which time-series databases balance data preservation against practical constraints. They answer critical questions: How long should we keep raw data? When should we downsample? What can we safely delete? How do we tier storage to optimize costs?
By the end of this page, you will master retention policy design: understanding the economics of time-series storage, implementing multi-tier retention strategies, configuring downsampling policies, and making informed trade-offs between data granularity, query performance, and cost.
Before diving into retention strategies, we must understand the economic forces that make retention policies necessary. Time-series storage costs are not linear—they compound in ways that can bankrupt engineering budgets.
Cost Drivers:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
# Time-Series Storage Cost Estimation # Scenario: Medium-scale monitoring deploymentmetrics_per_second = 100_000bytes_per_metric_raw = 100 # timestamp + tags + valuecompression_ratio = 10 # Typical for time-series datareplication_factor = 3 # Daily data volumeseconds_per_day = 86_400raw_daily_bytes = metrics_per_second * bytes_per_metric_raw * seconds_per_day# = 100,000 × 100 × 86,400 = 864 GB/day raw compressed_daily_bytes = raw_daily_bytes / compression_ratio# = 86.4 GB/day compressed replicated_daily_bytes = compressed_daily_bytes * replication_factor# = 259.2 GB/day with replication # Annual storage at different retention levelsone_year_storage = replicated_daily_bytes * 365# = 94.6 TB/year # Cost comparison by storage tier (monthly)# Scenario: 1 year retention with current month on hot storage hot_storage_bytes = replicated_daily_bytes * 30 # Last 30 dayscold_storage_bytes = replicated_daily_bytes * 335 # Older 11 months # Hot storage (NVMe SSD or high-performance cloud storage)hot_cost_per_gb_month = 0.10hot_monthly_cost = (hot_storage_bytes / 1e9) * hot_cost_per_gb_month# = 7,776 GB × $0.10 = $778/month # Cold storage (S3 Standard) cold_cost_per_gb_month = 0.023cold_monthly_cost = (cold_storage_bytes / 1e9) * cold_cost_per_gb_month# = 86,832 GB × $0.023 = $1,997/month # TOTAL: $2,775/month = $33,300/year # COMPARISON: All data on hot storageall_hot_monthly = (one_year_storage / 1e9 / 12) * hot_cost_per_gb_month# = 7,883 GB/month × $0.10 = $788/month... wait, this is per-month calc# Actually: 94.6 TB × $0.10 = $9,460/month = $113,520/year # SAVINGS from tiering: $113,520 - $33,300 = $80,220/year (71% reduction)Industry data consistently shows that 90% of time-series queries access data from the last 24-48 hours. The remaining 10% are historical queries for incident investigation or capacity planning. This access pattern is the foundation for tiered retention: hot storage for recent data, cold storage for historical.
A retention policy defines how long data is kept before automatic deletion. Modern time-series databases provide sophisticated retention management far beyond simple "delete after N days."
Retention Policy Components:
12345678910111213141516171819202122232425
// InfluxDB 2.x: Retention is set at the bucket level // Create bucket with 30-day retentioninflux bucket create \ --name metrics-hot \ --retention 30d // Create bucket with 1-year retention influx bucket create \ --name metrics-cold \ --retention 365d // Create bucket with infinite retention (for compliance)influx bucket create \ --name metrics-archive \ --retention 0 // 0 means infinite // Data flow: write to hot bucket, downsample to cold// Flux task for downsampling (runs hourly)option task = {name: "downsample-to-cold", every: 1h} from(bucket: "metrics-hot") |> range(start: -2h, stop: -1h) |> aggregateWindow(every: 5m, fn: mean) |> to(bucket: "metrics-cold")Downsampling is the process of reducing data resolution while preserving essential information. Instead of deleting old data entirely, downsampling converts high-resolution data (15-second intervals) into lower-resolution aggregates (5-minute, hourly averages). This dramatically reduces storage while maintaining queryability.
Why Downsampling Works:
For most analytical purposes, the exact value of CPU usage at 14:23:15 last month doesn't matter. What matters is:
Downsampling preserves these insights while reducing storage by 10-100x.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
Multi-Resolution Retention Strategy: Time Since Ingestion │ Resolution │ Retention │ Storage Factor─────────────────────┼────────────┼───────────┼───────────────0 - 24 hours │ 15 seconds │ 24 hours │ 1x (baseline)1 - 7 days │ 1 minute │ 6 days │ 0.25x7 - 30 days │ 5 minutes │ 23 days │ 0.05x30 - 365 days │ 1 hour │ 335 days │ 0.004x365+ days │ 1 day │ Forever │ 0.0002x Storage Calculation for 1 year of data:- Raw (15s): 5,760 points/day × 1 day = 5,760 points- 1m agg: 1,440 points/day × 6 days = 8,640 points - 5m agg: 288 points/day × 23 days = 6,624 points- 1h agg: 24 points/day × 335 days = 8,040 points- 1d agg: 1 point/day × ongoing Total: ~29,000 points for 1 year vs ~2.1M points if all at 15sCompression ratio: 72x reduction while preserving queryability Data Flow:┌─────────────────────────────────────────────────────────────────────┐│ Ingest (15-second resolution) │└───────────────────────────────────────────────────────────────────┬─┘ │ ┌───────────────────────────────────────────────────────────────┼──┐ │ After 24 hours: Downsample to 1-minute │ │ │ Aggregations: avg, min, max, count │ │ └───────────────────────────────────────────────────────────────┼──┘ │ ┌───────────────────────────────────────────────────────────────┼──┐ │ After 7 days: Downsample to 5-minute │ │ │ Aggregations: avg, min, max, count │ │ └───────────────────────────────────────────────────────────────┼──┘ │ ┌───────────────────────────────────────────────────────────────┼──┐ │ After 30 days: Downsample to 1-hour │ │ │ Delete original 5-minute data │ │ └───────────────────────────────────────────────────────────────┼──┘ │ ┌───────────────────────────────────────────────────────────────┼──┐ │ After 1 year: Downsample to 1-day │ │ │ Delete original hourly data │ │ └───────────────────────────────────────────────────────────────┴──┘Aggregation Functions for Downsampling:
Choosing the right aggregation functions is critical. Different metrics require different aggregations to preserve meaningful information:
| Metric Type | Example | Aggregations to Preserve | Rationale |
|---|---|---|---|
| Counter (rate) | requests_total | sum (for rate calculation) | Rate = sum of deltas over interval |
| Gauge (level) | cpu_usage | avg, min, max | Captures typical + extremes |
| Histogram | latency_bucket | sum (per bucket) | Enables percentile reconstruction |
| Event count | errors | sum, count | Total and frequency |
| Temperature | sensor_temp | avg, min, max | Detect spikes and drops |
| Business metric | revenue | sum | Total over period |
Downsampling irreversibly discards information. A 1-minute spike in CPU usage that lasted 10 seconds becomes invisible in 5-minute averages. For metrics where brief anomalies matter, preserve max/min aggregations or consider longer high-resolution retention.
Modern time-series databases increasingly support tiered storage—automatically moving data between storage tiers based on age. This enables the best of both worlds: fast queries on recent data with economical long-term retention.
Common Storage Tiers:
1234567891011121314151617181920212223242526272829303132
# Thanos: Prometheus with unlimited retention via object storage # Thanos Sidecar config (runs alongside Prometheus)# Uploads blocks to object storagethanos: sidecar: objstore_config: type: S3 config: bucket: my-thanos-metrics endpoint: s3.amazonaws.com region: us-east-1 # Thanos Compactor: Manages retention and downsamplingcompactor: retention: # Raw data (15s resolution) --retention.resolution-raw=30d # 5-minute downsampled data --retention.resolution-5m=180d # 1-hour downsampled data --retention.resolution-1h=365d # Automatic downsampling --downsample.concurrency=4 # Query latency by tier:# - Hot (Prometheus): <100ms# - Cold (S3 via Store Gateway): 1-5s# - Thanos Query federates transparentlyThe best tiered storage implementations are query-transparent: users write the same query regardless of where data resides. Systems like Thanos and InfluxDB IOx automatically fan out queries to appropriate tiers and merge results. This simplifies application code and enables gradual tiering refinement without query changes.
Different use cases demand different retention strategies. Here are proven patterns for common scenarios:
Pattern 1: SRE/Ops Monitoring
Goal: Fast incident response for recent data; trend analysis for capacity planning.
Pattern 2: IoT Sensor Data
Goal: Real-time monitoring with long-term analytics for predictive maintenance.
Pattern 3: Financial/Compliance
Goal: Regulatory compliance requires long retention with audit capability.
| Use Case | Hot Duration | Total Retention | Downsampling | Cost Profile |
|---|---|---|---|---|
| SRE/Ops Monitoring | 7 days | 1-3 years | Aggressive | Optimize for cost |
| IoT/Sensors | 24 hours | 90 days - 2 years | Moderate | Balance cost/resolution |
| Financial/Compliance | 30 days | 7+ years | None (regulatory) | Optimize for compliance |
| Application Analytics | 24 hours | 90 days | Aggressive | Optimize for cost |
| Scientific Research | 7 days | Forever | Selective | Preserve raw + aggregates |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
# Enterprise Monitoring Retention Policy organization: acme-corppolicy_version: "2.0"effective_date: "2024-01-01" # Default policy for all metricsdefault_policy: hot_retention: 7d hot_resolution: 15s warm_retention: 30d warm_resolution: 1m cold_retention: 365d cold_resolution: 1h archive_retention: 3y archive_resolution: 1d # Override for critical infrastructure metricscritical_infrastructure: applies_to: - measurement: kubernetes_* - measurement: database_* - tags: { tier: "critical" } hot_retention: 14d # Longer for debugging hot_resolution: 10s # Higher resolution warm_retention: 90d cold_retention: 2y # Override for high-cardinality application metricshigh_cardinality_apps: applies_to: - measurement: app_request_* - tags: { cardinality: "high" } hot_retention: 3d # Shorter due to volume warm_retention: 14d cold_retention: 90d # Drop high-cardinality labels in cold tier cold_label_policy: drop: ["request_id", "trace_id", "session_id"] keep: ["service", "method", "status", "region"] # Compliance-sensitive data (no downsampling)compliance_data: applies_to: - measurement: audit_* - measurement: financial_* hot_retention: 30d hot_resolution: original # No downsampling cold_retention: 7y cold_resolution: original archive_storage: s3-glacier-deep-archive encryption: required deletion_approval: legal-team # Aggregation functions for downsamplingdownsampling_config: default: functions: [avg, min, max, count] counters: functions: [sum] histograms: functions: [sum] # per bucketDesigning retention policies is one thing; operating them reliably in production is another. Here are critical operational considerations:
1. Gradual Rollout:
Never apply new retention policies to all data at once. Start with a pilot:
2. Monitoring Retention Operations:
Retention jobs can fail silently. Monitor:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
-- TimescaleDB: Monitor chunk retention -- Check chunk age distributionSELECT hypertable_name, count(*) as chunk_count, min(range_start) as oldest_chunk, max(range_end) as newest_chunk, pg_size_pretty(sum(total_bytes)) as total_sizeFROM timescaledb_information.chunksGROUP BY hypertable_nameORDER BY sum(total_bytes) DESC; -- Verify compression is workingSELECT hypertable_name, chunk_name, compression_status, before_compression_total_bytes, after_compression_total_bytes, round( (1 - after_compression_total_bytes::float / nullif(before_compression_total_bytes, 0)) * 100, 1 ) as compression_ratio_pctFROM timescaledb_information.chunksWHERE compression_status = 'Compressed'; -- Monitor retention job executionSELECT job_id, application_name, schedule_interval, last_run_duration, last_successful_finish, total_successes, total_failuresFROM timescaledb_information.jobsWHERE proc_name LIKE '%retention%' OR proc_name LIKE '%compression%'; -- Alert if oldest data exceeds expected retentionSELECT hypertable_name, min(range_start) as oldest_data, NOW() - min(range_start) as data_ageFROM timescaledb_information.chunksGROUP BY hypertable_nameHAVING NOW() - min(range_start) > INTERVAL '35 days' -- Expected: 30 daysORDER BY data_age DESC;3. Handling Deletion Failures:
Retention operations can fail due to:
Build alerting for these failures and have runbooks for recovery.
4. Testing Recovery:
Periodically verify that archived data is actually recoverable:
Once retention policies delete data, it's gone. Always ensure you have appropriate backups before enabling aggressive retention. For compliance-sensitive data, implement soft-delete (mark as deleted, purge later) rather than immediate deletion.
Retention policies don't exist in a vacuum. Regulatory requirements, legal holds, and data governance policies impose constraints that override pure cost optimization.
Key Compliance Frameworks:
| Regulation | Scope | Typical Retention | Key Requirements |
|---|---|---|---|
| SOX (Sarbanes-Oxley) | Financial records (US) | 7 years | Audit trail integrity, immutability |
| GDPR | Personal data (EU) | As short as possible | Right to deletion, purpose limitation |
| HIPAA | Healthcare data (US) | 6 years | Access logging, encryption at rest |
| PCI-DSS | Payment card data | 1 year (audit logs) | Full track of access, encryption |
| SEC Rule 17a-4 | Broker-dealer records | 6 years (some 3) | WORM storage (write-once) |
| MiFID II | Trading records (EU) | 5-7 years | Transaction reconstruction |
Legal Hold Considerations:
When litigation or investigation is anticipated, organizations must implement legal holds—suspension of normal retention/deletion for relevant data. Time-series databases must support:
GDPR Right to Erasure:
GDPR creates a tension: users can request deletion of their personal data, but your retention policies might keep it for operational reasons. Solutions:
Retention policy design for regulated industries should involve legal and compliance teams. The cost of over-retention (storage) is far less than the cost of under-retention (regulatory fines, inability to respond to discovery requests).
We've explored the complete landscape of time-series data retention. Let's consolidate the key insights:
What's Next:
Having covered time-stamped data optimization, database implementations, metrics infrastructure, and retention policies, we'll conclude with Use Cases and Trade-offs—a comprehensive decision framework for when and how to apply time-series databases in your architecture.
You now understand the complete lifecycle of time-series data retention—from economic drivers through implementation patterns to compliance considerations. This knowledge enables you to design retention policies that balance performance, cost, and regulatory requirements.