Logging at Scale - Learning Module

Loading content...

0/273

Log Retention and Cost

The Cost Reality of Logging at Scale

Log storage costs can become one of the largest line items in an engineering organization's budget. At scale, it's not uncommon for logging infrastructure to cost more than the application infrastructure it monitors.

Consider: A medium-sized organization generating 1TB of logs per day. At $0.023/GB/month for S3 storage (the cheapest tier), keeping 90 days of logs costs ~$2,000/month just for cold storage. But logs need fast search, which requires Elasticsearch or similar—where costs can reach $0.10-0.30/GB/month. Now that 90 days of searchable logs costs $9,000-27,000/month.

This page addresses the economics of logging: How to balance debugging needs, compliance requirements, and budget constraints through strategic retention policies, tiered storage, and cost optimization techniques.

What You Will Learn

By the end of this page, you'll understand how to design retention policies aligned with business needs, implement tiered storage to reduce costs 10x or more, navigate compliance requirements that mandate retention periods, calculate and optimize total logging costs, and make informed trade-offs between searchability and budget.

Understanding Log Storage Costs

Log storage costs are multi-dimensional. Understanding the components helps identify optimization opportunities:

Cost Components:

Raw Storage — Disk/object store space for log data
Indexing — Inverted indexes for fast search (often 10-30% of raw data size)
Compute — CPU for ingestion, indexing, and queries
Memory — RAM for caching and fast access
Network — Bandwidth for log transport
Operations — Human time managing infrastructure

Logging Cost Comparison by Storage Tier
Storage Tier	Access Speed	Cost/GB/Month	Use Case
Elasticsearch Hot	< 100ms	$0.15-0.30	Active debugging, real-time queries
Elasticsearch Warm	< 1s	$0.05-0.10	Recent historical queries
Elasticsearch Cold/Frozen	10-60s	$0.02-0.05	Rare historical access
Object Storage (S3)	Minutes (restore)	$0.02-0.03	Archive, compliance retention
S3 Glacier	Hours	$0.004	Long-term archive, rare access
S3 Glacier Deep Archive	12-48 hours	$0.00099	Regulatory compliance, never accessed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Log Volume: 1TB/day (compressed)
Retention Requirements:
  - Hot (searchable): 7 days
  - Warm (searchable): 14 days
  - Cold (archived): 60 days
  - Glacier: 1 year
 
Cost Breakdown (AWS-based estimates):
 
HOT TIER (7 days = 7TB):
  Storage: 7TB × $0.20/GB = $1,400/month
  Compute: 6 × r5.2xlarge = $2,700/month
  Subtotal: $4,100/month
 
WARM TIER (14 days = 14TB):
  Storage: 14TB × $0.07/GB = $1,000/month
  Compute: 3 × r5.xlarge = $675/month
  Subtotal: $1,675/month
 
COLD/FROZEN (60 days = 60TB):
  Storage: 60TB × $0.025/GB = $1,500/month
  Compute: 2 × r5.xlarge = $450/month
  Subtotal: $1,950/month
 
GLACIER ARCHIVE (365 days = 365TB):
  Storage: 365TB × $0.004/GB = $1,460/month
  No active compute needed
  Subtotal: $1,460/month
 
TOTAL MONTHLY: ~$9,185/month
 
WITHOUT TIERING (all hot, 365 days):
  Storage alone: 365TB × $0.20/GB = $73,000/month
  Compute scaled proportionally: ~$40,000/month
  TOTAL: ~$113,000/month
 
SAVINGS FROM TIERING: ~$103,000/month (92% reduction)

The Ingest Rate Trap

Cost scales linearly with log volume, but log volume tends to grow exponentially. Each new microservice adds logs. Debug logging left enabled multiplies volume. Over-instrumentation is invisible until the bill arrives. Monitor log volume growth as a key metric.

Designing Retention Policies

Retention policies answer three questions:

How long must logs be kept? (Compliance minimum)
How long should logs be searchable? (Operational need)
What's the appropriate storage tier at each age? (Cost optimization)

Retention Policy Design Factors:

Factors Influencing Retention

•Regulatory Compliance — Industries like healthcare (HIPAA), finance (SOX), and payments (PCI-DSS) mandate specific retention periods, often 1-7 years.
•Incident Investigation Needs — How far back might you need to investigate security incidents? Breaches can go undetected for months.
•Debug Use Patterns — Analyze when engineers actually query logs. 90% of queries are for the last 24 hours; 99% for the last 7 days.
•Data Classification — Sensitive data (PII, financial) may require shorter retention to minimize breach exposure, or longer for audit trails.
•Legal Hold Requirements — Active litigation or investigations may require suspending deletion for specific data.
•Storage Budget — After compliance minimums, budget determines tier aging velocity.

Retention Policy Templates by Industry
Industry/Use Case	Hot (Searchable)	Warm	Archive	Total Retention
SaaS (General)	7 days	30 days	90 days	90 days
E-Commerce	14 days	60 days	1 year	1 year
Financial Services	30 days	90 days	7 years	7 years
Healthcare (HIPAA)	30 days	1 year	6 years	6 years
PCI-DSS (Payments)	90 days	1 year	1 year	1 year (minimum)
Government/Defense	90 days	1 year	Indefinite	Indefinite
Startup (Cost-Focused)	3 days	7 days	30 days	30 days

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# retention-policy.yaml
# Different log types have different retention needs
 
policies:
  # Application debug logs - high volume, low value over time
  application-debug:
    description: "DEBUG and TRACE level logs"
    hot_days: 1
    warm_days: 0
    archive_days: 0
    total_retention_days: 1
    sampling_after_hot: 0.01  # Keep 1% sample
 
  # Application operational logs  
  application-info:
    description: "INFO and WARN level logs"
    hot_days: 7
    warm_days: 14
    archive_days: 69
    total_retention_days: 90
 
  # Application error logs - need longer for investigation
  application-error:
    description: "ERROR and FATAL level logs"
    hot_days: 14
    warm_days: 76
    archive_days: 275
    total_retention_days: 365
 
  # Security audit logs - compliance requires long retention
  security-audit:
    description: "Authentication, authorization, admin actions"
    hot_days: 30
    warm_days: 335
    archive_days: 2555  # 7 years
    total_retention_days: 2920  # 8 years
    immutable: true
    legal_hold_eligible: true
 
  # Infrastructure logs - medium retention
  infrastructure:
    description: "Kubernetes, network, database logs"
    hot_days: 7
    warm_days: 23
    archive_days: 60
    total_retention_days: 90
 
  # Access logs - high volume, summarize instead of retain
  access-logs:
    description: "HTTP request/response logs"
    hot_days: 3
    warm_days: 4
    archive_days: 0
    total_retention_days: 7
    aggregate_after_hot: true  # Convert to metrics

Differential Retention Saves Dramatically

Not all logs are equal. DEBUG logs (often 50%+ of volume) rarely need more than 1-day retention. Security audit logs (typically 1% of volume) may need 7-year retention. Applying uniform retention to all logs wastes enormous resources. Classify logs and apply appropriate policies.

Tiered Storage Architecture

Tiered storage moves data to progressively cheaper storage as it ages, matching storage cost to access frequency. This is the single most effective cost optimization for logging.

The Hot/Warm/Cold Model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Time →                                                                
                                                                       
  Day 0-7          Day 7-30         Day 30-365        Day 365+         
  ────────────     ────────────     ────────────      ────────────     
  ┌──────────┐     ┌──────────┐     ┌──────────┐      ┌──────────┐     
  │   HOT    │ ──▶ │   WARM   │ ──▶ │   COLD   │ ──▶  │  ARCHIVE │     
  │          │     │          │     │          │      │          │     
  │ NVMe SSD │     │ Standard │     │ Frozen   │      │ Object   │     
  │ Full RAM │     │ SSD      │     │ Index    │      │ Store    │     
  │ Replicas │     │ Less RAM │     │ No RAM   │      │ Snapshots│     
  └──────────┘     └──────────┘     └──────────┘      └──────────┘     
                                                                       
  Query: <100ms    Query: <1s       Query: 10-60s     Query: Minutes+  
  Cost: $$$$$      Cost: $$$        Cost: $$          Cost: $          
                                                                       
  Usage: Active    Usage: Recent    Usage: Rare       Usage: Compliance
  debugging        investigation    investigation     only             

Tier Characteristics

•HOT — High-performance SSDs, full replica set, indices in memory for sub-second queries. For active incidents and recent debugging. Most expensive, most capable.
•WARM — Standard storage, reduced replicas, indices on disk but queryable. For historical investigation when you know what you're looking for. Read-only indices after rollover.
•COLD — Minimal resources, frozen indices that lazy-load on query. Query latency in seconds. For rare deep investigations. May require search tier resources.
•FROZEN — (Elasticsearch 7.12+) Searchable snapshots on object storage. Indices mount on demand. Very slow but very cheap. For compliance searches.
•ARCHIVE — Object storage snapshots. Not directly searchable—must restore to cluster first. For long-term compliance, disaster recovery.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": {
            "number_of_replicas": 0,
            "require": { "data": "warm" }
          },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" }
          },
          "freeze": {}
        }
      },
      "frozen": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "logs-archive-repo"
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Tier Transition Operations
Transition	Operation	Impact	Duration
Hot → Warm	Shrink + Forcemerge	Index becomes read-only, smaller footprint	Minutes per shard
Warm → Cold	Freeze + Relocate	Index unloaded from RAM until queried	Quick (metadata change)
Cold → Frozen	Searchable Snapshot	Data moves to object store, mount on query	Minutes-hours depending on size
Frozen → Delete	Snapshot Delete	Data permanently removed from all stores	Quick

Frozen Tier Performance Reality

Frozen tier queries can take 30-60+ seconds as indices are restored from object storage on demand. This is acceptable for compliance queries run once a month—not for real-time debugging. Ensure hot/warm tiers cover your debugging time windows.

Volume Reduction Strategies

Beyond tiering, reducing log volume at the source provides the most fundamental cost savings. Every byte not logged is a byte not stored, indexed, and paid for.

Volume Reduction Hierarchy:

Don't log it — Best option. Remove unnecessary logging at source.
Sample it — Keep statistically representative subset.
Aggregate it — Convert granular logs to metrics.
Compress it — Reduce bytes for same information.
Truncate it — Limit field sizes and stack traces.

Volume Reduction Techniques

•Disable DEBUG/TRACE in production — Often 50%+ of log volume is DEBUG that should never be enabled. Gate behind feature flags for targeted activation.
•Sample high-frequency logs — For access logs at 10K RPS, sample 1% for debugging while aggregating full stream to metrics for completeness.
•Aggregate instead of log — Instead of logging every cache_hit, increment a cache_hits metric. Logs store context; metrics store counts.
•De-duplicate repeating errors — An error in a loop generates thousands of identical logs. Log once with count, or use rate limiting (1 log per 10 events).
•Truncate large fields — Stack traces rarely need full depth. Request bodies don't belong in logs. Set field length limits.
•Exclude health checks — Kubernetes probes can generate 20%+ of access log volume. Filter at collector level.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import structlog
import time
import random
from collections import defaultdict
 
class SampledLogger:
    """Logger that samples high-frequency events."""
    
    def __init__(self, base_logger, sample_rate=0.01):
        self.logger = base_logger
        self.sample_rate = sample_rate
        
    def info(self, event, **kwargs):
        # Always log if sampled or if marked critical
        if kwargs.pop('always_log', False) or random.random() < self.sample_rate:
            self.logger.info(event, sampled=True, sample_rate=self.sample_rate, **kwargs)
 
 
class RateLimitedLogger:
    """Logger that rate-limits identical messages."""
    
    def __init__(self, base_logger, max_per_minute=10):
        self.logger = base_logger
        self.max_per_minute = max_per_minute
        self.event_counts = defaultdict(lambda: {'count': 0, 'last_reset': time.time()})
    
    def error(self, event, **kwargs):
        # Create key from event and critical identifying fields
        key = f"{event}:{kwargs.get('error_type', '')}:{kwargs.get('service', '')}"
        
        now = time.time()
        entry = self.event_counts[key]
        
        # Reset counter if minute passed
        if now - entry['last_reset'] > 60:
            entry['count'] = 0
            entry['last_reset'] = now
        
        entry['count'] += 1
        
        # Log if under limit, or on round numbers for suppressed events
        if entry['count'] <= self.max_per_minute:
            self.logger.error(event, **kwargs)
        elif entry['count'] in [100, 1000, 10000]:
            self.logger.error(
                event, 
                suppressed_count=entry['count'],
                note="Rate limited - showing count milestone",
                **kwargs
            )
 
 
# Example usage
base_logger = structlog.get_logger()
access_logger = SampledLogger(base_logger, sample_rate=0.01)  # 1% sample
error_logger = RateLimitedLogger(base_logger, max_per_minute=10)
 
# High-frequency access logging
access_logger.info("http_request", path="/api/users", status=200)
 
# Rate-limited error logging
for i in range(1000):
    error_logger.error("database_connection_failed", error_type="TimeoutError")
# Only logs 10 times + milestones at 100, 1000

Volume Reduction Potential by Technique
Technique	Typical Reduction	Trade-off
Disable DEBUG in production	30-60%	Less detail when investigating
Filter health check logs	5-20%	None (health in metrics)
Sample access logs at 1%	20-40%	May miss specific request
Rate-limit duplicate errors	10-30%	Lose count accuracy
Aggregate to metrics	10-50%	Context lost, counts retained
Truncate stack traces	5-15%	Deep debugging harder
Combined optimizations	60-80%	Careful balance required

Sampling Strategy

Use head-based sampling (decide at request start) not tail-based (decide after request completes). This ensures correlated logs across services are all sampled or all dropped. Random sampling per log entry fragments traces.

Compliance and Legal Considerations

Retention policies must satisfy regulatory and legal requirements. These requirements often conflict with cost optimization goals—compliance typically demands longer retention than operationally necessary.

Key Regulatory Frameworks:

Regulatory Retention Requirements
Regulation	Scope	Log Retention Requirements	Notes
PCI-DSS	Payment card data	1 year (3 months immediately available)	Audit trail for all access to cardholder data
HIPAA	Healthcare data (US)	6 years	Includes access logs for PHI
SOX	Financial reporting (US)	7 years	Audit trails for financial systems
GDPR	Personal data (EU)	As long as necessary (minimize)	Conflicts with long retention—anonymize or delete
CCPA	Personal data (California)	12 months (consumer requests)	Must be able to delete on request
SEC Rule 17a-4	Financial records (US)	6 years (2 immediately accessible)	Broker-dealer specific
GLBA	Financial institutions (US)	At least 5 years	Customer information safeguards

Compliance Best Practices

•Classify logs by data content — Separate logs containing PII/PHI from operational logs. Apply data-appropriate retention to each class.
•Implement immutability for audit logs — Security audit logs must be tamper-proof. Use write-once storage or cryptographic integrity verification.
•Document retention policies — Regulators require documented policies. Store policy documents alongside systems, updated with each change.
•Define legal hold procedures — When litigation or investigation occurs, normal deletion must pause for relevant data. Have tooling ready.
•Enable right-to-deletion — GDPR/CCPA require deleting individual's data on request. Logs with user IDs must support selective deletion or pseudonymization.
•Conduct regular audits — Verify that retention policies are actually enforced. Data that should be deleted but wasn't creates liability.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Compliance-driven retention configuration
 
data_classification:
  security_audit:
    description: "Authentication, authorization, admin actions"
    contains_pii: true
    regulatory_frameworks: [SOX, PCI-DSS, HIPAA]
    minimum_retention_days: 2555  # 7 years (SOX requirement)
    
    storage:
      hot_days: 30
      warm_days: 335
      archive_days: 2190
      archive_tier: GLACIER
      
    immutability:
      enabled: true
      method: WORM_STORAGE
      
    deletion:
      requires_approval: true
      audit_trail: true
      
  access_logs_with_pii:
    description: "HTTP access logs containing user IDs"
    contains_pii: true
    regulatory_frameworks: [GDPR, CCPA]
    maximum_retention_days: 365  # GDPR data minimization
    right_to_deletion: true
    
    anonymization:
      enabled: true
      fields: [user_id, client_ip, user_agent]
      anonymize_after_days: 30
    
    storage:
      hot_days: 7
      warm_days: 23
      archive_days: 0  # Anonymize or delete after warm
      
  operational_logs:
    description: "System logs without PII"
    contains_pii: false
    regulatory_frameworks: []
    
    storage:
      hot_days: 7
      warm_days: 23
      archive_days: 60
      total_retention_days: 90
 
legal_hold:
  enabled: true
  trigger: MANUAL  # Legal team initiates
  scope: BY_TAG  # Hold specific indices by tag
  notification:
    - legal@company.com
    - sre@company.com

The GDPR Contradiction

GDPR's data minimization principle may conflict with long compliance retention. You cannot keep data "just in case" if it's not necessary. Work with legal to define specific purposes for retention. Consider early anonymization: keep logs but remove identifiers after the debugging window passes.

Cost Monitoring and Optimization

Effective cost management requires visibility into where costs originate and active optimization based on that visibility.

Cost Attribution:

Know which teams, services, and log types consume resources. Without attribution, cost optimization is guesswork.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Vector (log collector) exposes metrics for cost attribution
 
groups:
- name: logging_cost
  rules:
  # Bytes ingested by service
  - record: logging:bytes_ingested_by_service:rate5m
    expr: sum by (service) (rate(vector_component_sent_bytes_total{component_type="sink"}[5m]))
 
  # Estimate monthly cost by service ($0.10 per GB ingested, simplified)
  - record: logging:estimated_monthly_cost_by_service
    expr: |
      (logging:bytes_ingested_by_service:rate5m * 2592000)  # bytes/sec × seconds/month
      / 1073741824  # to GB
      * 0.10  # cost per GB
 
  # Top cost-generating services
  - alert: LoggingCostAnomaly
    expr: |
      logging:estimated_monthly_cost_by_service 
      > 1.5 * avg_over_time(logging:estimated_monthly_cost_by_service[7d])
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.service }} logging costs increased 50%+"
      description: "Service {{ $labels.service }} is generating significantly more logs than usual."
 
  # Detect debug logging enabled
  - alert: DebugLoggingInProduction
    expr: |
      sum by (service) (rate(log_messages_total{level="debug",environment="production"}[5m])) 
      > 0
    for: 30m
    labels:
      severity: info
    annotations:
      summary: "DEBUG logging enabled for {{ $labels.service }} in production"
      description: "Consider disabling DEBUG logging to reduce costs."

Cost Optimization Checklist

•Review log volume by service monthly — Identify top 10 contributors. Often 3-5 services generate 50%+ of logs.
•Audit DEBUG/TRACE in production — These should be disabled except for active debugging. Automated alerts for DEBUG in production.
•Verify ILM policies are executing — Check that old indices are actually moving tiers and deleting. Policy misconfiguration is common.
•Review index settings — Excessive replicas, too many shards, or missing compression waste resources.
•Benchmark query patterns — If 95% of queries hit hot tier successfully, your hot window may be larger than necessary.
•Evaluate sampling opportunities — High-volume, low-value logs (access logs, verbose INFO) are sampling candidates.
•Consider alternative backends — If most queries are label-based, Loki may be 10x cheaper than Elasticsearch for the same data.

Quick Wins for Logging Cost Reduction
Optimization	Effort	Typical Savings
Enable compression (gzip/zstd)	Low	30-50% storage
Reduce replica count (warm/cold)	Low	33-50% storage
Implement ILM if not present	Medium	60-80% long-term
Disable DEBUG in production	Medium	30-60% volume
Sample access logs	Medium	20-40% volume
Use searchable snapshots	Medium	40-70% cold storage
Migrate to Loki (if applicable)	High	50-90% total cost

Log Cost Dashboards

Create a dedicated dashboard showing: total log volume by service, estimated cost by service, log level distribution in production, and tier utilization. Review weekly with team leads. Visibility drives behavior—teams optimize when they see their costs.

Deletion and Data Lifecycle

Reliable data deletion is as important as reliable storage. Retention policies only reduce costs if deletion actually happens.

Deletion Mechanisms:

Deletion Approaches by System

•Elasticsearch/OpenSearch ILM — Delete phase removes indices automatically based on age. Set delete.min_age in policy. Monitor for stuck ILM indices.
•Time-based index deletion — Scheduled job deletes indices older than threshold: DELETE logs-* where date < (now - retention). Simple but rigid.
•Curator — Elasticsearch tool for managing indices. More control than ILM for complex scenarios, but adds operational component.
•Object storage lifecycle — S3/GCS lifecycle rules delete objects after specified time. Configure alongside ILM for archived logs.
•Selective deletion — GDPR right-to-erasure requires deleting specific user's data. Elasticsearch _delete_by_query is slow and resource-intensive at scale.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#!/usr/bin/env python3
"""
Log retention enforcement script.
Run daily via cron or Kubernetes CronJob.
"""
 
import boto3
from elasticsearch import Elasticsearch
from datetime import datetime, timedelta
import logging
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
# Configuration
ES_HOST = "elasticsearch.internal:9200"
S3_BUCKET = "logs-archive"
 
RETENTION_POLICIES = {
    "logs-debug-*": {"max_age_days": 1, "delete": True},
    "logs-info-*": {"max_age_days": 30, "archive_after_days": 7, "delete_after_archive": True},
    "logs-error-*": {"max_age_days": 365, "archive_after_days": 30},
    "logs-audit-*": {"max_age_days": 2920, "archive_after_days": 90, "immutable": True},
}
 
def enforce_retention():
    es = Elasticsearch([ES_HOST])
    s3 = boto3.client('s3')
    
    for pattern, policy in RETENTION_POLICIES.items():
        indices = es.cat.indices(index=pattern, format='json')
        
        for idx in indices:
            idx_name = idx['index']
            idx_date = extract_date(idx_name)  # e.g., logs-info-2024.01.15 → datetime
            age_days = (datetime.now() - idx_date).days
            
            # Check immutability / legal hold
            if policy.get('immutable') or is_under_legal_hold(idx_name):
                logger.info(f"Skipping {idx_name} (immutable/legal hold)")
                continue
            
            # Archive if past threshold
            if policy.get('archive_after_days') and age_days >= policy['archive_after_days']:
                if not is_archived(idx_name, s3):
                    logger.info(f"Archiving {idx_name}")
                    archive_index(es, idx_name, s3)
            
            # Delete if past max age
            if age_days >= policy['max_age_days']:
                if policy.get('delete_after_archive') and not is_archived(idx_name, s3):
                    logger.warning(f"Cannot delete {idx_name}: not archived")
                    continue
                    
                logger.info(f"Deleting {idx_name} (age: {age_days} days)")
                delete_index(es, idx_name)
 
def archive_index(es, idx_name, s3):
    """Snapshot index to S3 before deletion."""
    snapshot_name = f"{idx_name}-{datetime.now().strftime('%Y%m%d')}"
    es.snapshot.create(
        repository="logs-archive-repo",
        snapshot=snapshot_name,
        body={"indices": idx_name},
        wait_for_completion=True
    )
    logger.info(f"Created snapshot {snapshot_name}")
 
def delete_index(es, idx_name):
    """Delete index with safety checks."""
    # Double-check age before deletion
    es.indices.delete(index=idx_name)
    logger.info(f"Deleted index {idx_name}")
 
def is_under_legal_hold(idx_name):
    """Check if index is under legal hold."""
    # Implementation: Check metadata or external hold registry
    return False
 
def is_archived(idx_name, s3):
    """Check if index snapshot exists."""
    # Implementation: Check snapshot repository
    return False
 
if __name__ == "__main__":
    enforce_retention()

Deletion Verification

Always verify deletions are occurring. Set up alerts for: indices older than retention threshold still existing, disk usage not decreasing as expected, ILM policies in ERROR state. Silent deletion failures lead to storage exhaustion and surprise costs.

Summary: Log Retention Mastery

Log retention strategy balances debugging needs, compliance requirements, and cost constraints. Done poorly, logging becomes a runaway expense. Done well, tiered retention provides necessary visibility at manageable cost.

Key Takeaways:

Key Takeaways

•Tiered storage is essential — Hot/warm/cold/archive tiering can reduce costs 80-90% compared to keeping all logs searchable.
•Not all logs deserve equal treatment — DEBUG logs need 1-day retention; audit logs may need 7 years. Classify and apply appropriate policies.
•Volume reduction multiplies savings — Sampling, aggregation, and filtering reduce data before it's stored. Best savings come from not logging in the first place.
•Compliance dictates minimums — Regulatory frameworks set retention floors. Know your requirements before designing policies.
•Cost attribution enables optimization — You can't improve what you don't measure. Track log volume and cost by service.
•Deletion requires verification — Retention policies are meaningless if deletion doesn't happen. Monitor and alert on lifecycle enforcement.
•Balance access speed and cost — Match tier to access frequency. Most queries hit recent data; optimize for that while archiving the rest cheaply.

What's next:

The final piece of production logging is connecting logs across distributed systems. The next page covers correlation IDs—how to link logs from a single request across dozens of microservices, enabling end-to-end debugging of distributed transactions.

Page Complete

You now understand the economics of logging at scale and can design cost-effective retention strategies. You know how to implement tiered storage, satisfy compliance requirements, and optimize logging costs. Next, we'll learn how correlation IDs tie it all together for distributed debugging.