Metrics Collection - Learning Module

Loading content...

0/273

Cardinality Considerations

The Silent Killer of Observability Systems

Cardinality—the number of unique time series in your metrics system—is the primary scaling dimension of observability systems. It determines memory usage, query performance, and ultimately, whether your monitoring survives the production environment it's meant to observe.

Cardinality explosions are subtle. A single well-intentioned label addition can transform a manageable 10,000-series deployment into an unmanageable 10-million-series catastrophe. By the time you notice degraded query performance or memory exhaustion, the damage is done.

This page provides a deep understanding of cardinality: how to calculate it, how to identify dangerous patterns, and how to design metrics that scale with your systems rather than exploding beyond them.

What You Will Learn

By the end of this page, you will understand what cardinality is, how to calculate cardinality impact before deploying metrics, how to identify cardinality bombs in existing systems, and practical strategies for keeping cardinality under control.

Understanding Cardinality

What is Cardinality?

In time-series databases, cardinality refers to the total number of unique time series. Each unique combination of metric name and label values creates a distinct time series.

Consider this metric:

http_requests_total{method="GET", endpoint="/users", status="200"}

This is ONE time series. If you have:

5 HTTP methods (GET, POST, PUT, DELETE, PATCH)
50 endpoints
10 status codes

The total cardinality for this metric is: 5 × 50 × 10 = 2,500 time series

Why Cardinality Matters:

Every time series consumes resources:

Resource	Per-Series Cost	Impact
Memory	~1-3 KB	Active series kept in RAM
Storage	Variable	Each series stored separately
Query Time	Linear+	More series = slower queries
Index Size	Logarithmic	Label index grows with cardinality
Compaction	Linear	Background work scales with series

The Cardinality Equation:

For a single metric with N labels, cardinality is:

Cardinality = ∏(i=1 to N) |values_i|

# Where |values_i| is the number of unique values for label i

This is multiplicative, not additive. Adding a new label doesn't add to cardinality—it multiplies it.

Example Calculation:

http_request_duration_seconds{method, endpoint, status, customer_tier}

|method|     = 5   (GET, POST, PUT, DELETE, PATCH)
|endpoint|   = 100 (API endpoints)
|status|     = 20  (HTTP status codes used)
|customer_tier| = 3 (free, pro, enterprise)

Cardinality = 5 × 100 × 20 × 3 = 30,000 series

# But this is a histogram with 10 buckets + sum + count = 12 series each:
Actual Cardinality = 30,000 × 12 = 360,000 series!

The Histogram Multiplier

Histograms dramatically amplify cardinality. A histogram with default buckets (10 buckets) generates 12 time series per label combination. Before adding labels to histograms, multiply your expected cardinality by 12.

Cardinality Budgets

Just as you budget compute and storage, you must budget cardinality. Different Prometheus-compatible systems have different practical limits:

System	Practical Limit	Notes
Prometheus (16GB RAM)	1-2 million series	Single instance
Prometheus (64GB RAM)	5-8 million series	Single instance
Thanos / Cortex / Mimir	50-500+ million	Horizontally scaled
Managed (Datadog, etc.)	Billing-based	Pay per active series

Calculating Your Budget:

Prometheus Memory ≈ (active_series × 1-3 KB) + query_overhead

Example:
- Target: 32GB RAM Prometheus instance
- Query overhead: ~8GB
- Available for series: 24GB
- Series limit: 24GB / 2KB = 12 million series (theoretical max)
- Safe limit: 8-10 million (leave headroom)

Planning Your Budget:

A practical approach is to allocate cardinality budgets per service or team:

Total Budget: 2,000,000 series

Infrastructure:
  - node_exporter (per node): ~500 series
  - 100 nodes: 50,000 series (2.5%)

Kubernetes Metrics:
  - kube-state-metrics: ~200 series per pod
  - 500 pods: 100,000 series (5%)

Application Metrics:
  - Average per service: 10,000 series
  - 100 services: 1,000,000 series (50%)

Custom Business Metrics:
  - Reserved: 200,000 series (10%)

Headroom:
  - Reserved: 650,000 series (32.5%)

Cardinality Budget Practices

•Set limits per service — Each service gets a cardinality quota; enforce through review
•Monitor actual cardinality — prometheus_tsdb_head_series tracks active series
•Alert on growth rate — Sudden cardinality spikes indicate problems
•Reserve headroom — Keep 20-30% headroom for traffic spikes and new metrics
•Plan for histograms — Factor in the 12x multiplier for histogram metrics

Monitor Your Monitor

Use PromQL to track cardinality: 'prometheus_tsdb_head_series' shows current series count. Set alerts at 70% and 90% of capacity. 'increase(prometheus_tsdb_head_series_created_total[1h])' shows series creation rate—sudden spikes indicate cardinality bombs.

Cardinality Bombs: Patterns to Avoid

A cardinality bomb is a metric design that creates unbounded or explosively large numbers of time series. Here are the most common patterns:

Cardinality Bomb Patterns

•User IDs as labels — Even 100,000 users × 10 metrics = 1 million series. User counts grow continuously.
•Request/Transaction IDs — Every request creates a new series. Literally unbounded.
•IP addresses — Thousands of unique IPs, especially with DDoS or scanning traffic.
•Full URLs with query parameters — /search?q=... creates series for every unique query.
•Email addresses — Unbounded like user IDs, plus privacy concerns.
•Timestamp labels — Every unique timestamp = new series. Grows every second.
•Free-form error messages — "Connection failed: 192.168.1.x" creates series per IP.
•Unconstrained path variables — /users/12345 instead of /users/:id.

cardinality_bombs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from prometheus_client import Counter, Histogram
 
# ❌ CARDINALITY BOMB: User ID as label
# If you have 1 million users, this creates 1 million series
requests_by_user = Counter(
    'http_requests_by_user_total',
    'Requests per user',
    ['user_id']  # BOMB! Unbounded
)
 
# ❌ CARDINALITY BOMB: Request ID as label
# Every request creates a new series - grows infinitely
request_latency = Histogram(
    'http_request_latency_seconds',
    'Request latency',
    ['request_id']  # BOMB! Every request is unique
)
 
# ❌ CARDINALITY BOMB: Full path with IDs
# /users/123, /users/456, etc. each create new series
requests_by_path = Counter(
    'http_requests_total',
    'HTTP requests',
    ['full_path']  # BOMB! Dynamic path segments
)
 
# ❌ CARDINALITY BOMB: IP address
# Thousands of unique IPs, especially during attacks
requests_by_ip = Counter(
    'http_requests_by_ip_total',
    'Requests by IP',
    ['client_ip']  # BOMB! Thousands of IPs
)
 
# ❌ CARDINALITY BOMB: Error message as label
# Stack traces, connection strings, etc.
errors_by_message = Counter(
    'application_errors_total',
    'Errors by message',
    ['error_message']  # BOMB! Free-form text
)

How to Detect Cardinality Bombs:

Use PromQL to find high-cardinality metrics:

# Top 10 metrics by series count
topk(10, count by (__name__)({__name__!=""})) 

# Metrics with more than 10,000 series
count by (__name__)({__name__!=""}) > 10000

# Labels with highest cardinality for a metric
count by (label_name)(http_requests_total)

# Series created in the last hour
increase(prometheus_tsdb_head_series_created_total[1h])

Warning Signs:

Prometheus memory usage growing faster than traffic
Query timeouts increasing
Compaction taking longer than expected
prometheus_tsdb_head_series growing continuously

The Cascade Effect

Cardinality bombs often cascade. High cardinality causes slow queries. Slow queries time out. Timeouts trigger retries. Retries create more load. More load creates more slowness. By the time you notice, your observability system may be unusable.

Safe Patterns for High-Cardinality Data

Sometimes you legitimately need to track high-cardinality dimensions. Here are safe patterns:

Pattern 1: Aggregate in Application

Instead of exposing per-user metrics, aggregate in your application and expose summaries:

aggregate_pattern.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// ❌ BAD: Per-user metrics (unbounded cardinality)
userRequests := prometheus.NewCounterVec(
    prometheus.CounterOpts{Name: "requests_total"},
    []string{"user_id"},  // Millions of users!
)
 
// ✅ GOOD: Aggregate by user tier
userRequests := prometheus.NewCounterVec(
    prometheus.CounterOpts{Name: "requests_total"},
    []string{"user_tier"},  // free, pro, enterprise = 3 values
)
 
// ✅ GOOD: Track distributions without per-user labels
requestsPerUser := prometheus.NewHistogram(
    prometheus.HistogramOpts{
        Name:    "requests_per_user_distribution",
        Help:    "Distribution of requests per user",
        Buckets: prometheus.ExponentialBuckets(1, 2, 10),
    },
)
 
// In your code: periodically sample user request counts
func recordUserRequestDistribution() {
    for _, user := range users {
        requestCount := getRequestCount(user.ID)
        requestsPerUser.Observe(float64(requestCount))
    }
}

Pattern 2: Use Exemplars

Exemplars attach high-cardinality trace IDs to metric samples without creating new series:

exemplar_pattern.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Exemplars: attach trace_id without creating new series
requestDuration := prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name: "http_request_duration_seconds",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "endpoint"},  // Bounded labels only
)
 
func handleRequest(ctx context.Context, w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    traceID := getTraceID(ctx)
    
    // ... handle request ...
    
    duration := time.Since(start).Seconds()
    
    // Record with exemplar - trace_id is NOT a label
    requestDuration.WithLabelValues(r.Method, r.URL.Path).(prometheus.ExemplarObserver).
        ObserveWithExemplar(duration, prometheus.Labels{
            "traceID": traceID,  // Exemplar, not label!
        })
}
 
// In Prometheus, exemplars are stored separately from series
// You can query them to drill down to specific traces

Pattern 3: Bucketing/Categorization

Transform unbounded values into bounded categories:

Raw Value	Bucketed Value
Request size: 1,234 bytes	size_bucket: "1KB-10KB"
Response time: 156ms	latency_bucket: "100-500ms"
User age: 34 days	cohort: "month_1"
IP: 192.168.1.42	ip_class: "private"

Pattern 4: Use Logs/Traces for Details

Metrics answer "how much?" and "how often?" For "who?" and "why?", use logs and traces:

Metrics: http_requests_total{status="500"} = 47 in last hour
Logs: Individual request details with user IDs, request IDs
Traces: End-to-end request flow with timing breakdown

The Right Tool for the Job

Metrics are for aggregates; logs and traces are for individuals. If you need per-user, per-request, or per-session data, that's a logging or tracing use case. Forcing it into metrics creates cardinality explosions.

Controlling Cardinality in Production

Even with careful design, cardinality can grow unexpectedly. Here are production controls:

Metric Relabeling (Drop Expensive Labels):

Use metric_relabel_configs to drop or transform labels before storage:

prometheus_relabel.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['my-app:8080']
    
    # Relabeling BEFORE scraping (affects what is collected)
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_team]
        target_label: team
    
    # Relabeling AFTER scraping (affects what is stored)
    metric_relabel_configs:
      # Drop specific high-cardinality metrics entirely
      - source_labels: [__name__]
        regex: 'expensive_metric_.*'
        action: drop
      
      # Drop high-cardinality labels from specific metrics
      - source_labels: [__name__]
        regex: 'http_requests_total'
        action: labeldrop
        regex: 'instance_id'
      
      # Replace high-cardinality label with "aggregated"
      - source_labels: [customer_id]
        regex: '.+'
        target_label: customer_id
        replacement: 'aggregated'
      
      # Hash IP addresses to reduce cardinality
      - source_labels: [client_ip]
        regex: '(\d+\.\d+)\..*'
        target_label: client_ip_prefix
        replacement: '${1}.x.x'
      
      # Keep only specific label values
      - source_labels: [environment]
        regex: 'prod|staging'
        action: keep

Recording Rules for Aggregation:

Pre-aggregate high-cardinality metrics into lower-cardinality summaries:

aggregation_rules.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
groups:
  - name: cardinality_reduction
    interval: 1m
    rules:
      # Aggregate per-endpoint metrics into per-service
      - record: service:http_requests:rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))
      
      # Drop instance label for capacity planning
      - record: service:memory_usage:avg
        expr: avg by (service) (process_resident_memory_bytes)
      
      # Aggregate histograms to reduce label dimensions
      - record: service:request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          )
      
      # Then you can drop the raw high-cardinality data
      # using retention policies or metric_relabel_configs

Per-Metric Limits (Advanced):

Some Prometheus-compatible systems offer per-metric cardinality limits:

# Mimir/Cortex limits configuration
limits:
  max_label_names_per_series: 20
  max_label_value_length: 2048
  max_series_per_metric: 50000

Cardinality Explorer Tools:

Prometheus TSDB Status page: /tsdb-status shows top series by cardinality
Mimir/Cortex cardinality API: Query cardinality by label
Grafana Explore: Visual cardinality analysis

Progressive Aggregation

A powerful strategy: keep full-detail metrics with short retention (24h), aggregated metrics with medium retention (30d), and highly aggregated metrics with long retention (1y). This balances detail for debugging with cost for trends.

Designing for Cardinality

When designing new metrics, cardinality should be a primary consideration. Here's a systematic approach:

Cardinality Design Checklist

•List all proposed labels — Write down every label you're considering
•Estimate value count per label — How many unique values? Is it bounded?
•Calculate total cardinality — Multiply all label value counts
•Apply histogram multiplier — If histogram, multiply by bucket count + 2
•Compare to budget — Does this fit within your cardinality budget?
•Challenge each label — For each label, ask: "Is this essential for queries?"
•Consider alternatives — Can logs/traces capture this instead?
•Document the decision — Record the cardinality estimate and rationale

cardinality_review.md
Markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Metric Proposal: Payment Processing Latency
 
## Proposed Metric
```
payment_processing_duration_seconds (histogram)
Labels: method, currency, country, customer_type
```
 
## Cardinality Analysis
 
| Label | Estimated Values | Bounded? |
|-------|-----------------|----------|
| method | 5 (card, ach, wire, paypal, crypto) | ✅ |
| currency | 50 (supported currencies) | ✅ |
| country | 200 (potential countries) | ✅ |
| customer_type | 3 (individual, business, enterprise) | ✅ |
 
### Calculation
- Base cardinality: 5 × 50 × 200 × 3 = 150,000
- Histogram multiplier (12 buckets): × 12
- **Total: 1,800,000 series**
 
## Decision
**REJECTED** - Too high for a single metric.
 
## Revised Proposal
```
payment_processing_duration_seconds (histogram)
Labels: method, region, customer_type
```
 
| Label | Estimated Values |
|-------|-----------------|
| method | 5 |
| region | 10 (aggregated regions) |
| customer_type | 3 |
 
- Base: 5 × 10 × 3 = 150
- With histogram: 150 × 12 = **1,800 series** ✅
 
Country-level breakdown available via logs for debugging.

The Label Necessity Test:

For each label, ask these questions:

Will you alert on it? If not, maybe it doesn't need to be a metric label
Will you dashboard it? If it's rarely viewed, consider logs
Is it actionable? If you can't act on the dimension, why measure it?
Can it be aggregated later? Prometheus can only aggregate away labels, not add them

Start Minimal

It's easy to add labels later but painful to remove them (breaks existing queries). Start with minimal labels covering your core use cases. Add more only when you have a concrete query you can't answer with existing labels.

Cardinality Monitoring and Alerting

Proactive cardinality monitoring prevents disasters. Here are essential queries and alerts:

cardinality_alerts.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
groups:
  - name: cardinality_alerts
    rules:
      # Alert when total series exceeds threshold
      - alert: HighTotalCardinality
        expr: prometheus_tsdb_head_series > 8000000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High total series count: {{ $value }}"
          description: "Prometheus has more than 8M active series"
 
      # Alert on rapid cardinality growth
      - alert: CardinalityExplosion
        expr: |
          (
            prometheus_tsdb_head_series 
            - prometheus_tsdb_head_series offset 1h
          ) > 100000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Cardinality explosion detected"
          description: "{{ $value }} new series created in the last hour"
 
      # Alert when specific metric has too many series
      - alert: HighCardinalityMetric
        expr: count by (__name__)({__name__!=""}) > 100000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High cardinality metric: {{ $labels.__name__ }}"
          description: "Metric has {{ $value }} series"
 
      # Alert on memory pressure from cardinality
      - alert: PrometheusHighMemoryUsage
        expr: |
          process_resident_memory_bytes{job="prometheus"} 
          / 
          on() prometheus_tsdb_head_series 
          > 3000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High memory per series: {{ $value }} bytes"

Useful Monitoring Queries:

# Current total series
prometheus_tsdb_head_series

# Series created in last hour
increase(prometheus_tsdb_head_series_created_total[1h])

# Top 10 metrics by series count
topk(10, count by (__name__)({__name__!=""}))

# Cardinality of specific metric
count(http_requests_total)

# Cardinality by label for a metric
count by (method)(http_requests_total)
count by (endpoint)(http_requests_total)
count by (status)(http_requests_total)

# Memory per series (should be ~1-3KB)
process_resident_memory_bytes{job="prometheus"} / prometheus_tsdb_head_series

# Chunks in memory (each series has chunks)
prometheus_tsdb_head_chunks

# WAL size (indicates write load)
prometheus_tsdb_wal_storage_size_bytes

Building a Cardinality Dashboard:

Create a Grafana dashboard with:

Total series over time — Trending upward is expected; spikes are problems
Series creation rate — How fast are new series appearing?
Top 10 metrics by cardinality — Which metrics dominate?
Memory usage correlation — Does memory track cardinality?
Per-job/target series count — Which services contribute most?

Series Churn vs High Cardinality

Distinguish between high-but-stable cardinality (many series, constant set) and series churn (series constantly created/deleted). Churn is worse—it defeats compression and bloats the index. Watch 'prometheus_tsdb_head_series_created_total' for churn indicators.

Real-World Case Studies

Learning from real cardinality incidents helps build intuition for the risks:

Case Study 1: The Customer ID Incident

Scenario: A SaaS company added customer_id as a label to track per-customer latency. With 50,000 customers and 20 endpoints, they went from 200 series to 1,000,000 series overnight.

Impact: Prometheus memory usage spiked from 8GB to 64GB. Queries timed out. Dashboards became unusable.

Resolution: Removed customer_id label. Added customer_tier label (free/pro/enterprise = 3 values). Used logs for customer-specific debugging.

Lesson: Never use unbounded entity IDs as labels.

Case Study 2: The URL Query Parameter Bomb

Scenario: An e-commerce site used url as a label for request tracking. URLs included search queries: /search?q=red+shoes, /search?q=blue+hat, etc.

Impact: Millions of unique URLs created millions of series. Combined with histogram buckets (×12), cardinality reached tens of millions.

Resolution: Normalized URLs to route patterns (/search). Query parameters moved to logs.

Lesson: Always normalize dynamic URL components.

Case Study 3: The Error Message Trap

Scenario: A team added error_message as a label to categorize errors. Messages included stack traces and dynamic content.

Impact: Each unique error message created a new series. Some messages included request IDs, creating infinite cardinality.

Resolution: Replaced error_message with error_type (5 categories). Full error messages went to logging system.

Lesson: Categorize, don't quote. Free-form text has unbounded cardinality.

The Common Thread

All these cases share a pattern: using labels for data that belongs in logs or traces. Metrics answer 'how many?' and 'how fast?'. Logs answer 'which one?' and 'what happened?'. Mixing these purposes causes cardinality explosions.

Summary: Cardinality Under Control

Cardinality is the fundamental scaling dimension of metrics systems. Understanding and controlling it separates successful observability implementations from failed ones.

Key Takeaways

•Cardinality is multiplicative — Adding a label with 100 values multiplies your series by 100, not adds 100.
•Histograms multiply cardinality — Every histogram creates 10-15 series per label combination. Plan accordingly.
•Budget cardinality — Know your limits. Monitor prometheus_tsdb_head_series. Alert on growth rate.
•Labels must be bounded — Never use user IDs, request IDs, IP addresses, or free-form text as labels.
•Use the right tool — High-cardinality data belongs in logs and traces, not metrics.
•Design before implementing — Calculate expected cardinality before deploying any new metric.
•Monitor your monitoring — Cardinality problems grow exponentially. Catch them early.

What's Next:

With cardinality understood, the final page covers aggregation and queries—how to effectively query metrics at scale, use recording rules for performance, and build efficient dashboards. You'll learn PromQL patterns that work with high cardinality rather than against it.

Page Complete

You now understand cardinality as the critical scaling dimension of observability. This knowledge will protect you from the most common cause of metrics system failures and enable you to design metrics that scale with your systems.