Loading content...
Cardinality—the number of unique time series in your metrics system—is the primary scaling dimension of observability systems. It determines memory usage, query performance, and ultimately, whether your monitoring survives the production environment it's meant to observe.
Cardinality explosions are subtle. A single well-intentioned label addition can transform a manageable 10,000-series deployment into an unmanageable 10-million-series catastrophe. By the time you notice degraded query performance or memory exhaustion, the damage is done.
This page provides a deep understanding of cardinality: how to calculate it, how to identify dangerous patterns, and how to design metrics that scale with your systems rather than exploding beyond them.
By the end of this page, you will understand what cardinality is, how to calculate cardinality impact before deploying metrics, how to identify cardinality bombs in existing systems, and practical strategies for keeping cardinality under control.
What is Cardinality?
In time-series databases, cardinality refers to the total number of unique time series. Each unique combination of metric name and label values creates a distinct time series.
Consider this metric:
http_requests_total{method="GET", endpoint="/users", status="200"}
This is ONE time series. If you have:
The total cardinality for this metric is: 5 × 50 × 10 = 2,500 time series
Why Cardinality Matters:
Every time series consumes resources:
| Resource | Per-Series Cost | Impact |
|---|---|---|
| Memory | ~1-3 KB | Active series kept in RAM |
| Storage | Variable | Each series stored separately |
| Query Time | Linear+ | More series = slower queries |
| Index Size | Logarithmic | Label index grows with cardinality |
| Compaction | Linear | Background work scales with series |
The Cardinality Equation:
For a single metric with N labels, cardinality is:
Cardinality = ∏(i=1 to N) |values_i|
# Where |values_i| is the number of unique values for label i
This is multiplicative, not additive. Adding a new label doesn't add to cardinality—it multiplies it.
Example Calculation:
http_request_duration_seconds{method, endpoint, status, customer_tier}
|method| = 5 (GET, POST, PUT, DELETE, PATCH)
|endpoint| = 100 (API endpoints)
|status| = 20 (HTTP status codes used)
|customer_tier| = 3 (free, pro, enterprise)
Cardinality = 5 × 100 × 20 × 3 = 30,000 series
# But this is a histogram with 10 buckets + sum + count = 12 series each:
Actual Cardinality = 30,000 × 12 = 360,000 series!
Histograms dramatically amplify cardinality. A histogram with default buckets (10 buckets) generates 12 time series per label combination. Before adding labels to histograms, multiply your expected cardinality by 12.
Just as you budget compute and storage, you must budget cardinality. Different Prometheus-compatible systems have different practical limits:
| System | Practical Limit | Notes |
|---|---|---|
| Prometheus (16GB RAM) | 1-2 million series | Single instance |
| Prometheus (64GB RAM) | 5-8 million series | Single instance |
| Thanos / Cortex / Mimir | 50-500+ million | Horizontally scaled |
| Managed (Datadog, etc.) | Billing-based | Pay per active series |
Calculating Your Budget:
Prometheus Memory ≈ (active_series × 1-3 KB) + query_overhead
Example:
- Target: 32GB RAM Prometheus instance
- Query overhead: ~8GB
- Available for series: 24GB
- Series limit: 24GB / 2KB = 12 million series (theoretical max)
- Safe limit: 8-10 million (leave headroom)
Planning Your Budget:
A practical approach is to allocate cardinality budgets per service or team:
Total Budget: 2,000,000 series
Infrastructure:
- node_exporter (per node): ~500 series
- 100 nodes: 50,000 series (2.5%)
Kubernetes Metrics:
- kube-state-metrics: ~200 series per pod
- 500 pods: 100,000 series (5%)
Application Metrics:
- Average per service: 10,000 series
- 100 services: 1,000,000 series (50%)
Custom Business Metrics:
- Reserved: 200,000 series (10%)
Headroom:
- Reserved: 650,000 series (32.5%)
prometheus_tsdb_head_series tracks active seriesUse PromQL to track cardinality: 'prometheus_tsdb_head_series' shows current series count. Set alerts at 70% and 90% of capacity. 'increase(prometheus_tsdb_head_series_created_total[1h])' shows series creation rate—sudden spikes indicate cardinality bombs.
A cardinality bomb is a metric design that creates unbounded or explosively large numbers of time series. Here are the most common patterns:
/search?q=... creates series for every unique query./users/12345 instead of /users/:id.1234567891011121314151617181920212223242526272829303132333435363738394041
from prometheus_client import Counter, Histogram # ❌ CARDINALITY BOMB: User ID as label# If you have 1 million users, this creates 1 million seriesrequests_by_user = Counter( 'http_requests_by_user_total', 'Requests per user', ['user_id'] # BOMB! Unbounded) # ❌ CARDINALITY BOMB: Request ID as label# Every request creates a new series - grows infinitelyrequest_latency = Histogram( 'http_request_latency_seconds', 'Request latency', ['request_id'] # BOMB! Every request is unique) # ❌ CARDINALITY BOMB: Full path with IDs# /users/123, /users/456, etc. each create new seriesrequests_by_path = Counter( 'http_requests_total', 'HTTP requests', ['full_path'] # BOMB! Dynamic path segments) # ❌ CARDINALITY BOMB: IP address# Thousands of unique IPs, especially during attacksrequests_by_ip = Counter( 'http_requests_by_ip_total', 'Requests by IP', ['client_ip'] # BOMB! Thousands of IPs) # ❌ CARDINALITY BOMB: Error message as label# Stack traces, connection strings, etc.errors_by_message = Counter( 'application_errors_total', 'Errors by message', ['error_message'] # BOMB! Free-form text)How to Detect Cardinality Bombs:
Use PromQL to find high-cardinality metrics:
# Top 10 metrics by series count
topk(10, count by (__name__)({__name__!=""}))
# Metrics with more than 10,000 series
count by (__name__)({__name__!=""}) > 10000
# Labels with highest cardinality for a metric
count by (label_name)(http_requests_total)
# Series created in the last hour
increase(prometheus_tsdb_head_series_created_total[1h])
Warning Signs:
prometheus_tsdb_head_series growing continuouslyCardinality bombs often cascade. High cardinality causes slow queries. Slow queries time out. Timeouts trigger retries. Retries create more load. More load creates more slowness. By the time you notice, your observability system may be unusable.
Sometimes you legitimately need to track high-cardinality dimensions. Here are safe patterns:
Pattern 1: Aggregate in Application
Instead of exposing per-user metrics, aggregate in your application and expose summaries:
12345678910111213141516171819202122232425262728
// ❌ BAD: Per-user metrics (unbounded cardinality)userRequests := prometheus.NewCounterVec( prometheus.CounterOpts{Name: "requests_total"}, []string{"user_id"}, // Millions of users!) // ✅ GOOD: Aggregate by user tieruserRequests := prometheus.NewCounterVec( prometheus.CounterOpts{Name: "requests_total"}, []string{"user_tier"}, // free, pro, enterprise = 3 values) // ✅ GOOD: Track distributions without per-user labelsrequestsPerUser := prometheus.NewHistogram( prometheus.HistogramOpts{ Name: "requests_per_user_distribution", Help: "Distribution of requests per user", Buckets: prometheus.ExponentialBuckets(1, 2, 10), },) // In your code: periodically sample user request countsfunc recordUserRequestDistribution() { for _, user := range users { requestCount := getRequestCount(user.ID) requestsPerUser.Observe(float64(requestCount)) }}Pattern 2: Use Exemplars
Exemplars attach high-cardinality trace IDs to metric samples without creating new series:
1234567891011121314151617181920212223242526
// Exemplars: attach trace_id without creating new seriesrequestDuration := prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Buckets: prometheus.DefBuckets, }, []string{"method", "endpoint"}, // Bounded labels only) func handleRequest(ctx context.Context, w http.ResponseWriter, r *http.Request) { start := time.Now() traceID := getTraceID(ctx) // ... handle request ... duration := time.Since(start).Seconds() // Record with exemplar - trace_id is NOT a label requestDuration.WithLabelValues(r.Method, r.URL.Path).(prometheus.ExemplarObserver). ObserveWithExemplar(duration, prometheus.Labels{ "traceID": traceID, // Exemplar, not label! })} // In Prometheus, exemplars are stored separately from series// You can query them to drill down to specific tracesPattern 3: Bucketing/Categorization
Transform unbounded values into bounded categories:
| Raw Value | Bucketed Value |
|---|---|
| Request size: 1,234 bytes | size_bucket: "1KB-10KB" |
| Response time: 156ms | latency_bucket: "100-500ms" |
| User age: 34 days | cohort: "month_1" |
| IP: 192.168.1.42 | ip_class: "private" |
Pattern 4: Use Logs/Traces for Details
Metrics answer "how much?" and "how often?" For "who?" and "why?", use logs and traces:
Metrics: http_requests_total{status="500"} = 47 in last hour
Logs: Individual request details with user IDs, request IDs
Traces: End-to-end request flow with timing breakdown
Metrics are for aggregates; logs and traces are for individuals. If you need per-user, per-request, or per-session data, that's a logging or tracing use case. Forcing it into metrics creates cardinality explosions.
Even with careful design, cardinality can grow unexpectedly. Here are production controls:
Metric Relabeling (Drop Expensive Labels):
Use metric_relabel_configs to drop or transform labels before storage:
123456789101112131415161718192021222324252627282930313233343536373839
scrape_configs: - job_name: 'my-app' static_configs: - targets: ['my-app:8080'] # Relabeling BEFORE scraping (affects what is collected) relabel_configs: - source_labels: [__meta_kubernetes_pod_label_team] target_label: team # Relabeling AFTER scraping (affects what is stored) metric_relabel_configs: # Drop specific high-cardinality metrics entirely - source_labels: [__name__] regex: 'expensive_metric_.*' action: drop # Drop high-cardinality labels from specific metrics - source_labels: [__name__] regex: 'http_requests_total' action: labeldrop regex: 'instance_id' # Replace high-cardinality label with "aggregated" - source_labels: [customer_id] regex: '.+' target_label: customer_id replacement: 'aggregated' # Hash IP addresses to reduce cardinality - source_labels: [client_ip] regex: '(\d+\.\d+)\..*' target_label: client_ip_prefix replacement: '${1}.x.x' # Keep only specific label values - source_labels: [environment] regex: 'prod|staging' action: keepRecording Rules for Aggregation:
Pre-aggregate high-cardinality metrics into lower-cardinality summaries:
123456789101112131415161718192021
groups: - name: cardinality_reduction interval: 1m rules: # Aggregate per-endpoint metrics into per-service - record: service:http_requests:rate5m expr: sum by (service) (rate(http_requests_total[5m])) # Drop instance label for capacity planning - record: service:memory_usage:avg expr: avg by (service) (process_resident_memory_bytes) # Aggregate histograms to reduce label dimensions - record: service:request_duration:p99 expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) # Then you can drop the raw high-cardinality data # using retention policies or metric_relabel_configsPer-Metric Limits (Advanced):
Some Prometheus-compatible systems offer per-metric cardinality limits:
# Mimir/Cortex limits configuration
limits:
max_label_names_per_series: 20
max_label_value_length: 2048
max_series_per_metric: 50000
Cardinality Explorer Tools:
/tsdb-status shows top series by cardinalityA powerful strategy: keep full-detail metrics with short retention (24h), aggregated metrics with medium retention (30d), and highly aggregated metrics with long retention (1y). This balances detail for debugging with cost for trends.
When designing new metrics, cardinality should be a primary consideration. Here's a systematic approach:
1234567891011121314151617181920212223242526272829303132333435363738394041
# Metric Proposal: Payment Processing Latency ## Proposed Metric```payment_processing_duration_seconds (histogram)Labels: method, currency, country, customer_type``` ## Cardinality Analysis | Label | Estimated Values | Bounded? ||-------|-----------------|----------|| method | 5 (card, ach, wire, paypal, crypto) | ✅ || currency | 50 (supported currencies) | ✅ || country | 200 (potential countries) | ✅ || customer_type | 3 (individual, business, enterprise) | ✅ | ### Calculation- Base cardinality: 5 × 50 × 200 × 3 = 150,000- Histogram multiplier (12 buckets): × 12- **Total: 1,800,000 series** ## Decision**REJECTED** - Too high for a single metric. ## Revised Proposal```payment_processing_duration_seconds (histogram)Labels: method, region, customer_type``` | Label | Estimated Values ||-------|-----------------|| method | 5 || region | 10 (aggregated regions) || customer_type | 3 | - Base: 5 × 10 × 3 = 150- With histogram: 150 × 12 = **1,800 series** ✅ Country-level breakdown available via logs for debugging.The Label Necessity Test:
For each label, ask these questions:
It's easy to add labels later but painful to remove them (breaks existing queries). Start with minimal labels covering your core use cases. Add more only when you have a concrete query you can't answer with existing labels.
Proactive cardinality monitoring prevents disasters. Here are essential queries and alerts:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
groups: - name: cardinality_alerts rules: # Alert when total series exceeds threshold - alert: HighTotalCardinality expr: prometheus_tsdb_head_series > 8000000 for: 10m labels: severity: warning annotations: summary: "High total series count: {{ $value }}" description: "Prometheus has more than 8M active series" # Alert on rapid cardinality growth - alert: CardinalityExplosion expr: | ( prometheus_tsdb_head_series - prometheus_tsdb_head_series offset 1h ) > 100000 for: 5m labels: severity: critical annotations: summary: "Cardinality explosion detected" description: "{{ $value }} new series created in the last hour" # Alert when specific metric has too many series - alert: HighCardinalityMetric expr: count by (__name__)({__name__!=""}) > 100000 for: 15m labels: severity: warning annotations: summary: "High cardinality metric: {{ $labels.__name__ }}" description: "Metric has {{ $value }} series" # Alert on memory pressure from cardinality - alert: PrometheusHighMemoryUsage expr: | process_resident_memory_bytes{job="prometheus"} / on() prometheus_tsdb_head_series > 3000 for: 15m labels: severity: warning annotations: summary: "High memory per series: {{ $value }} bytes"Useful Monitoring Queries:
# Current total series
prometheus_tsdb_head_series
# Series created in last hour
increase(prometheus_tsdb_head_series_created_total[1h])
# Top 10 metrics by series count
topk(10, count by (__name__)({__name__!=""}))
# Cardinality of specific metric
count(http_requests_total)
# Cardinality by label for a metric
count by (method)(http_requests_total)
count by (endpoint)(http_requests_total)
count by (status)(http_requests_total)
# Memory per series (should be ~1-3KB)
process_resident_memory_bytes{job="prometheus"} / prometheus_tsdb_head_series
# Chunks in memory (each series has chunks)
prometheus_tsdb_head_chunks
# WAL size (indicates write load)
prometheus_tsdb_wal_storage_size_bytes
Building a Cardinality Dashboard:
Create a Grafana dashboard with:
Distinguish between high-but-stable cardinality (many series, constant set) and series churn (series constantly created/deleted). Churn is worse—it defeats compression and bloats the index. Watch 'prometheus_tsdb_head_series_created_total' for churn indicators.
Learning from real cardinality incidents helps build intuition for the risks:
Case Study 1: The Customer ID Incident
Scenario: A SaaS company added customer_id as a label to track per-customer latency. With 50,000 customers and 20 endpoints, they went from 200 series to 1,000,000 series overnight.
Impact: Prometheus memory usage spiked from 8GB to 64GB. Queries timed out. Dashboards became unusable.
Resolution: Removed customer_id label. Added customer_tier label (free/pro/enterprise = 3 values). Used logs for customer-specific debugging.
Lesson: Never use unbounded entity IDs as labels.
Case Study 2: The URL Query Parameter Bomb
Scenario: An e-commerce site used url as a label for request tracking. URLs included search queries: /search?q=red+shoes, /search?q=blue+hat, etc.
Impact: Millions of unique URLs created millions of series. Combined with histogram buckets (×12), cardinality reached tens of millions.
Resolution: Normalized URLs to route patterns (/search). Query parameters moved to logs.
Lesson: Always normalize dynamic URL components.
Case Study 3: The Error Message Trap
Scenario: A team added error_message as a label to categorize errors. Messages included stack traces and dynamic content.
Impact: Each unique error message created a new series. Some messages included request IDs, creating infinite cardinality.
Resolution: Replaced error_message with error_type (5 categories). Full error messages went to logging system.
Lesson: Categorize, don't quote. Free-form text has unbounded cardinality.
All these cases share a pattern: using labels for data that belongs in logs or traces. Metrics answer 'how many?' and 'how fast?'. Logs answer 'which one?' and 'what happened?'. Mixing these purposes causes cardinality explosions.
Cardinality is the fundamental scaling dimension of metrics systems. Understanding and controlling it separates successful observability implementations from failed ones.
prometheus_tsdb_head_series. Alert on growth rate.What's Next:
With cardinality understood, the final page covers aggregation and queries—how to effectively query metrics at scale, use recording rules for performance, and build efficient dashboards. You'll learn PromQL patterns that work with high cardinality rather than against it.
You now understand cardinality as the critical scaling dimension of observability. This knowledge will protect you from the most common cause of metrics system failures and enable you to design metrics that scale with your systems.