Loading content...
Collecting millions of samples means nothing if you can't extract meaning from them. The power of a metrics system lies not in the data it stores, but in the questions it can answer. How many requests did we serve last hour? What's the 99th percentile latency by service? Which endpoints are degrading?
Effective querying and aggregation transform raw time-series data into actionable insights. But writing efficient queries at scale requires understanding how time-series databases process data, when to pre-aggregate with recording rules, and how to structure dashboards for performance.
This page provides a comprehensive guide to PromQL aggregation patterns, performance optimization, and building dashboards that remain responsive even with millions of time series.
By the end of this page, you will master PromQL aggregation operators, understand query performance characteristics, know when to use recording rules, and be able to build dashboards that scale. You'll also learn common query anti-patterns that kill performance.
PromQL provides powerful aggregation operators that collapse multiple time series into fewer series. Understanding these operators is fundamental to effective querying.
Core Aggregation Operators:
| Operator | Description | Example |
|---|---|---|
| sum | Total of all values | sum(rate(http_requests_total[5m])) |
| avg | Arithmetic mean | avg(node_cpu_seconds_total) |
| min | Minimum value | min(up) |
| max | Maximum value | max(container_memory_usage_bytes) |
| count | Number of series | count(up{job="api"}) |
| stddev | Standard deviation | stddev(http_request_duration_seconds) |
| stdvar | Standard variance | stdvar(http_request_duration_seconds) |
| topk | Largest k values | topk(5, rate(http_requests_total[5m])) |
| bottomk | Smallest k values | bottomk(3, up) |
| count_values | Count by value | count_values("version", build_info) |
| quantile | φ-quantile over series | quantile(0.9, rate(http_requests_total[5m])) |
| group | Group series (value=1) | group(up) by (job) |
Grouping with by and without:
Aggregations can preserve specific labels using by (keep only these) or without (drop these, keep all others):
# Keep only job and endpoint labels
sum by (job, endpoint) (rate(http_requests_total[5m]))
# Drop instance label, keep all others
sum without (instance) (rate(http_requests_total[5m]))
# Both produce grouped aggregates, but:
# - by(): result has ONLY specified labels
# - without(): result has ALL labels EXCEPT specified ones
Choosing by vs without:
| Scenario | Use |
|---|---|
| You know exactly which labels you want | by (label1, label2) |
| You want all labels except a few | without (instance, pod) |
| Service-level aggregation | without (instance, pod) |
| Capacity planning by dimension | by (region, service) |
1234567891011121314151617181920212223242526272829
# Total request rate across all instancessum(rate(http_requests_total[5m]))# Result: single value # Request rate per servicesum by (service) (rate(http_requests_total[5m]))# Result: one series per service # Request rate per endpoint, dropping instance detailsum without (instance, pod) (rate(http_requests_total[5m]))# Result: aggregated by all other labels # Multiple aggregations combined# Total requests by service AND status categorysum by (service, status_class) ( rate(http_requests_total[5m])) # Note: status_class might come from label_replace or recording rule # Top 5 busiest endpointstopk(5, sum by (endpoint) (rate(http_requests_total[5m]))) # Average CPU across all nodes vs max CPU on any nodeavg(node_cpu_usage_ratio) # Typical utilizationmax(node_cpu_usage_ratio) # Hottest node # Count of unique values for a labelcount_values("version", myapp_build_info)# Returns series like {version="1.2.3"} = 5 (5 instances running this version)Apply rate() before sum(). 'sum(rate(counter[5m]))' is correct. 'rate(sum(counter)[5m])' produces wrong results because summing counters before rate() masks individual resets and produces discontinuities.
When combining two vectors (e.g., dividing errors by total requests), PromQL must match series from both sides. This matching is precise and can be customized.
Default Matching:
By default, series match when ALL labels are identical:
# Error ratio: errors / total
sum by (service) (rate(http_errors_total[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
# Both sides must have exactly matching 'service' labels
# {service="api"} / {service="api"} ✓
# {service="api"} / {service="web"} ✗ (no match)
Explicit Matching with on and ignoring:
When series have different label sets, use on or ignoring:
# Match only on specified labels
http_requests_total
/ on (job, instance)
http_requests_started_total
# Match on all labels except specified ones
http_requests_total
/ ignoring (status_code)
http_requests_total{status_code="200"}
One-to-Many Matching with group_left and group_right:
When one side has fewer series than the other (many-to-one relationship), use group modifiers:
# Scenario: Add team name from service metadata
# Metrics: http_requests_total{service="api", endpoint="/users"}
# Metadata: service_team_info{service="api", team="platform"} = 1
# Add team label to each request series
http_requests_total
* on (service) group_left (team)
service_team_info
# Result: http_requests_total{service="api", endpoint="/users", team="platform"}
1234567891011121314151617181920212223242526272829303132333435
# BASIC: Error rate calculation (same labels on both sides)sum by (service) (rate(http_errors_total[5m])) / sum by (service) (rate(http_requests_total[5m])) # EXPLICIT MATCHING: Join with fewer labels # Use case: Divide per-endpoint metrics by per-service capacitysum by (service, endpoint) (rate(http_requests_total[5m]))/ on (service) service_capacity_requests_per_second # MANY-TO-ONE with group_left: Enrich with metadata# Add region label from instance metadatarate(http_requests_total[5m])* on (instance) group_left (region)instance_region_info # ALTERNATIVE: Use label_replace for computed labels# If you don't have a metadata metricsum by (service) ( label_replace( rate(http_requests_total[5m]), "team", "$1", "service", "(api|auth|users).*" # Regex to derive team from service name )) # COMPLEX JOIN: Multiple enrichmentsrate(http_requests_total[5m])* on (instance) group_left (region, environment)instance_metadata* on (service) group_left (team, cost_center)service_metadataPromQL doesn't support many-to-many matching. If multiple series on the left match multiple series on the right, the query fails. Ensure your matching produces one-to-one or one-to-many (with group_left/group_right) relationships.
Time-series data demands functions that operate over time windows. PromQL provides a rich set of functions for analyzing trends, rates, and changes.
Range Vector Functions:
These functions apply to range vectors (e.g., metric[5m]):
| Function | Use With | Description |
|---|---|---|
| rate() | Counters | Per-second average rate over the range |
| irate() | Counters | Instant rate using last two samples |
| increase() | Counters | Total increase over the range |
| delta() | Gauges | Difference between first and last |
| deriv() | Gauges | Per-second derivative (slope) |
| avg_over_time() | Any | Average value over the range |
| max_over_time() | Any | Maximum value over the range |
| min_over_time() | Any | Minimum value over the range |
| sum_over_time() | Any | Sum of all values in range |
| count_over_time() | Any | Number of samples in range |
| quantile_over_time() | Any | φ-quantile over time |
| predict_linear() | Gauges | Linear prediction of future value |
| changes() | Any | Number of value changes |
| resets() | Counters | Number of counter resets |
123456789101112131415161718192021222324252627282930313233343536373839
# RATE vs IRATE# rate(): smoothed average rate - better for alertingrate(http_requests_total[5m]) # irate(): instant rate from last 2 points - shows spikesirate(http_requests_total[5m]) # INCREASE: Total over period# "How many requests in the last hour?"increase(http_requests_total[1h]) # DELTAS for gauges# "How much did memory change in the last hour?"delta(process_resident_memory_bytes[1h]) # PREDICTIONS# "When will disk fill up?"predict_linear(node_filesystem_free_bytes[6h], 24*3600) # Predict 24h ahead # Time until disk is full at current ratenode_filesystem_free_bytes / (-deriv(node_filesystem_free_bytes[6h])) # OVER TIME for gauges# Average CPU over the last houravg_over_time(node_cpu_usage_ratio[1h]) # Maximum memory in the last daymax_over_time(process_resident_memory_bytes[1d]) # CHANGES: Detect instability # How many times did this pod restart?changes(kube_pod_info[1h]) # RESETS: Detect counter resets (process restarts)resets(http_requests_total[1d]) # SUBQUERIES: Apply rate over windows# Average of the 5-minute rate, sampled every minute, over the last houravg_over_time(rate(http_requests_total[5m])[1h:1m])Subqueries: Nested Time Operations
Subqueries allow applying a function over a range of another function's output:
# Syntax: <expression>[<range>:<resolution>]
# Average rate over the last hour, sampled every minute
avg_over_time(rate(http_requests_total[5m])[1h:1m])
# Max of the 5-minute rate in the last 24 hours
max_over_time(rate(http_requests_total[5m])[24h:5m])
# Smoothes noisy rates by averaging them over time
Note: Subqueries are computationally expensive. Consider recording rules for frequently used patterns.
The rate() range should be at least 4x your scrape interval to handle missed scrapes gracefully. For 15s scrape interval, use rate(metric[1m]) minimum. 5-minute windows are commonly used for stability.
Histograms enable powerful latency analysis, but querying them correctly requires understanding their structure.
Histogram Anatomy:
A histogram metric creates multiple series:
http_request_duration_seconds_bucket{le="0.1"} # Observations ≤ 0.1s
http_request_duration_seconds_bucket{le="0.5"} # Observations ≤ 0.5s
http_request_duration_seconds_bucket{le="+Inf"} # All observations
http_request_duration_seconds_sum # Total of all observations
http_request_duration_seconds_count # Number of observations
Buckets are cumulative—the 0.5s bucket includes all values that are also in the 0.1s bucket.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
# PERCENTILES (quantiles)# p50 (median) request durationhistogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m])) # p90 request durationhistogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m])) # p99 request duration (common SLO target)histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # AGGREGATED PERCENTILES# p99 by service (aggregate buckets first, then calculate)histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))) # AVERAGE request duration# sum / count = averagerate(http_request_duration_seconds_sum[5m])/rate(http_request_duration_seconds_count[5m]) # AVERAGE by endpointsum by (endpoint) (rate(http_request_duration_seconds_sum[5m]))/sum by (endpoint) (rate(http_request_duration_seconds_count[5m])) # REQUEST RATE from histogramrate(http_request_duration_seconds_count[5m])# This equals rate(http_request_duration_seconds_bucket{le="+Inf"}[5m]) # HEATMAP DATA (for Grafana)# Use the bucket metric with rate() - Grafana understands this formatrate(http_request_duration_seconds_bucket[5m]) # SLO: "What % of requests are under 100ms?"sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))/sum(rate(http_request_duration_seconds_count[5m])) # APDEX: Approximate user satisfaction# Apdex = (satisfied + tolerating/2) / total# Satisfied: <400ms, Tolerating: 400ms-2s, Frustrated: >2s( sum(rate(http_request_duration_seconds_bucket{le="0.4"}[5m])) + sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2)/sum(rate(http_request_duration_seconds_count[5m]))Histogram Aggregation Rules:
le label must be preserved for histogram_quantile()Correct vs Incorrect:
# CORRECT: rate → aggregate → quantile
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# WRONG: quantile first, then average (statistically invalid)
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))
# WRONG: missing le in aggregation (histogram_quantile fails)
histogram_quantile(0.99,
sum by (service) (rate(http_request_duration_seconds_bucket[5m])) # No 'le'!
)
histogram_quantile() uses linear interpolation between buckets. If your buckets are sparse in the relevant range (e.g., no bucket between 100ms and 1s when measuring p99 around 200ms), accuracy suffers. Place buckets around your SLO thresholds for best accuracy.
Query performance directly impacts dashboard responsiveness and alerting reliability. Slow queries cause timeouts, stale data, and frustrated users.
Performance Factors:
| Factor | Impact | Mitigation |
|---|---|---|
| Series count | Linear memory/CPU | Be selective with labels |
| Time range | Linear data scanned | Use shorter ranges when possible |
| Step/resolution | More points = more work | Match resolution to visualization need |
| Functions applied | Some are expensive | Avoid subqueries if possible |
| Regex in selectors | Can be slow | Prefer exact matches |
| Query complexity | Multiplicative | Simplify, use recording rules |
Selector Optimization:
# FAST: Exact label match
http_requests_total{job="api", status="200"}
# SLOWER: Regex match
http_requests_total{endpoint=~"/api/.*"}
# SLOWEST: Negative regex across many series
http_requests_total{endpoint!~"/(health|metrics|ready)"}
# OPTIMIZATION: Prefer positive filters
# Instead of matching everything except health endpoints,
# match the specific endpoints you care about
Recording Rules for Expensive Queries:
If a query is used in multiple dashboards or takes >500ms, create a recording rule:
123456789101112131415161718192021222324
groups: - name: performance_rules interval: 30s # Pre-compute every 30 seconds rules: # Expensive: aggregates across all instances and endpoints - record: job:http_request_rate:sum5m expr: sum by (job) (rate(http_requests_total[5m])) # Very expensive: histogram quantile with aggregation - record: job:http_request_duration:p99 expr: | histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) ) # Expensive: error ratio calculation - record: job:http_error_ratio:rate5m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) # Now dashboards query the recorded metrics instead: # job:http_request_rate:sum5m is instant, not computed on queryPrometheus exposes query metrics: 'prometheus_engine_query_duration_seconds' shows query latency distribution. 'prometheus_engine_queries_concurrent_max' shows concurrency. Track these to identify performance regressions.
A dashboard with 20 panels, each querying a 7-day range at 1-minute resolution, can overwhelm even a well-tuned Prometheus. Smart dashboard design maintains responsiveness.
Panel Efficiency:
| Approach | Impact | Recommendation |
|---|---|---|
| Time range | Data volume per panel | Default to 1-6 hours; use longer ranges only when needed |
| Resolution | Points computed | Use auto resolution or match visual need |
| Query reuse | Duplicate work | Use variables and templating |
| Recording rules | Query execution time | Pre-compute for all panels >500ms |
| Panel count | Concurrent queries | Max 20-30 panels per dashboard |
| Refresh rate | Query frequency | Match to data cadence (not faster than scrape) |
Dashboard Structure Best Practices:
Service Dashboard Structure:
├── Overview Row (4-6 panels)
│ ├── Traffic overview (single stat)
│ ├── Error rate (single stat with threshold)
│ ├── P99 latency (single stat)
│ └── Availability (single stat)
├── Request Details Row (3-4 panels)
│ ├── Request rate over time (graph)
│ ├── Error rate over time (graph)
│ └── Latency percentiles (graph)
├── Resource Utilization Row (3-4 panels)
│ ├── CPU by instance (graph)
│ ├── Memory by instance (graph)
│ └── Network I/O (graph)
└── Detailed Breakdown (collapsed by default)
├── Per-endpoint latency heatmap
├── Per-endpoint error rates
└── Instance details
Key principles:
1234567891011121314151617181920212223
# SINGLE STAT: Use instant query, not range# For "current error rate", you don't need a graph# Query type: instantsum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # GRAPH: Use appropriate resolution# For 1-hour view: 15s step is plenty (240 points)# For 24-hour view: 1m step is sufficient (1440 points)# For 7-day view: 15m step still gives 672 points # USE RECORDING RULES in graphs# Instead of computing histogram_quantile each timejob:http_request_duration:p99{job="$service"} # TEMPLATE VARIABLES for reuse# Single query populates variable, used in all panelssum by (job) (up{job=~"$service"}) # CONDITIONAL DISPLAY# Use Grafana thresholds instead of computing in query# Let Grafana color based on value, don't compute alert state in PromQLUse Grafana's 'Lazy load panels' option for heavy dashboards. Panels below the fold won't query until scrolled into view. Combined with collapsible rows, this dramatically reduces initial load.
Avoid these common mistakes that hurt performance or produce incorrect results:
avg(histogram_quantile(...)) is statistically invalid. Aggregate buckets first.{__name__=~".*"} matches everything. Be specific.12345678910111213141516171819202122232425262728293031323334353637383940
# ❌ WRONG: rate() on a gaugerate(process_resident_memory_bytes[5m]) # Memory doesn't accumulate!# ✅ RIGHT: deriv() for rate of change of a gaugederiv(process_resident_memory_bytes[5m]) # ❌ WRONG: sum() before rate()rate(sum(http_requests_total)[5m]) # Hides individual counter resets# ✅ RIGHT: rate() first, then sum()sum(rate(http_requests_total[5m])) # ❌ WRONG: Averaging percentilesavg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))# ✅ RIGHT: Aggregate buckets, then calculate percentilehistogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) # ❌ WRONG: Range too short for scrape intervalrate(http_requests_total[15s]) # If scrape_interval is 15s, may have 0-1 samples# ✅ RIGHT: Range should be 4x+ scrape intervalrate(http_requests_total[1m]) # 4 samples minimum for 15s scrape # ❌ WRONG: Subquery in alert (expensive)alert: HighLatencyP99expr: | avg_over_time( histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))[1h:1m] ) > 1# ✅ RIGHT: Use recording rule, alert on recorded metricalert: HighLatencyP99expr: job:http_request_duration:p99 > 1 # ❌ WRONG: Missing le in histogram aggregationhistogram_quantile(0.99, sum by (service) (rate(http_request_duration_seconds_bucket[5m])) # Where's le?)# ✅ RIGHT: Include le in the aggregationhistogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))Many anti-patterns don't cause errors—they produce wrong numbers. A sum() before rate() query will run and return values; they'll just be incorrect during counter resets. Test your queries against expected behavior.
These advanced patterns solve common real-world problems:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
# AVAILABILITY (SLO)# "What percentage of time was the service up in the last 30 days?"avg_over_time(up{job="api"}[30d]) * 100# Returns percentage (e.g., 99.95) # SLO BURN RATE# "How fast are we consuming our error budget?"# Error budget for 99.9% SLO: 0.1%# If error rate is 0.2%, burn rate = 0.2 / 0.1 = 2x( 1 - ( sum(rate(http_requests_total{status="200"}[1h])) / sum(rate(http_requests_total[1h])) )) / 0.001 # 0.001 = error budget for 99.9% SLO # MULTI-WINDOW SLO (Google SRE approach)# Fast window catches acute outages, slow window catches gradual degradation# Alert if burning budget fast over 1h AND confirmed over 6h( (1 - sum(rate(http_requests_total{status="200"}[1h])) / sum(rate(http_requests_total[1h]))) > (14.4 * 0.001) # 14.4x burn rate and (1 - sum(rate(http_requests_total{status="200"}[6h])) / sum(rate(http_requests_total[6h]))) > (6 * 0.001) # 6x burn rate) # COMPARISON TO LAST WEEK# "Is traffic higher than same time last week?"rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1w)# > 1 means higher than last week # ANOMALY DETECTION (simple z-score)# "Is current value more than 3 standard deviations from average?"( rate(http_requests_total[5m]) - avg_over_time(rate(http_requests_total[5m])[1d:5m])) / stddev_over_time(rate(http_requests_total[5m])[1d:5m])# Values beyond ±3 are likely anomalies # TOP RESOURCE CONSUMERS# "Which pods are using the most memory relative to their limits?"topk(10, container_memory_usage_bytes / on (pod, namespace) kube_pod_container_resource_limits{resource="memory"}) # SERVICE DEPENDENCY HEALTH# "What's the health of services I depend on, weighted by call volume?"sum by (downstream_service) ( rate(http_client_requests_total{status!~"5.."}[5m])) / sum by (downstream_service) ( rate(http_client_requests_total[5m]))Label Manipulation Functions:
# label_replace: Add/modify labels using regex
label_replace(
rate(http_requests_total[5m]),
"status_class",
"${1}xx",
"status",
"(.).*"
)
# Adds status_class="2xx" from status="200"
# label_join: Combine multiple labels into one
label_join(
up,
"full_target",
":",
"job", "instance"
)
# Creates full_target="api:10.0.0.1:8080"
# group(): Create a series with value 1 for grouping/joining
http_requests_total * on (instance) group_left
group(up) by (instance)
# Associates each request series with its up status
Build complex queries step by step. Start with the innermost expression and add layers. At each step, verify the output makes sense before proceeding. PromQL debugging is easier when you isolate which layer produces unexpected results.
Effective querying and aggregation transform raw metrics into actionable insights. The patterns and techniques covered here will serve you from ad-hoc debugging to production dashboard design.
Module Complete:
You have completed the Metrics Collection module. You now understand:
This knowledge enables you to design, collect, and query metrics that scale with your systems and provide actionable observability.
Congratulations! You have mastered metrics collection. From understanding metric types through advanced PromQL patterns, you now have the knowledge to build and maintain world-class observability systems. The next module covers Distributed Tracing—the second pillar of observability.