Metrics Collection - Learning Module

Loading content...

0/273

Aggregation and Queries

From Raw Data to Actionable Insights

Collecting millions of samples means nothing if you can't extract meaning from them. The power of a metrics system lies not in the data it stores, but in the questions it can answer. How many requests did we serve last hour? What's the 99th percentile latency by service? Which endpoints are degrading?

Effective querying and aggregation transform raw time-series data into actionable insights. But writing efficient queries at scale requires understanding how time-series databases process data, when to pre-aggregate with recording rules, and how to structure dashboards for performance.

This page provides a comprehensive guide to PromQL aggregation patterns, performance optimization, and building dashboards that remain responsive even with millions of time series.

What You Will Learn

By the end of this page, you will master PromQL aggregation operators, understand query performance characteristics, know when to use recording rules, and be able to build dashboards that scale. You'll also learn common query anti-patterns that kill performance.

Aggregation Operators Deep Dive

PromQL provides powerful aggregation operators that collapse multiple time series into fewer series. Understanding these operators is fundamental to effective querying.

Core Aggregation Operators:

PromQL Aggregation Operators
Operator	Description	Example
sum	Total of all values	sum(rate(http_requests_total[5m]))
avg	Arithmetic mean	avg(node_cpu_seconds_total)
min	Minimum value	min(up)
max	Maximum value	max(container_memory_usage_bytes)
count	Number of series	count(up{job="api"})
stddev	Standard deviation	stddev(http_request_duration_seconds)
stdvar	Standard variance	stdvar(http_request_duration_seconds)
topk	Largest k values	topk(5, rate(http_requests_total[5m]))
bottomk	Smallest k values	bottomk(3, up)
count_values	Count by value	count_values("version", build_info)
quantile	φ-quantile over series	quantile(0.9, rate(http_requests_total[5m]))
group	Group series (value=1)	group(up) by (job)

Grouping with by and without:

Aggregations can preserve specific labels using by (keep only these) or without (drop these, keep all others):

# Keep only job and endpoint labels
sum by (job, endpoint) (rate(http_requests_total[5m]))

# Drop instance label, keep all others
sum without (instance) (rate(http_requests_total[5m]))

# Both produce grouped aggregates, but:
# - by(): result has ONLY specified labels
# - without(): result has ALL labels EXCEPT specified ones

Choosing by vs without:

Scenario	Use
You know exactly which labels you want	`by (label1, label2)`
You want all labels except a few	`without (instance, pod)`
Service-level aggregation	`without (instance, pod)`
Capacity planning by dimension	`by (region, service)`

aggregation_examples.promql

PromQL

# Total request rate across all instances
sum(rate(http_requests_total[5m]))
# Result: single value
 
# Request rate per service
sum by (service) (rate(http_requests_total[5m]))
# Result: one series per service
 
# Request rate per endpoint, dropping instance detail
sum without (instance, pod) (rate(http_requests_total[5m]))
# Result: aggregated by all other labels
 
# Multiple aggregations combined
# Total requests by service AND status category
sum by (service, status_class) (
  rate(http_requests_total[5m])
) 
# Note: status_class might come from label_replace or recording rule
 
# Top 5 busiest endpoints
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
 
# Average CPU across all nodes vs max CPU on any node
avg(node_cpu_usage_ratio)  # Typical utilization
max(node_cpu_usage_ratio)  # Hottest node
 
# Count of unique values for a label
count_values("version", myapp_build_info)
# Returns series like {version="1.2.3"} = 5 (5 instances running this version)

Aggregation Order Matters

Apply rate() before sum(). 'sum(rate(counter[5m]))' is correct. 'rate(sum(counter)[5m])' produces wrong results because summing counters before rate() masks individual resets and produces discontinuities.

Vector Matching and Binary Operations

When combining two vectors (e.g., dividing errors by total requests), PromQL must match series from both sides. This matching is precise and can be customized.

Default Matching:

By default, series match when ALL labels are identical:

# Error ratio: errors / total
sum by (service) (rate(http_errors_total[5m]))
/
sum by (service) (rate(http_requests_total[5m]))

# Both sides must have exactly matching 'service' labels
# {service="api"} / {service="api"} ✓
# {service="api"} / {service="web"} ✗ (no match)

Explicit Matching with on and ignoring:

When series have different label sets, use on or ignoring:

# Match only on specified labels
http_requests_total
/ on (job, instance)
http_requests_started_total

# Match on all labels except specified ones
http_requests_total
/ ignoring (status_code)
http_requests_total{status_code="200"}

One-to-Many Matching with group_left and group_right:

When one side has fewer series than the other (many-to-one relationship), use group modifiers:

# Scenario: Add team name from service metadata
# Metrics: http_requests_total{service="api", endpoint="/users"}
# Metadata: service_team_info{service="api", team="platform"} = 1

# Add team label to each request series
http_requests_total
* on (service) group_left (team)
service_team_info

# Result: http_requests_total{service="api", endpoint="/users", team="platform"}

vector_matching.promql

PromQL

# BASIC: Error rate calculation (same labels on both sides)
sum by (service) (rate(http_errors_total[5m])) 
/ 
sum by (service) (rate(http_requests_total[5m]))
 
# EXPLICIT MATCHING: Join with fewer labels  
# Use case: Divide per-endpoint metrics by per-service capacity
sum by (service, endpoint) (rate(http_requests_total[5m]))
/ on (service) 
service_capacity_requests_per_second
 
# MANY-TO-ONE with group_left: Enrich with metadata
# Add region label from instance metadata
rate(http_requests_total[5m])
* on (instance) group_left (region)
instance_region_info
 
# ALTERNATIVE: Use label_replace for computed labels
# If you don't have a metadata metric
sum by (service) (
  label_replace(
    rate(http_requests_total[5m]),
    "team", 
    "$1",
    "service",
    "(api|auth|users).*"  # Regex to derive team from service name
  )
)
 
# COMPLEX JOIN: Multiple enrichments
rate(http_requests_total[5m])
* on (instance) group_left (region, environment)
instance_metadata
* on (service) group_left (team, cost_center)
service_metadata

Many-to-Many is an Error

PromQL doesn't support many-to-many matching. If multiple series on the left match multiple series on the right, the query fails. Ensure your matching produces one-to-one or one-to-many (with group_left/group_right) relationships.

Time-Based Functions

Time-series data demands functions that operate over time windows. PromQL provides a rich set of functions for analyzing trends, rates, and changes.

Range Vector Functions:

These functions apply to range vectors (e.g., metric[5m]):

Essential Range Vector Functions
Function	Use With	Description
rate()	Counters	Per-second average rate over the range
irate()	Counters	Instant rate using last two samples
increase()	Counters	Total increase over the range
delta()	Gauges	Difference between first and last
deriv()	Gauges	Per-second derivative (slope)
avg_over_time()	Any	Average value over the range
max_over_time()	Any	Maximum value over the range
min_over_time()	Any	Minimum value over the range
sum_over_time()	Any	Sum of all values in range
count_over_time()	Any	Number of samples in range
quantile_over_time()	Any	φ-quantile over time
predict_linear()	Gauges	Linear prediction of future value
changes()	Any	Number of value changes
resets()	Counters	Number of counter resets

time_functions.promql

PromQL

# RATE vs IRATE
# rate(): smoothed average rate - better for alerting
rate(http_requests_total[5m])
 
# irate(): instant rate from last 2 points - shows spikes
irate(http_requests_total[5m])
 
# INCREASE: Total over period
# "How many requests in the last hour?"
increase(http_requests_total[1h])
 
# DELTAS for gauges
# "How much did memory change in the last hour?"
delta(process_resident_memory_bytes[1h])
 
# PREDICTIONS
# "When will disk fill up?"
predict_linear(node_filesystem_free_bytes[6h], 24*3600)  # Predict 24h ahead
 
# Time until disk is full at current rate
node_filesystem_free_bytes / (-deriv(node_filesystem_free_bytes[6h]))
 
# OVER TIME for gauges
# Average CPU over the last hour
avg_over_time(node_cpu_usage_ratio[1h])
 
# Maximum memory in the last day
max_over_time(process_resident_memory_bytes[1d])
 
# CHANGES: Detect instability  
# How many times did this pod restart?
changes(kube_pod_info[1h])
 
# RESETS: Detect counter resets (process restarts)
resets(http_requests_total[1d])
 
# SUBQUERIES: Apply rate over windows
# Average of the 5-minute rate, sampled every minute, over the last hour
avg_over_time(rate(http_requests_total[5m])[1h:1m])

Subqueries: Nested Time Operations

Subqueries allow applying a function over a range of another function's output:

# Syntax: <expression>[<range>:<resolution>]

# Average rate over the last hour, sampled every minute
avg_over_time(rate(http_requests_total[5m])[1h:1m])

# Max of the 5-minute rate in the last 24 hours
max_over_time(rate(http_requests_total[5m])[24h:5m])

# Smoothes noisy rates by averaging them over time

Note: Subqueries are computationally expensive. Consider recording rules for frequently used patterns.

rate() Range Selection

The rate() range should be at least 4x your scrape interval to handle missed scrapes gracefully. For 15s scrape interval, use rate(metric[1m]) minimum. 5-minute windows are commonly used for stability.

Histogram Query Patterns

Histograms enable powerful latency analysis, but querying them correctly requires understanding their structure.

Histogram Anatomy:

A histogram metric creates multiple series:

http_request_duration_seconds_bucket{le="0.1"}   # Observations ≤ 0.1s
http_request_duration_seconds_bucket{le="0.5"}   # Observations ≤ 0.5s
http_request_duration_seconds_bucket{le="+Inf"}  # All observations
http_request_duration_seconds_sum                 # Total of all observations
http_request_duration_seconds_count               # Number of observations

Buckets are cumulative—the 0.5s bucket includes all values that are also in the 0.1s bucket.

histogram_queries.promql

PromQL

# PERCENTILES (quantiles)
# p50 (median) request duration
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))
 
# p90 request duration
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m]))
 
# p99 request duration (common SLO target)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
 
# AGGREGATED PERCENTILES
# p99 by service (aggregate buckets first, then calculate)
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# AVERAGE request duration
# sum / count = average
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
 
# AVERAGE by endpoint
sum by (endpoint) (rate(http_request_duration_seconds_sum[5m]))
/
sum by (endpoint) (rate(http_request_duration_seconds_count[5m]))
 
# REQUEST RATE from histogram
rate(http_request_duration_seconds_count[5m])
# This equals rate(http_request_duration_seconds_bucket{le="+Inf"}[5m])
 
# HEATMAP DATA (for Grafana)
# Use the bucket metric with rate() - Grafana understands this format
rate(http_request_duration_seconds_bucket[5m])
 
# SLO: "What % of requests are under 100ms?"
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
 
# APDEX: Approximate user satisfaction
# Apdex = (satisfied + tolerating/2) / total
# Satisfied: <400ms, Tolerating: 400ms-2s, Frustrated: >2s
(
  sum(rate(http_request_duration_seconds_bucket{le="0.4"}[5m]))
  + sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2
)
/
sum(rate(http_request_duration_seconds_count[5m]))

Histogram Aggregation Rules:

Always rate() before aggregation — Raw bucket values are counters
Aggregate by (le) — The le label must be preserved for histogram_quantile()
sum() is the aggregation — Usually what you want for combining instances
histogram_quantile() comes last — After rate() and aggregation

Correct vs Incorrect:

# CORRECT: rate → aggregate → quantile
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# WRONG: quantile first, then average (statistically invalid)
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

# WRONG: missing le in aggregation (histogram_quantile fails)
histogram_quantile(0.99,
  sum by (service) (rate(http_request_duration_seconds_bucket[5m]))  # No 'le'!
)

Percentile Precision

histogram_quantile() uses linear interpolation between buckets. If your buckets are sparse in the relevant range (e.g., no bucket between 100ms and 1s when measuring p99 around 200ms), accuracy suffers. Place buckets around your SLO thresholds for best accuracy.

Query Performance Optimization

Query performance directly impacts dashboard responsiveness and alerting reliability. Slow queries cause timeouts, stale data, and frustrated users.

Performance Factors:

Query Performance Factors
Factor	Impact	Mitigation
Series count	Linear memory/CPU	Be selective with labels
Time range	Linear data scanned	Use shorter ranges when possible
Step/resolution	More points = more work	Match resolution to visualization need
Functions applied	Some are expensive	Avoid subqueries if possible
Regex in selectors	Can be slow	Prefer exact matches
Query complexity	Multiplicative	Simplify, use recording rules

Selector Optimization:

# FAST: Exact label match
http_requests_total{job="api", status="200"}

# SLOWER: Regex match
http_requests_total{endpoint=~"/api/.*"}

# SLOWEST: Negative regex across many series
http_requests_total{endpoint!~"/(health|metrics|ready)"}

# OPTIMIZATION: Prefer positive filters
# Instead of matching everything except health endpoints,
# match the specific endpoints you care about

Recording Rules for Expensive Queries:

If a query is used in multiple dashboards or takes >500ms, create a recording rule:

recording_rules_performance.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
groups:
  - name: performance_rules
    interval: 30s  # Pre-compute every 30 seconds
    rules:
      # Expensive: aggregates across all instances and endpoints
      - record: job:http_request_rate:sum5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      
      # Very expensive: histogram quantile with aggregation
      - record: job:http_request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )
      
      # Expensive: error ratio calculation
      - record: job:http_error_ratio:rate5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))
      
      # Now dashboards query the recorded metrics instead:
      # job:http_request_rate:sum5m is instant, not computed on query

Query Optimization Techniques

•Use recording rules — Pre-compute aggregations that appear in multiple dashboards
•Be selective in queries — Filter to only the series you need
•Match resolution to need — Don't request per-second data for a weekly trend
•Aggregate early in the query — Reduce series count before applying expensive functions
•Avoid subqueries if possible — They're powerful but expensive
•Use instant queries for single values — Range queries for graphs only

Measure Query Performance

Prometheus exposes query metrics: 'prometheus_engine_query_duration_seconds' shows query latency distribution. 'prometheus_engine_queries_concurrent_max' shows concurrency. Track these to identify performance regressions.

Dashboard Design for Performance

A dashboard with 20 panels, each querying a 7-day range at 1-minute resolution, can overwhelm even a well-tuned Prometheus. Smart dashboard design maintains responsiveness.

Panel Efficiency:

Panel Design for Performance
Approach	Impact	Recommendation
Time range	Data volume per panel	Default to 1-6 hours; use longer ranges only when needed
Resolution	Points computed	Use auto resolution or match visual need
Query reuse	Duplicate work	Use variables and templating
Recording rules	Query execution time	Pre-compute for all panels >500ms
Panel count	Concurrent queries	Max 20-30 panels per dashboard
Refresh rate	Query frequency	Match to data cadence (not faster than scrape)

Dashboard Structure Best Practices:

Service Dashboard Structure:
├── Overview Row (4-6 panels)
│   ├── Traffic overview (single stat)
│   ├── Error rate (single stat with threshold)
│   ├── P99 latency (single stat)
│   └── Availability (single stat)
├── Request Details Row (3-4 panels)  
│   ├── Request rate over time (graph)
│   ├── Error rate over time (graph)
│   └── Latency percentiles (graph)
├── Resource Utilization Row (3-4 panels)
│   ├── CPU by instance (graph)
│   ├── Memory by instance (graph)
│   └── Network I/O (graph)
└── Detailed Breakdown (collapsed by default)
    ├── Per-endpoint latency heatmap
    ├── Per-endpoint error rates
    └── Instance details

Key principles:

Overview panels use recording rules (instant)
Detailed panels are collapsed by default (not queried until expanded)
Time ranges are appropriate to the visualization need
Refresh rates match the data cadence

efficient_panel_queries.promql

PromQL

# SINGLE STAT: Use instant query, not range
# For "current error rate", you don't need a graph
# Query type: instant
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m]))
 
# GRAPH: Use appropriate resolution
# For 1-hour view: 15s step is plenty (240 points)
# For 24-hour view: 1m step is sufficient (1440 points)
# For 7-day view: 15m step still gives 672 points
 
# USE RECORDING RULES in graphs
# Instead of computing histogram_quantile each time
job:http_request_duration:p99{job="$service"}
 
# TEMPLATE VARIABLES for reuse
# Single query populates variable, used in all panels
sum by (job) (up{job=~"$service"})
 
# CONDITIONAL DISPLAY
# Use Grafana thresholds instead of computing in query
# Let Grafana color based on value, don't compute alert state in PromQL

Lazy Loading

Use Grafana's 'Lazy load panels' option for heavy dashboards. Panels below the fold won't query until scrolled into view. Combined with collapsible rows, this dramatically reduces initial load.

Common Query Anti-Patterns

Avoid these common mistakes that hurt performance or produce incorrect results:

PromQL Anti-Patterns

•rate() of a gauge — Gauges don't accumulate; use delta() or deriv() instead.
•sum() before rate() — Summing counters before rate() produces incorrect results on resets. Always rate() first.
•Averaging percentiles — avg(histogram_quantile(...)) is statistically invalid. Aggregate buckets first.
•Range too short for rate() — rate() needs samples to compute. Use at least 4x scrape interval.
•Subquery in alerting rules — Subqueries are expensive; alerts should be fast. Use recording rules.
•Unbounded regex — {__name__=~".*"} matches everything. Be specific.
•Missing (le) in histogram aggregation — Aggregating histogram buckets without le breaks histogram_quantile().
•Using irate() for alerting — irate() is noisy; use rate() for stable alerting.

anti_patterns.promql

PromQL

# ❌ WRONG: rate() on a gauge
rate(process_resident_memory_bytes[5m])  # Memory doesn't accumulate!
# ✅ RIGHT: deriv() for rate of change of a gauge
deriv(process_resident_memory_bytes[5m])
 
# ❌ WRONG: sum() before rate()
rate(sum(http_requests_total)[5m])  # Hides individual counter resets
# ✅ RIGHT: rate() first, then sum()
sum(rate(http_requests_total[5m]))
 
# ❌ WRONG: Averaging percentiles
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))
# ✅ RIGHT: Aggregate buckets, then calculate percentile
histogram_quantile(0.99, 
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# ❌ WRONG: Range too short for scrape interval
rate(http_requests_total[15s])  # If scrape_interval is 15s, may have 0-1 samples
# ✅ RIGHT: Range should be 4x+ scrape interval
rate(http_requests_total[1m])  # 4 samples minimum for 15s scrape
 
# ❌ WRONG: Subquery in alert (expensive)
alert: HighLatencyP99
expr: |
  avg_over_time(
    histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))[1h:1m]
  ) > 1
# ✅ RIGHT: Use recording rule, alert on recorded metric
alert: HighLatencyP99
expr: job:http_request_duration:p99 > 1
 
# ❌ WRONG: Missing le in histogram aggregation
histogram_quantile(0.99,
  sum by (service) (rate(http_request_duration_seconds_bucket[5m]))  # Where's le?
)
# ✅ RIGHT: Include le in the aggregation
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

Silent Wrongness

Many anti-patterns don't cause errors—they produce wrong numbers. A sum() before rate() query will run and return values; they'll just be incorrect during counter resets. Test your queries against expected behavior.

Advanced Query Patterns

These advanced patterns solve common real-world problems:

advanced_queries.promql

PromQL

# AVAILABILITY (SLO)
# "What percentage of time was the service up in the last 30 days?"
avg_over_time(up{job="api"}[30d]) * 100
# Returns percentage (e.g., 99.95)
 
# SLO BURN RATE
# "How fast are we consuming our error budget?"
# Error budget for 99.9% SLO: 0.1%
# If error rate is 0.2%, burn rate = 0.2 / 0.1 = 2x
(
  1 - (
    sum(rate(http_requests_total{status="200"}[1h])) 
    / 
    sum(rate(http_requests_total[1h]))
  )
) / 0.001  # 0.001 = error budget for 99.9% SLO
 
# MULTI-WINDOW SLO (Google SRE approach)
# Fast window catches acute outages, slow window catches gradual degradation
# Alert if burning budget fast over 1h AND confirmed over 6h
(
  (1 - sum(rate(http_requests_total{status="200"}[1h])) 
     / sum(rate(http_requests_total[1h]))) > (14.4 * 0.001)  # 14.4x burn rate
  and
  (1 - sum(rate(http_requests_total{status="200"}[6h])) 
     / sum(rate(http_requests_total[6h]))) > (6 * 0.001)     # 6x burn rate
)
 
# COMPARISON TO LAST WEEK
# "Is traffic higher than same time last week?"
rate(http_requests_total[5m]) 
/ 
rate(http_requests_total[5m] offset 1w)
# > 1 means higher than last week
 
# ANOMALY DETECTION (simple z-score)
# "Is current value more than 3 standard deviations from average?"
(
  rate(http_requests_total[5m])
  - avg_over_time(rate(http_requests_total[5m])[1d:5m])
) / stddev_over_time(rate(http_requests_total[5m])[1d:5m])
# Values beyond ±3 are likely anomalies
 
# TOP RESOURCE CONSUMERS
# "Which pods are using the most memory relative to their limits?"
topk(10,
  container_memory_usage_bytes
  / on (pod, namespace) 
  kube_pod_container_resource_limits{resource="memory"}
)
 
# SERVICE DEPENDENCY HEALTH
# "What's the health of services I depend on, weighted by call volume?"
sum by (downstream_service) (
  rate(http_client_requests_total{status!~"5.."}[5m])
) 
/ 
sum by (downstream_service) (
  rate(http_client_requests_total[5m])
)

Label Manipulation Functions:

# label_replace: Add/modify labels using regex
label_replace(
  rate(http_requests_total[5m]),
  "status_class",
  "${1}xx",
  "status",
  "(.).*"
)
# Adds status_class="2xx" from status="200"

# label_join: Combine multiple labels into one
label_join(
  up,
  "full_target",
  ":",
  "job", "instance"
)
# Creates full_target="api:10.0.0.1:8080"

# group(): Create a series with value 1 for grouping/joining
http_requests_total * on (instance) group_left 
  group(up) by (instance)
# Associates each request series with its up status

Test Complex Queries Incrementally

Build complex queries step by step. Start with the innermost expression and add layers. At each step, verify the output makes sense before proceeding. PromQL debugging is easier when you isolate which layer produces unexpected results.

Summary: Query Mastery Achieved

Effective querying and aggregation transform raw metrics into actionable insights. The patterns and techniques covered here will serve you from ad-hoc debugging to production dashboard design.

Key Takeaways

•Aggregation is multiplicative work — Understand by/without grouping and use selectors to limit series before aggregation.
•Vector matching requires attention — Use on(), ignoring(), group_left(), group_right() for multi-series operations.
•Apply rate() to counters first — Before sum(), before aggregation. Order matters for correctness.
•Histogram queries have strict patterns — Aggregate by (le), then histogram_quantile(). Never average percentiles.
•Recording rules are your performance friend — Pre-compute expensive queries, reference them in dashboards and alerts.
•Dashboard design impacts performance — Use appropriate time ranges, resolution, and lazy loading.
•Avoid anti-patterns — Many produce results that look reasonable but are mathematically wrong.

Module Complete:

You have completed the Metrics Collection module. You now understand:

The three fundamental metric types and when to use each
Prometheus architecture and how it enables scalable collection
Naming conventions for maintainable, discoverable metrics
Cardinality as the critical scaling dimension
Aggregation and query patterns for effective analysis

This knowledge enables you to design, collect, and query metrics that scale with your systems and provide actionable observability.

Module Complete

Congratulations! You have mastered metrics collection. From understanding metric types through advanced PromQL patterns, you now have the knowledge to build and maintain world-class observability systems. The next module covers Distributed Tracing—the second pillar of observability.

Aggregation and Queries

From Raw Data to Actionable Insights

This page provides a comprehensive guide to PromQL aggregation patterns, performance optimization, and building dashboards that remain responsive even with millions of time series.

What You Will Learn

Aggregation Operators Deep Dive

PromQL provides powerful aggregation operators that collapse multiple time series into fewer series. Understanding these operators is fundamental to effective querying.

Core Aggregation Operators:

PromQL Aggregation Operators
Operator	Description	Example
sum	Total of all values	sum(rate(http_requests_total[5m]))
avg	Arithmetic mean	avg(node_cpu_seconds_total)
min	Minimum value	min(up)
max	Maximum value	max(container_memory_usage_bytes)
count	Number of series	count(up{job="api"})
stddev	Standard deviation	stddev(http_request_duration_seconds)
stdvar	Standard variance	stdvar(http_request_duration_seconds)
topk	Largest k values	topk(5, rate(http_requests_total[5m]))
bottomk	Smallest k values	bottomk(3, up)
count_values	Count by value	count_values("version", build_info)
quantile	φ-quantile over series	quantile(0.9, rate(http_requests_total[5m]))
group	Group series (value=1)	group(up) by (job)

Grouping with by and without:

Aggregations can preserve specific labels using by (keep only these) or without (drop these, keep all others):

# Keep only job and endpoint labels
sum by (job, endpoint) (rate(http_requests_total[5m]))

# Drop instance label, keep all others
sum without (instance) (rate(http_requests_total[5m]))

# Both produce grouped aggregates, but:
# - by(): result has ONLY specified labels
# - without(): result has ALL labels EXCEPT specified ones

Choosing by vs without:

Scenario	Use
You know exactly which labels you want	`by (label1, label2)`
You want all labels except a few	`without (instance, pod)`
Service-level aggregation	`without (instance, pod)`
Capacity planning by dimension	`by (region, service)`

aggregation_examples.promql

PromQL

# Total request rate across all instances
sum(rate(http_requests_total[5m]))
# Result: single value
 
# Request rate per service
sum by (service) (rate(http_requests_total[5m]))
# Result: one series per service
 
# Request rate per endpoint, dropping instance detail
sum without (instance, pod) (rate(http_requests_total[5m]))
# Result: aggregated by all other labels
 
# Multiple aggregations combined
# Total requests by service AND status category
sum by (service, status_class) (
  rate(http_requests_total[5m])
) 
# Note: status_class might come from label_replace or recording rule
 
# Top 5 busiest endpoints
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
 
# Average CPU across all nodes vs max CPU on any node
avg(node_cpu_usage_ratio)  # Typical utilization
max(node_cpu_usage_ratio)  # Hottest node
 
# Count of unique values for a label
count_values("version", myapp_build_info)
# Returns series like {version="1.2.3"} = 5 (5 instances running this version)

Aggregation Order Matters

Vector Matching and Binary Operations

When combining two vectors (e.g., dividing errors by total requests), PromQL must match series from both sides. This matching is precise and can be customized.

Default Matching:

By default, series match when ALL labels are identical:

# Error ratio: errors / total
sum by (service) (rate(http_errors_total[5m]))
/
sum by (service) (rate(http_requests_total[5m]))

# Both sides must have exactly matching 'service' labels
# {service="api"} / {service="api"} ✓
# {service="api"} / {service="web"} ✗ (no match)

Explicit Matching with on and ignoring:

When series have different label sets, use on or ignoring:

# Match only on specified labels
http_requests_total
/ on (job, instance)
http_requests_started_total

# Match on all labels except specified ones
http_requests_total
/ ignoring (status_code)
http_requests_total{status_code="200"}

One-to-Many Matching with group_left and group_right:

When one side has fewer series than the other (many-to-one relationship), use group modifiers:

# Scenario: Add team name from service metadata
# Metrics: http_requests_total{service="api", endpoint="/users"}
# Metadata: service_team_info{service="api", team="platform"} = 1

# Add team label to each request series
http_requests_total
* on (service) group_left (team)
service_team_info

# Result: http_requests_total{service="api", endpoint="/users", team="platform"}

vector_matching.promql

PromQL

# BASIC: Error rate calculation (same labels on both sides)
sum by (service) (rate(http_errors_total[5m])) 
/ 
sum by (service) (rate(http_requests_total[5m]))
 
# EXPLICIT MATCHING: Join with fewer labels  
# Use case: Divide per-endpoint metrics by per-service capacity
sum by (service, endpoint) (rate(http_requests_total[5m]))
/ on (service) 
service_capacity_requests_per_second
 
# MANY-TO-ONE with group_left: Enrich with metadata
# Add region label from instance metadata
rate(http_requests_total[5m])
* on (instance) group_left (region)
instance_region_info
 
# ALTERNATIVE: Use label_replace for computed labels
# If you don't have a metadata metric
sum by (service) (
  label_replace(
    rate(http_requests_total[5m]),
    "team", 
    "$1",
    "service",
    "(api|auth|users).*"  # Regex to derive team from service name
  )
)
 
# COMPLEX JOIN: Multiple enrichments
rate(http_requests_total[5m])
* on (instance) group_left (region, environment)
instance_metadata
* on (service) group_left (team, cost_center)
service_metadata

Many-to-Many is an Error

Time-Based Functions

Time-series data demands functions that operate over time windows. PromQL provides a rich set of functions for analyzing trends, rates, and changes.

Range Vector Functions:

These functions apply to range vectors (e.g., metric[5m]):

Essential Range Vector Functions
Function	Use With	Description
rate()	Counters	Per-second average rate over the range
irate()	Counters	Instant rate using last two samples
increase()	Counters	Total increase over the range
delta()	Gauges	Difference between first and last
deriv()	Gauges	Per-second derivative (slope)
avg_over_time()	Any	Average value over the range
max_over_time()	Any	Maximum value over the range
min_over_time()	Any	Minimum value over the range
sum_over_time()	Any	Sum of all values in range
count_over_time()	Any	Number of samples in range
quantile_over_time()	Any	φ-quantile over time
predict_linear()	Gauges	Linear prediction of future value
changes()	Any	Number of value changes
resets()	Counters	Number of counter resets

time_functions.promql

PromQL

# RATE vs IRATE
# rate(): smoothed average rate - better for alerting
rate(http_requests_total[5m])
 
# irate(): instant rate from last 2 points - shows spikes
irate(http_requests_total[5m])
 
# INCREASE: Total over period
# "How many requests in the last hour?"
increase(http_requests_total[1h])
 
# DELTAS for gauges
# "How much did memory change in the last hour?"
delta(process_resident_memory_bytes[1h])
 
# PREDICTIONS
# "When will disk fill up?"
predict_linear(node_filesystem_free_bytes[6h], 24*3600)  # Predict 24h ahead
 
# Time until disk is full at current rate
node_filesystem_free_bytes / (-deriv(node_filesystem_free_bytes[6h]))
 
# OVER TIME for gauges
# Average CPU over the last hour
avg_over_time(node_cpu_usage_ratio[1h])
 
# Maximum memory in the last day
max_over_time(process_resident_memory_bytes[1d])
 
# CHANGES: Detect instability  
# How many times did this pod restart?
changes(kube_pod_info[1h])
 
# RESETS: Detect counter resets (process restarts)
resets(http_requests_total[1d])
 
# SUBQUERIES: Apply rate over windows
# Average of the 5-minute rate, sampled every minute, over the last hour
avg_over_time(rate(http_requests_total[5m])[1h:1m])

Subqueries: Nested Time Operations

Subqueries allow applying a function over a range of another function's output:

# Syntax: <expression>[<range>:<resolution>]

# Average rate over the last hour, sampled every minute
avg_over_time(rate(http_requests_total[5m])[1h:1m])

# Max of the 5-minute rate in the last 24 hours
max_over_time(rate(http_requests_total[5m])[24h:5m])

# Smoothes noisy rates by averaging them over time

Note: Subqueries are computationally expensive. Consider recording rules for frequently used patterns.

rate() Range Selection

Histogram Query Patterns

Histograms enable powerful latency analysis, but querying them correctly requires understanding their structure.

Histogram Anatomy:

A histogram metric creates multiple series:

http_request_duration_seconds_bucket{le="0.1"}   # Observations ≤ 0.1s
http_request_duration_seconds_bucket{le="0.5"}   # Observations ≤ 0.5s
http_request_duration_seconds_bucket{le="+Inf"}  # All observations
http_request_duration_seconds_sum                 # Total of all observations
http_request_duration_seconds_count               # Number of observations

Buckets are cumulative—the 0.5s bucket includes all values that are also in the 0.1s bucket.

histogram_queries.promql

PromQL

# PERCENTILES (quantiles)
# p50 (median) request duration
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))
 
# p90 request duration
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m]))
 
# p99 request duration (common SLO target)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
 
# AGGREGATED PERCENTILES
# p99 by service (aggregate buckets first, then calculate)
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# AVERAGE request duration
# sum / count = average
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
 
# AVERAGE by endpoint
sum by (endpoint) (rate(http_request_duration_seconds_sum[5m]))
/
sum by (endpoint) (rate(http_request_duration_seconds_count[5m]))
 
# REQUEST RATE from histogram
rate(http_request_duration_seconds_count[5m])
# This equals rate(http_request_duration_seconds_bucket{le="+Inf"}[5m])
 
# HEATMAP DATA (for Grafana)
# Use the bucket metric with rate() - Grafana understands this format
rate(http_request_duration_seconds_bucket[5m])
 
# SLO: "What % of requests are under 100ms?"
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
 
# APDEX: Approximate user satisfaction
# Apdex = (satisfied + tolerating/2) / total
# Satisfied: <400ms, Tolerating: 400ms-2s, Frustrated: >2s
(
  sum(rate(http_request_duration_seconds_bucket{le="0.4"}[5m]))
  + sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) / 2
)
/
sum(rate(http_request_duration_seconds_count[5m]))

Histogram Aggregation Rules:

Always rate() before aggregation — Raw bucket values are counters
Aggregate by (le) — The le label must be preserved for histogram_quantile()
sum() is the aggregation — Usually what you want for combining instances
histogram_quantile() comes last — After rate() and aggregation

Correct vs Incorrect:

# CORRECT: rate → aggregate → quantile
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# WRONG: quantile first, then average (statistically invalid)
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

# WRONG: missing le in aggregation (histogram_quantile fails)
histogram_quantile(0.99,
  sum by (service) (rate(http_request_duration_seconds_bucket[5m]))  # No 'le'!
)

Percentile Precision

Query Performance Optimization

Query performance directly impacts dashboard responsiveness and alerting reliability. Slow queries cause timeouts, stale data, and frustrated users.

Performance Factors:

Query Performance Factors
Factor	Impact	Mitigation
Series count	Linear memory/CPU	Be selective with labels
Time range	Linear data scanned	Use shorter ranges when possible
Step/resolution	More points = more work	Match resolution to visualization need
Functions applied	Some are expensive	Avoid subqueries if possible
Regex in selectors	Can be slow	Prefer exact matches
Query complexity	Multiplicative	Simplify, use recording rules

Selector Optimization:

# FAST: Exact label match
http_requests_total{job="api", status="200"}

# SLOWER: Regex match
http_requests_total{endpoint=~"/api/.*"}

# SLOWEST: Negative regex across many series
http_requests_total{endpoint!~"/(health|metrics|ready)"}

# OPTIMIZATION: Prefer positive filters
# Instead of matching everything except health endpoints,
# match the specific endpoints you care about

Recording Rules for Expensive Queries:

If a query is used in multiple dashboards or takes >500ms, create a recording rule:

recording_rules_performance.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
groups:
  - name: performance_rules
    interval: 30s  # Pre-compute every 30 seconds
    rules:
      # Expensive: aggregates across all instances and endpoints
      - record: job:http_request_rate:sum5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      
      # Very expensive: histogram quantile with aggregation
      - record: job:http_request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )
      
      # Expensive: error ratio calculation
      - record: job:http_error_ratio:rate5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))
      
      # Now dashboards query the recorded metrics instead:
      # job:http_request_rate:sum5m is instant, not computed on query

Query Optimization Techniques

•Use recording rules — Pre-compute aggregations that appear in multiple dashboards
•Be selective in queries — Filter to only the series you need
•Match resolution to need — Don't request per-second data for a weekly trend
•Aggregate early in the query — Reduce series count before applying expensive functions
•Avoid subqueries if possible — They're powerful but expensive
•Use instant queries for single values — Range queries for graphs only

Measure Query Performance

Dashboard Design for Performance

A dashboard with 20 panels, each querying a 7-day range at 1-minute resolution, can overwhelm even a well-tuned Prometheus. Smart dashboard design maintains responsiveness.

Panel Efficiency:

Panel Design for Performance
Approach	Impact	Recommendation
Time range	Data volume per panel	Default to 1-6 hours; use longer ranges only when needed
Resolution	Points computed	Use auto resolution or match visual need
Query reuse	Duplicate work	Use variables and templating
Recording rules	Query execution time	Pre-compute for all panels >500ms
Panel count	Concurrent queries	Max 20-30 panels per dashboard
Refresh rate	Query frequency	Match to data cadence (not faster than scrape)

Dashboard Structure Best Practices:

Service Dashboard Structure:
├── Overview Row (4-6 panels)
│   ├── Traffic overview (single stat)
│   ├── Error rate (single stat with threshold)
│   ├── P99 latency (single stat)
│   └── Availability (single stat)
├── Request Details Row (3-4 panels)  
│   ├── Request rate over time (graph)
│   ├── Error rate over time (graph)
│   └── Latency percentiles (graph)
├── Resource Utilization Row (3-4 panels)
│   ├── CPU by instance (graph)
│   ├── Memory by instance (graph)
│   └── Network I/O (graph)
└── Detailed Breakdown (collapsed by default)
    ├── Per-endpoint latency heatmap
    ├── Per-endpoint error rates
    └── Instance details

Key principles:

Overview panels use recording rules (instant)
Detailed panels are collapsed by default (not queried until expanded)
Time ranges are appropriate to the visualization need
Refresh rates match the data cadence

efficient_panel_queries.promql

PromQL

# SINGLE STAT: Use instant query, not range
# For "current error rate", you don't need a graph
# Query type: instant
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m]))
 
# GRAPH: Use appropriate resolution
# For 1-hour view: 15s step is plenty (240 points)
# For 24-hour view: 1m step is sufficient (1440 points)
# For 7-day view: 15m step still gives 672 points
 
# USE RECORDING RULES in graphs
# Instead of computing histogram_quantile each time
job:http_request_duration:p99{job="$service"}
 
# TEMPLATE VARIABLES for reuse
# Single query populates variable, used in all panels
sum by (job) (up{job=~"$service"})
 
# CONDITIONAL DISPLAY
# Use Grafana thresholds instead of computing in query
# Let Grafana color based on value, don't compute alert state in PromQL

Lazy Loading

Use Grafana's 'Lazy load panels' option for heavy dashboards. Panels below the fold won't query until scrolled into view. Combined with collapsible rows, this dramatically reduces initial load.

Common Query Anti-Patterns

Avoid these common mistakes that hurt performance or produce incorrect results:

PromQL Anti-Patterns

•rate() of a gauge — Gauges don't accumulate; use delta() or deriv() instead.
•sum() before rate() — Summing counters before rate() produces incorrect results on resets. Always rate() first.
•Averaging percentiles — avg(histogram_quantile(...)) is statistically invalid. Aggregate buckets first.
•Range too short for rate() — rate() needs samples to compute. Use at least 4x scrape interval.
•Subquery in alerting rules — Subqueries are expensive; alerts should be fast. Use recording rules.
•Unbounded regex — {__name__=~".*"} matches everything. Be specific.
•Missing (le) in histogram aggregation — Aggregating histogram buckets without le breaks histogram_quantile().
•Using irate() for alerting — irate() is noisy; use rate() for stable alerting.

anti_patterns.promql

PromQL

# ❌ WRONG: rate() on a gauge
rate(process_resident_memory_bytes[5m])  # Memory doesn't accumulate!
# ✅ RIGHT: deriv() for rate of change of a gauge
deriv(process_resident_memory_bytes[5m])
 
# ❌ WRONG: sum() before rate()
rate(sum(http_requests_total)[5m])  # Hides individual counter resets
# ✅ RIGHT: rate() first, then sum()
sum(rate(http_requests_total[5m]))
 
# ❌ WRONG: Averaging percentiles
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))
# ✅ RIGHT: Aggregate buckets, then calculate percentile
histogram_quantile(0.99, 
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# ❌ WRONG: Range too short for scrape interval
rate(http_requests_total[15s])  # If scrape_interval is 15s, may have 0-1 samples
# ✅ RIGHT: Range should be 4x+ scrape interval
rate(http_requests_total[1m])  # 4 samples minimum for 15s scrape
 
# ❌ WRONG: Subquery in alert (expensive)
alert: HighLatencyP99
expr: |
  avg_over_time(
    histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))[1h:1m]
  ) > 1
# ✅ RIGHT: Use recording rule, alert on recorded metric
alert: HighLatencyP99
expr: job:http_request_duration:p99 > 1
 
# ❌ WRONG: Missing le in histogram aggregation
histogram_quantile(0.99,
  sum by (service) (rate(http_request_duration_seconds_bucket[5m]))  # Where's le?
)
# ✅ RIGHT: Include le in the aggregation
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

Silent Wrongness

Advanced Query Patterns

These advanced patterns solve common real-world problems:

advanced_queries.promql

PromQL

# AVAILABILITY (SLO)
# "What percentage of time was the service up in the last 30 days?"
avg_over_time(up{job="api"}[30d]) * 100
# Returns percentage (e.g., 99.95)
 
# SLO BURN RATE
# "How fast are we consuming our error budget?"
# Error budget for 99.9% SLO: 0.1%
# If error rate is 0.2%, burn rate = 0.2 / 0.1 = 2x
(
  1 - (
    sum(rate(http_requests_total{status="200"}[1h])) 
    / 
    sum(rate(http_requests_total[1h]))
  )
) / 0.001  # 0.001 = error budget for 99.9% SLO
 
# MULTI-WINDOW SLO (Google SRE approach)
# Fast window catches acute outages, slow window catches gradual degradation
# Alert if burning budget fast over 1h AND confirmed over 6h
(
  (1 - sum(rate(http_requests_total{status="200"}[1h])) 
     / sum(rate(http_requests_total[1h]))) > (14.4 * 0.001)  # 14.4x burn rate
  and
  (1 - sum(rate(http_requests_total{status="200"}[6h])) 
     / sum(rate(http_requests_total[6h]))) > (6 * 0.001)     # 6x burn rate
)
 
# COMPARISON TO LAST WEEK
# "Is traffic higher than same time last week?"
rate(http_requests_total[5m]) 
/ 
rate(http_requests_total[5m] offset 1w)
# > 1 means higher than last week
 
# ANOMALY DETECTION (simple z-score)
# "Is current value more than 3 standard deviations from average?"
(
  rate(http_requests_total[5m])
  - avg_over_time(rate(http_requests_total[5m])[1d:5m])
) / stddev_over_time(rate(http_requests_total[5m])[1d:5m])
# Values beyond ±3 are likely anomalies
 
# TOP RESOURCE CONSUMERS
# "Which pods are using the most memory relative to their limits?"
topk(10,
  container_memory_usage_bytes
  / on (pod, namespace) 
  kube_pod_container_resource_limits{resource="memory"}
)
 
# SERVICE DEPENDENCY HEALTH
# "What's the health of services I depend on, weighted by call volume?"
sum by (downstream_service) (
  rate(http_client_requests_total{status!~"5.."}[5m])
) 
/ 
sum by (downstream_service) (
  rate(http_client_requests_total[5m])
)

Label Manipulation Functions:

# label_replace: Add/modify labels using regex
label_replace(
  rate(http_requests_total[5m]),
  "status_class",
  "${1}xx",
  "status",
  "(.).*"
)
# Adds status_class="2xx" from status="200"

# label_join: Combine multiple labels into one
label_join(
  up,
  "full_target",
  ":",
  "job", "instance"
)
# Creates full_target="api:10.0.0.1:8080"

# group(): Create a series with value 1 for grouping/joining
http_requests_total * on (instance) group_left 
  group(up) by (instance)
# Associates each request series with its up status

Test Complex Queries Incrementally

Summary: Query Mastery Achieved

Effective querying and aggregation transform raw metrics into actionable insights. The patterns and techniques covered here will serve you from ad-hoc debugging to production dashboard design.

Key Takeaways

•Aggregation is multiplicative work — Understand by/without grouping and use selectors to limit series before aggregation.
•Vector matching requires attention — Use on(), ignoring(), group_left(), group_right() for multi-series operations.
•Apply rate() to counters first — Before sum(), before aggregation. Order matters for correctness.
•Histogram queries have strict patterns — Aggregate by (le), then histogram_quantile(). Never average percentiles.
•Recording rules are your performance friend — Pre-compute expensive queries, reference them in dashboards and alerts.
•Dashboard design impacts performance — Use appropriate time ranges, resolution, and lazy loading.
•Avoid anti-patterns — Many produce results that look reasonable but are mathematically wrong.

Module Complete:

You have completed the Metrics Collection module. You now understand:

The three fundamental metric types and when to use each
Prometheus architecture and how it enables scalable collection
Naming conventions for maintainable, discoverable metrics
Cardinality as the critical scaling dimension
Aggregation and query patterns for effective analysis

This knowledge enables you to design, collect, and query metrics that scale with your systems and provide actionable observability.

Module Complete