System Design (HLD)Metrics Collection

Metrics Collection

LevelIntermediate

Duration90 mins

TopicMetrics Collection

1 / 5

Types of Metrics: Counters, Gauges, Histograms

The Language of System Behavior

Every production system tells a story through numbers. The rate of incoming requests, the amount of memory consumed, the distribution of response times—these numerical signals describe what's happening inside our systems at any given moment. But raw numbers alone don't create understanding. The type of metric you choose fundamentally shapes how you interpret that number and what questions you can answer.

Consider a simple question: "How many requests is my API processing?" This seemingly straightforward question actually has multiple valid interpretations:

How many requests have we processed in total since the service started?
How many requests are we processing right now?
What's the rate of requests over the last minute?

Each interpretation requires a different metric type. Choosing the wrong type doesn't just limit your analysis—it can make your data entirely misleading. Understanding metric types is the prerequisite to building any observability system.

What You Will Learn

By the end of this page, you will deeply understand the three fundamental metric types—counters, gauges, and histograms—their mathematical properties, appropriate use cases, common pitfalls, and how they interact with time-series databases and query languages.

The Metric Type Taxonomy

At their core, metrics are time-stamped numerical measurements. But not all numbers behave the same way. Some numbers only go up (total requests processed). Some fluctuate freely (current memory usage). Some capture distributions (request latencies). Each behavior pattern demands a different data model.

The observability industry has converged on three fundamental metric types, each optimized for different measurement patterns:

The Three Fundamental Metric Types
Metric Type	Behavior	Primary Use Case	Key Property
Counter	Monotonically increasing	Counting events over time	Can only go up (or reset to zero)
Gauge	Arbitrary values	Measuring current state	Can go up, down, or stay constant
Histogram	Bucketed distributions	Understanding value distributions	Captures percentiles and ranges

These three types aren't arbitrary categories—they're mathematical models that enable specific operations. A counter's monotonicity allows calculating rates. A gauge's point-in-time nature enables sampling. A histogram's buckets enable percentile estimation. Choose the right type, and your observability system works seamlessly. Choose the wrong type, and you'll fight your tools at every step.

Why only three types?

You might wonder why we don't have more metric types. The answer lies in balancing expressiveness with simplicity. These three types can represent virtually any measurement pattern while remaining simple enough to implement efficiently in time-series databases. Additional types (like summaries) exist in some systems but are typically variations on these fundamentals.

The Right Mental Model

Think of metric types as contracts. When you declare a metric as a counter, you're promising the database that this value will never decrease (except on reset). The database uses this promise to optimize storage and enable specific operations like rate calculations. Breaking this contract—like using a counter for something that decreases—breaks the math that depends on it.

Counters: Counting Events Over Time

A counter is a cumulative metric that represents a single monotonically increasing value. It can only go up—never down—though it may reset to zero when the process restarts. Counters are the workhorse of operational metrics, used whenever you need to count discrete events.

Mathematical Properties:

Let's denote a counter value at time t as C(t). The fundamental property of a counter is:

C(t₂) ≥ C(t₁) for all t₂ > t₁ (absent resets)

This monotonicity property enables the most important operation on counters: rate calculation. The rate of events over a time window [t₁, t₂] is simply:

rate = (C(t₂) - C(t₁)) / (t₂ - t₁)

This is why the absolute value of a counter is usually uninteresting—what matters is how fast it's changing.

When to Use Counters

•Total HTTP requests received — Count every request, then calculate requests/second
•Bytes transferred — Count total bytes, derive transfer rate
•Errors encountered — Count each error type, calculate error rate
•Tasks completed — Count completions, measure throughput
•Cache operations — Count hits and misses, calculate hit ratio
•Messages processed — Count queue messages, track processing rate

counter_examples.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
package main
 
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
// Define counters for HTTP request tracking
var (
    // Total HTTP requests received (counter)
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests received",
        },
        []string{"method", "endpoint", "status_code"},
    )
 
    // Total bytes received (counter)
    bytesReceivedTotal = promauto.NewCounter(
        prometheus.CounterOpts{
            Name: "http_request_bytes_total",
            Help: "Total bytes received in HTTP request bodies",
        },
    )
 
    // Errors by type (counter)
    errorsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "application_errors_total",
            Help: "Total number of application errors by type",
        },
        []string{"error_type", "service"},
    )
)
 
// Usage in request handler
func handleRequest(method, endpoint string, bodySize int64, statusCode int) {
    // Increment request counter with labels
    httpRequestsTotal.WithLabelValues(method, endpoint, 
        strconv.Itoa(statusCode)).Inc()
    
    // Add bytes received
    bytesReceivedTotal.Add(float64(bodySize))
    
    // Track errors
    if statusCode >= 500 {
        errorsTotal.WithLabelValues("server_error", "api").Inc()
    } else if statusCode >= 400 {
        errorsTotal.WithLabelValues("client_error", "api").Inc()
    }
}

Handling Counter Resets

Counters reset to zero when processes restart. This is expected behavior, but your monitoring system must handle it gracefully. Most time-series databases like Prometheus automatically detect and compensate for resets using algorithms like rate() and increase().

When a counter resets, the rate() function detects that the current value is lower than the previous sample and assumes a reset occurred. It calculates the rate using the new value, treating zero as the starting point. This typically works well, but extremely short scrape intervals during rapid restarts can cause accuracy issues.

Anti-Pattern Warning:

Never decrement a counter. If you find yourself wanting to subtract from a counter, you're using the wrong metric type. Counters are for things that only accumulate. If your value can decrease, use a gauge instead.

Common Counter Mistake

Don't use counters for values that can decrease, like queue depth or active connections. 'http_active_requests' that decrements when requests complete is NOT a counter—it's a gauge. The counter equivalent would be 'http_requests_started_total' and 'http_requests_completed_total', where active requests = started - completed.

Gauges: Measuring Current State

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Gauges are used for measured values like temperature, current memory usage, or the number of concurrent requests. Unlike counters, the absolute value of a gauge is meaningful at any point in time.

Mathematical Properties:

A gauge G(t) at time t has no constraints on its relationship to previous values:

G(t₂) can be >, <, or = to G(t₁) for any t₂ > t₁

This flexibility means gauges represent snapshots. When you query a gauge, you're asking: "What was the value at this moment?" This is fundamentally different from counters, where you typically ask about change over time.

The Sampling Challenge:

Gauges present a unique challenge: values between samples are unknown. If you sample memory usage once per minute and see 50% at 12:00 and 80% at 12:01, you don't know what happened in between. It could have spiked to 100% and come back down. For volatile gauges, sampling frequency matters enormously.

When to Use Gauges

•Memory usage — Current heap, RSS, or resident memory in bytes
•CPU utilization — Current percentage of CPU cycles consumed
•Active connections — Number of currently open connections
•Queue depth — Number of items currently waiting in a queue
•Thread pool size — Current number of active/idle threads
•Cache size — Current number of items or bytes in cache
•Temperature — Current system or environment temperature
•Goroutines/Threads — Current count of concurrent execution contexts

gauge_examples.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
package main
 
import (
    "runtime"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
var (
    // Memory usage gauge
    memoryBytes = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "process_memory_bytes",
            Help: "Current memory usage in bytes",
        },
        []string{"type"},
    )
    
    // Active connections gauge
    activeConnections = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of currently active connections",
        },
        []string{"protocol", "state"},
    )
    
    // Queue depth gauge
    queueDepth = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "queue_depth",
            Help: "Number of items currently in queue",
        },
        []string{"queue_name", "priority"},
    )
    
    // In-flight requests gauge
    inFlightRequests = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "http_requests_in_flight",
            Help: "Number of HTTP requests currently being processed",
        },
        []string{"handler"},
    )
)
 
// Update memory metrics periodically
func updateMemoryMetrics() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    
    memoryBytes.WithLabelValues("heap_alloc").Set(float64(m.HeapAlloc))
    memoryBytes.WithLabelValues("heap_sys").Set(float64(m.HeapSys))
    memoryBytes.WithLabelValues("stack").Set(float64(m.StackInuse))
}
 
// Track request lifecycle
func handleWithMetrics(handler string, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        inFlightRequests.WithLabelValues(handler).Inc()
        defer inFlightRequests.WithLabelValues(handler).Dec()
        next.ServeHTTP(w, r)
    })
}
 
// Track connections
type connectionTracker struct {
    protocol string
}
 
func (ct *connectionTracker) OnConnect() {
    activeConnections.WithLabelValues(ct.protocol, "established").Inc()
}
 
func (ct *connectionTracker) OnDisconnect() {
    activeConnections.WithLabelValues(ct.protocol, "established").Dec()
}

Gauge Aggregation Challenges

Aggregating gauges across instances requires careful thought. Consider "current queue depth" across 10 service replicas:

Sum makes sense if items are unique per instance (total items across all queues)
Average makes sense if you want typical per-instance behavior
Max makes sense if you're looking for hotspots

Contrast this with counters, where summing rates is almost always correct. Gauge aggregation depends heavily on what you're measuring.

Ephemeral Gauges and Staleness

When an instance dies, its gauge values don't update anymore. Prometheus marks these values as "stale" after a configurable period. For alerting on gauges, consider:

Using absent() to detect missing metrics
Accounting for staleness in threshold alerts
Preferring counters when possible for rate-based alerting

Gauge Best Practice

For volatile values like in-flight requests, consider also exposing related counters (requests started, requests completed). This gives you both the instantaneous snapshot (gauge) and the ability to calculate rates over time (counters), providing a more complete picture.

Histograms: Understanding Distributions

A histogram samples observations (typically request latencies or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values and a total count of observations. Histograms are essential for understanding the distribution of values, not just their average.

Why Averages Lie:

Consider an API with these response times for 100 requests:

98 requests: 10ms
2 requests: 5000ms (5 seconds!)

The average is 109ms, which tells you almost nothing useful. 98% of users experience 10ms, while 2% suffer 5-second delays. A histogram reveals this bimodal distribution; an average hides it.

Histogram Structure:

A histogram actually exposes multiple time series:

<metric>_bucket{le="<upper_bound>"}: Cumulative count of observations ≤ upper_bound
<metric>_sum: Total sum of all observed values
<metric>_count: Total count of observations

The "le" (less than or equal) buckets are cumulative. If you have buckets at 10ms, 50ms, and 100ms, the 50ms bucket includes all observations in the 10ms bucket.

When to Use Histograms

•Request latency — Distribution of how long requests take
•Response size — Distribution of payload sizes
•Database query times — Understanding query performance ranges
•Batch job durations — How long processing takes
•API response times — Per-endpoint latency distributions
•Queue wait times — How long items wait before processing
•Any value where percentiles matter — When you need p50, p90, p99

histogram_examples.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
package main
 
import (
    "time"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
var (
    // HTTP request duration histogram with custom buckets
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request latency distribution in seconds",
            // Buckets designed for typical API response times
            // .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint", "status_code"},
    )
    
    // Custom buckets for database query times
    dbQueryDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "db_query_duration_seconds",
            Help: "Database query latency distribution",
            // Custom buckets: 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s
            Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 5},
        },
        []string{"query_type", "table"},
    )
    
    // Response size histogram (in bytes)
    responseSize = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_response_size_bytes",
            Help: "HTTP response size distribution in bytes",
            // Exponential buckets: 100B, 1KB, 10KB, 100KB, 1MB, 10MB
            Buckets: prometheus.ExponentialBuckets(100, 10, 6),
        },
        []string{"endpoint"},
    )
)
 
// Time HTTP request handling
func handleHTTPRequest(method, endpoint string) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        // Wrap ResponseWriter to capture status and size
        wrapped := &responseRecorder{ResponseWriter: w, status: 200}
        
        // Handle request...
        handler.ServeHTTP(wrapped, r)
        
        // Record metrics
        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(
            method, 
            endpoint, 
            strconv.Itoa(wrapped.status),
        ).Observe(duration)
        
        responseSize.WithLabelValues(endpoint).Observe(float64(wrapped.written))
    })
}
 
// Time database queries
func executeQuery(queryType, table, query string) (Result, error) {
    start := time.Now()
    
    result, err := db.Query(query)
    
    duration := time.Since(start).Seconds()
    dbQueryDuration.WithLabelValues(queryType, table).Observe(duration)
    
    return result, err
}

Choosing Bucket Boundaries

Bucket selection is both art and science. Poor bucket boundaries waste storage or lose precision:

Problem	Cause	Solution
All observations in first bucket	Lower buckets too high	Add smaller buckets (e.g., 1ms, 5ms)
All observations in +Inf bucket	Upper buckets too low	Add larger buckets beyond expected max
Can't distinguish p50 from p90	Too few buckets	Add buckets in the relevant range
Cardinality explosion	Too many buckets	Use fewer, strategically placed buckets

Guidelines for bucket selection:

Know your SLOs: If your latency SLO is 100ms, ensure you have buckets around that threshold
Use exponential spacing: Latencies often follow log-normal distributions
Account for outliers: Always have buckets beyond expected max
Start conservative: You can add buckets; removing affects historical data

Histogram vs Summary

Some systems (like Prometheus) also offer 'summaries' that calculate precise quantiles client-side. Histograms calculate approximate quantiles server-side. Histograms are generally preferred because they can be aggregated across instances, while summaries cannot. The tradeoff: histograms require good bucket planning, while summaries provide exact quantiles but limited aggregation.

Calculating Percentiles from Histograms

One of the most powerful features of histograms is the ability to estimate percentiles (quantiles). The p99 latency—the value below which 99% of observations fall—is a critical SLO metric that histograms enable.

The histogram_quantile Function:

In Prometheus, you calculate percentiles using histogram_quantile(). This function uses linear interpolation between bucket boundaries:

# Calculate p99 latency over the last 5 minutes
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Calculate p50 (median) latency
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# Calculate p99 grouped by endpoint
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

How Linear Interpolation Works:

Suppose you have buckets at 100ms and 250ms. If 90% of observations fall below 100ms and 98% fall below 250ms, and you want p95:

p95 (0.95) falls between 0.90 (100ms bucket) and 0.98 (250ms bucket)
Linear interpolation: position = (0.95 - 0.90) / (0.98 - 0.90) = 0.625
Estimated p95 = 100 + (250 - 100) * 0.625 = 193.75ms

This is an estimate—the true p95 could be anywhere in that range.

Common Percentiles and Their Meaning
Percentile	Also Known As	Meaning	When to Use
p50	Median	50% of observations are faster	Understand typical user experience
p90		90% of observations are faster	Identify the long tail beginning
p95		95% of observations are faster	Balance between typical and worst-case
p99		99% of observations are faster	SLO target for latency-sensitive services
p99.9	Three nines	99.9% faster	Ultra-premium tier or financial services

Aggregating Histograms Correctly

One of histogram's superpowers is aggregation across instances. Unlike summaries (which calculate client-side quantiles), you can combine histogram buckets and then calculate quantiles:

# CORRECT: Sum buckets first, then calculate percentile
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# WRONG: Averaging percentiles is statistically invalid
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

Why is averaging percentiles wrong? Consider two instances:

Instance A: p99 = 100ms (100 requests)
Instance B: p99 = 500ms (10000 requests)

Averaging gives 300ms, but Instance B dominates the traffic. The true aggregate p99 is much closer to 500ms.

Important: Always aggregate the underlying buckets BEFORE calculating the percentile.

Percentile Accuracy

Histogram percentile calculations are estimates based on bucket boundaries. If your buckets are too coarse, your percentile estimates will be inaccurate. For critical SLOs, ensure you have buckets close to your target thresholds. A p99 SLO of 100ms should have buckets at 50ms, 75ms, 100ms, 125ms, and 150ms to get accurate measurement.

Choosing the Right Metric Type

Selecting the appropriate metric type is one of the first and most important decisions in instrumentation design. Here's a systematic decision framework:

Use a Counter When...

•You're counting discrete events
•The value only ever increases
•You want to calculate rates
•Resets to zero are acceptable
•Examples: requests, bytes, errors, completions

Use a Gauge When...

•You're measuring current state
•The value can go up or down
•The absolute value is meaningful
•You're sampling a point-in-time measurement
•Examples: memory, queue depth, connections

Use a Histogram When:

You need to understand the distribution of values
Percentiles/quantiles are important
You want to calculate aggregated percentiles across instances
You're measuring latencies, sizes, or durations
You have SLOs based on percentiles (e.g., "p99 latency < 100ms")

Metric Type Decision Matrix
Question	Counter	Gauge	Histogram
Does the value only increase?	✅ Yes	❌	N/A
Can the value decrease?	❌	✅ Yes	N/A
Do you need rates/throughput?	✅ Ideal	⚠️ Possible	N/A
Is the absolute value important?	❌ Usually not	✅ Yes	N/A
Do you need percentiles?	❌	❌	✅ Yes
Do you need distribution info?	❌	❌	✅ Yes
Is the value discrete events?	✅ Ideal	❌	N/A
Is it a point-in-time snapshot?	❌	✅ Yes	N/A

When in Doubt, Instrument More

It's often valuable to expose the same measurement as multiple metric types. For request handling, you might expose: 'http_requests_total' (counter), 'http_requests_in_flight' (gauge), and 'http_request_duration_seconds' (histogram). Each answers different questions about the same phenomenon.

Common Pitfalls and Anti-Patterns

Even experienced engineers make metric type mistakes. Here are the most common pitfalls and how to avoid them:

Metric Type Anti-Patterns

•Using counters for things that decrease — 'active_requests' that decrements is a gauge, not a counter. Counter math breaks when values decrease unexpectedly.
•Ignoring histogram bucket design — Default buckets often don't match your latency profile. A 1-second p99 SLO with buckets at 0.5s and 2.5s can't be measured accurately.
•Averaging percentiles — Statistical nonsense. Always aggregate histogram buckets, then calculate percentile.
•High-cardinality labels on histograms — Each label combination creates N buckets. 10 buckets × 1000 unique user IDs = 10,000 time series.
•Resetting counters intentionally — Some try to reset counters at midnight for 'daily counts'. This breaks rate calculations. Use PromQL windowing instead.
•Using gauges for cumulative values — 'total_bytes_transferred' should be a counter, enabling rate calculations. A gauge loses the ability to calculate accurate rates.
•Sparse histogram buckets — Buckets with no observations still consume storage. Design buckets around expected value ranges.

The Mixed Metric Pattern:

A sophisticated pattern uses multiple metric types together. For request tracking:

http_requests_started_total (counter)   → Calculate start rate
http_requests_completed_total (counter) → Calculate completion rate
http_requests_in_flight (gauge)         → Current active requests
http_request_duration_seconds (histogram) → Latency distribution

# Validation check:
http_requests_in_flight ≈ rate(http_requests_started_total) - rate(http_requests_completed_total)

This multi-metric approach provides comprehensive visibility and cross-validation opportunities.

The Cardinality Trap

Histograms multiply cardinality. A histogram with 10 buckets and labels {method, endpoint, status} where you have 5 methods × 100 endpoints × 5 status codes = 2500 label combinations × 12 series per histogram (10 buckets + sum + count) = 30,000 time series from a single metric. Design labels carefully.

Summary: Mastering Metric Types

Metric types are the foundation of observability. Choosing the right type enables powerful analysis; choosing the wrong type creates constant friction with your tools.

Key Takeaways

•Counters are for cumulative, monotonically increasing values—use them for counting events and calculating rates. They only go up (except on reset).
•Gauges are for point-in-time measurements that can fluctuate—use them for current state like memory usage, connection counts, and queue depths.
•Histograms capture value distributions—use them when you need percentiles, when averages lie, and for latency/size measurements.
•Choosing the wrong type doesn't just limit analysis—it breaks the mathematical operations that query languages depend on.
•Histogram bucket design is critical—buckets should be placed around your expected value range and SLO thresholds.
•Aggregate histogram buckets before calculating percentiles—averaging percentiles is statistically invalid.
•Consider exposing multiple metric types for the same phenomenon to answer different questions comprehensively.

What's Next:

Now that you understand the fundamental metric types, the next page explores Prometheus architecture—the most widely adopted metrics collection system. You'll learn how Prometheus's pull-based model, time-series database, and powerful query language work together to make metrics collection practical at scale.

Page Complete

You now understand the three fundamental metric types and when to use each. This knowledge forms the vocabulary of observability—every metric you create or query will be classified by these types. Next, we'll explore how Prometheus operationalizes these concepts at scale.

1 / 5

Loading learning content...

System Design (HLD)Metrics Collection

Metrics Collection

LevelIntermediate

Duration90 mins

TopicMetrics Collection

1 / 5

Types of Metrics: Counters, Gauges, Histograms

The Language of System Behavior

Consider a simple question: "How many requests is my API processing?" This seemingly straightforward question actually has multiple valid interpretations:

How many requests have we processed in total since the service started?
How many requests are we processing right now?
What's the rate of requests over the last minute?

What You Will Learn

The Metric Type Taxonomy

The observability industry has converged on three fundamental metric types, each optimized for different measurement patterns:

The Three Fundamental Metric Types
Metric Type	Behavior	Primary Use Case	Key Property
Counter	Monotonically increasing	Counting events over time	Can only go up (or reset to zero)
Gauge	Arbitrary values	Measuring current state	Can go up, down, or stay constant
Histogram	Bucketed distributions	Understanding value distributions	Captures percentiles and ranges

Why only three types?

The Right Mental Model

Counters: Counting Events Over Time

Mathematical Properties:

Let's denote a counter value at time t as C(t). The fundamental property of a counter is:

C(t₂) ≥ C(t₁) for all t₂ > t₁ (absent resets)

This monotonicity property enables the most important operation on counters: rate calculation. The rate of events over a time window [t₁, t₂] is simply:

rate = (C(t₂) - C(t₁)) / (t₂ - t₁)

This is why the absolute value of a counter is usually uninteresting—what matters is how fast it's changing.

When to Use Counters

•Total HTTP requests received — Count every request, then calculate requests/second
•Bytes transferred — Count total bytes, derive transfer rate
•Errors encountered — Count each error type, calculate error rate
•Tasks completed — Count completions, measure throughput
•Cache operations — Count hits and misses, calculate hit ratio
•Messages processed — Count queue messages, track processing rate

counter_examples.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
package main
 
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
// Define counters for HTTP request tracking
var (
    // Total HTTP requests received (counter)
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests received",
        },
        []string{"method", "endpoint", "status_code"},
    )
 
    // Total bytes received (counter)
    bytesReceivedTotal = promauto.NewCounter(
        prometheus.CounterOpts{
            Name: "http_request_bytes_total",
            Help: "Total bytes received in HTTP request bodies",
        },
    )
 
    // Errors by type (counter)
    errorsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "application_errors_total",
            Help: "Total number of application errors by type",
        },
        []string{"error_type", "service"},
    )
)
 
// Usage in request handler
func handleRequest(method, endpoint string, bodySize int64, statusCode int) {
    // Increment request counter with labels
    httpRequestsTotal.WithLabelValues(method, endpoint, 
        strconv.Itoa(statusCode)).Inc()
    
    // Add bytes received
    bytesReceivedTotal.Add(float64(bodySize))
    
    // Track errors
    if statusCode >= 500 {
        errorsTotal.WithLabelValues("server_error", "api").Inc()
    } else if statusCode >= 400 {
        errorsTotal.WithLabelValues("client_error", "api").Inc()
    }
}

Handling Counter Resets

Anti-Pattern Warning:

Common Counter Mistake

Gauges: Measuring Current State

Mathematical Properties:

A gauge G(t) at time t has no constraints on its relationship to previous values:

G(t₂) can be >, <, or = to G(t₁) for any t₂ > t₁

The Sampling Challenge:

When to Use Gauges

•Memory usage — Current heap, RSS, or resident memory in bytes
•CPU utilization — Current percentage of CPU cycles consumed
•Active connections — Number of currently open connections
•Queue depth — Number of items currently waiting in a queue
•Thread pool size — Current number of active/idle threads
•Cache size — Current number of items or bytes in cache
•Temperature — Current system or environment temperature
•Goroutines/Threads — Current count of concurrent execution contexts

gauge_examples.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
package main
 
import (
    "runtime"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
var (
    // Memory usage gauge
    memoryBytes = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "process_memory_bytes",
            Help: "Current memory usage in bytes",
        },
        []string{"type"},
    )
    
    // Active connections gauge
    activeConnections = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of currently active connections",
        },
        []string{"protocol", "state"},
    )
    
    // Queue depth gauge
    queueDepth = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "queue_depth",
            Help: "Number of items currently in queue",
        },
        []string{"queue_name", "priority"},
    )
    
    // In-flight requests gauge
    inFlightRequests = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "http_requests_in_flight",
            Help: "Number of HTTP requests currently being processed",
        },
        []string{"handler"},
    )
)
 
// Update memory metrics periodically
func updateMemoryMetrics() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    
    memoryBytes.WithLabelValues("heap_alloc").Set(float64(m.HeapAlloc))
    memoryBytes.WithLabelValues("heap_sys").Set(float64(m.HeapSys))
    memoryBytes.WithLabelValues("stack").Set(float64(m.StackInuse))
}
 
// Track request lifecycle
func handleWithMetrics(handler string, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        inFlightRequests.WithLabelValues(handler).Inc()
        defer inFlightRequests.WithLabelValues(handler).Dec()
        next.ServeHTTP(w, r)
    })
}
 
// Track connections
type connectionTracker struct {
    protocol string
}
 
func (ct *connectionTracker) OnConnect() {
    activeConnections.WithLabelValues(ct.protocol, "established").Inc()
}
 
func (ct *connectionTracker) OnDisconnect() {
    activeConnections.WithLabelValues(ct.protocol, "established").Dec()
}

Gauge Aggregation Challenges

Aggregating gauges across instances requires careful thought. Consider "current queue depth" across 10 service replicas:

Sum makes sense if items are unique per instance (total items across all queues)
Average makes sense if you want typical per-instance behavior
Max makes sense if you're looking for hotspots

Contrast this with counters, where summing rates is almost always correct. Gauge aggregation depends heavily on what you're measuring.

Ephemeral Gauges and Staleness

When an instance dies, its gauge values don't update anymore. Prometheus marks these values as "stale" after a configurable period. For alerting on gauges, consider:

Using absent() to detect missing metrics
Accounting for staleness in threshold alerts
Preferring counters when possible for rate-based alerting

Gauge Best Practice

Histograms: Understanding Distributions

Why Averages Lie:

Consider an API with these response times for 100 requests:

98 requests: 10ms
2 requests: 5000ms (5 seconds!)

The average is 109ms, which tells you almost nothing useful. 98% of users experience 10ms, while 2% suffer 5-second delays. A histogram reveals this bimodal distribution; an average hides it.

Histogram Structure:

A histogram actually exposes multiple time series:

<metric>_bucket{le="<upper_bound>"}: Cumulative count of observations ≤ upper_bound
<metric>_sum: Total sum of all observed values
<metric>_count: Total count of observations

The "le" (less than or equal) buckets are cumulative. If you have buckets at 10ms, 50ms, and 100ms, the 50ms bucket includes all observations in the 10ms bucket.

When to Use Histograms

•Request latency — Distribution of how long requests take
•Response size — Distribution of payload sizes
•Database query times — Understanding query performance ranges
•Batch job durations — How long processing takes
•API response times — Per-endpoint latency distributions
•Queue wait times — How long items wait before processing
•Any value where percentiles matter — When you need p50, p90, p99

histogram_examples.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
package main
 
import (
    "time"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
var (
    // HTTP request duration histogram with custom buckets
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request latency distribution in seconds",
            // Buckets designed for typical API response times
            // .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint", "status_code"},
    )
    
    // Custom buckets for database query times
    dbQueryDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "db_query_duration_seconds",
            Help: "Database query latency distribution",
            // Custom buckets: 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s
            Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 5},
        },
        []string{"query_type", "table"},
    )
    
    // Response size histogram (in bytes)
    responseSize = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_response_size_bytes",
            Help: "HTTP response size distribution in bytes",
            // Exponential buckets: 100B, 1KB, 10KB, 100KB, 1MB, 10MB
            Buckets: prometheus.ExponentialBuckets(100, 10, 6),
        },
        []string{"endpoint"},
    )
)
 
// Time HTTP request handling
func handleHTTPRequest(method, endpoint string) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        // Wrap ResponseWriter to capture status and size
        wrapped := &responseRecorder{ResponseWriter: w, status: 200}
        
        // Handle request...
        handler.ServeHTTP(wrapped, r)
        
        // Record metrics
        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(
            method, 
            endpoint, 
            strconv.Itoa(wrapped.status),
        ).Observe(duration)
        
        responseSize.WithLabelValues(endpoint).Observe(float64(wrapped.written))
    })
}
 
// Time database queries
func executeQuery(queryType, table, query string) (Result, error) {
    start := time.Now()
    
    result, err := db.Query(query)
    
    duration := time.Since(start).Seconds()
    dbQueryDuration.WithLabelValues(queryType, table).Observe(duration)
    
    return result, err
}

Choosing Bucket Boundaries

Bucket selection is both art and science. Poor bucket boundaries waste storage or lose precision:

Problem	Cause	Solution
All observations in first bucket	Lower buckets too high	Add smaller buckets (e.g., 1ms, 5ms)
All observations in +Inf bucket	Upper buckets too low	Add larger buckets beyond expected max
Can't distinguish p50 from p90	Too few buckets	Add buckets in the relevant range
Cardinality explosion	Too many buckets	Use fewer, strategically placed buckets

Guidelines for bucket selection:

Know your SLOs: If your latency SLO is 100ms, ensure you have buckets around that threshold
Use exponential spacing: Latencies often follow log-normal distributions
Account for outliers: Always have buckets beyond expected max
Start conservative: You can add buckets; removing affects historical data

Histogram vs Summary

Calculating Percentiles from Histograms

The histogram_quantile Function:

In Prometheus, you calculate percentiles using histogram_quantile(). This function uses linear interpolation between bucket boundaries:

# Calculate p99 latency over the last 5 minutes
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Calculate p50 (median) latency
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# Calculate p99 grouped by endpoint
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

How Linear Interpolation Works:

Suppose you have buckets at 100ms and 250ms. If 90% of observations fall below 100ms and 98% fall below 250ms, and you want p95:

p95 (0.95) falls between 0.90 (100ms bucket) and 0.98 (250ms bucket)
Linear interpolation: position = (0.95 - 0.90) / (0.98 - 0.90) = 0.625
Estimated p95 = 100 + (250 - 100) * 0.625 = 193.75ms

This is an estimate—the true p95 could be anywhere in that range.

Common Percentiles and Their Meaning
Percentile	Also Known As	Meaning	When to Use
p50	Median	50% of observations are faster	Understand typical user experience
p90		90% of observations are faster	Identify the long tail beginning
p95		95% of observations are faster	Balance between typical and worst-case
p99		99% of observations are faster	SLO target for latency-sensitive services
p99.9	Three nines	99.9% faster	Ultra-premium tier or financial services

Aggregating Histograms Correctly

One of histogram's superpowers is aggregation across instances. Unlike summaries (which calculate client-side quantiles), you can combine histogram buckets and then calculate quantiles:

# CORRECT: Sum buckets first, then calculate percentile
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# WRONG: Averaging percentiles is statistically invalid
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

Why is averaging percentiles wrong? Consider two instances:

Instance A: p99 = 100ms (100 requests)
Instance B: p99 = 500ms (10000 requests)

Averaging gives 300ms, but Instance B dominates the traffic. The true aggregate p99 is much closer to 500ms.

Important: Always aggregate the underlying buckets BEFORE calculating the percentile.

Percentile Accuracy

Choosing the Right Metric Type

Selecting the appropriate metric type is one of the first and most important decisions in instrumentation design. Here's a systematic decision framework:

Use a Counter When...

•You're counting discrete events
•The value only ever increases
•You want to calculate rates
•Resets to zero are acceptable
•Examples: requests, bytes, errors, completions

Use a Gauge When...

•You're measuring current state
•The value can go up or down
•The absolute value is meaningful
•You're sampling a point-in-time measurement
•Examples: memory, queue depth, connections

Use a Histogram When:

You need to understand the distribution of values
Percentiles/quantiles are important
You want to calculate aggregated percentiles across instances
You're measuring latencies, sizes, or durations
You have SLOs based on percentiles (e.g., "p99 latency < 100ms")

Metric Type Decision Matrix
Question	Counter	Gauge	Histogram
Does the value only increase?	✅ Yes	❌	N/A
Can the value decrease?	❌	✅ Yes	N/A
Do you need rates/throughput?	✅ Ideal	⚠️ Possible	N/A
Is the absolute value important?	❌ Usually not	✅ Yes	N/A
Do you need percentiles?	❌	❌	✅ Yes
Do you need distribution info?	❌	❌	✅ Yes
Is the value discrete events?	✅ Ideal	❌	N/A
Is it a point-in-time snapshot?	❌	✅ Yes	N/A

When in Doubt, Instrument More

Common Pitfalls and Anti-Patterns

Even experienced engineers make metric type mistakes. Here are the most common pitfalls and how to avoid them:

Metric Type Anti-Patterns

•Using counters for things that decrease — 'active_requests' that decrements is a gauge, not a counter. Counter math breaks when values decrease unexpectedly.
•Ignoring histogram bucket design — Default buckets often don't match your latency profile. A 1-second p99 SLO with buckets at 0.5s and 2.5s can't be measured accurately.
•Averaging percentiles — Statistical nonsense. Always aggregate histogram buckets, then calculate percentile.
•High-cardinality labels on histograms — Each label combination creates N buckets. 10 buckets × 1000 unique user IDs = 10,000 time series.
•Resetting counters intentionally — Some try to reset counters at midnight for 'daily counts'. This breaks rate calculations. Use PromQL windowing instead.
•Using gauges for cumulative values — 'total_bytes_transferred' should be a counter, enabling rate calculations. A gauge loses the ability to calculate accurate rates.
•Sparse histogram buckets — Buckets with no observations still consume storage. Design buckets around expected value ranges.

The Mixed Metric Pattern:

A sophisticated pattern uses multiple metric types together. For request tracking:

http_requests_started_total (counter)   → Calculate start rate
http_requests_completed_total (counter) → Calculate completion rate
http_requests_in_flight (gauge)         → Current active requests
http_request_duration_seconds (histogram) → Latency distribution

# Validation check:
http_requests_in_flight ≈ rate(http_requests_started_total) - rate(http_requests_completed_total)

This multi-metric approach provides comprehensive visibility and cross-validation opportunities.

The Cardinality Trap

Summary: Mastering Metric Types

Metric types are the foundation of observability. Choosing the right type enables powerful analysis; choosing the wrong type creates constant friction with your tools.

Key Takeaways

•Counters are for cumulative, monotonically increasing values—use them for counting events and calculating rates. They only go up (except on reset).
•Gauges are for point-in-time measurements that can fluctuate—use them for current state like memory usage, connection counts, and queue depths.
•Histograms capture value distributions—use them when you need percentiles, when averages lie, and for latency/size measurements.
•Choosing the wrong type doesn't just limit analysis—it breaks the mathematical operations that query languages depend on.
•Histogram bucket design is critical—buckets should be placed around your expected value range and SLO thresholds.
•Aggregate histogram buckets before calculating percentiles—averaging percentiles is statistically invalid.
•Consider exposing multiple metric types for the same phenomenon to answer different questions comprehensively.

What's Next:

Page Complete

1 / 5