Observability Design - Learning Module

Loading content...

0/246

Metrics and Monitoring Hooks

Quantifying System Behavior

If observability is about understanding system behavior, metrics are the quantitative dimension of that understanding. While logs tell you what happened and traces tell you where it happened, metrics tell you how much and how fast—the numerical heartbeat of your system.

Metrics transform ephemeral system behavior into measurable data points that can be aggregated, visualized, alerted upon, and analyzed over time. A well-instrumented class emits metrics that answer questions like: How many requests per second? What's the 99th percentile latency? How many errors occurred? What's the current queue depth?

This page explores the art and science of designing classes with built-in metric emission—creating monitoring hooks that provide genuine operational insight without overwhelming operators with noise.

What You Will Learn

By the end of this page, you will master the four fundamental metric types (counters, gauges, timers, histograms), understand when to use each, learn to design classes with built-in monitoring hooks, and develop strategies for metric naming, tagging, and cardinality management.

The Four Fundamental Metric Types

All metrics in modern monitoring systems derive from four fundamental types. Understanding their semantics and appropriate use cases is essential for effective instrumentation.

Fundamental Metric Types
Metric Type	Description	Examples	Mathematical Property
Counter	Monotonically increasing value that only goes up (or resets to zero)	Requests processed, errors occurred, bytes sent	Only increment (+) and reset operations allowed
Gauge	Point-in-time value that can go up or down	Current queue depth, active connections, memory usage	Arbitrary set, increment, decrement operations
Timer	Measures duration of events	Request latency, database query time, API call duration	Records duration values into histograms
Histogram	Captures distribution of values across buckets	Request sizes, latency percentiles, batch sizes	Aggregates values into configurable buckets

Counters are the simplest and most common metric type. They track cumulative totals of events and only increase (or reset to zero on restart).

When to Use Counters:

Tracking total requests, errors, successes
Counting events of any kind
Calculating rates (rate of counter change = events/second)

Key Properties:

Monotonically increasing within a process lifetime
Rates derived via rate(counter[interval]) or increase(counter[interval])
Resets on process restart (monitoring systems handle this)

CounterExamples.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Counter usage patterns
public class PaymentService {
    private final Counter paymentsProcessed;
    private final Counter paymentsFailed;
    private final Counter paymentAmountTotal;
    
    public PaymentService(MeterRegistry registry) {
        // Basic event counter
        this.paymentsProcessed = registry.counter("payments.processed");
        
        // Counter with tags for categorization
        this.paymentsFailed = registry.counter("payments.failed", 
            "service", "payment-service");
        
        // Counter tracking a cumulative value (total money processed)
        this.paymentAmountTotal = registry.counter("payments.amount.total",
            "currency", "USD");
    }
    
    public PaymentResult processPayment(PaymentRequest request) {
        try {
            PaymentResult result = executePayment(request);
            paymentsProcessed.increment();
            paymentAmountTotal.increment(request.getAmount().doubleValue());
            return result;
        } catch (PaymentException e) {
            // Tagging failure with reason enables filtering
            paymentsFailed.withTag("reason", e.getErrorCode()).increment();
            throw e;
        }
    }
}

Counter Anti-Patterns

Never decrement a counter—if you need to track a value that goes up and down, use a gauge. Never use counters for point-in-time values like 'current queue size'. Reset handling is automatic; don't try to persist counter values across restarts.

Metric Naming Conventions

Consistent metric naming is crucial for discoverability, understandability, and automation. A well-designed naming scheme allows operators to find metrics intuitively and enables programmatic queries across related metrics.

Golden Rules of Metric Naming:

Naming Best Practices

•Use Snake_Case or Dot.Notation — Consistent separator style across all metrics (e.g., http_requests_total or http.requests.total)
•Include Unit Suffix — Append unit to name when not obvious: request_duration_seconds, message_size_bytes, queue_depth_messages
•Use Base Units — Prefer base units (seconds, bytes, not milliseconds or kilobytes) for easier aggregation
•Be Descriptive — Names should be self-documenting: http_server_request_duration_seconds > request_time
•Follow Domain Conventions — HTTP metrics should follow patterns: http_server_*, http_client_*
•Counter Suffix Convention — Some systems suffix counters with _total: http_requests_total

Metric Naming Examples
Bad Name	Good Name	Why Better
`reqTime`	`http_request_duration_seconds`	Clear domain, descriptive, includes unit
`errors`	`http_requests_failed_total`	Specific to domain, indicates counter
`poolSize`	`db_connection_pool_size_connections`	Indicates resource type and unit
`cache`	`cache_hit_ratio`	Describes what's being measured
`memory`	`jvm_memory_used_bytes`	Scoped, specific, unit included

Naming Pattern Template

Follow this pattern: <namespace>_<subsystem>_<metric>_<unit>. Examples: myapp_database_query_duration_seconds, myapp_http_requests_total, myapp_cache_entries_count. The namespace prevents collisions with library metrics.

Tagging Strategy and Cardinality Management

Tags (also called labels or dimensions) add contextual dimensions to metrics, enabling filtering and grouping. However, tags are a double-edged sword—each unique combination of tag values creates a new time series, and cardinality explosion can overwhelm monitoring systems.

Understanding Cardinality:

Cardinality is the number of unique time series a metric creates. For a metric with tags {method, status, endpoint}:

5 methods × 10 status codes × 100 endpoints = 5,000 time series
Add {user_id} with 1M users = 5,000,000,000 time series ❌

This is catastrophic. Storage costs explode, query performance degrades, and monitoring systems collapse.

Safe Tag Values

•HTTP methods (GET, POST, PUT...)
•Status codes (200, 404, 500...)
•Service names (fixed set)
•Environment (prod, staging, dev)
•Region (us-east, eu-west...)
•Version numbers (v1, v2...)

Dangerous Tag Values

•User IDs (unbounded)
•Request IDs (unique per request)
•Email addresses
•Timestamps as strings
•Full URLs with query params
•Unbounded error messages

CardinalitySafeMetrics.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Cardinality-safe metric design
public class ApiMetrics {
    private final MeterRegistry registry;
    
    // GOOD: Bounded tag values
    public void recordRequest(String method, int statusCode, String endpoint) {
        // Normalize endpoint to template form
        String normalizedEndpoint = normalizeEndpoint(endpoint);
        // e.g., "/users/12345/orders" -> "/users/{id}/orders"
        
        registry.counter("http.requests.total",
            "method", method,
            "status", String.valueOf(statusCode),
            "endpoint", normalizedEndpoint
        ).increment();
    }
    
    // BAD: Unbounded tag - DO NOT DO THIS
    public void recordRequestBad(String userId, String requestId) {
        // This creates a new time series for every user and request!
        registry.counter("http.requests.total",
            "user_id", userId,      // ❌ Millions of values
            "request_id", requestId  // ❌ Unique per request = infinite!
        ).increment();
    }
    
    // GOOD: Categorize unbounded values into buckets
    public void recordErrorWithReason(Exception e) {
        // Categorize error types, don't use full messages
        String errorCategory = categorizeError(e);
        
        registry.counter("errors.total",
            "type", e.getClass().getSimpleName(),
            "category", errorCategory  // "validation", "timeout", "unknown"
        ).increment();
        
        // Full error details go to logs, not metrics
        log.error("Request failed", kv("errorMessage", e.getMessage()));
    }
    
    private String normalizeEndpoint(String endpoint) {
        // Replace path parameters with placeholders
        return endpoint
            .replaceAll("/users/\\d+", "/users/{id}")
            .replaceAll("/orders/[a-f0-9-]+", "/orders/{orderId}");
    }
    
    private String categorizeError(Exception e) {
        if (e instanceof ValidationException) return "validation";
        if (e instanceof TimeoutException) return "timeout";
        if (e instanceof AuthenticationException) return "auth";
        return "unknown";
    }
}

Cardinality Rule of Thumb

Before adding a tag, estimate: (existing series) × (new tag unique values) = (new total series). If this exceeds 10,000 for a single metric, reconsider. If it exceeds 100,000, you almost certainly have a design problem. Use logs for high-cardinality data.

Designing Monitoring Hook Interfaces

To maintain clean separation between business logic and instrumentation, classes should expose well-defined monitoring hooks—interfaces that observability infrastructure can plug into without coupling to specific monitoring systems.

MonitoringHooks.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
/**
 * Monitoring hook interface for order processing.
 * Implementations connect to specific monitoring systems (Prometheus, Datadog, etc.)
 */
public interface OrderProcessingMonitor {
    // Event hooks - called at key lifecycle points
    void onOrderReceived(OrderRequest request);
    void onOrderProcessingStarted(String orderId);
    void onOrderProcessingCompleted(String orderId, Duration duration);
    void onOrderProcessingFailed(String orderId, Exception error);
    
    // State hooks - called to report current state
    void reportQueueDepth(int depth);
    void reportActiveProcessors(int count);
}
 
/**
 * Prometheus implementation of order monitoring
 */
public class PrometheusOrderMonitor implements OrderProcessingMonitor {
    private final Counter ordersReceived;
    private final Counter ordersCompleted;
    private final Counter ordersFailed;
    private final Timer processingDuration;
    private final Gauge queueDepth;
    private final Gauge activeProcessors;
    
    public PrometheusOrderMonitor(MeterRegistry registry) {
        this.ordersReceived = registry.counter("orders.received.total");
        this.ordersCompleted = registry.counter("orders.completed.total");
        this.ordersFailed = registry.counter("orders.failed.total");
        this.processingDuration = registry.timer("orders.processing.duration");
        this.queueDepth = registry.gauge("orders.queue.depth", new AtomicInteger());
        this.activeProcessors = registry.gauge("orders.processors.active", 
            new AtomicInteger());
    }
    
    @Override
    public void onOrderReceived(OrderRequest request) {
        ordersReceived.increment();
    }
    
    @Override
    public void onOrderProcessingStarted(String orderId) {
        // Could track in-flight orders if needed
    }
    
    @Override
    public void onOrderProcessingCompleted(String orderId, Duration duration) {
        ordersCompleted.increment();
        processingDuration.record(duration);
    }
    
    @Override
    public void onOrderProcessingFailed(String orderId, Exception error) {
        ordersFailed.increment(Tags.of("error", error.getClass().getSimpleName()));
    }
    
    @Override
    public void reportQueueDepth(int depth) {
        ((AtomicInteger) queueDepth).set(depth);
    }
    
    @Override
    public void reportActiveProcessors(int count) {
        ((AtomicInteger) activeProcessors).set(count);
    }
}
 
/**
 * No-op implementation for testing or when monitoring is disabled
 */
public class NoOpOrderMonitor implements OrderProcessingMonitor {
    @Override public void onOrderReceived(OrderRequest request) {}
    @Override public void onOrderProcessingStarted(String orderId) {}
    @Override public void onOrderProcessingCompleted(String orderId, Duration d) {}
    @Override public void onOrderProcessingFailed(String orderId, Exception e) {}
    @Override public void reportQueueDepth(int depth) {}
    @Override public void reportActiveProcessors(int count) {}
}
 
/**
 * Order processor using monitoring hooks
 */
public class OrderProcessor {
    private final OrderProcessingMonitor monitor;
    private final Queue<Order> orderQueue;
    
    public OrderProcessor(OrderProcessingMonitor monitor) {
        this.monitor = monitor;
        this.orderQueue = new ConcurrentLinkedQueue<>();
    }
    
    public void submitOrder(OrderRequest request) {
        monitor.onOrderReceived(request);
        orderQueue.offer(new Order(request));
        monitor.reportQueueDepth(orderQueue.size());
    }
    
    public void processNextOrder() {
        Order order = orderQueue.poll();
        if (order == null) return;
        
        monitor.reportQueueDepth(orderQueue.size());
        monitor.onOrderProcessingStarted(order.getId());
        
        Instant start = Instant.now();
        try {
            doProcessOrder(order);
            monitor.onOrderProcessingCompleted(order.getId(), 
                Duration.between(start, Instant.now()));
        } catch (Exception e) {
            monitor.onOrderProcessingFailed(order.getId(), e);
            throw e;
        }
    }
}

Benefits of Monitoring Hook Interfaces

•Testability — Inject NoOp or mock implementations during testing
•Flexibility — Switch monitoring backends without changing business logic
•Clarity — Interface documents all observable events and state
•Composition — Combine multiple monitors (log + metrics + traces)
•Performance — Disable monitoring entirely in performance-critical paths

RED and USE Method: Standard Metric Sets

Rather than inventing metrics ad-hoc, mature organizations adopt standardized metric methodologies that ensure consistent coverage. Two widely adopted methods are RED (for services) and USE (for resources).

RED Method (Services)

•Rate — Requests per second
•Errors — Failed requests per second
•Duration — Latency distribution

RED is ideal for request-driven services (APIs, microservices). Every service should emit these three metric types.

Example: For an API endpoint:

Rate: http_requests_total (counter)
Errors: http_requests_failed_total (counter)
Duration: http_request_duration_seconds (histogram)

USE Method (Resources)

•Utilization — % of resource capacity used
•Saturation — Queued work beyond capacity
•Errors — Error events

USE is ideal for resources (queues, pools, caches). Every resource should emit these three metric types.

Example: For a connection pool:

Utilization: pool_connections_used / pool_connections_max
Saturation: pool_wait_queue_size (gauge)
Errors: pool_connection_errors_total (counter)

REDUSEMetrics.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/**
 * RED method implementation for a service
 */
public class ServiceREDMetrics {
    private final Counter requestsTotal;
    private final Counter errorsTotal;
    private final Timer duration;
    
    public ServiceREDMetrics(MeterRegistry registry, String serviceName) {
        this.requestsTotal = registry.counter("service.requests.total",
            "service", serviceName);
        this.errorsTotal = registry.counter("service.errors.total",
            "service", serviceName);
        this.duration = Timer.builder("service.request.duration.seconds")
            .tag("service", serviceName)
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);
    }
    
    public <T> T time(Supplier<T> operation) {
        requestsTotal.increment();
        try {
            return duration.record(operation);
        } catch (Exception e) {
            errorsTotal.increment();
            throw e;
        }
    }
}
 
/**
 * USE method implementation for a resource (e.g., connection pool)
 */
public class ResourceUSEMetrics {
    private final Gauge utilization;
    private final Gauge saturation;
    private final Counter errors;
    
    public ResourceUSEMetrics(
        MeterRegistry registry, 
        String resourceName,
        Supplier<Double> utilizationFn,
        Supplier<Integer> saturationFn
    ) {
        // Utilization: 0.0 to 1.0 representing capacity used
        this.utilization = registry.gauge("resource.utilization",
            Tags.of("resource", resourceName),
            this, m -> utilizationFn.get());
        
        // Saturation: queue depth or backpressure indicator
        this.saturation = registry.gauge("resource.saturation",
            Tags.of("resource", resourceName),
            this, m -> saturationFn.get().doubleValue());
        
        // Errors: count of resource-related errors
        this.errors = registry.counter("resource.errors.total",
            "resource", resourceName);
    }
    
    public void recordError(String errorType) {
        errors.increment(Tags.of("type", errorType));
    }
}

Four Golden Signals

Google's SRE book extends RED with the 'Four Golden Signals': Latency, Traffic, Errors, and Saturation. For most services, implementing RED + saturation monitoring provides comprehensive visibility. Start with these before adding domain-specific metrics.

Designing for Aggregation and Dashboards

Metrics are only valuable when visualized and analyzed. When designing class instrumentation, anticipate how metrics will be aggregated, queried, and displayed on dashboards.

Aggregation Considerations

•Sum-Aggregatable Counters — Counters from multiple instances sum correctly: sum(http_requests_total) gives total requests
•Average-Aggregatable Gauges — Some gauges average meaningfully (CPU %), others don't (queue depth should sum)
•Histogram Aggregation — Use histograms (not summaries) for server-side percentile calculation across instances
•Consistent Tags Across Instances — All instances must use identical tag keys for aggregation to work
•Include Instance Identifier — Add instance or pod tag for drill-down to specific instances

DashboardQueries.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Dashboard Query Examples (PromQL)
 
## Service Request Rate (per second, aggregated across all instances)
sum(rate(http_requests_total{service="order-service"}[5m]))
 
## Error Rate (percentage of requests that failed)
sum(rate(http_requests_failed_total{service="order-service"}[5m])) 
  / 
sum(rate(http_requests_total{service="order-service"}[5m])) * 100
 
## P99 Latency (99th percentile across all instances)
histogram_quantile(0.99, 
  sum by (le) (rate(http_request_duration_seconds_bucket{service="order-service"}[5m]))
)
 
## Connection Pool Utilization (per instance, then averaged)
avg(db_pool_connections_used / db_pool_connections_max)
 
## Active Orders by Status (using tags)
sum by (status) (orders_active{service="order-service"})
 
## Saturation Alert (queue backing up)
sum(request_queue_depth) > 100

Dashboard Design Principle

Design metrics with dashboards in mind. For each metric, know: (1) How will this be visualized? (2) What alerting thresholds apply? (3) How does it aggregate across instances? If you can't answer these questions, reconsider whether you need the metric.

Summary: Metrics and Monitoring Hooks

We've covered the fundamental techniques for designing classes with built-in metric emission and monitoring hooks. Let's consolidate the key takeaways:

Key Takeaways

•Four Metric Types — Counters (totals), Gauges (current values), Timers (durations), Histograms (distributions)
•Naming Matters — Use consistent, descriptive names with units: service_operation_metric_unit
•Cardinality Control — Never use unbounded values (user IDs, request IDs) as metric tags
•Monitoring Hooks — Abstract instrumentation behind interfaces for testability and flexibility
•RED for Services — Rate, Errors, Duration cover service health
•USE for Resources — Utilization, Saturation, Errors cover resource health
•Design for Aggregation — Consider how metrics combine across instances and time

What's Next:

With metrics mastered, we'll explore Tracing Considerations—how to design classes that participate in distributed tracing, propagate context across service boundaries, and enable end-to-end request visibility in microservice architectures.

Page Complete

You now understand how to design classes with comprehensive metric emission, from choosing the right metric types to managing cardinality and creating reusable monitoring hook interfaces. These skills form the quantitative foundation of observable systems.