Loading content...
If observability is about understanding system behavior, metrics are the quantitative dimension of that understanding. While logs tell you what happened and traces tell you where it happened, metrics tell you how much and how fast—the numerical heartbeat of your system.
Metrics transform ephemeral system behavior into measurable data points that can be aggregated, visualized, alerted upon, and analyzed over time. A well-instrumented class emits metrics that answer questions like: How many requests per second? What's the 99th percentile latency? How many errors occurred? What's the current queue depth?
This page explores the art and science of designing classes with built-in metric emission—creating monitoring hooks that provide genuine operational insight without overwhelming operators with noise.
By the end of this page, you will master the four fundamental metric types (counters, gauges, timers, histograms), understand when to use each, learn to design classes with built-in monitoring hooks, and develop strategies for metric naming, tagging, and cardinality management.
All metrics in modern monitoring systems derive from four fundamental types. Understanding their semantics and appropriate use cases is essential for effective instrumentation.
| Metric Type | Description | Examples | Mathematical Property |
|---|---|---|---|
| Counter | Monotonically increasing value that only goes up (or resets to zero) | Requests processed, errors occurred, bytes sent | Only increment (+) and reset operations allowed |
| Gauge | Point-in-time value that can go up or down | Current queue depth, active connections, memory usage | Arbitrary set, increment, decrement operations |
| Timer | Measures duration of events | Request latency, database query time, API call duration | Records duration values into histograms |
| Histogram | Captures distribution of values across buckets | Request sizes, latency percentiles, batch sizes | Aggregates values into configurable buckets |
Counters are the simplest and most common metric type. They track cumulative totals of events and only increase (or reset to zero on restart).
When to Use Counters:
Key Properties:
rate(counter[interval]) or increase(counter[interval])1234567891011121314151617181920212223242526272829303132
// Counter usage patternspublic class PaymentService { private final Counter paymentsProcessed; private final Counter paymentsFailed; private final Counter paymentAmountTotal; public PaymentService(MeterRegistry registry) { // Basic event counter this.paymentsProcessed = registry.counter("payments.processed"); // Counter with tags for categorization this.paymentsFailed = registry.counter("payments.failed", "service", "payment-service"); // Counter tracking a cumulative value (total money processed) this.paymentAmountTotal = registry.counter("payments.amount.total", "currency", "USD"); } public PaymentResult processPayment(PaymentRequest request) { try { PaymentResult result = executePayment(request); paymentsProcessed.increment(); paymentAmountTotal.increment(request.getAmount().doubleValue()); return result; } catch (PaymentException e) { // Tagging failure with reason enables filtering paymentsFailed.withTag("reason", e.getErrorCode()).increment(); throw e; } }}Never decrement a counter—if you need to track a value that goes up and down, use a gauge. Never use counters for point-in-time values like 'current queue size'. Reset handling is automatic; don't try to persist counter values across restarts.
Consistent metric naming is crucial for discoverability, understandability, and automation. A well-designed naming scheme allows operators to find metrics intuitively and enables programmatic queries across related metrics.
Golden Rules of Metric Naming:
http_requests_total or http.requests.total)request_duration_seconds, message_size_bytes, queue_depth_messageshttp_server_request_duration_seconds > request_timehttp_server_*, http_client_*_total: http_requests_total| Bad Name | Good Name | Why Better |
|---|---|---|
reqTime | http_request_duration_seconds | Clear domain, descriptive, includes unit |
errors | http_requests_failed_total | Specific to domain, indicates counter |
poolSize | db_connection_pool_size_connections | Indicates resource type and unit |
cache | cache_hit_ratio | Describes what's being measured |
memory | jvm_memory_used_bytes | Scoped, specific, unit included |
Follow this pattern: <namespace>_<subsystem>_<metric>_<unit>. Examples: myapp_database_query_duration_seconds, myapp_http_requests_total, myapp_cache_entries_count. The namespace prevents collisions with library metrics.
Tags (also called labels or dimensions) add contextual dimensions to metrics, enabling filtering and grouping. However, tags are a double-edged sword—each unique combination of tag values creates a new time series, and cardinality explosion can overwhelm monitoring systems.
Understanding Cardinality:
Cardinality is the number of unique time series a metric creates. For a metric with tags {method, status, endpoint}:
{user_id} with 1M users = 5,000,000,000 time series ❌This is catastrophic. Storage costs explode, query performance degrades, and monitoring systems collapse.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
// Cardinality-safe metric designpublic class ApiMetrics { private final MeterRegistry registry; // GOOD: Bounded tag values public void recordRequest(String method, int statusCode, String endpoint) { // Normalize endpoint to template form String normalizedEndpoint = normalizeEndpoint(endpoint); // e.g., "/users/12345/orders" -> "/users/{id}/orders" registry.counter("http.requests.total", "method", method, "status", String.valueOf(statusCode), "endpoint", normalizedEndpoint ).increment(); } // BAD: Unbounded tag - DO NOT DO THIS public void recordRequestBad(String userId, String requestId) { // This creates a new time series for every user and request! registry.counter("http.requests.total", "user_id", userId, // ❌ Millions of values "request_id", requestId // ❌ Unique per request = infinite! ).increment(); } // GOOD: Categorize unbounded values into buckets public void recordErrorWithReason(Exception e) { // Categorize error types, don't use full messages String errorCategory = categorizeError(e); registry.counter("errors.total", "type", e.getClass().getSimpleName(), "category", errorCategory // "validation", "timeout", "unknown" ).increment(); // Full error details go to logs, not metrics log.error("Request failed", kv("errorMessage", e.getMessage())); } private String normalizeEndpoint(String endpoint) { // Replace path parameters with placeholders return endpoint .replaceAll("/users/\\d+", "/users/{id}") .replaceAll("/orders/[a-f0-9-]+", "/orders/{orderId}"); } private String categorizeError(Exception e) { if (e instanceof ValidationException) return "validation"; if (e instanceof TimeoutException) return "timeout"; if (e instanceof AuthenticationException) return "auth"; return "unknown"; }}Before adding a tag, estimate: (existing series) × (new tag unique values) = (new total series). If this exceeds 10,000 for a single metric, reconsider. If it exceeds 100,000, you almost certainly have a design problem. Use logs for high-cardinality data.
To maintain clean separation between business logic and instrumentation, classes should expose well-defined monitoring hooks—interfaces that observability infrastructure can plug into without coupling to specific monitoring systems.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
/** * Monitoring hook interface for order processing. * Implementations connect to specific monitoring systems (Prometheus, Datadog, etc.) */public interface OrderProcessingMonitor { // Event hooks - called at key lifecycle points void onOrderReceived(OrderRequest request); void onOrderProcessingStarted(String orderId); void onOrderProcessingCompleted(String orderId, Duration duration); void onOrderProcessingFailed(String orderId, Exception error); // State hooks - called to report current state void reportQueueDepth(int depth); void reportActiveProcessors(int count);} /** * Prometheus implementation of order monitoring */public class PrometheusOrderMonitor implements OrderProcessingMonitor { private final Counter ordersReceived; private final Counter ordersCompleted; private final Counter ordersFailed; private final Timer processingDuration; private final Gauge queueDepth; private final Gauge activeProcessors; public PrometheusOrderMonitor(MeterRegistry registry) { this.ordersReceived = registry.counter("orders.received.total"); this.ordersCompleted = registry.counter("orders.completed.total"); this.ordersFailed = registry.counter("orders.failed.total"); this.processingDuration = registry.timer("orders.processing.duration"); this.queueDepth = registry.gauge("orders.queue.depth", new AtomicInteger()); this.activeProcessors = registry.gauge("orders.processors.active", new AtomicInteger()); } @Override public void onOrderReceived(OrderRequest request) { ordersReceived.increment(); } @Override public void onOrderProcessingStarted(String orderId) { // Could track in-flight orders if needed } @Override public void onOrderProcessingCompleted(String orderId, Duration duration) { ordersCompleted.increment(); processingDuration.record(duration); } @Override public void onOrderProcessingFailed(String orderId, Exception error) { ordersFailed.increment(Tags.of("error", error.getClass().getSimpleName())); } @Override public void reportQueueDepth(int depth) { ((AtomicInteger) queueDepth).set(depth); } @Override public void reportActiveProcessors(int count) { ((AtomicInteger) activeProcessors).set(count); }} /** * No-op implementation for testing or when monitoring is disabled */public class NoOpOrderMonitor implements OrderProcessingMonitor { @Override public void onOrderReceived(OrderRequest request) {} @Override public void onOrderProcessingStarted(String orderId) {} @Override public void onOrderProcessingCompleted(String orderId, Duration d) {} @Override public void onOrderProcessingFailed(String orderId, Exception e) {} @Override public void reportQueueDepth(int depth) {} @Override public void reportActiveProcessors(int count) {}} /** * Order processor using monitoring hooks */public class OrderProcessor { private final OrderProcessingMonitor monitor; private final Queue<Order> orderQueue; public OrderProcessor(OrderProcessingMonitor monitor) { this.monitor = monitor; this.orderQueue = new ConcurrentLinkedQueue<>(); } public void submitOrder(OrderRequest request) { monitor.onOrderReceived(request); orderQueue.offer(new Order(request)); monitor.reportQueueDepth(orderQueue.size()); } public void processNextOrder() { Order order = orderQueue.poll(); if (order == null) return; monitor.reportQueueDepth(orderQueue.size()); monitor.onOrderProcessingStarted(order.getId()); Instant start = Instant.now(); try { doProcessOrder(order); monitor.onOrderProcessingCompleted(order.getId(), Duration.between(start, Instant.now())); } catch (Exception e) { monitor.onOrderProcessingFailed(order.getId(), e); throw e; } }}Rather than inventing metrics ad-hoc, mature organizations adopt standardized metric methodologies that ensure consistent coverage. Two widely adopted methods are RED (for services) and USE (for resources).
RED is ideal for request-driven services (APIs, microservices). Every service should emit these three metric types.
Example: For an API endpoint:
http_requests_total (counter)http_requests_failed_total (counter)http_request_duration_seconds (histogram)USE is ideal for resources (queues, pools, caches). Every resource should emit these three metric types.
Example: For a connection pool:
pool_connections_used / pool_connections_maxpool_wait_queue_size (gauge)pool_connection_errors_total (counter)123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
/** * RED method implementation for a service */public class ServiceREDMetrics { private final Counter requestsTotal; private final Counter errorsTotal; private final Timer duration; public ServiceREDMetrics(MeterRegistry registry, String serviceName) { this.requestsTotal = registry.counter("service.requests.total", "service", serviceName); this.errorsTotal = registry.counter("service.errors.total", "service", serviceName); this.duration = Timer.builder("service.request.duration.seconds") .tag("service", serviceName) .publishPercentiles(0.5, 0.95, 0.99) .register(registry); } public <T> T time(Supplier<T> operation) { requestsTotal.increment(); try { return duration.record(operation); } catch (Exception e) { errorsTotal.increment(); throw e; } }} /** * USE method implementation for a resource (e.g., connection pool) */public class ResourceUSEMetrics { private final Gauge utilization; private final Gauge saturation; private final Counter errors; public ResourceUSEMetrics( MeterRegistry registry, String resourceName, Supplier<Double> utilizationFn, Supplier<Integer> saturationFn ) { // Utilization: 0.0 to 1.0 representing capacity used this.utilization = registry.gauge("resource.utilization", Tags.of("resource", resourceName), this, m -> utilizationFn.get()); // Saturation: queue depth or backpressure indicator this.saturation = registry.gauge("resource.saturation", Tags.of("resource", resourceName), this, m -> saturationFn.get().doubleValue()); // Errors: count of resource-related errors this.errors = registry.counter("resource.errors.total", "resource", resourceName); } public void recordError(String errorType) { errors.increment(Tags.of("type", errorType)); }}Google's SRE book extends RED with the 'Four Golden Signals': Latency, Traffic, Errors, and Saturation. For most services, implementing RED + saturation monitoring provides comprehensive visibility. Start with these before adding domain-specific metrics.
Metrics are only valuable when visualized and analyzed. When designing class instrumentation, anticipate how metrics will be aggregated, queried, and displayed on dashboards.
sum(http_requests_total) gives total requestsinstance or pod tag for drill-down to specific instances1234567891011121314151617181920212223
# Dashboard Query Examples (PromQL) ## Service Request Rate (per second, aggregated across all instances)sum(rate(http_requests_total{service="order-service"}[5m])) ## Error Rate (percentage of requests that failed)sum(rate(http_requests_failed_total{service="order-service"}[5m])) / sum(rate(http_requests_total{service="order-service"}[5m])) * 100 ## P99 Latency (99th percentile across all instances)histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{service="order-service"}[5m]))) ## Connection Pool Utilization (per instance, then averaged)avg(db_pool_connections_used / db_pool_connections_max) ## Active Orders by Status (using tags)sum by (status) (orders_active{service="order-service"}) ## Saturation Alert (queue backing up)sum(request_queue_depth) > 100Design metrics with dashboards in mind. For each metric, know: (1) How will this be visualized? (2) What alerting thresholds apply? (3) How does it aggregate across instances? If you can't answer these questions, reconsider whether you need the metric.
We've covered the fundamental techniques for designing classes with built-in metric emission and monitoring hooks. Let's consolidate the key takeaways:
service_operation_metric_unitWhat's Next:
With metrics mastered, we'll explore Tracing Considerations—how to design classes that participate in distributed tracing, propagate context across service boundaries, and enable end-to-end request visibility in microservice architectures.
You now understand how to design classes with comprehensive metric emission, from choosing the right metric types to managing cardinality and creating reusable monitoring hook interfaces. These skills form the quantitative foundation of observable systems.