Loading learning content...
We've examined each pillar of observability—metrics, logs, and traces—in isolation. Each provides valuable but incomplete visibility into your systems. Metrics show you that something is happening; logs show you what happened; traces show you where it happened across services.
But the true power of observability emerges when these three signals work together.
Imagine an incident response scenario: An alert fires because your error rate metric crossed a threshold. You open the dashboard and see the spike—but which errors? You query logs filtered by the error timeframe and find stack traces—but from which requests? You find a trace ID in a log, open it in your tracing tool, and see the complete request path—now you understand exactly which service failed, for which requests, and why.
This integrated workflow is modern observability. Each signal answers different questions and guides you to the next. Mastering observability means understanding how to navigate between them fluidly.
By the end of this page, you will understand how to connect metrics, logs, and traces into a cohesive observability strategy. You'll learn correlation techniques, integrated debugging workflows, the role of exemplars, and how to build dashboards and alerts that leverage all three signals.
Understanding why all three pillars are necessary starts with recognizing what each does well—and poorly. No single signal type provides complete visibility; each has blind spots that the others fill.
| Signal | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Metrics | Low cost, aggregatable, alertable, trend analysis, SLO tracking | Lacks context, can't answer 'why', limited cardinality | Dashboards, alerts, capacity planning, SLOs |
| Logs | Rich context, arbitrary data, searchable, audit trail | Expensive at scale, no request flow, hard to aggregate | Debugging, forensics, compliance, error details |
| Traces | Shows request flow, latency breakdown, cross-service visibility | Sampling loses data, complex setup, high overhead | Latency investigation, dependency mapping, error localization |
The key insight:
Each signal type answers different questions:
Metrics answer 'WHAT is the system's state?' — Error rate is 5%. Latency p99 is 800ms. CPU is at 80%.
Logs answer 'WHY did something happen?' — NullPointerException at UserService.java:142. Payment gateway returned 'insufficient_funds'. Config reload failed due to syntax error.
Traces answer 'WHERE in the system did it happen?' — The request started at API Gateway, slowed in Order Service (30ms → 800ms), specifically in the inventory check step calling the Warehouse API.
Effective incident response requires all three. The alert (metrics) tells you to look. The trace shows you where. The logs explain why.
When something goes wrong, you're always asking: What (is the symptom?), Where (is it happening?), and Why (is it happening?). Metrics answer What, Traces answer Where, Logs answer Why. Observability expertise is knowing which signal to consult for each question.
For the three pillars to work together, they must be correlated—connected so you can navigate from one to another. The primary mechanism for correlation is shared identifiers, particularly the trace ID.
The trace ID is the universal connector:
When a trace ID appears in:
...you can move seamlessly between all three signals for any given request.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
// Complete correlation: Trace ID in logs, metrics, and traces import { trace, context, metrics } from '@opentelemetry/api';import pino from 'pino'; // Create a logger that includes trace context automaticallyconst logger = pino({ mixin() { const span = trace.getActiveSpan(); if (span) { const spanContext = span.spanContext(); return { trace_id: spanContext.traceId, span_id: spanContext.spanId, }; } return {}; },}); // Create metrics with exemplarsconst meter = metrics.getMeter('order-service');const orderLatencyHistogram = meter.createHistogram('order_processing_duration', { description: 'Order processing duration in milliseconds', unit: 'ms',}); async function processOrder(order) { const startTime = Date.now(); const span = trace.getActiveSpan(); try { // Log with automatic trace context logger.info({ order_id: order.id, user_id: order.userId, msg: 'Processing order', }); // This log will include trace_id and span_id automatically const result = await doOrderProcessing(order); // Record metric with exemplar linking to this trace const duration = Date.now() - startTime; orderLatencyHistogram.record(duration, { status: 'success', payment_method: order.paymentMethod, }, { // Exemplar: Links this metric data point to the trace traceId: span?.spanContext().traceId, spanId: span?.spanContext().spanId, }); logger.info({ order_id: order.id, duration_ms: duration, msg: 'Order processed successfully', }); return result; } catch (error) { const duration = Date.now() - startTime; orderLatencyHistogram.record(duration, { status: 'error', error_type: error.constructor.name, }, { traceId: span?.spanContext().traceId, spanId: span?.spanContext().spanId, }); // Log error with full context logger.error({ err: error, order_id: order.id, duration_ms: duration, msg: 'Order processing failed', }); throw error; }} // Result: For any given order:// - Trace shows the request flow through services// - Logs show detailed events (searchable by trace_id)// - Metrics include exemplars pointing to example tracesExemplars: Connecting metrics to traces
Exemplars are the bridge from aggregate metrics to individual traces. When Prometheus records a histogram bucket, an exemplar stores the trace ID of an example request in that bucket.
For instance, when your p99 latency is 800ms, an exemplar answers: 'Here's a specific request that took 800ms—click to see its trace.'
123456789101112131415
# Prometheus metric with exemplar # Standard histogram buckethttp_request_duration_seconds_bucket{le="0.5"} 24054 # Histogram bucket with exemplarhttp_request_duration_seconds_bucket{le="1"} 129389 # {trace_id="abc123"} 0.987 # The exemplar data:# - trace_id="abc123" identifies a specific trace# - 0.987 is the actual observed value for that trace# - This request took 987ms and fell into the ≤1s bucket # In Grafana, clicking on the p99 latency can take you directly# to the Jaeger/Tempo trace for "abc123"Beyond trace IDs, other shared identifiers enhance correlation: request_id (for non-traced requests), user_id (for user journey analysis), deployment_id (for release correlation), host/pod name (for infrastructure correlation). The more shared context across signals, the easier navigation becomes.
Let's walk through a complete incident response using all three pillars. This represents the ideal observability workflow that you should strive to enable in your systems.
Scenario: Support receives reports that order processing is slow. Users are waiting 30+ seconds for checkout to complete.
Step 1: Alert triggers (Metrics)
Before user reports even arrive, an alert fires:
1234567891011121314151617
# Prometheus alert rulegroups: - name: checkout-alerts rules: - alert: CheckoutLatencyHigh expr: | histogram_quantile(0.95, sum(rate(checkout_duration_seconds_bucket[5m])) by (le) ) > 10 for: 5m labels: severity: warning team: checkout annotations: summary: "Checkout p95 latency exceeds 10 seconds" dashboard: "https://grafana.example.com/d/checkout-performance" runbook: "https://wiki.example.com/runbooks/checkout-latency"Step 2: Examine dashboard (Metrics)
You open the linked dashboard. It shows:
Step 3: Find example traces (Metrics → Traces)
The latency histogram has exemplars. You click on a data point in the spike. It opens a trace with ID 4bf92f3577b34da6...
Step 4: Analyze the trace (Traces)
123456789101112131415161718
# Trace waterfall view (simplified)Timeline: 0ms 15000ms |-------------------------------------------------------| [api-gateway] POST /checkout ├── [auth-service] validate_token |==| 45ms ├── [cart-service] get_cart |===| 120ms ├── [inventory-service] check_availability |===| 85ms └── [payment-service] process_payment |================================| 14200ms ⚠️ └── [payment-service] call_stripe_api |================================| 14150ms ⚠️ └── [event] Retrying after timeout... └── [event] Retrying after timeout... └── [event] Retry succeeded # The trace reveals:# - payment-service is the bottleneck (14.2s of 15s total)# - Specifically, the Stripe API call# - There were multiple retries due to timeoutsStep 5: Investigate with logs (Traces → Logs)
You copy the trace ID and query logs:
12345678910111213141516171819202122232425
# Query logs for this trace{service="payment-service"} | json | trace_id = "4bf92f3577b34da6..." | level = "WARN" or level = "ERROR" # Results:2024-01-08T14:32:15.123Z WARN trace_id=4bf92f3577b34da6 msg="Stripe API timeout, retrying" attempt=1 timeout_ms=5000 endpoint="https://api.stripe.com/v1/charges" 2024-01-08T14:32:20.456Z WARN trace_id=4bf92f3577b34da6 msg="Stripe API timeout, retrying" attempt=2 timeout_ms=5000 endpoint="https://api.stripe.com/v1/charges" 2024-01-08T14:32:25.789Z INFO trace_id=4bf92f3577b34da6 msg="Stripe API call succeeded after retries" attempt=3 total_duration_ms=14150 # The logs reveal:# - Stripe API is timing out# - Retries eventually succeed# - This explains the 14s latency (3 retries × ~5s each)Step 6: Verify scope and impact (Back to Metrics)
You check if this is happening to all requests or just some:
stripe_api_status=timeoutStep 7: Resolution
The complete workflow:
Alert (Metrics) → Dashboard (Metrics) → Exemplar (Metrics→Traces) → Trace waterfall → Trace ID (Traces→Logs) → Log details → Verification (Logs→Metrics)
With proper correlation and tooling, what used to take hours of grep-ing through log files and guessing can now take minutes. The key is having all three signals available and connected, with UI tools that make navigation between them seamless.
The workflow described above requires tools that understand all three signals and can navigate between them. Modern observability platforms provide this unified experience.
Key capabilities of unified platforms:
| Platform | Metrics | Logs | Traces | Key Strength |
|---|---|---|---|---|
| Grafana Stack | Prometheus/Mimir | Loki | Tempo | Open source, flexible, cost-efficient |
| Elastic Observability | Elastic APM | Elasticsearch | Elastic APM | Full-text log search, single backend |
| Datadog | Datadog Metrics | Datadog Logs | Datadog APM | Polished UX, ML-powered insights |
| Honeycomb | Metrics (limited) | Events | Traces | High-cardinality analysis, BubbleUp |
| Splunk Observability | SignalFx | Splunk | SignalFx | Enterprise features, Splunk ecosystem |
| New Relic | NRDB | NRDB | New Relic APM | All-in-one, straightforward pricing |
| AWS Native | CloudWatch | CloudWatch Logs | X-Ray | AWS integration, serverless-friendly |
The Grafana stack as an example:
Grafana has become a popular choice for unified observability because it integrates multiple specialized backends:
All three integrate in Grafana dashboards with cross-linking capabilities.
123456789101112131415161718192021222324252627
# Grafana Explore: Navigating between signals # Step 1: Start with metrics querysum(rate(http_requests_total{status=~"5.."}[5m])) by (service) # Dashboard shows error rate spike for 'checkout-service' # Step 2: Click on exemplar from the graph# Opens Tempo trace view for trace_id "abc123..." # Step 3: From trace view, click "Logs for this span"# Auto-generates Loki query:{service="checkout-service"} | json | trace_id="abc123..." # Step 4: Find root cause in logs# "Database connection pool exhausted" # Step 5: Create ad-hoc dashboard panel# Combining metrics and logs in same view # Panel 1: Error rate metric (Prometheus)sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) # Panel 2: Error logs count (Loki)sum(count_over_time({level="error"}[1m])) by (service) # Both panels correlate automatically by service labelOpenTelemetry provides the vendor-neutral data model that makes unified observability possible. When all three signals use OTel's semantic conventions, tools can correlate them automatically. This is why the industry has converged on OpenTelemetry as the foundation for modern observability.
Effective dashboards leverage multiple signal types to tell a complete story. Rather than having separate 'metrics dashboard' and 'logs dashboard,' design dashboards that integrate all three.
Dashboard design principles:
Example: Service health dashboard structure
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
// Service Health Dashboard Structure { "title": "Checkout Service Health", "templating": { "variables": [ { "name": "service", "default": "checkout-service" }, { "name": "environment", "default": "production" } ] }, "rows": [ { // ROW 1: Golden Signals Overview "title": "Overview", "panels": [ { "type": "stat", "title": "Request Rate", "query": "rate(http_requests_total{service="$service"}[5m])" }, { "type": "stat", "title": "Error Rate", "query": "rate(http_requests_total{service="$service",status=~"5.."}[5m])" }, { "type": "stat", "title": "P95 Latency", "query": "histogram_quantile(0.95, rate(http_request_duration_bucket{service="$service"}[5m]))" }, { "type": "stat", "title": "Saturation (CPU)", "query": "avg(container_cpu_usage_seconds_total{container="$service"})" } ] }, { // ROW 2: Time Series with Exemplars "title": "Latency Over Time", "panels": [ { "type": "timeseries", "title": "Latency Distribution", "query": "histogram_quantile(0.5|0.95|0.99, rate(http_request_duration_bucket{service="$service"}[5m]))", "options": { "exemplars": true, // Show exemplar points "exemplarLinkTo": "tempo" // Click opens Tempo trace } } ] }, { // ROW 3: Errors - Combined Metrics and Logs "title": "Error Analysis", "panels": [ { "type": "timeseries", "title": "Error Rate by Status Code", "query": "sum(rate(http_requests_total{service="$service",status=~"[45].."}[5m])) by (status)" }, { "type": "logs", "title": "Recent Errors", "datasource": "loki", "query": "{service="$service"} | json | level = "ERROR"", "options": { "showLabels": ["trace_id", "error_type"] } } ] }, { // ROW 4: Traces for Slow Requests "title": "Slow Requests", "panels": [ { "type": "traces", "title": "Slowest Traces (Last Hour)", "datasource": "tempo", "query": "{resource.service.name="$service"} | status = "OK" | duration > 1s", "options": { "limit": 10 } } ] }, { // ROW 5: Deployments and Events Correlation "title": "Events Timeline", "panels": [ { "type": "annotations", "title": "Deployments, Alerts, Incidents", "sources": [ { "type": "deployment", "query": "kube_deployment_status_replicas_updated{deployment="$service"}" }, { "type": "alert", "query": "ALERTS{service="$service"}" } ] } ] } ]}Use dashboard variables (service, environment, time range) that apply across all panels regardless of data source. This ensures your metrics, logs, and traces panels all filter to the same context, making correlation automatic.
Alerts are typically based on metrics—thresholds, anomalies, or SLO burn rates. But effective alerting incorporates all three signals to reduce noise and accelerate response.
Multi-signal alerting patterns:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
# Alertmanager template with multi-signal enrichment # Alert configurationgroups: - name: checkout-alerts rules: - alert: HighErrorRate expr: | (sum(rate(http_requests_total{service="checkout", status=~"5.."}[5m])) / sum(rate(http_requests_total{service="checkout"}[5m]))) > 0.05 for: 2m labels: severity: critical service: checkout annotations: summary: "Checkout service error rate above 5%" description: "Error rate is {{ $value | printf "%.2f" }}%" # Link to dashboard with all signals dashboard: "https://grafana.example.com/d/checkout?from=now-1h" # Link to error logs logs_query: | https://grafana.example.com/explore?datasource=loki& expr={service="checkout"} |= "error" | json # Link to recent traces with errors traces_query: | https://grafana.example.com/explore?datasource=tempo& query={service.name="checkout" status=error} # Recent error samples (from recording rule) recent_errors: "{{ $labels.recent_error_sample }}" # Recording rule to capture recent error messages- record: checkout:recent_errors:sample expr: | # This would typically be done via a log-to-metric pipeline # showing recent error types for inclusion in alerts ---# Notification template{{ define "slack.custom.message" }}*Alert:* {{ .Annotations.summary }}*Severity:* {{ .Labels.severity }}*Service:* {{ .Labels.service }} *Status:* {{ .Status }}*Duration:* {{ .StartsAt | since }} 📊 *Dashboard:* {{ .Annotations.dashboard }}📜 *Logs:* {{ .Annotations.logs_query }}🔍 *Traces:* {{ .Annotations.traces_query }} *Recent Errors:*```{{ .Annotations.recent_errors }}``` *Runbook:* {{ .Annotations.runbook }}{{ end }}SLO-based alerting with multi-signal context:
The most effective alerting strategy focuses on SLOs—Service Level Objectives that represent what actually matters to users. When an SLO is burning (consuming error budget), alerts should provide immediate access to all signals for diagnosis.
A well-designed alert should enable resolution within 5 clicks: 1) Read alert → 2) Click dashboard link → 3) Identify anomaly → 4) Click trace exemplar → 5) Understand root cause from trace + linked logs. Each click provides progressive detail, all pre-filtered to the right context.
Even with all three pillars in place, teams often fail to achieve effective observability due to common mistakes:
Many organizations believe they have observability because they have metrics AND logs AND traces. But without correlation, they have three separate monitoring systems that happen to coexist. True observability is the ability to ask any question about your system and answer it quickly—this requires integration, not just presence.
We've now covered the three pillars of observability—Metrics, Logs, and Traces—and how they work together. Let's consolidate everything:
The observability mindset:
Observability is not just about having tools—it's about being able to ask any question about your system's behavior and get an answer quickly. The three pillars, properly integrated, give you this capability:
When you can answer questions like these in minutes instead of hours, you have achieved true observability.
You now have a comprehensive understanding of observability's three pillars and how they work together. You understand metrics for quantitative measurement, logs for event context, traces for request paths, and most importantly—how to correlate and navigate between them for effective incident response and system understanding.
Next in Chapter 27:
With the foundational understanding of observability's three pillars complete, the next modules will dive deeper into practical implementation: metrics collection with Prometheus, distributed tracing systems, logging at scale, alerting design, and building effective dashboards.