System Design (HLD)Three Pillars of Observability

The Three Pillars of Observability

LevelIntermediate

Duration60 mins

TopicThree Pillars of Observability

4 / 4

How They Complement Each Other

The Whole Is Greater Than the Sum

We've examined each pillar of observability—metrics, logs, and traces—in isolation. Each provides valuable but incomplete visibility into your systems. Metrics show you that something is happening; logs show you what happened; traces show you where it happened across services.

But the true power of observability emerges when these three signals work together.

Imagine an incident response scenario: An alert fires because your error rate metric crossed a threshold. You open the dashboard and see the spike—but which errors? You query logs filtered by the error timeframe and find stack traces—but from which requests? You find a trace ID in a log, open it in your tracing tool, and see the complete request path—now you understand exactly which service failed, for which requests, and why.

This integrated workflow is modern observability. Each signal answers different questions and guides you to the next. Mastering observability means understanding how to navigate between them fluidly.

What You Will Learn

By the end of this page, you will understand how to connect metrics, logs, and traces into a cohesive observability strategy. You'll learn correlation techniques, integrated debugging workflows, the role of exemplars, and how to build dashboards and alerts that leverage all three signals.

The Strengths and Weaknesses of Each Pillar

Understanding why all three pillars are necessary starts with recognizing what each does well—and poorly. No single signal type provides complete visibility; each has blind spots that the others fill.

Observability Pillars: Strengths and Weaknesses
Signal	Strengths	Weaknesses	Best For
Metrics	Low cost, aggregatable, alertable, trend analysis, SLO tracking	Lacks context, can't answer 'why', limited cardinality	Dashboards, alerts, capacity planning, SLOs
Logs	Rich context, arbitrary data, searchable, audit trail	Expensive at scale, no request flow, hard to aggregate	Debugging, forensics, compliance, error details
Traces	Shows request flow, latency breakdown, cross-service visibility	Sampling loses data, complex setup, high overhead	Latency investigation, dependency mapping, error localization

The key insight:

Each signal type answers different questions:

Metrics answer 'WHAT is the system's state?' — Error rate is 5%. Latency p99 is 800ms. CPU is at 80%.
Logs answer 'WHY did something happen?' — NullPointerException at UserService.java:142. Payment gateway returned 'insufficient_funds'. Config reload failed due to syntax error.
Traces answer 'WHERE in the system did it happen?' — The request started at API Gateway, slowed in Order Service (30ms → 800ms), specifically in the inventory check step calling the Warehouse API.

Effective incident response requires all three. The alert (metrics) tells you to look. The trace shows you where. The logs explain why.

The Three Questions Framework

When something goes wrong, you're always asking: What (is the symptom?), Where (is it happening?), and Why (is it happening?). Metrics answer What, Traces answer Where, Logs answer Why. Observability expertise is knowing which signal to consult for each question.

Correlation: The Connective Tissue

For the three pillars to work together, they must be correlated—connected so you can navigate from one to another. The primary mechanism for correlation is shared identifiers, particularly the trace ID.

The trace ID is the universal connector:

When a trace ID appears in:

Traces (obviously)
Logs (included as a field in every log entry)
Metrics (via exemplars)

...you can move seamlessly between all three signals for any given request.

correlated-observability.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
// Complete correlation: Trace ID in logs, metrics, and traces
 
import { trace, context, metrics } from '@opentelemetry/api';
import pino from 'pino';
 
// Create a logger that includes trace context automatically
const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (span) {
      const spanContext = span.spanContext();
      return {
        trace_id: spanContext.traceId,
        span_id: spanContext.spanId,
      };
    }
    return {};
  },
});
 
// Create metrics with exemplars
const meter = metrics.getMeter('order-service');
const orderLatencyHistogram = meter.createHistogram('order_processing_duration', {
  description: 'Order processing duration in milliseconds',
  unit: 'ms',
});
 
async function processOrder(order) {
  const startTime = Date.now();
  const span = trace.getActiveSpan();
  
  try {
    // Log with automatic trace context
    logger.info({
      order_id: order.id,
      user_id: order.userId,
      msg: 'Processing order',
    });
    // This log will include trace_id and span_id automatically
    
    const result = await doOrderProcessing(order);
    
    // Record metric with exemplar linking to this trace
    const duration = Date.now() - startTime;
    orderLatencyHistogram.record(duration, {
      status: 'success',
      payment_method: order.paymentMethod,
    }, {
      // Exemplar: Links this metric data point to the trace
      traceId: span?.spanContext().traceId,
      spanId: span?.spanContext().spanId,
    });
    
    logger.info({
      order_id: order.id,
      duration_ms: duration,
      msg: 'Order processed successfully',
    });
    
    return result;
    
  } catch (error) {
    const duration = Date.now() - startTime;
    
    orderLatencyHistogram.record(duration, {
      status: 'error',
      error_type: error.constructor.name,
    }, {
      traceId: span?.spanContext().traceId,
      spanId: span?.spanContext().spanId,
    });
    
    // Log error with full context
    logger.error({
      err: error,
      order_id: order.id,
      duration_ms: duration,
      msg: 'Order processing failed',
    });
    
    throw error;
  }
}
 
// Result: For any given order:
// - Trace shows the request flow through services
// - Logs show detailed events (searchable by trace_id)
// - Metrics include exemplars pointing to example traces

Exemplars: Connecting metrics to traces

Exemplars are the bridge from aggregate metrics to individual traces. When Prometheus records a histogram bucket, an exemplar stores the trace ID of an example request in that bucket.

For instance, when your p99 latency is 800ms, an exemplar answers: 'Here's a specific request that took 800ms—click to see its trace.'

prometheus-exemplar.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Prometheus metric with exemplar
 
# Standard histogram bucket
http_request_duration_seconds_bucket{le="0.5"} 24054
 
# Histogram bucket with exemplar
http_request_duration_seconds_bucket{le="1"} 129389 # {trace_id="abc123"} 0.987
 
# The exemplar data:
# - trace_id="abc123" identifies a specific trace
# - 0.987 is the actual observed value for that trace
# - This request took 987ms and fell into the ≤1s bucket
 
# In Grafana, clicking on the p99 latency can take you directly
# to the Jaeger/Tempo trace for "abc123"

Additional Correlation Dimensions

Beyond trace IDs, other shared identifiers enhance correlation: request_id (for non-traced requests), user_id (for user journey analysis), deployment_id (for release correlation), host/pod name (for infrastructure correlation). The more shared context across signals, the easier navigation becomes.

The Integrated Debugging Workflow

Let's walk through a complete incident response using all three pillars. This represents the ideal observability workflow that you should strive to enable in your systems.

Scenario: Support receives reports that order processing is slow. Users are waiting 30+ seconds for checkout to complete.

Step 1: Alert triggers (Metrics)

Before user reports even arrive, an alert fires:

alert-rule.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Prometheus alert rule
groups:
  - name: checkout-alerts
    rules:
      - alert: CheckoutLatencyHigh
        expr: |
          histogram_quantile(0.95, 
            sum(rate(checkout_duration_seconds_bucket[5m])) by (le)
          ) > 10
        for: 5m
        labels:
          severity: warning
          team: checkout
        annotations:
          summary: "Checkout p95 latency exceeds 10 seconds"
          dashboard: "https://grafana.example.com/d/checkout-performance"
          runbook: "https://wiki.example.com/runbooks/checkout-latency"

Step 2: Examine dashboard (Metrics)

You open the linked dashboard. It shows:

p95 latency spiked from 2s to 15s at 14:30
Error rate increased slightly
Request rate is normal
Latency is high for all checkout steps, but especially 'payment_processing'

Step 3: Find example traces (Metrics → Traces)

The latency histogram has exemplars. You click on a data point in the spike. It opens a trace with ID 4bf92f3577b34da6...

Step 4: Analyze the trace (Traces)

trace-waterfall.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Trace waterfall view (simplified)
Timeline: 0ms                                               15000ms
         |-------------------------------------------------------|
 
[api-gateway] POST /checkout
  ├── [auth-service] validate_token           |==| 45ms
  ├── [cart-service] get_cart                 |===| 120ms
  ├── [inventory-service] check_availability  |===| 85ms
  └── [payment-service] process_payment       |================================| 14200ms  ⚠️
       └── [payment-service] call_stripe_api  |================================| 14150ms  ⚠️
            └── [event] Retrying after timeout...
            └── [event] Retrying after timeout...  
            └── [event] Retry succeeded
 
# The trace reveals:
# - payment-service is the bottleneck (14.2s of 15s total)
# - Specifically, the Stripe API call
# - There were multiple retries due to timeouts

Step 5: Investigate with logs (Traces → Logs)

You copy the trace ID and query logs:

log-query.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Query logs for this trace
{service="payment-service"} | json | trace_id = "4bf92f3577b34da6..." | level = "WARN" or level = "ERROR"
 
# Results:
2024-01-08T14:32:15.123Z WARN  trace_id=4bf92f3577b34da6 
  msg="Stripe API timeout, retrying" 
  attempt=1 
  timeout_ms=5000 
  endpoint="https://api.stripe.com/v1/charges"
 
2024-01-08T14:32:20.456Z WARN  trace_id=4bf92f3577b34da6
  msg="Stripe API timeout, retrying"
  attempt=2
  timeout_ms=5000
  endpoint="https://api.stripe.com/v1/charges"
 
2024-01-08T14:32:25.789Z INFO  trace_id=4bf92f3577b34da6
  msg="Stripe API call succeeded after retries"
  attempt=3
  total_duration_ms=14150
 
# The logs reveal:
# - Stripe API is timing out
# - Retries eventually succeed
# - This explains the 14s latency (3 retries × ~5s each)

Step 6: Verify scope and impact (Back to Metrics)

You check if this is happening to all requests or just some:

Create a dashboard query filtering by stripe_api_status=timeout
See that 15% of payment requests are experiencing timeouts
Check Stripe's status page—they're reporting degraded performance

Step 7: Resolution

Immediate: Increase timeout values and retry limits to reduce user-visible failures
Communication: Inform support that Stripe is experiencing issues; ETA from their status page
Monitoring: Add alert for Stripe API timeout rate

The complete workflow:

Alert (Metrics) → Dashboard (Metrics) → Exemplar (Metrics→Traces) → Trace waterfall → Trace ID (Traces→Logs) → Log details → Verification (Logs→Metrics)

Total Investigation Time: 5 Minutes

With proper correlation and tooling, what used to take hours of grep-ing through log files and guessing can now take minutes. The key is having all three signals available and connected, with UI tools that make navigation between them seamless.

Unified Observability Platforms

The workflow described above requires tools that understand all three signals and can navigate between them. Modern observability platforms provide this unified experience.

Key capabilities of unified platforms:

•Single-pane-of-glass — View metrics, logs, and traces in one interface without context-switching between tools
•Cross-signal navigation — Click from a metric data point to related traces; click from a trace to related logs
•Correlated alerting — Alerts can include links to relevant dashboards, traces, and log queries
•Unified query — Query across signal types: 'Show me logs from requests where latency > 1s'
•Shared context — Common labels (service, environment, version) work across all signals

Unified Observability Platform Comparison
Platform	Metrics	Logs	Traces	Key Strength
Grafana Stack	Prometheus/Mimir	Loki	Tempo	Open source, flexible, cost-efficient
Elastic Observability	Elastic APM	Elasticsearch	Elastic APM	Full-text log search, single backend
Datadog	Datadog Metrics	Datadog Logs	Datadog APM	Polished UX, ML-powered insights
Honeycomb	Metrics (limited)	Events	Traces	High-cardinality analysis, BubbleUp
Splunk Observability	SignalFx	Splunk	SignalFx	Enterprise features, Splunk ecosystem
New Relic	NRDB	NRDB	New Relic APM	All-in-one, straightforward pricing
AWS Native	CloudWatch	CloudWatch Logs	X-Ray	AWS integration, serverless-friendly

The Grafana stack as an example:

Grafana has become a popular choice for unified observability because it integrates multiple specialized backends:

Prometheus/Mimir for metrics (efficient, PromQL query language)
Loki for logs (labels-based indexing, cost-efficient)
Tempo for traces (object storage backend, minimal overhead)

All three integrate in Grafana dashboards with cross-linking capabilities.

grafana-explore.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Grafana Explore: Navigating between signals
 
# Step 1: Start with metrics query
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
 
# Dashboard shows error rate spike for 'checkout-service'
 
# Step 2: Click on exemplar from the graph
# Opens Tempo trace view for trace_id "abc123..."
 
# Step 3: From trace view, click "Logs for this span"
# Auto-generates Loki query:
{service="checkout-service"} | json | trace_id="abc123..."
 
# Step 4: Find root cause in logs
# "Database connection pool exhausted"
 
# Step 5: Create ad-hoc dashboard panel
# Combining metrics and logs in same view
 
# Panel 1: Error rate metric (Prometheus)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
 
# Panel 2: Error logs count (Loki)
sum(count_over_time({level="error"}[1m])) by (service)
 
# Both panels correlate automatically by service label

Open Standards Enable Integration

OpenTelemetry provides the vendor-neutral data model that makes unified observability possible. When all three signals use OTel's semantic conventions, tools can correlate them automatically. This is why the industry has converged on OpenTelemetry as the foundation for modern observability.

Designing Correlated Dashboards

Effective dashboards leverage multiple signal types to tell a complete story. Rather than having separate 'metrics dashboard' and 'logs dashboard,' design dashboards that integrate all three.

Dashboard design principles:

Multi-Signal Dashboard Patterns

•Overview panel — Key metrics (request rate, error rate, latency) set context for the entire dashboard
•Correlation timeline — Events from all sources (deployments, alerts, anomalies) on a shared time axis
•Error breakdown — Metrics showing error rates alongside log-derived error categories
•Trace integration — Exemplar links in metric panels; embedded trace views for selected requests
•Log panel — Recent errors/warnings filtered by the same service/time range as other panels

Example: Service health dashboard structure

dashboard-layout.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Service Health Dashboard Structure
 
{
  "title": "Checkout Service Health",
  "templating": {
    "variables": [
      { "name": "service", "default": "checkout-service" },
      { "name": "environment", "default": "production" }
    ]
  },
  "rows": [
    {
      // ROW 1: Golden Signals Overview
      "title": "Overview",
      "panels": [
        { "type": "stat", "title": "Request Rate", "query": "rate(http_requests_total{service="$service"}[5m])" },
        { "type": "stat", "title": "Error Rate", "query": "rate(http_requests_total{service="$service",status=~"5.."}[5m])" },
        { "type": "stat", "title": "P95 Latency", "query": "histogram_quantile(0.95, rate(http_request_duration_bucket{service="$service"}[5m]))" },
        { "type": "stat", "title": "Saturation (CPU)", "query": "avg(container_cpu_usage_seconds_total{container="$service"})" }
      ]
    },
    {
      // ROW 2: Time Series with Exemplars
      "title": "Latency Over Time",
      "panels": [
        {
          "type": "timeseries",
          "title": "Latency Distribution",
          "query": "histogram_quantile(0.5|0.95|0.99, rate(http_request_duration_bucket{service="$service"}[5m]))",
          "options": {
            "exemplars": true,  // Show exemplar points
            "exemplarLinkTo": "tempo"  // Click opens Tempo trace
          }
        }
      ]
    },
    {
      // ROW 3: Errors - Combined Metrics and Logs
      "title": "Error Analysis",
      "panels": [
        {
          "type": "timeseries",
          "title": "Error Rate by Status Code",
          "query": "sum(rate(http_requests_total{service="$service",status=~"[45].."}[5m])) by (status)"
        },
        {
          "type": "logs",
          "title": "Recent Errors",
          "datasource": "loki",
          "query": "{service="$service"} | json | level = "ERROR"",
          "options": { "showLabels": ["trace_id", "error_type"] }
        }
      ]
    },
    {
      // ROW 4: Traces for Slow Requests
      "title": "Slow Requests",
      "panels": [
        {
          "type": "traces",
          "title": "Slowest Traces (Last Hour)",
          "datasource": "tempo",
          "query": "{resource.service.name="$service"} | status = "OK" | duration > 1s",
          "options": { "limit": 10 }
        }
      ]
    },
    {
      // ROW 5: Deployments and Events Correlation
      "title": "Events Timeline",
      "panels": [
        {
          "type": "annotations",
          "title": "Deployments, Alerts, Incidents",
          "sources": [
            { "type": "deployment", "query": "kube_deployment_status_replicas_updated{deployment="$service"}" },
            { "type": "alert", "query": "ALERTS{service="$service"}" }
          ]
        }
      ]
    }
  ]
}

Dashboard Variables for Context

Use dashboard variables (service, environment, time range) that apply across all panels regardless of data source. This ensures your metrics, logs, and traces panels all filter to the same context, making correlation automatic.

Alerting Strategy with Multiple Signals

Alerts are typically based on metrics—thresholds, anomalies, or SLO burn rates. But effective alerting incorporates all three signals to reduce noise and accelerate response.

Multi-signal alerting patterns:

•Metric-triggered, log-enriched — Alert fires on error rate; notification includes sample error messages from logs
•Metric-triggered, trace-linked — Alert includes links to exemplar traces showing affected requests
•Log-derived metrics — Count specific log patterns as metrics; alert on those
•Trace-derived metrics — Alert on trace-computed metrics (service dependency latency, error propagation)
•Correlated alerts — Suppress downstream alerts when upstream issues are detected

enriched-alert.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Alertmanager template with multi-signal enrichment
 
# Alert configuration
groups:
  - name: checkout-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          (sum(rate(http_requests_total{service="checkout", status=~"5.."}[5m]))
          / sum(rate(http_requests_total{service="checkout"}[5m]))) > 0.05
        for: 2m
        labels:
          severity: critical
          service: checkout
        annotations:
          summary: "Checkout service error rate above 5%"
          description: "Error rate is {{ $value | printf "%.2f" }}%"
          
          # Link to dashboard with all signals
          dashboard: "https://grafana.example.com/d/checkout?from=now-1h"
          
          # Link to error logs
          logs_query: |
            https://grafana.example.com/explore?datasource=loki&
            expr={service="checkout"} |= "error" | json
          
          # Link to recent traces with errors
          traces_query: |
            https://grafana.example.com/explore?datasource=tempo&
            query={service.name="checkout" status=error}
          
          # Recent error samples (from recording rule)
          recent_errors: "{{ $labels.recent_error_sample }}"
 
# Recording rule to capture recent error messages
- record: checkout:recent_errors:sample
  expr: |
    # This would typically be done via a log-to-metric pipeline
    # showing recent error types for inclusion in alerts
 
---
# Notification template
{{ define "slack.custom.message" }}
*Alert:* {{ .Annotations.summary }}
*Severity:* {{ .Labels.severity }}
*Service:* {{ .Labels.service }}
 
*Status:* {{ .Status }}
*Duration:* {{ .StartsAt | since }}
 
📊 *Dashboard:* {{ .Annotations.dashboard }}
📜 *Logs:* {{ .Annotations.logs_query }}
🔍 *Traces:* {{ .Annotations.traces_query }}
 
*Recent Errors:*
```
{{ .Annotations.recent_errors }}
```
 
*Runbook:* {{ .Annotations.runbook }}
{{ end }}

SLO-based alerting with multi-signal context:

The most effective alerting strategy focuses on SLOs—Service Level Objectives that represent what actually matters to users. When an SLO is burning (consuming error budget), alerts should provide immediate access to all signals for diagnosis.

From Alert to Resolution in 5 Clicks

A well-designed alert should enable resolution within 5 clicks: 1) Read alert → 2) Click dashboard link → 3) Identify anomaly → 4) Click trace exemplar → 5) Understand root cause from trace + linked logs. Each click provides progressive detail, all pre-filtered to the right context.

Common Anti-Patterns and Pitfalls

Even with all three pillars in place, teams often fail to achieve effective observability due to common mistakes:

Observability Anti-Patterns

•Siloed signals — Metrics team uses Prometheus, logs team uses Splunk, traces team uses Jaeger, none share labels or trace IDs. Navigation between signals is impossible.
•Missing correlation — Logs don't include trace IDs; metrics don't have exemplars. Even with all three signals, they can't be connected.
•Inconsistent labeling — Metrics use 'svc', logs use 'service_name', traces use 'resource.service.name'. Queries don't work across signals.
•Over-collection — Tracing everything, logging everything, generating metrics for everything. Costs explode; signal-to-noise ratio collapses.
•Under-collection — Only instrumenting 'important' services; missing the one that causes the outage. Sample rates too aggressive.
•Dashboard sprawl — 500 dashboards nobody maintains. No clear entry point. Users don't know where to start.
•Alert fatigue — Thousands of alerts, most noise. Real issues get lost. Teams start ignoring alerts.

Best Practices

•Unified labeling conventions across all signals
•Trace IDs in every log entry
•Exemplars enabled for key metrics
•Single entry-point dashboard per service
•Actionable alerts with signal links
•Regular observability review/pruning

Anti-Patterns

•Different tools with no integration
•Logs without trace context
•Metrics without exemplars
•Hundreds of unmaintained dashboards
•Alerts without remediation links
•No regular review of observability health

The Observability Maturity Trap

Many organizations believe they have observability because they have metrics AND logs AND traces. But without correlation, they have three separate monitoring systems that happen to coexist. True observability is the ability to ask any question about your system and answer it quickly—this requires integration, not just presence.

Summary: The Complete Observability Picture

We've now covered the three pillars of observability—Metrics, Logs, and Traces—and how they work together. Let's consolidate everything:

Key Takeaways

•Each pillar has unique strengths — Metrics for aggregation/alerting, Logs for context/detail, Traces for request flow/localization.
•Each pillar has blind spots — Metrics lack detail, Logs lack flow, Traces are sampled. Together they compensate.
•Correlation is the connective tissue — Trace IDs and shared labels enable navigation between signals.
•Exemplars bridge metrics to traces — From aggregate latency to specific slow requests.
•Unified platforms enable fluid workflows — Single pane of glass with cross-signal navigation.
•Dashboards should combine signals — Overview metrics, embedded logs, linked traces in one view.
•Alerts should link to all signals — Quick path from notification to diagnosis using all available data.
•Avoid silos and inconsistency — Unified labeling, shared tool chain, integrated workflows.

The observability mindset:

Observability is not just about having tools—it's about being able to ask any question about your system's behavior and get an answer quickly. The three pillars, properly integrated, give you this capability:

'Why is latency high?' → Metrics show the symptom, traces locate the slow component, logs explain the cause.
'What happened to user X's request?' → Trace shows the journey, logs show each step's details, metrics show if it was abnormal.
'Is this deployment healthy?' → Metrics compare before/after, traces sample request behavior, logs catch new error patterns.

When you can answer questions like these in minutes instead of hours, you have achieved true observability.

Module Complete: Three Pillars of Observability

You now have a comprehensive understanding of observability's three pillars and how they work together. You understand metrics for quantitative measurement, logs for event context, traces for request paths, and most importantly—how to correlate and navigate between them for effective incident response and system understanding.

Next in Chapter 27:

With the foundational understanding of observability's three pillars complete, the next modules will dive deeper into practical implementation: metrics collection with Prometheus, distributed tracing systems, logging at scale, alerting design, and building effective dashboards.

4 / 4

Loading learning content...

System Design (HLD)Three Pillars of Observability

The Three Pillars of Observability

LevelIntermediate

Duration60 mins

TopicThree Pillars of Observability

4 / 4

How They Complement Each Other

The Whole Is Greater Than the Sum

But the true power of observability emerges when these three signals work together.

What You Will Learn

The Strengths and Weaknesses of Each Pillar

Observability Pillars: Strengths and Weaknesses
Signal	Strengths	Weaknesses	Best For
Metrics	Low cost, aggregatable, alertable, trend analysis, SLO tracking	Lacks context, can't answer 'why', limited cardinality	Dashboards, alerts, capacity planning, SLOs
Logs	Rich context, arbitrary data, searchable, audit trail	Expensive at scale, no request flow, hard to aggregate	Debugging, forensics, compliance, error details
Traces	Shows request flow, latency breakdown, cross-service visibility	Sampling loses data, complex setup, high overhead	Latency investigation, dependency mapping, error localization

The key insight:

Each signal type answers different questions:

Metrics answer 'WHAT is the system's state?' — Error rate is 5%. Latency p99 is 800ms. CPU is at 80%.
Logs answer 'WHY did something happen?' — NullPointerException at UserService.java:142. Payment gateway returned 'insufficient_funds'. Config reload failed due to syntax error.
Traces answer 'WHERE in the system did it happen?' — The request started at API Gateway, slowed in Order Service (30ms → 800ms), specifically in the inventory check step calling the Warehouse API.

Effective incident response requires all three. The alert (metrics) tells you to look. The trace shows you where. The logs explain why.

The Three Questions Framework

Correlation: The Connective Tissue

The trace ID is the universal connector:

When a trace ID appears in:

Traces (obviously)
Logs (included as a field in every log entry)
Metrics (via exemplars)

...you can move seamlessly between all three signals for any given request.

correlated-observability.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
// Complete correlation: Trace ID in logs, metrics, and traces
 
import { trace, context, metrics } from '@opentelemetry/api';
import pino from 'pino';
 
// Create a logger that includes trace context automatically
const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (span) {
      const spanContext = span.spanContext();
      return {
        trace_id: spanContext.traceId,
        span_id: spanContext.spanId,
      };
    }
    return {};
  },
});
 
// Create metrics with exemplars
const meter = metrics.getMeter('order-service');
const orderLatencyHistogram = meter.createHistogram('order_processing_duration', {
  description: 'Order processing duration in milliseconds',
  unit: 'ms',
});
 
async function processOrder(order) {
  const startTime = Date.now();
  const span = trace.getActiveSpan();
  
  try {
    // Log with automatic trace context
    logger.info({
      order_id: order.id,
      user_id: order.userId,
      msg: 'Processing order',
    });
    // This log will include trace_id and span_id automatically
    
    const result = await doOrderProcessing(order);
    
    // Record metric with exemplar linking to this trace
    const duration = Date.now() - startTime;
    orderLatencyHistogram.record(duration, {
      status: 'success',
      payment_method: order.paymentMethod,
    }, {
      // Exemplar: Links this metric data point to the trace
      traceId: span?.spanContext().traceId,
      spanId: span?.spanContext().spanId,
    });
    
    logger.info({
      order_id: order.id,
      duration_ms: duration,
      msg: 'Order processed successfully',
    });
    
    return result;
    
  } catch (error) {
    const duration = Date.now() - startTime;
    
    orderLatencyHistogram.record(duration, {
      status: 'error',
      error_type: error.constructor.name,
    }, {
      traceId: span?.spanContext().traceId,
      spanId: span?.spanContext().spanId,
    });
    
    // Log error with full context
    logger.error({
      err: error,
      order_id: order.id,
      duration_ms: duration,
      msg: 'Order processing failed',
    });
    
    throw error;
  }
}
 
// Result: For any given order:
// - Trace shows the request flow through services
// - Logs show detailed events (searchable by trace_id)
// - Metrics include exemplars pointing to example traces

Exemplars: Connecting metrics to traces

Exemplars are the bridge from aggregate metrics to individual traces. When Prometheus records a histogram bucket, an exemplar stores the trace ID of an example request in that bucket.

For instance, when your p99 latency is 800ms, an exemplar answers: 'Here's a specific request that took 800ms—click to see its trace.'

prometheus-exemplar.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Prometheus metric with exemplar
 
# Standard histogram bucket
http_request_duration_seconds_bucket{le="0.5"} 24054
 
# Histogram bucket with exemplar
http_request_duration_seconds_bucket{le="1"} 129389 # {trace_id="abc123"} 0.987
 
# The exemplar data:
# - trace_id="abc123" identifies a specific trace
# - 0.987 is the actual observed value for that trace
# - This request took 987ms and fell into the ≤1s bucket
 
# In Grafana, clicking on the p99 latency can take you directly
# to the Jaeger/Tempo trace for "abc123"

Additional Correlation Dimensions

The Integrated Debugging Workflow

Let's walk through a complete incident response using all three pillars. This represents the ideal observability workflow that you should strive to enable in your systems.

Scenario: Support receives reports that order processing is slow. Users are waiting 30+ seconds for checkout to complete.

Step 1: Alert triggers (Metrics)

Before user reports even arrive, an alert fires:

alert-rule.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Prometheus alert rule
groups:
  - name: checkout-alerts
    rules:
      - alert: CheckoutLatencyHigh
        expr: |
          histogram_quantile(0.95, 
            sum(rate(checkout_duration_seconds_bucket[5m])) by (le)
          ) > 10
        for: 5m
        labels:
          severity: warning
          team: checkout
        annotations:
          summary: "Checkout p95 latency exceeds 10 seconds"
          dashboard: "https://grafana.example.com/d/checkout-performance"
          runbook: "https://wiki.example.com/runbooks/checkout-latency"

Step 2: Examine dashboard (Metrics)

You open the linked dashboard. It shows:

p95 latency spiked from 2s to 15s at 14:30
Error rate increased slightly
Request rate is normal
Latency is high for all checkout steps, but especially 'payment_processing'

Step 3: Find example traces (Metrics → Traces)

The latency histogram has exemplars. You click on a data point in the spike. It opens a trace with ID 4bf92f3577b34da6...

Step 4: Analyze the trace (Traces)

trace-waterfall.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Trace waterfall view (simplified)
Timeline: 0ms                                               15000ms
         |-------------------------------------------------------|
 
[api-gateway] POST /checkout
  ├── [auth-service] validate_token           |==| 45ms
  ├── [cart-service] get_cart                 |===| 120ms
  ├── [inventory-service] check_availability  |===| 85ms
  └── [payment-service] process_payment       |================================| 14200ms  ⚠️
       └── [payment-service] call_stripe_api  |================================| 14150ms  ⚠️
            └── [event] Retrying after timeout...
            └── [event] Retrying after timeout...  
            └── [event] Retry succeeded
 
# The trace reveals:
# - payment-service is the bottleneck (14.2s of 15s total)
# - Specifically, the Stripe API call
# - There were multiple retries due to timeouts

Step 5: Investigate with logs (Traces → Logs)

You copy the trace ID and query logs:

log-query.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Query logs for this trace
{service="payment-service"} | json | trace_id = "4bf92f3577b34da6..." | level = "WARN" or level = "ERROR"
 
# Results:
2024-01-08T14:32:15.123Z WARN  trace_id=4bf92f3577b34da6 
  msg="Stripe API timeout, retrying" 
  attempt=1 
  timeout_ms=5000 
  endpoint="https://api.stripe.com/v1/charges"
 
2024-01-08T14:32:20.456Z WARN  trace_id=4bf92f3577b34da6
  msg="Stripe API timeout, retrying"
  attempt=2
  timeout_ms=5000
  endpoint="https://api.stripe.com/v1/charges"
 
2024-01-08T14:32:25.789Z INFO  trace_id=4bf92f3577b34da6
  msg="Stripe API call succeeded after retries"
  attempt=3
  total_duration_ms=14150
 
# The logs reveal:
# - Stripe API is timing out
# - Retries eventually succeed
# - This explains the 14s latency (3 retries × ~5s each)

Step 6: Verify scope and impact (Back to Metrics)

You check if this is happening to all requests or just some:

Create a dashboard query filtering by stripe_api_status=timeout
See that 15% of payment requests are experiencing timeouts
Check Stripe's status page—they're reporting degraded performance

Step 7: Resolution

Immediate: Increase timeout values and retry limits to reduce user-visible failures
Communication: Inform support that Stripe is experiencing issues; ETA from their status page
Monitoring: Add alert for Stripe API timeout rate

The complete workflow:

Alert (Metrics) → Dashboard (Metrics) → Exemplar (Metrics→Traces) → Trace waterfall → Trace ID (Traces→Logs) → Log details → Verification (Logs→Metrics)

Total Investigation Time: 5 Minutes

Unified Observability Platforms

The workflow described above requires tools that understand all three signals and can navigate between them. Modern observability platforms provide this unified experience.

Key capabilities of unified platforms:

•Single-pane-of-glass — View metrics, logs, and traces in one interface without context-switching between tools
•Cross-signal navigation — Click from a metric data point to related traces; click from a trace to related logs
•Correlated alerting — Alerts can include links to relevant dashboards, traces, and log queries
•Unified query — Query across signal types: 'Show me logs from requests where latency > 1s'
•Shared context — Common labels (service, environment, version) work across all signals

Unified Observability Platform Comparison
Platform	Metrics	Logs	Traces	Key Strength
Grafana Stack	Prometheus/Mimir	Loki	Tempo	Open source, flexible, cost-efficient
Elastic Observability	Elastic APM	Elasticsearch	Elastic APM	Full-text log search, single backend
Datadog	Datadog Metrics	Datadog Logs	Datadog APM	Polished UX, ML-powered insights
Honeycomb	Metrics (limited)	Events	Traces	High-cardinality analysis, BubbleUp
Splunk Observability	SignalFx	Splunk	SignalFx	Enterprise features, Splunk ecosystem
New Relic	NRDB	NRDB	New Relic APM	All-in-one, straightforward pricing
AWS Native	CloudWatch	CloudWatch Logs	X-Ray	AWS integration, serverless-friendly

The Grafana stack as an example:

Grafana has become a popular choice for unified observability because it integrates multiple specialized backends:

Prometheus/Mimir for metrics (efficient, PromQL query language)
Loki for logs (labels-based indexing, cost-efficient)
Tempo for traces (object storage backend, minimal overhead)

All three integrate in Grafana dashboards with cross-linking capabilities.

grafana-explore.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Grafana Explore: Navigating between signals
 
# Step 1: Start with metrics query
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
 
# Dashboard shows error rate spike for 'checkout-service'
 
# Step 2: Click on exemplar from the graph
# Opens Tempo trace view for trace_id "abc123..."
 
# Step 3: From trace view, click "Logs for this span"
# Auto-generates Loki query:
{service="checkout-service"} | json | trace_id="abc123..."
 
# Step 4: Find root cause in logs
# "Database connection pool exhausted"
 
# Step 5: Create ad-hoc dashboard panel
# Combining metrics and logs in same view
 
# Panel 1: Error rate metric (Prometheus)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
 
# Panel 2: Error logs count (Loki)
sum(count_over_time({level="error"}[1m])) by (service)
 
# Both panels correlate automatically by service label

Open Standards Enable Integration

Designing Correlated Dashboards

Effective dashboards leverage multiple signal types to tell a complete story. Rather than having separate 'metrics dashboard' and 'logs dashboard,' design dashboards that integrate all three.

Dashboard design principles:

Multi-Signal Dashboard Patterns

•Overview panel — Key metrics (request rate, error rate, latency) set context for the entire dashboard
•Correlation timeline — Events from all sources (deployments, alerts, anomalies) on a shared time axis
•Error breakdown — Metrics showing error rates alongside log-derived error categories
•Trace integration — Exemplar links in metric panels; embedded trace views for selected requests
•Log panel — Recent errors/warnings filtered by the same service/time range as other panels

Example: Service health dashboard structure

dashboard-layout.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Service Health Dashboard Structure
 
{
  "title": "Checkout Service Health",
  "templating": {
    "variables": [
      { "name": "service", "default": "checkout-service" },
      { "name": "environment", "default": "production" }
    ]
  },
  "rows": [
    {
      // ROW 1: Golden Signals Overview
      "title": "Overview",
      "panels": [
        { "type": "stat", "title": "Request Rate", "query": "rate(http_requests_total{service="$service"}[5m])" },
        { "type": "stat", "title": "Error Rate", "query": "rate(http_requests_total{service="$service",status=~"5.."}[5m])" },
        { "type": "stat", "title": "P95 Latency", "query": "histogram_quantile(0.95, rate(http_request_duration_bucket{service="$service"}[5m]))" },
        { "type": "stat", "title": "Saturation (CPU)", "query": "avg(container_cpu_usage_seconds_total{container="$service"})" }
      ]
    },
    {
      // ROW 2: Time Series with Exemplars
      "title": "Latency Over Time",
      "panels": [
        {
          "type": "timeseries",
          "title": "Latency Distribution",
          "query": "histogram_quantile(0.5|0.95|0.99, rate(http_request_duration_bucket{service="$service"}[5m]))",
          "options": {
            "exemplars": true,  // Show exemplar points
            "exemplarLinkTo": "tempo"  // Click opens Tempo trace
          }
        }
      ]
    },
    {
      // ROW 3: Errors - Combined Metrics and Logs
      "title": "Error Analysis",
      "panels": [
        {
          "type": "timeseries",
          "title": "Error Rate by Status Code",
          "query": "sum(rate(http_requests_total{service="$service",status=~"[45].."}[5m])) by (status)"
        },
        {
          "type": "logs",
          "title": "Recent Errors",
          "datasource": "loki",
          "query": "{service="$service"} | json | level = "ERROR"",
          "options": { "showLabels": ["trace_id", "error_type"] }
        }
      ]
    },
    {
      // ROW 4: Traces for Slow Requests
      "title": "Slow Requests",
      "panels": [
        {
          "type": "traces",
          "title": "Slowest Traces (Last Hour)",
          "datasource": "tempo",
          "query": "{resource.service.name="$service"} | status = "OK" | duration > 1s",
          "options": { "limit": 10 }
        }
      ]
    },
    {
      // ROW 5: Deployments and Events Correlation
      "title": "Events Timeline",
      "panels": [
        {
          "type": "annotations",
          "title": "Deployments, Alerts, Incidents",
          "sources": [
            { "type": "deployment", "query": "kube_deployment_status_replicas_updated{deployment="$service"}" },
            { "type": "alert", "query": "ALERTS{service="$service"}" }
          ]
        }
      ]
    }
  ]
}

Dashboard Variables for Context

Alerting Strategy with Multiple Signals

Alerts are typically based on metrics—thresholds, anomalies, or SLO burn rates. But effective alerting incorporates all three signals to reduce noise and accelerate response.

Multi-signal alerting patterns:

•Metric-triggered, log-enriched — Alert fires on error rate; notification includes sample error messages from logs
•Metric-triggered, trace-linked — Alert includes links to exemplar traces showing affected requests
•Log-derived metrics — Count specific log patterns as metrics; alert on those
•Trace-derived metrics — Alert on trace-computed metrics (service dependency latency, error propagation)
•Correlated alerts — Suppress downstream alerts when upstream issues are detected

enriched-alert.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Alertmanager template with multi-signal enrichment
 
# Alert configuration
groups:
  - name: checkout-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          (sum(rate(http_requests_total{service="checkout", status=~"5.."}[5m]))
          / sum(rate(http_requests_total{service="checkout"}[5m]))) > 0.05
        for: 2m
        labels:
          severity: critical
          service: checkout
        annotations:
          summary: "Checkout service error rate above 5%"
          description: "Error rate is {{ $value | printf "%.2f" }}%"
          
          # Link to dashboard with all signals
          dashboard: "https://grafana.example.com/d/checkout?from=now-1h"
          
          # Link to error logs
          logs_query: |
            https://grafana.example.com/explore?datasource=loki&
            expr={service="checkout"} |= "error" | json
          
          # Link to recent traces with errors
          traces_query: |
            https://grafana.example.com/explore?datasource=tempo&
            query={service.name="checkout" status=error}
          
          # Recent error samples (from recording rule)
          recent_errors: "{{ $labels.recent_error_sample }}"
 
# Recording rule to capture recent error messages
- record: checkout:recent_errors:sample
  expr: |
    # This would typically be done via a log-to-metric pipeline
    # showing recent error types for inclusion in alerts
 
---
# Notification template
{{ define "slack.custom.message" }}
*Alert:* {{ .Annotations.summary }}
*Severity:* {{ .Labels.severity }}
*Service:* {{ .Labels.service }}
 
*Status:* {{ .Status }}
*Duration:* {{ .StartsAt | since }}
 
📊 *Dashboard:* {{ .Annotations.dashboard }}
📜 *Logs:* {{ .Annotations.logs_query }}
🔍 *Traces:* {{ .Annotations.traces_query }}
 
*Recent Errors:*
```
{{ .Annotations.recent_errors }}
```
 
*Runbook:* {{ .Annotations.runbook }}
{{ end }}

SLO-based alerting with multi-signal context:

From Alert to Resolution in 5 Clicks

Common Anti-Patterns and Pitfalls

Even with all three pillars in place, teams often fail to achieve effective observability due to common mistakes:

Observability Anti-Patterns

•Siloed signals — Metrics team uses Prometheus, logs team uses Splunk, traces team uses Jaeger, none share labels or trace IDs. Navigation between signals is impossible.
•Missing correlation — Logs don't include trace IDs; metrics don't have exemplars. Even with all three signals, they can't be connected.
•Inconsistent labeling — Metrics use 'svc', logs use 'service_name', traces use 'resource.service.name'. Queries don't work across signals.
•Over-collection — Tracing everything, logging everything, generating metrics for everything. Costs explode; signal-to-noise ratio collapses.
•Under-collection — Only instrumenting 'important' services; missing the one that causes the outage. Sample rates too aggressive.
•Dashboard sprawl — 500 dashboards nobody maintains. No clear entry point. Users don't know where to start.
•Alert fatigue — Thousands of alerts, most noise. Real issues get lost. Teams start ignoring alerts.

Best Practices

•Unified labeling conventions across all signals
•Trace IDs in every log entry
•Exemplars enabled for key metrics
•Single entry-point dashboard per service
•Actionable alerts with signal links
•Regular observability review/pruning

Anti-Patterns

•Different tools with no integration
•Logs without trace context
•Metrics without exemplars
•Hundreds of unmaintained dashboards
•Alerts without remediation links
•No regular review of observability health

The Observability Maturity Trap

Summary: The Complete Observability Picture

We've now covered the three pillars of observability—Metrics, Logs, and Traces—and how they work together. Let's consolidate everything:

Key Takeaways

•Each pillar has unique strengths — Metrics for aggregation/alerting, Logs for context/detail, Traces for request flow/localization.
•Each pillar has blind spots — Metrics lack detail, Logs lack flow, Traces are sampled. Together they compensate.
•Correlation is the connective tissue — Trace IDs and shared labels enable navigation between signals.
•Exemplars bridge metrics to traces — From aggregate latency to specific slow requests.
•Unified platforms enable fluid workflows — Single pane of glass with cross-signal navigation.
•Dashboards should combine signals — Overview metrics, embedded logs, linked traces in one view.
•Alerts should link to all signals — Quick path from notification to diagnosis using all available data.
•Avoid silos and inconsistency — Unified labeling, shared tool chain, integrated workflows.

The observability mindset:

'Why is latency high?' → Metrics show the symptom, traces locate the slow component, logs explain the cause.
'What happened to user X's request?' → Trace shows the journey, logs show each step's details, metrics show if it was abnormal.
'Is this deployment healthy?' → Metrics compare before/after, traces sample request behavior, logs catch new error patterns.

When you can answer questions like these in minutes instead of hours, you have achieved true observability.

Module Complete: Three Pillars of Observability

Next in Chapter 27:

4 / 4