System Design (HLD)Three Pillars of Observability

The Three Pillars of Observability

LevelIntermediate

Duration60 mins

TopicThree Pillars of Observability

3 / 4

Traces: Request Paths

Following the Request Through the Maze

Imagine debugging a slow API response in a microservices architecture. The request enters through an API gateway, flows to an authentication service, queries a user service, checks an inventory service, processes payment through a payment service, sends a notification via a messaging service, and finally returns. The total latency is 3 seconds—but which service is responsible?

Your metrics show elevated latency. Your logs show individual events in each service. But neither shows you the complete journey—the actual path the request took, how long each hop took, and where the bottleneck lies.

This is the problem distributed tracing solves.

A trace is a representation of a single request's journey through a distributed system. It captures every service the request touched, the sequence of operations, the time spent in each, and the relationships between them. Traces provide the end-to-end visibility that metrics and logs alone cannot offer.

What You Will Learn

By the end of this page, you will understand distributed tracing fundamentally—the concepts of traces, spans, and context propagation. You'll learn how to instrument applications for tracing, understand sampling strategies to control costs, and appreciate how traces connect with metrics and logs to form complete observability.

What Is Distributed Tracing?

Distributed tracing tracks the flow of requests as they propagate through distributed systems. Unlike logs (individual events) or metrics (aggregated numbers), traces capture the cause-and-effect relationships between operations.

Core concepts:

Trace Fundamentals

•Trace — Represents the entire journey of a request from start to finish. A trace contains multiple spans, like a tree of operations.
•Span — Represents a single unit of work within a trace. Examples: an HTTP request, a database query, a message being processed. Each span has a start time, duration, and optional metadata.
•Trace ID — A globally unique identifier for the trace, shared by all spans belonging to the same request.
•Span ID — A unique identifier for each individual span within a trace.
•Parent Span ID — The ID of the span that initiated the current span, creating the tree structure.
•Context — The trace ID and span ID together form the 'trace context' that propagates through the system.

trace-structure.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Example trace structure (simplified)
{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spans": [
    {
      "span_id": "00f067aa0ba902b7",
      "parent_span_id": null,  // Root span
      "operation_name": "HTTP GET /api/orders/123",
      "service_name": "api-gateway",
      "start_time": "2024-01-08T10:23:45.000Z",
      "duration_ms": 847,
      "status": "OK",
      "tags": {
        "http.method": "GET",
        "http.url": "/api/orders/123",
        "http.status_code": 200
      }
    },
    {
      "span_id": "01a902b7c3d4e5f6",
      "parent_span_id": "00f067aa0ba902b7",  // Child of root
      "operation_name": "authenticate_user",
      "service_name": "auth-service",
      "start_time": "2024-01-08T10:23:45.010Z",
      "duration_ms": 45,
      "status": "OK"
    },
    {
      "span_id": "02b3c4d5e6f7a8b9",
      "parent_span_id": "00f067aa0ba902b7",  // Child of root
      "operation_name": "get_order",
      "service_name": "order-service",
      "start_time": "2024-01-08T10:23:45.060Z",
      "duration_ms": 380,
      "status": "OK"
    },
    {
      "span_id": "03c4d5e6f7a8b9c0",
      "parent_span_id": "02b3c4d5e6f7a8b9",  // Child of order-service
      "operation_name": "SELECT * FROM orders",
      "service_name": "order-service",
      "start_time": "2024-01-08T10:23:45.065Z",
      "duration_ms": 350,
      "status": "OK",
      "tags": {
        "db.type": "postgresql",
        "db.statement": "SELECT * FROM orders WHERE id = ?"
      }
    }
  ]
}

Visualizing traces:

Traces are typically visualized in two ways:

Timeline view (Waterfall/Gantt chart) — Shows spans laid out horizontally by time. You can see which operations happened in parallel, which were sequential, and where time was spent.
Service graph view — Shows the topology of services involved, with edges representing calls between them and aggregated latency statistics.

The timeline view is essential for debugging individual slow requests—you immediately see that 350ms of a 380ms operation was spent in a database query. The service graph is valuable for understanding system-wide patterns and dependencies.

The Dapper Legacy

Modern distributed tracing traces its origins to Google's Dapper paper (2010). Dapper introduced the concepts of trace IDs, span IDs, and context propagation that underpin all modern tracing systems. Systems like Zipkin, Jaeger, and OpenTelemetry all follow this model.

Context Propagation

For distributed tracing to work, trace context must propagate from service to service. When Service A calls Service B, it must pass along the trace ID and current span ID so that B's spans can be connected to A's.

Without context propagation, traces would break at every service boundary. You'd have disconnected tree fragments instead of a complete request journey.

How context propagates:

•HTTP headers — The most common method. Trace context is encoded in HTTP headers that flow with every request.
•gRPC metadata — Similar to HTTP headers, trace context rides in gRPC's metadata.
•Message headers — For async messaging, context is embedded in message headers (Kafka headers, AMQP properties).
•In-process — Within a single process, context flows through thread-local storage or async context APIs.

The W3C Trace Context standard:

To ensure interoperability between different tracing systems, the W3C defined a standard trace context format. Two headers carry the essential information:

w3c-trace-context.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# W3C Trace Context Headers
# ==========================
 
# traceparent: The core correlation header
# Format: version-trace_id-parent_id-flags
#
# version:    2 hex digits (currently "00")
# trace_id:   32 hex digits (16 bytes)  
# parent_id:  16 hex digits (8 bytes)
# flags:      2 hex digits (sampling flags)
 
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
 
# Breakdown:
# - version:   00
# - trace_id:  4bf92f3577b34da6a3ce929d0e0e4736
# - parent_id: 00f067aa0ba902b7
# - flags:     01 (sampled)
 
 
# tracestate: Vendor-specific data
# Format: key=value pairs separated by commas
# Used to pass additional context without breaking compatibility
 
tracestate: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7
 
# Example: Datadog trace might add:
tracestate: dd=t.dm:-0

Implementing context propagation:

context-propagation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
// TypeScript example: Context propagation with OpenTelemetry
 
import { context, propagation, trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';
 
const tracer = trace.getTracer('order-service');
 
// ========== Extracting context from incoming request ==========
function extractContext(req: Request) {
  // Extract trace context from incoming HTTP headers
  return propagation.extract(context.active(), req.headers, {
    get(carrier, key) {
      return carrier[key.toLowerCase()];
    },
    keys(carrier) {
      return Object.keys(carrier);
    },
  });
}
 
// ========== Creating a span for incoming request ==========
app.use((req, res, next) => {
  const extractedContext = extractContext(req);
  
  // Create a new span within the extracted context
  const span = tracer.startSpan(
    `HTTP ${req.method} ${req.path}`,
    {
      kind: SpanKind.SERVER,
      attributes: {
        'http.method': req.method,
        'http.url': req.url,
        'http.target': req.path,
      },
    },
    extractedContext
  );
 
  // Set the span in context for downstream code
  context.with(trace.setSpan(extractedContext, span), () => {
    res.on('finish', () => {
      span.setAttributes({
        'http.status_code': res.statusCode,
      });
      
      if (res.statusCode >= 400) {
        span.setStatus({ code: SpanStatusCode.ERROR });
      }
      
      span.end();
    });
    
    next();
  });
});
 
// ========== Propagating context to outgoing request ==========
async function callPaymentService(orderId: string) {
  const currentSpan = trace.getActiveSpan();
  
  return tracer.startActiveSpan('call_payment_service', async (span) => {
    try {
      // Inject trace context into outgoing headers
      const headers: Record<string, string> = {
        'Content-Type': 'application/json',
      };
      
      propagation.inject(context.active(), headers, {
        set(carrier, key, value) {
          carrier[key] = value;
        },
      });
      
      // Now 'headers' contains traceparent and tracestate
      const response = await fetch('http://payment-service/process', {
        method: 'POST',
        headers,
        body: JSON.stringify({ orderId }),
      });
      
      span.setAttributes({
        'http.status_code': response.status,
        'payment.service.response': response.ok,
      });
      
      return response.json();
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Context Loss: The Silent Trace Killer

Context can be lost when crossing async boundaries, thread pools, or message queues. If you spawn a background thread or publish a message without propagating context, the trace breaks. Libraries like OpenTelemetry provide automatic instrumentation for many frameworks, but custom async patterns require explicit handling.

Span Attributes and Events

Beyond the basic timing information, spans carry attributes (tags) and events (logs) that provide rich context for debugging and analysis.

Attributes (Tags)

Attributes are key-value pairs attached to spans that describe the operation. OpenTelemetry defines semantic conventions for common attribute names, ensuring consistency across different services and languages.

Common Span Attributes (Semantic Conventions)
Category	Attribute	Example Value	Purpose
HTTP	http.method	GET, POST	HTTP request method
HTTP	http.url	https://api.example.com/users	Full URL of request
HTTP	http.status_code	200, 404, 500	Response status code
HTTP	http.route	/users/:id	Route template (not actual path)
Database	db.system	postgresql, mysql, redis	Database type
Database	db.statement	SELECT * FROM users	Query (may be sanitized)
Database	db.operation	SELECT, INSERT, UPDATE	Operation type
Messaging	messaging.system	kafka, rabbitmq	Messaging system
Messaging	messaging.destination	orders-topic	Queue or topic name
RPC	rpc.system	grpc, jsonrpc	RPC framework
RPC	rpc.service	PaymentService	Remote service name
RPC	rpc.method	ProcessPayment	Remote method called
Error	exception.type	NullPointerException	Exception class name
Error	exception.message	Value cannot be null	Exception message
Custom	user.id	user_12345	Application-specific context
Custom	order.total	299.99	Business-relevant data

Events (Span Logs)

While attributes describe the span as a whole, events capture discrete occurrences during the span's lifetime. Events are timestamped and can have their own attributes.

span-events.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Using span events for rich context
 
async function processOrder(order) {
  const span = trace.getActiveSpan();
  
  // Event: Order validation started
  span?.addEvent('order_validation_started', {
    order_id: order.id,
    items_count: order.items.length,
  });
  
  const validationResult = await validateOrder(order);
  
  // Event: Order validation completed
  span?.addEvent('order_validation_completed', {
    order_id: order.id,
    valid: validationResult.valid,
    validation_duration_ms: validationResult.durationMs,
  });
  
  if (!validationResult.valid) {
    // Event: Validation failed with details
    span?.addEvent('validation_failed', {
      reason: validationResult.error,
      field: validationResult.failingField,
    });
    throw new ValidationError(validationResult.error);
  }
  
  // Event: Payment processing started
  span?.addEvent('payment_processing_started', {
    order_id: order.id,
    amount: order.total,
    payment_method: order.paymentMethod,
  });
  
  try {
    const paymentResult = await processPayment(order);
    
    // Event: Payment successful
    span?.addEvent('payment_completed', {
      order_id: order.id,
      transaction_id: paymentResult.transactionId,
    });
  } catch (error) {
    // Record exception as an event
    span?.recordException(error);
    span?.addEvent('payment_failed', {
      order_id: order.id,
      error_code: error.code,
    });
    throw error;
  }
}

Events vs Child Spans

Use spans for operations that have meaningful duration and might fail independently. Use events for point-in-time occurrences within a span. Creating too many spans adds overhead; events are lightweight. Rule of thumb: if you'd create a log for it, it's probably an event. If you'd measure its latency, it's probably a span.

Sampling Strategies

Tracing every single request in a high-throughput system is prohibitively expensive. A system handling 100,000 requests per second would generate billions of spans per day. Sampling selectively records a subset of traces, balancing observability with cost and performance.

Types of sampling:

Head-Based Sampling

•Decision made at trace start — The first service decides whether to sample, and this decision propagates downstream.
•Simple to implement — A random number determines sampling at the entry point.
•Consistent — All services respect the sampling decision in the trace context.
•Limitation — You can't know at the start if this particular request will be interesting (slow, error-prone).
•Use case — General-purpose sampling when you want a representative sample of traffic.

Tail-Based Sampling

•Decision made after trace completes — All spans are collected; a backend system decides what to keep.
•Intelligent selection — Can keep all error traces, slow traces, or traces matching specific criteria.
•Resource intensive — Must buffer all spans temporarily before deciding.
•Complex architecture — Requires a collector that can aggregate spans from all services.
•Use case — Ensuring you capture anomalies (errors, latency outliers) even at low overall sample rates.

Sampling Strategy Comparison
Strategy	When Decision Made	Pros	Cons
Head-based random	At trace start	Simple, low overhead	May miss interesting traces
Head-based rate-limit	At trace start	Predictable volume	May miss bursts of interesting events
Tail-based	After trace complete	Captures anomalies	Higher resource usage, complex
Priority/Rule-based	At trace start with rules	Targeted sampling	Requires configuration maintenance
Adaptive	Dynamic adjustment	Balances coverage and cost	Complex to implement correctly

sampling-configuration.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# OpenTelemetry Collector with tail-based sampling
 
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  # Memory limiter to prevent OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
    spike_limit_mib: 400
 
  # Tail-based sampling processor
  tail_sampling:
    decision_wait: 10s           # Wait up to 10s for all spans
    num_traces: 100000           # Buffer capacity
    expected_new_traces_per_sec: 10000
    policies:
      # Always sample errors
      - name: errors-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR
 
      # Always sample slow traces (>1 second)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 1000
 
      # Sample 10% of everything else
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
 
      # Always sample specific important operations
      - name: critical-operations
        type: string_attribute
        string_attribute:
          key: operation.critical
          values:
            - "true"
 
exporters:
  otlp:
    endpoint: jaeger-collector:4317
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling]
      exporters: [otlp]

The Sampling Trade-off

With 1% sampling, you'll likely miss the one slow request out of 100 that a user complained about. With 100% sampling, your tracing infrastructure costs more than your application infrastructure. There's no perfect answer—tune based on your debugging needs, budget, and the value of captured traces.

Tracing Infrastructure

A complete tracing infrastructure consists of several components working together:

Architecture components:

•Instrumentation Libraries — Code in your applications that creates spans and propagates context. OpenTelemetry SDKs, Jaeger clients, or vendor-specific libraries.
•Agents/Sidecars — Optional local agents that collect spans from applications, reducing application overhead. Run as sidecars in Kubernetes or daemon processes.
•Collectors — Central services that receive spans from agents or directly from applications. Perform batching, processing, and routing to backends.
•Storage Backend — Databases that store trace data. Can be purpose-built (Jaeger, Tempo) or general-purpose (Elasticsearch, Cassandra).
•Query/UI — Interfaces for searching, visualizing, and analyzing traces. Jaeger UI, Grafana Tempo, Zipkin UI, or vendor dashboards.

Popular tracing systems:

Distributed Tracing Systems Comparison
System	Origin	Key Strengths	Storage Options
Jaeger	Uber	Kubernetes-native, Cassandra scale, mature	Cassandra, Elasticsearch, Kafka
Zipkin	Twitter	Simple, widely adopted, low overhead	In-memory, MySQL, Cassandra, Elasticsearch
Grafana Tempo	Grafana Labs	Object storage backend, cost-efficient, Grafana integration	S3, GCS, Azure Blob, local disk
AWS X-Ray	AWS	AWS native, Lambda integration	AWS managed
OpenTelemetry Collector	CNCF	Vendor-neutral, flexible routing	Any backend via exporters
Datadog APM	Datadog	Full APM suite, ML analysis	Datadog managed
Honeycomb	Honeycomb	High-cardinality analysis, BubbleUp	Honeycomb managed

OpenTelemetry as the Foundation

OpenTelemetry (OTel) has emerged as the vendor-neutral standard for instrumentation. Instrument your code once with OTel SDKs, and you can send traces to any backend—Jaeger, Tempo, Datadog, or others. This avoids vendor lock-in and provides flexibility to change backends without code changes.

otel-config.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// OpenTelemetry Node.js SDK setup
 
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
 
// Configure the SDK
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
 
  // Export traces to OTel Collector via gRPC
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
  }),
 
  // Batch spans for efficient export
  spanProcessor: new BatchSpanProcessor(traceExporter, {
    maxQueueSize: 1000,
    maxExportBatchSize: 100,
    scheduledDelayMillis: 1000,
  }),
 
  // Auto-instrument common libraries
  instrumentations: [
    getNodeAutoInstrumentations({
      // Instrument HTTP client/server
      '@opentelemetry/instrumentation-http': {
        requestHook: (span, request) => {
          span.setAttribute('custom.request.id', request.headers['x-request-id']);
        },
      },
      // Instrument Express framework
      '@opentelemetry/instrumentation-express': {},
      // Instrument database clients
      '@opentelemetry/instrumentation-pg': {},
      '@opentelemetry/instrumentation-redis': {},
      // Instrument messaging
      '@opentelemetry/instrumentation-kafkajs': {},
    }),
  ],
});
 
// Start the SDK before your application
sdk.start();
 
// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing shutdown complete'))
    .catch((err) => console.error('Error shutting down tracing', err))
    .finally(() => process.exit(0));
});

Tracing Best Practices

Effective tracing requires thoughtful implementation. Here are battle-tested practices for getting the most value from your tracing investment:

Instrumentation Best Practices

•Name spans meaningfully — Use operation-focused names like 'ProcessPayment' or 'GET /api/users', not generic names like 'span1' or 'http.request'.
•Use semantic conventions — Follow OpenTelemetry semantic conventions for attributes. This enables cross-service and cross-language consistency.
•Don't over-span — Creating a span for every function call adds overhead and noise. Focus on boundary points: HTTP calls, database queries, message handling.
•Include business context — Add attributes like order_id, user_id, product_sku—data that helps debug business logic issues.
•Propagate context everywhere — Ensure context crosses async boundaries, background jobs, message queues. Broken context = broken traces.

Operational Best Practices

•Start with auto-instrumentation — OpenTelemetry provides automatic instrumentation for many frameworks. Start there, add custom spans as needed.
•Monitor your tracing infrastructure — Meta-monitoring: track span export latency, dropped spans, collector health. Tracing shouldn't be a black box.
•Set appropriate retention — Traces are primarily for debugging recent issues. 7-14 days retention is often sufficient; cold storage for longer.
•Connect traces to logs and metrics — Include trace_id in logs. Link from trace views to related metrics dashboards. Observability works best when unified.
•Establish trace ID conventions — Make trace IDs easily accessible: in error messages, support tickets, logs. Users reporting issues should be able to provide a trace ID.

•Create spans for service calls, DB queries, external APIs
•Add business-relevant attributes
•Use consistent naming across services
•Record exceptions on error spans
•Propagate context through async flows

Don't

•Create a span for every function call
•Put sensitive data (passwords, tokens) in attributes
•Ignore sampling—trace 100% at scale
•Forget to end spans (causes memory leaks)
•Use generic names like 'operation' or 'request'

The 'Exemplar' Pattern

Exemplars connect metrics to traces. When Prometheus records a histogram bucket, it can store the trace ID of an example request that fell into that bucket. Clicking on a slow latency percentile in a dashboard can take you directly to a trace exhibiting that latency. This bridges the gap between aggregate views (metrics) and individual request views (traces).

Debugging with Traces

The true value of tracing emerges during debugging and incident response. Here's how to leverage traces effectively:

Scenario: Latency investigation

A dashboard shows p99 latency for the /checkout endpoint has doubled. Metrics tell you something is slow. Now what?

•Find affected traces — Query for traces with the /checkout endpoint and latency > threshold. Filter by the time window when the issue started.
•Examine the waterfall — Open a slow trace. The timeline view immediately shows where time is spent. Is it one slow service? Multiple? Database? External API?
•Compare with baseline — Find a 'fast' trace from before the issue. Compare the waterfall. What's different? A new span? An existing span taking longer?
•Drill into the slow span — Look at the attributes of the slow span. What database query? Which external endpoint? What parameters?
•Correlate with logs — Using the trace ID, pull logs from the relevant timeframe and service. The logs provide the 'why' behind the slow span.
•Form hypothesis and validate — Based on findings, form a hypothesis (e.g., 'database query is missing an index'). Validate by checking further traces or database metrics.

Scenario: Error investigation

Users are reporting random failures when submitting orders. Support tickets mention 'something went wrong.' Error rates are elevated but not catastrophic.

•Find error traces — Query for traces with error status codes or exception attributes in the time window users reported.
•Identify the failing service — The trace shows exactly where the error occurred. API Gateway → Auth Service (OK) → Order Service (OK) → Payment Service (ERROR).
•Examine error attributes — The error span has exception.type='ConnectionTimeoutException' and db.statement='INSERT INTO transactions...'
•Check for patterns — Are all errors from the same payment service instance? Same database shard? Same time of day?
•Root cause — The trace + logs reveal: payment database connection pool was exhausted due to long-running transactions from another service.

The Power of Comparison

One of the most powerful debugging techniques with traces is comparison. Find a failing request and a succeeding one. Diff them. The difference often points directly to the issue—a missing span, an extra retry, a different code path taken, a new slow dependency.

Summary: Traces as Request Journey Maps

We've explored distributed tracing comprehensively. Let's consolidate the key takeaways:

Key Takeaways

•Traces capture request journeys — A complete picture of how a request flows through distributed services, with timing for each hop.
•Traces, spans, and context — Traces contain spans (units of work); context (trace_id + span_id) propagates via headers to connect them.
•Context propagation is critical — Without propagation, traces break. Use W3C Trace Context headers; handle async boundaries carefully.
•Attributes and events add richness — Semantic conventions ensure consistency; events capture point-in-time occurrences within spans.
•Sampling balances cost and coverage — Head-based sampling is simple; tail-based captures anomalies. Most systems need a combination.
•OpenTelemetry provides vendor neutrality — Instrument once, export anywhere. Auto-instrumentation covers most frameworks.
•Traces power debugging workflows — Find slow/error traces, examine waterfalls, compare with baselines, correlate with logs.

What's next:

We've now covered all three pillars of observability individually—Metrics (quantitative measurements), Logs (event records), and Traces (request paths). But the true power of observability emerges when these three signals work together.

Next, we'll explore how Metrics, Logs, and Traces complement each other, how to connect them effectively, and how to build an observability strategy that leverages all three in a unified approach.

Page Complete

You now understand distributed tracing—the third pillar that shows request journeys through your system. Combined with metrics and logs, traces complete the observability picture. Next, we'll see how to integrate all three pillars into a cohesive observability strategy.

3 / 4

Loading learning content...

System Design (HLD)Three Pillars of Observability

The Three Pillars of Observability

LevelIntermediate

Duration60 mins

TopicThree Pillars of Observability

3 / 4

Traces: Request Paths

Following the Request Through the Maze

This is the problem distributed tracing solves.

What You Will Learn

What Is Distributed Tracing?

Core concepts:

Trace Fundamentals

•Trace — Represents the entire journey of a request from start to finish. A trace contains multiple spans, like a tree of operations.
•Span — Represents a single unit of work within a trace. Examples: an HTTP request, a database query, a message being processed. Each span has a start time, duration, and optional metadata.
•Trace ID — A globally unique identifier for the trace, shared by all spans belonging to the same request.
•Span ID — A unique identifier for each individual span within a trace.
•Parent Span ID — The ID of the span that initiated the current span, creating the tree structure.
•Context — The trace ID and span ID together form the 'trace context' that propagates through the system.

trace-structure.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Example trace structure (simplified)
{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spans": [
    {
      "span_id": "00f067aa0ba902b7",
      "parent_span_id": null,  // Root span
      "operation_name": "HTTP GET /api/orders/123",
      "service_name": "api-gateway",
      "start_time": "2024-01-08T10:23:45.000Z",
      "duration_ms": 847,
      "status": "OK",
      "tags": {
        "http.method": "GET",
        "http.url": "/api/orders/123",
        "http.status_code": 200
      }
    },
    {
      "span_id": "01a902b7c3d4e5f6",
      "parent_span_id": "00f067aa0ba902b7",  // Child of root
      "operation_name": "authenticate_user",
      "service_name": "auth-service",
      "start_time": "2024-01-08T10:23:45.010Z",
      "duration_ms": 45,
      "status": "OK"
    },
    {
      "span_id": "02b3c4d5e6f7a8b9",
      "parent_span_id": "00f067aa0ba902b7",  // Child of root
      "operation_name": "get_order",
      "service_name": "order-service",
      "start_time": "2024-01-08T10:23:45.060Z",
      "duration_ms": 380,
      "status": "OK"
    },
    {
      "span_id": "03c4d5e6f7a8b9c0",
      "parent_span_id": "02b3c4d5e6f7a8b9",  // Child of order-service
      "operation_name": "SELECT * FROM orders",
      "service_name": "order-service",
      "start_time": "2024-01-08T10:23:45.065Z",
      "duration_ms": 350,
      "status": "OK",
      "tags": {
        "db.type": "postgresql",
        "db.statement": "SELECT * FROM orders WHERE id = ?"
      }
    }
  ]
}

Visualizing traces:

Traces are typically visualized in two ways:

Timeline view (Waterfall/Gantt chart) — Shows spans laid out horizontally by time. You can see which operations happened in parallel, which were sequential, and where time was spent.
Service graph view — Shows the topology of services involved, with edges representing calls between them and aggregated latency statistics.

The Dapper Legacy

Context Propagation

Without context propagation, traces would break at every service boundary. You'd have disconnected tree fragments instead of a complete request journey.

How context propagates:

•HTTP headers — The most common method. Trace context is encoded in HTTP headers that flow with every request.
•gRPC metadata — Similar to HTTP headers, trace context rides in gRPC's metadata.
•Message headers — For async messaging, context is embedded in message headers (Kafka headers, AMQP properties).
•In-process — Within a single process, context flows through thread-local storage or async context APIs.

The W3C Trace Context standard:

To ensure interoperability between different tracing systems, the W3C defined a standard trace context format. Two headers carry the essential information:

w3c-trace-context.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# W3C Trace Context Headers
# ==========================
 
# traceparent: The core correlation header
# Format: version-trace_id-parent_id-flags
#
# version:    2 hex digits (currently "00")
# trace_id:   32 hex digits (16 bytes)  
# parent_id:  16 hex digits (8 bytes)
# flags:      2 hex digits (sampling flags)
 
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
 
# Breakdown:
# - version:   00
# - trace_id:  4bf92f3577b34da6a3ce929d0e0e4736
# - parent_id: 00f067aa0ba902b7
# - flags:     01 (sampled)
 
 
# tracestate: Vendor-specific data
# Format: key=value pairs separated by commas
# Used to pass additional context without breaking compatibility
 
tracestate: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7
 
# Example: Datadog trace might add:
tracestate: dd=t.dm:-0

Implementing context propagation:

context-propagation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
// TypeScript example: Context propagation with OpenTelemetry
 
import { context, propagation, trace, SpanKind, SpanStatusCode } from '@opentelemetry/api';
 
const tracer = trace.getTracer('order-service');
 
// ========== Extracting context from incoming request ==========
function extractContext(req: Request) {
  // Extract trace context from incoming HTTP headers
  return propagation.extract(context.active(), req.headers, {
    get(carrier, key) {
      return carrier[key.toLowerCase()];
    },
    keys(carrier) {
      return Object.keys(carrier);
    },
  });
}
 
// ========== Creating a span for incoming request ==========
app.use((req, res, next) => {
  const extractedContext = extractContext(req);
  
  // Create a new span within the extracted context
  const span = tracer.startSpan(
    `HTTP ${req.method} ${req.path}`,
    {
      kind: SpanKind.SERVER,
      attributes: {
        'http.method': req.method,
        'http.url': req.url,
        'http.target': req.path,
      },
    },
    extractedContext
  );
 
  // Set the span in context for downstream code
  context.with(trace.setSpan(extractedContext, span), () => {
    res.on('finish', () => {
      span.setAttributes({
        'http.status_code': res.statusCode,
      });
      
      if (res.statusCode >= 400) {
        span.setStatus({ code: SpanStatusCode.ERROR });
      }
      
      span.end();
    });
    
    next();
  });
});
 
// ========== Propagating context to outgoing request ==========
async function callPaymentService(orderId: string) {
  const currentSpan = trace.getActiveSpan();
  
  return tracer.startActiveSpan('call_payment_service', async (span) => {
    try {
      // Inject trace context into outgoing headers
      const headers: Record<string, string> = {
        'Content-Type': 'application/json',
      };
      
      propagation.inject(context.active(), headers, {
        set(carrier, key, value) {
          carrier[key] = value;
        },
      });
      
      // Now 'headers' contains traceparent and tracestate
      const response = await fetch('http://payment-service/process', {
        method: 'POST',
        headers,
        body: JSON.stringify({ orderId }),
      });
      
      span.setAttributes({
        'http.status_code': response.status,
        'payment.service.response': response.ok,
      });
      
      return response.json();
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Context Loss: The Silent Trace Killer

Span Attributes and Events

Beyond the basic timing information, spans carry attributes (tags) and events (logs) that provide rich context for debugging and analysis.

Attributes (Tags)

Common Span Attributes (Semantic Conventions)
Category	Attribute	Example Value	Purpose
HTTP	http.method	GET, POST	HTTP request method
HTTP	http.url	https://api.example.com/users	Full URL of request
HTTP	http.status_code	200, 404, 500	Response status code
HTTP	http.route	/users/:id	Route template (not actual path)
Database	db.system	postgresql, mysql, redis	Database type
Database	db.statement	SELECT * FROM users	Query (may be sanitized)
Database	db.operation	SELECT, INSERT, UPDATE	Operation type
Messaging	messaging.system	kafka, rabbitmq	Messaging system
Messaging	messaging.destination	orders-topic	Queue or topic name
RPC	rpc.system	grpc, jsonrpc	RPC framework
RPC	rpc.service	PaymentService	Remote service name
RPC	rpc.method	ProcessPayment	Remote method called
Error	exception.type	NullPointerException	Exception class name
Error	exception.message	Value cannot be null	Exception message
Custom	user.id	user_12345	Application-specific context
Custom	order.total	299.99	Business-relevant data

Events (Span Logs)

While attributes describe the span as a whole, events capture discrete occurrences during the span's lifetime. Events are timestamped and can have their own attributes.

span-events.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Using span events for rich context
 
async function processOrder(order) {
  const span = trace.getActiveSpan();
  
  // Event: Order validation started
  span?.addEvent('order_validation_started', {
    order_id: order.id,
    items_count: order.items.length,
  });
  
  const validationResult = await validateOrder(order);
  
  // Event: Order validation completed
  span?.addEvent('order_validation_completed', {
    order_id: order.id,
    valid: validationResult.valid,
    validation_duration_ms: validationResult.durationMs,
  });
  
  if (!validationResult.valid) {
    // Event: Validation failed with details
    span?.addEvent('validation_failed', {
      reason: validationResult.error,
      field: validationResult.failingField,
    });
    throw new ValidationError(validationResult.error);
  }
  
  // Event: Payment processing started
  span?.addEvent('payment_processing_started', {
    order_id: order.id,
    amount: order.total,
    payment_method: order.paymentMethod,
  });
  
  try {
    const paymentResult = await processPayment(order);
    
    // Event: Payment successful
    span?.addEvent('payment_completed', {
      order_id: order.id,
      transaction_id: paymentResult.transactionId,
    });
  } catch (error) {
    // Record exception as an event
    span?.recordException(error);
    span?.addEvent('payment_failed', {
      order_id: order.id,
      error_code: error.code,
    });
    throw error;
  }
}

Events vs Child Spans

Sampling Strategies

Types of sampling:

Head-Based Sampling

•Decision made at trace start — The first service decides whether to sample, and this decision propagates downstream.
•Simple to implement — A random number determines sampling at the entry point.
•Consistent — All services respect the sampling decision in the trace context.
•Limitation — You can't know at the start if this particular request will be interesting (slow, error-prone).
•Use case — General-purpose sampling when you want a representative sample of traffic.

Tail-Based Sampling

•Decision made after trace completes — All spans are collected; a backend system decides what to keep.
•Intelligent selection — Can keep all error traces, slow traces, or traces matching specific criteria.
•Resource intensive — Must buffer all spans temporarily before deciding.
•Complex architecture — Requires a collector that can aggregate spans from all services.
•Use case — Ensuring you capture anomalies (errors, latency outliers) even at low overall sample rates.

Sampling Strategy Comparison
Strategy	When Decision Made	Pros	Cons
Head-based random	At trace start	Simple, low overhead	May miss interesting traces
Head-based rate-limit	At trace start	Predictable volume	May miss bursts of interesting events
Tail-based	After trace complete	Captures anomalies	Higher resource usage, complex
Priority/Rule-based	At trace start with rules	Targeted sampling	Requires configuration maintenance
Adaptive	Dynamic adjustment	Balances coverage and cost	Complex to implement correctly

sampling-configuration.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# OpenTelemetry Collector with tail-based sampling
 
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  # Memory limiter to prevent OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
    spike_limit_mib: 400
 
  # Tail-based sampling processor
  tail_sampling:
    decision_wait: 10s           # Wait up to 10s for all spans
    num_traces: 100000           # Buffer capacity
    expected_new_traces_per_sec: 10000
    policies:
      # Always sample errors
      - name: errors-policy
        type: status_code
        status_code:
          status_codes:
            - ERROR
 
      # Always sample slow traces (>1 second)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 1000
 
      # Sample 10% of everything else
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
 
      # Always sample specific important operations
      - name: critical-operations
        type: string_attribute
        string_attribute:
          key: operation.critical
          values:
            - "true"
 
exporters:
  otlp:
    endpoint: jaeger-collector:4317
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling]
      exporters: [otlp]

The Sampling Trade-off

Tracing Infrastructure

A complete tracing infrastructure consists of several components working together:

Architecture components:

•Instrumentation Libraries — Code in your applications that creates spans and propagates context. OpenTelemetry SDKs, Jaeger clients, or vendor-specific libraries.
•Agents/Sidecars — Optional local agents that collect spans from applications, reducing application overhead. Run as sidecars in Kubernetes or daemon processes.
•Collectors — Central services that receive spans from agents or directly from applications. Perform batching, processing, and routing to backends.
•Storage Backend — Databases that store trace data. Can be purpose-built (Jaeger, Tempo) or general-purpose (Elasticsearch, Cassandra).
•Query/UI — Interfaces for searching, visualizing, and analyzing traces. Jaeger UI, Grafana Tempo, Zipkin UI, or vendor dashboards.

Popular tracing systems:

Distributed Tracing Systems Comparison
System	Origin	Key Strengths	Storage Options
Jaeger	Uber	Kubernetes-native, Cassandra scale, mature	Cassandra, Elasticsearch, Kafka
Zipkin	Twitter	Simple, widely adopted, low overhead	In-memory, MySQL, Cassandra, Elasticsearch
Grafana Tempo	Grafana Labs	Object storage backend, cost-efficient, Grafana integration	S3, GCS, Azure Blob, local disk
AWS X-Ray	AWS	AWS native, Lambda integration	AWS managed
OpenTelemetry Collector	CNCF	Vendor-neutral, flexible routing	Any backend via exporters
Datadog APM	Datadog	Full APM suite, ML analysis	Datadog managed
Honeycomb	Honeycomb	High-cardinality analysis, BubbleUp	Honeycomb managed

OpenTelemetry as the Foundation

otel-config.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// OpenTelemetry Node.js SDK setup
 
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
 
// Configure the SDK
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
  }),
 
  // Export traces to OTel Collector via gRPC
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
  }),
 
  // Batch spans for efficient export
  spanProcessor: new BatchSpanProcessor(traceExporter, {
    maxQueueSize: 1000,
    maxExportBatchSize: 100,
    scheduledDelayMillis: 1000,
  }),
 
  // Auto-instrument common libraries
  instrumentations: [
    getNodeAutoInstrumentations({
      // Instrument HTTP client/server
      '@opentelemetry/instrumentation-http': {
        requestHook: (span, request) => {
          span.setAttribute('custom.request.id', request.headers['x-request-id']);
        },
      },
      // Instrument Express framework
      '@opentelemetry/instrumentation-express': {},
      // Instrument database clients
      '@opentelemetry/instrumentation-pg': {},
      '@opentelemetry/instrumentation-redis': {},
      // Instrument messaging
      '@opentelemetry/instrumentation-kafkajs': {},
    }),
  ],
});
 
// Start the SDK before your application
sdk.start();
 
// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing shutdown complete'))
    .catch((err) => console.error('Error shutting down tracing', err))
    .finally(() => process.exit(0));
});

Tracing Best Practices

Effective tracing requires thoughtful implementation. Here are battle-tested practices for getting the most value from your tracing investment:

Instrumentation Best Practices

•Name spans meaningfully — Use operation-focused names like 'ProcessPayment' or 'GET /api/users', not generic names like 'span1' or 'http.request'.
•Use semantic conventions — Follow OpenTelemetry semantic conventions for attributes. This enables cross-service and cross-language consistency.
•Don't over-span — Creating a span for every function call adds overhead and noise. Focus on boundary points: HTTP calls, database queries, message handling.
•Include business context — Add attributes like order_id, user_id, product_sku—data that helps debug business logic issues.
•Propagate context everywhere — Ensure context crosses async boundaries, background jobs, message queues. Broken context = broken traces.

Operational Best Practices

•Start with auto-instrumentation — OpenTelemetry provides automatic instrumentation for many frameworks. Start there, add custom spans as needed.
•Monitor your tracing infrastructure — Meta-monitoring: track span export latency, dropped spans, collector health. Tracing shouldn't be a black box.
•Set appropriate retention — Traces are primarily for debugging recent issues. 7-14 days retention is often sufficient; cold storage for longer.
•Connect traces to logs and metrics — Include trace_id in logs. Link from trace views to related metrics dashboards. Observability works best when unified.
•Establish trace ID conventions — Make trace IDs easily accessible: in error messages, support tickets, logs. Users reporting issues should be able to provide a trace ID.

•Create spans for service calls, DB queries, external APIs
•Add business-relevant attributes
•Use consistent naming across services
•Record exceptions on error spans
•Propagate context through async flows

Don't

•Create a span for every function call
•Put sensitive data (passwords, tokens) in attributes
•Ignore sampling—trace 100% at scale
•Forget to end spans (causes memory leaks)
•Use generic names like 'operation' or 'request'

The 'Exemplar' Pattern

Debugging with Traces

The true value of tracing emerges during debugging and incident response. Here's how to leverage traces effectively:

Scenario: Latency investigation

A dashboard shows p99 latency for the /checkout endpoint has doubled. Metrics tell you something is slow. Now what?

•Find affected traces — Query for traces with the /checkout endpoint and latency > threshold. Filter by the time window when the issue started.
•Examine the waterfall — Open a slow trace. The timeline view immediately shows where time is spent. Is it one slow service? Multiple? Database? External API?
•Compare with baseline — Find a 'fast' trace from before the issue. Compare the waterfall. What's different? A new span? An existing span taking longer?
•Drill into the slow span — Look at the attributes of the slow span. What database query? Which external endpoint? What parameters?
•Correlate with logs — Using the trace ID, pull logs from the relevant timeframe and service. The logs provide the 'why' behind the slow span.
•Form hypothesis and validate — Based on findings, form a hypothesis (e.g., 'database query is missing an index'). Validate by checking further traces or database metrics.

Scenario: Error investigation

Users are reporting random failures when submitting orders. Support tickets mention 'something went wrong.' Error rates are elevated but not catastrophic.

•Find error traces — Query for traces with error status codes or exception attributes in the time window users reported.
•Identify the failing service — The trace shows exactly where the error occurred. API Gateway → Auth Service (OK) → Order Service (OK) → Payment Service (ERROR).
•Examine error attributes — The error span has exception.type='ConnectionTimeoutException' and db.statement='INSERT INTO transactions...'
•Check for patterns — Are all errors from the same payment service instance? Same database shard? Same time of day?
•Root cause — The trace + logs reveal: payment database connection pool was exhausted due to long-running transactions from another service.

The Power of Comparison

Summary: Traces as Request Journey Maps

We've explored distributed tracing comprehensively. Let's consolidate the key takeaways:

Key Takeaways

•Traces capture request journeys — A complete picture of how a request flows through distributed services, with timing for each hop.
•Traces, spans, and context — Traces contain spans (units of work); context (trace_id + span_id) propagates via headers to connect them.
•Context propagation is critical — Without propagation, traces break. Use W3C Trace Context headers; handle async boundaries carefully.
•Attributes and events add richness — Semantic conventions ensure consistency; events capture point-in-time occurrences within spans.
•Sampling balances cost and coverage — Head-based sampling is simple; tail-based captures anomalies. Most systems need a combination.
•OpenTelemetry provides vendor neutrality — Instrument once, export anywhere. Auto-instrumentation covers most frameworks.
•Traces power debugging workflows — Find slow/error traces, examine waterfalls, compare with baselines, correlate with logs.

What's next:

Page Complete

3 / 4