Loading learning content...
Imagine debugging a slow API response in a microservices architecture. The request enters through an API gateway, flows to an authentication service, queries a user service, checks an inventory service, processes payment through a payment service, sends a notification via a messaging service, and finally returns. The total latency is 3 seconds—but which service is responsible?
Your metrics show elevated latency. Your logs show individual events in each service. But neither shows you the complete journey—the actual path the request took, how long each hop took, and where the bottleneck lies.
This is the problem distributed tracing solves.
A trace is a representation of a single request's journey through a distributed system. It captures every service the request touched, the sequence of operations, the time spent in each, and the relationships between them. Traces provide the end-to-end visibility that metrics and logs alone cannot offer.
By the end of this page, you will understand distributed tracing fundamentally—the concepts of traces, spans, and context propagation. You'll learn how to instrument applications for tracing, understand sampling strategies to control costs, and appreciate how traces connect with metrics and logs to form complete observability.
Distributed tracing tracks the flow of requests as they propagate through distributed systems. Unlike logs (individual events) or metrics (aggregated numbers), traces capture the cause-and-effect relationships between operations.
Core concepts:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// Example trace structure (simplified){ "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "spans": [ { "span_id": "00f067aa0ba902b7", "parent_span_id": null, // Root span "operation_name": "HTTP GET /api/orders/123", "service_name": "api-gateway", "start_time": "2024-01-08T10:23:45.000Z", "duration_ms": 847, "status": "OK", "tags": { "http.method": "GET", "http.url": "/api/orders/123", "http.status_code": 200 } }, { "span_id": "01a902b7c3d4e5f6", "parent_span_id": "00f067aa0ba902b7", // Child of root "operation_name": "authenticate_user", "service_name": "auth-service", "start_time": "2024-01-08T10:23:45.010Z", "duration_ms": 45, "status": "OK" }, { "span_id": "02b3c4d5e6f7a8b9", "parent_span_id": "00f067aa0ba902b7", // Child of root "operation_name": "get_order", "service_name": "order-service", "start_time": "2024-01-08T10:23:45.060Z", "duration_ms": 380, "status": "OK" }, { "span_id": "03c4d5e6f7a8b9c0", "parent_span_id": "02b3c4d5e6f7a8b9", // Child of order-service "operation_name": "SELECT * FROM orders", "service_name": "order-service", "start_time": "2024-01-08T10:23:45.065Z", "duration_ms": 350, "status": "OK", "tags": { "db.type": "postgresql", "db.statement": "SELECT * FROM orders WHERE id = ?" } } ]}Visualizing traces:
Traces are typically visualized in two ways:
Timeline view (Waterfall/Gantt chart) — Shows spans laid out horizontally by time. You can see which operations happened in parallel, which were sequential, and where time was spent.
Service graph view — Shows the topology of services involved, with edges representing calls between them and aggregated latency statistics.
The timeline view is essential for debugging individual slow requests—you immediately see that 350ms of a 380ms operation was spent in a database query. The service graph is valuable for understanding system-wide patterns and dependencies.
Modern distributed tracing traces its origins to Google's Dapper paper (2010). Dapper introduced the concepts of trace IDs, span IDs, and context propagation that underpin all modern tracing systems. Systems like Zipkin, Jaeger, and OpenTelemetry all follow this model.
For distributed tracing to work, trace context must propagate from service to service. When Service A calls Service B, it must pass along the trace ID and current span ID so that B's spans can be connected to A's.
Without context propagation, traces would break at every service boundary. You'd have disconnected tree fragments instead of a complete request journey.
How context propagates:
The W3C Trace Context standard:
To ensure interoperability between different tracing systems, the W3C defined a standard trace context format. Two headers carry the essential information:
12345678910111213141516171819202122232425262728
# W3C Trace Context Headers# ========================== # traceparent: The core correlation header# Format: version-trace_id-parent_id-flags## version: 2 hex digits (currently "00")# trace_id: 32 hex digits (16 bytes) # parent_id: 16 hex digits (8 bytes)# flags: 2 hex digits (sampling flags) traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 # Breakdown:# - version: 00# - trace_id: 4bf92f3577b34da6a3ce929d0e0e4736# - parent_id: 00f067aa0ba902b7# - flags: 01 (sampled) # tracestate: Vendor-specific data# Format: key=value pairs separated by commas# Used to pass additional context without breaking compatibility tracestate: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7 # Example: Datadog trace might add:tracestate: dd=t.dm:-0Implementing context propagation:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
// TypeScript example: Context propagation with OpenTelemetry import { context, propagation, trace, SpanKind, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('order-service'); // ========== Extracting context from incoming request ==========function extractContext(req: Request) { // Extract trace context from incoming HTTP headers return propagation.extract(context.active(), req.headers, { get(carrier, key) { return carrier[key.toLowerCase()]; }, keys(carrier) { return Object.keys(carrier); }, });} // ========== Creating a span for incoming request ==========app.use((req, res, next) => { const extractedContext = extractContext(req); // Create a new span within the extracted context const span = tracer.startSpan( `HTTP ${req.method} ${req.path}`, { kind: SpanKind.SERVER, attributes: { 'http.method': req.method, 'http.url': req.url, 'http.target': req.path, }, }, extractedContext ); // Set the span in context for downstream code context.with(trace.setSpan(extractedContext, span), () => { res.on('finish', () => { span.setAttributes({ 'http.status_code': res.statusCode, }); if (res.statusCode >= 400) { span.setStatus({ code: SpanStatusCode.ERROR }); } span.end(); }); next(); });}); // ========== Propagating context to outgoing request ==========async function callPaymentService(orderId: string) { const currentSpan = trace.getActiveSpan(); return tracer.startActiveSpan('call_payment_service', async (span) => { try { // Inject trace context into outgoing headers const headers: Record<string, string> = { 'Content-Type': 'application/json', }; propagation.inject(context.active(), headers, { set(carrier, key, value) { carrier[key] = value; }, }); // Now 'headers' contains traceparent and tracestate const response = await fetch('http://payment-service/process', { method: 'POST', headers, body: JSON.stringify({ orderId }), }); span.setAttributes({ 'http.status_code': response.status, 'payment.service.response': response.ok, }); return response.json(); } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); span.recordException(error); throw error; } finally { span.end(); } });}Context can be lost when crossing async boundaries, thread pools, or message queues. If you spawn a background thread or publish a message without propagating context, the trace breaks. Libraries like OpenTelemetry provide automatic instrumentation for many frameworks, but custom async patterns require explicit handling.
Beyond the basic timing information, spans carry attributes (tags) and events (logs) that provide rich context for debugging and analysis.
Attributes (Tags)
Attributes are key-value pairs attached to spans that describe the operation. OpenTelemetry defines semantic conventions for common attribute names, ensuring consistency across different services and languages.
| Category | Attribute | Example Value | Purpose |
|---|---|---|---|
| HTTP | http.method | GET, POST | HTTP request method |
| HTTP | http.url | https://api.example.com/users | Full URL of request |
| HTTP | http.status_code | 200, 404, 500 | Response status code |
| HTTP | http.route | /users/:id | Route template (not actual path) |
| Database | db.system | postgresql, mysql, redis | Database type |
| Database | db.statement | SELECT * FROM users | Query (may be sanitized) |
| Database | db.operation | SELECT, INSERT, UPDATE | Operation type |
| Messaging | messaging.system | kafka, rabbitmq | Messaging system |
| Messaging | messaging.destination | orders-topic | Queue or topic name |
| RPC | rpc.system | grpc, jsonrpc | RPC framework |
| RPC | rpc.service | PaymentService | Remote service name |
| RPC | rpc.method | ProcessPayment | Remote method called |
| Error | exception.type | NullPointerException | Exception class name |
| Error | exception.message | Value cannot be null | Exception message |
| Custom | user.id | user_12345 | Application-specific context |
| Custom | order.total | 299.99 | Business-relevant data |
Events (Span Logs)
While attributes describe the span as a whole, events capture discrete occurrences during the span's lifetime. Events are timestamped and can have their own attributes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
// Using span events for rich context async function processOrder(order) { const span = trace.getActiveSpan(); // Event: Order validation started span?.addEvent('order_validation_started', { order_id: order.id, items_count: order.items.length, }); const validationResult = await validateOrder(order); // Event: Order validation completed span?.addEvent('order_validation_completed', { order_id: order.id, valid: validationResult.valid, validation_duration_ms: validationResult.durationMs, }); if (!validationResult.valid) { // Event: Validation failed with details span?.addEvent('validation_failed', { reason: validationResult.error, field: validationResult.failingField, }); throw new ValidationError(validationResult.error); } // Event: Payment processing started span?.addEvent('payment_processing_started', { order_id: order.id, amount: order.total, payment_method: order.paymentMethod, }); try { const paymentResult = await processPayment(order); // Event: Payment successful span?.addEvent('payment_completed', { order_id: order.id, transaction_id: paymentResult.transactionId, }); } catch (error) { // Record exception as an event span?.recordException(error); span?.addEvent('payment_failed', { order_id: order.id, error_code: error.code, }); throw error; }}Use spans for operations that have meaningful duration and might fail independently. Use events for point-in-time occurrences within a span. Creating too many spans adds overhead; events are lightweight. Rule of thumb: if you'd create a log for it, it's probably an event. If you'd measure its latency, it's probably a span.
Tracing every single request in a high-throughput system is prohibitively expensive. A system handling 100,000 requests per second would generate billions of spans per day. Sampling selectively records a subset of traces, balancing observability with cost and performance.
Types of sampling:
| Strategy | When Decision Made | Pros | Cons |
|---|---|---|---|
| Head-based random | At trace start | Simple, low overhead | May miss interesting traces |
| Head-based rate-limit | At trace start | Predictable volume | May miss bursts of interesting events |
| Tail-based | After trace complete | Captures anomalies | Higher resource usage, complex |
| Priority/Rule-based | At trace start with rules | Targeted sampling | Requires configuration maintenance |
| Adaptive | Dynamic adjustment | Balances coverage and cost | Complex to implement correctly |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
# OpenTelemetry Collector with tail-based sampling receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 processors: # Memory limiter to prevent OOM memory_limiter: check_interval: 1s limit_mib: 2000 spike_limit_mib: 400 # Tail-based sampling processor tail_sampling: decision_wait: 10s # Wait up to 10s for all spans num_traces: 100000 # Buffer capacity expected_new_traces_per_sec: 10000 policies: # Always sample errors - name: errors-policy type: status_code status_code: status_codes: - ERROR # Always sample slow traces (>1 second) - name: latency-policy type: latency latency: threshold_ms: 1000 # Sample 10% of everything else - name: probabilistic-policy type: probabilistic probabilistic: sampling_percentage: 10 # Always sample specific important operations - name: critical-operations type: string_attribute string_attribute: key: operation.critical values: - "true" exporters: otlp: endpoint: jaeger-collector:4317 service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, tail_sampling] exporters: [otlp]With 1% sampling, you'll likely miss the one slow request out of 100 that a user complained about. With 100% sampling, your tracing infrastructure costs more than your application infrastructure. There's no perfect answer—tune based on your debugging needs, budget, and the value of captured traces.
A complete tracing infrastructure consists of several components working together:
Architecture components:
Popular tracing systems:
| System | Origin | Key Strengths | Storage Options |
|---|---|---|---|
| Jaeger | Uber | Kubernetes-native, Cassandra scale, mature | Cassandra, Elasticsearch, Kafka |
| Zipkin | Simple, widely adopted, low overhead | In-memory, MySQL, Cassandra, Elasticsearch | |
| Grafana Tempo | Grafana Labs | Object storage backend, cost-efficient, Grafana integration | S3, GCS, Azure Blob, local disk |
| AWS X-Ray | AWS | AWS native, Lambda integration | AWS managed |
| OpenTelemetry Collector | CNCF | Vendor-neutral, flexible routing | Any backend via exporters |
| Datadog APM | Datadog | Full APM suite, ML analysis | Datadog managed |
| Honeycomb | Honeycomb | High-cardinality analysis, BubbleUp | Honeycomb managed |
OpenTelemetry (OTel) has emerged as the vendor-neutral standard for instrumentation. Instrument your code once with OTel SDKs, and you can send traces to any backend—Jaeger, Tempo, Datadog, or others. This avoids vendor lock-in and provides flexibility to change backends without code changes.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
// OpenTelemetry Node.js SDK setup import { NodeSDK } from '@opentelemetry/sdk-node';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';import { Resource } from '@opentelemetry/resources';import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base'; // Configure the SDKconst sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'order-service', [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0', [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV, }), // Export traces to OTel Collector via gRPC traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317', }), // Batch spans for efficient export spanProcessor: new BatchSpanProcessor(traceExporter, { maxQueueSize: 1000, maxExportBatchSize: 100, scheduledDelayMillis: 1000, }), // Auto-instrument common libraries instrumentations: [ getNodeAutoInstrumentations({ // Instrument HTTP client/server '@opentelemetry/instrumentation-http': { requestHook: (span, request) => { span.setAttribute('custom.request.id', request.headers['x-request-id']); }, }, // Instrument Express framework '@opentelemetry/instrumentation-express': {}, // Instrument database clients '@opentelemetry/instrumentation-pg': {}, '@opentelemetry/instrumentation-redis': {}, // Instrument messaging '@opentelemetry/instrumentation-kafkajs': {}, }), ],}); // Start the SDK before your applicationsdk.start(); // Graceful shutdownprocess.on('SIGTERM', () => { sdk.shutdown() .then(() => console.log('Tracing shutdown complete')) .catch((err) => console.error('Error shutting down tracing', err)) .finally(() => process.exit(0));});Effective tracing requires thoughtful implementation. Here are battle-tested practices for getting the most value from your tracing investment:
Exemplars connect metrics to traces. When Prometheus records a histogram bucket, it can store the trace ID of an example request that fell into that bucket. Clicking on a slow latency percentile in a dashboard can take you directly to a trace exhibiting that latency. This bridges the gap between aggregate views (metrics) and individual request views (traces).
The true value of tracing emerges during debugging and incident response. Here's how to leverage traces effectively:
Scenario: Latency investigation
A dashboard shows p99 latency for the /checkout endpoint has doubled. Metrics tell you something is slow. Now what?
Scenario: Error investigation
Users are reporting random failures when submitting orders. Support tickets mention 'something went wrong.' Error rates are elevated but not catastrophic.
One of the most powerful debugging techniques with traces is comparison. Find a failing request and a succeeding one. Diff them. The difference often points directly to the issue—a missing span, an extra retry, a different code path taken, a new slow dependency.
We've explored distributed tracing comprehensively. Let's consolidate the key takeaways:
What's next:
We've now covered all three pillars of observability individually—Metrics (quantitative measurements), Logs (event records), and Traces (request paths). But the true power of observability emerges when these three signals work together.
Next, we'll explore how Metrics, Logs, and Traces complement each other, how to connect them effectively, and how to build an observability strategy that leverages all three in a unified approach.
You now understand distributed tracing—the third pillar that shows request journeys through your system. Combined with metrics and logs, traces complete the observability picture. Next, we'll see how to integrate all three pillars into a cohesive observability strategy.