Loading content...
A user reports: "My order failed at 2:15 PM."
In a monolithic application, debugging is straightforward: search logs for the user ID around that time, follow the single thread of execution, find the error.
In a distributed system, a single order might touch 15 microservices: API gateway → auth → user → inventory → pricing → tax → payment → fulfillment → notification. Each service logs independently. The user's request generates hundreds of log entries scattered across services, interleaved with logs from thousands of concurrent requests.
Without correlation IDs, finding all logs for a single request is nearly impossible.
Correlation IDs solve this by assigning a unique identifier to each request at the edge and propagating it through every service. Query for that ID, and you instantly retrieve every log from the entire request flow, in order, across all services.
By the end of this page, you'll understand correlation ID semantics and standards (W3C Trace Context), implement ID generation and propagation across services, integrate correlation IDs with logging frameworks, design correlation hierarchies for complex request flows, and troubleshoot distributed transactions using correlation.
When a request traverses multiple services, several correlation challenges emerge:
Challenge 1: Request Fragmentation
A single user action becomes multiple internal requests. "Place order" triggers parallel calls to inventory and payment, then sequential calls to fulfillment and notification. Each service sees its own request; none sees the complete picture.
Challenge 2: Async Processing
Parts of the request are processed asynchronously. Payment confirmation triggers a Kafka event; a worker picks it up minutes later. How do you connect the original request to the async processing?
Challenge 3: Fan-Out and Fan-In
A service might call 10 other services in parallel (fan-out), aggregate responses, then call 3 more (fan-in). The request tree becomes complex, hard to visualize from logs alone.
Challenge 4: Time Ordering
Services run on different machines with clock skew. Log timestamps may not reflect actual execution order. Correlation IDs with sequence or hierarchy solve this.
1234567891011121314151617181920212223
User: "Place Order" │ ▼┌─────────────────────────────────────────────────────────────────┐│ API Gateway ││ Log: "Received POST /orders" ← Which user? Which request? │└─────────────────────────────────────────────────────────────────┘ │ ┌────┴────┬────────────┬────────────┐ ▼ ▼ ▼ ▼┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐│ Auth │ │Inventory│ │ Pricing │ │ User ││ Service │ │ Service │ │ Service │ │ Service ││ │ │ │ │ │ │ ││ Log: │ │ Log: │ │ Log: │ │ Log: ││ "Token │ │ "Check │ │ "Get │ │ "Fetch ││ valid" │ │ stock" │ │ price" │ │ user" │└─────────┘ └─────────┘ └─────────┘ └─────────┘ Question: Which of the 10,000 "Check stock" logs from this second is related to THIS user's order? Answer: Without correlation ID... grep and hope.12345678910111213141516171819202122
User: "Place Order" │ ▼┌─────────────────────────────────────────────────────────────────┐│ API Gateway ││ Generate: trace_id = "abc123" ││ Log: {"trace_id":"abc123", "event":"request_received"} │└─────────────────────────────────────────────────────────────────┘ │ │ trace_id: abc123 (in headers) ┌────┴────┬────────────┬────────────┐ ▼ ▼ ▼ ▼┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐│ Auth │ │Inventory│ │ Pricing │ │ User ││ │ │ │ │ │ │ ││ Log: │ │ Log: │ │ Log: │ │ Log: ││trace_id │ │trace_id │ │trace_id │ │trace_id ││ abc123 │ │ abc123 │ │ abc123 │ │ abc123 │└─────────┘ └─────────┘ └─────────┘ └─────────┘ Query: trace_id:"abc123" → Returns all 47 logs from this request, in order, across all services.These terms are often used interchangeably, but have nuances: Trace ID (OpenTelemetry/W3C) spans the entire distributed transaction. Span ID identifies each operation within a trace. Request ID often refers to a single HTTP request. Correlation ID is an umbrella term for any linking identifier. We'll use 'trace ID' following W3C standards.
The W3C Trace Context specification is the industry standard for correlation ID propagation. Adopting it ensures interoperability with distributed tracing tools (Jaeger, Zipkin, OpenTelemetry) and third-party services.
The traceparent Header:
The primary propagation mechanism is the traceparent HTTP header:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
Format: {version}-{trace-id}-{parent-id}-{trace-flags}
| Field | Length | Example | Meaning |
|---|---|---|---|
| version | 2 hex chars | 00 | Spec version (always 00 currently) |
| trace-id | 32 hex chars | 4bf92f3577b34da6a3ce929d0e0e4736 | Unique ID for entire distributed transaction |
| parent-id | 16 hex chars | 00f067aa0ba902b7 | ID of the immediate parent span |
| trace-flags | 2 hex chars | 01 | Flags: 01 = sampled (recorded) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import { randomBytes } from 'crypto'; interface TraceContext { version: string; traceId: string; parentId: string; traceFlags: string; isSampled: boolean;} /** * Generate a new trace ID (32 hex characters = 128 bits) */function generateTraceId(): string { return randomBytes(16).toString('hex');} /** * Generate a new span/parent ID (16 hex characters = 64 bits) */function generateSpanId(): string { return randomBytes(8).toString('hex');} /** * Parse W3C traceparent header */function parseTraceParent(header: string): TraceContext | null { const pattern = /^([0-9a-f]{2})-([0-9a-f]{32})-([0-9a-f]{16})-([0-9a-f]{2})$/; const match = header.toLowerCase().match(pattern); if (!match) return null; return { version: match[1], traceId: match[2], parentId: match[3], traceFlags: match[4], isSampled: (parseInt(match[4], 16) & 0x01) === 1, };} /** * Create W3C traceparent header */function createTraceParent( traceId: string = generateTraceId(), parentId: string = generateSpanId(), sampled: boolean = true): string { const flags = sampled ? '01' : '00'; return `00-${traceId}-${parentId}-${flags}`;} /** * Create child span context (inherits trace ID, new span ID) */function createChildContext(parentContext: TraceContext): TraceContext { return { ...parentContext, parentId: generateSpanId(), // New span ID for this operation };} // Example usageconst incomingHeader = "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01";const context = parseTraceParent(incomingHeader); if (context) { console.log(`Trace ID: ${context.traceId}`); console.log(`Parent Span: ${context.parentId}`); // Create child context for outgoing request const childContext = createChildContext(context); const outgoingHeader = createTraceParent( context.traceId, // Same trace ID childContext.parentId, // New span ID context.isSampled ); console.log(`Outgoing header: ${outgoingHeader}`);}vendor1=value1,vendor2=value2. Used by tools like DataDog, Lightstep for their correlation.Before W3C standardization, each tracing system had its own format: Zipkin B3 headers (X-B3-TraceId), Jaeger headers (uber-trace-id), AWS X-Ray headers (X-Amzn-Trace-Id). Modern OpenTelemetry bridges these via configuration. When integrating with older services, check which format they expect.
Correlation IDs must be propagated through every boundary the request crosses. Incomplete propagation creates gaps in your ability to trace requests.
Propagation Boundaries:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import express, { Request, Response, NextFunction } from 'express';import { AsyncLocalStorage } from 'async_hooks'; // Type for trace contextinterface TraceContext { traceId: string; spanId: string; parentSpanId?: string; isSampled: boolean;} // Async local storage for context propagation across async boundariesconst traceStorage = new AsyncLocalStorage<TraceContext>(); // Middleware: Extract or create trace contextfunction traceContextMiddleware(req: Request, res: Response, next: NextFunction) { const traceparentHeader = req.headers['traceparent'] as string; let context: TraceContext; if (traceparentHeader) { // Parse incoming trace context const parsed = parseTraceParent(traceparentHeader); if (parsed) { context = { traceId: parsed.traceId, spanId: generateSpanId(), // New span for this service parentSpanId: parsed.parentId, isSampled: parsed.isSampled, }; } else { context = createNewTraceContext(); } } else { // No incoming context - this is the trace origin context = createNewTraceContext(); } // Add trace context to response headers (for debugging) res.setHeader('X-Trace-Id', context.traceId); // Run request handler in trace context traceStorage.run(context, () => { next(); });} function createNewTraceContext(): TraceContext { return { traceId: generateTraceId(), spanId: generateSpanId(), isSampled: Math.random() < 0.1, // 10% sampling rate };} // Helper: Get current trace context (from anywhere in async stack)export function getCurrentTraceContext(): TraceContext | undefined { return traceStorage.getStore();} // Helper: Create outgoing headers for downstream callsexport function getOutgoingTraceHeaders(): Record<string, string> { const context = getCurrentTraceContext(); if (!context) return {}; return { 'traceparent': `00-${context.traceId}-${context.spanId}-${context.isSampled ? '01' : '00'}`, };} // Usage in route handlerconst app = express();app.use(traceContextMiddleware); app.post('/orders', async (req, res) => { const context = getCurrentTraceContext(); // Log with trace context logger.info({ trace_id: context?.traceId, span_id: context?.spanId, event: 'order_received', }); // Call downstream service with propagated context const response = await fetch('http://inventory-service/check', { method: 'POST', headers: { 'Content-Type': 'application/json', ...getOutgoingTraceHeaders(), // Propagate trace context }, body: JSON.stringify(req.body), }); res.json({ success: true });});123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
from kafka import KafkaProducer, KafkaConsumerimport jsonfrom contextvars import ContextVarfrom opentelemetry import tracefrom opentelemetry.context import Context # Context variable for trace contextcurrent_trace_context: ContextVar[dict] = ContextVar('trace_context', default={}) def produce_with_trace_context(topic: str, message: dict): """Produce Kafka message with trace context in headers.""" producer = KafkaProducer(bootstrap_servers='kafka:9092') # Get current trace context ctx = current_trace_context.get() trace_id = ctx.get('trace_id', generate_trace_id()) span_id = generate_span_id() # Encode trace context as Kafka headers headers = [ ('traceparent', f'00-{trace_id}-{span_id}-01'.encode('utf-8')), ] # Log the produce event logger.info({ 'trace_id': trace_id, 'span_id': span_id, 'event': 'kafka_produce', 'topic': topic, }) producer.send( topic, value=json.dumps(message).encode('utf-8'), headers=headers ) producer.flush() def consume_with_trace_context(topic: str, handler): """Consume Kafka messages with trace context extraction.""" consumer = KafkaConsumer( topic, bootstrap_servers='kafka:9092', group_id='order-processor' ) for message in consumer: # Extract trace context from headers trace_context = {} for key, value in message.headers or []: if key == 'traceparent': parsed = parse_traceparent(value.decode('utf-8')) if parsed: trace_context = { 'trace_id': parsed['trace_id'], 'parent_span_id': parsed['parent_id'], 'span_id': generate_span_id(), # New span for consumer } # Set context for this message processing token = current_trace_context.set(trace_context) try: # Log consume event logger.info({ 'trace_id': trace_context.get('trace_id'), 'span_id': trace_context.get('span_id'), 'event': 'kafka_consume', 'topic': topic, 'partition': message.partition, 'offset': message.offset, }) # Process message handler(json.loads(message.value)) finally: current_trace_context.reset(token) # Usagedef handle_order_event(order_data): ctx = current_trace_context.get() logger.info({ 'trace_id': ctx.get('trace_id'), 'event': 'processing_order', 'order_id': order_data['order_id'], }) # Process order... consume_with_trace_context('orders', handle_order_event)Thread-local storage (MDC, ThreadLocal, contextvars) only works within the same thread. Async/await, goroutines, and thread pools break propagation. Use AsyncLocalStorage (Node.js), contextvars (Python), or explicit context passing in these scenarios. Many frameworks now have async-aware context propagation—use them.
For correlation IDs to be useful, they must appear in every log entry. Manual addition is error-prone; automatic injection is essential.
The Goal:
Every log entry should automatically include:
trace_id — Links logs across all services in the requestspan_id — Identifies the current operationparent_span_id — Links to the calling operation (optional but useful)This happens at the logger configuration level, not at each log call site.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import org.slf4j.MDC;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import javax.servlet.*;import javax.servlet.http.*; /** * Servlet filter that extracts trace context and populates MDC. * MDC values automatically appear in all logs within the request. */public class TraceContextFilter implements Filter { private static final Logger logger = LoggerFactory.getLogger(TraceContextFilter.class); @Override public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { HttpServletRequest httpRequest = (HttpServletRequest) request; try { // Extract or generate trace context String traceParent = httpRequest.getHeader("traceparent"); TraceContext context = traceParent != null ? parseTraceParent(traceParent) : newTraceContext(); // Populate MDC - these will appear in every log automatically MDC.put("trace_id", context.getTraceId()); MDC.put("span_id", context.getSpanId()); if (context.getParentSpanId() != null) { MDC.put("parent_span_id", context.getParentSpanId()); } // Also set on response for debugging ((HttpServletResponse) response).setHeader("X-Trace-Id", context.getTraceId()); // Continue processing chain.doFilter(request, response); } finally { // Always clean up MDC to prevent context leakage MDC.clear(); } }} // Logback configuration (logback.xml)/*<configuration> <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender"> <encoder class="net.logstash.logback.encoder.LogstashEncoder"> <includeMdcKeyName>trace_id</includeMdcKeyName> <includeMdcKeyName>span_id</includeMdcKeyName> <includeMdcKeyName>parent_span_id</includeMdcKeyName> </encoder> </appender></configuration>*/ // Now ALL logs in the request automatically include trace context:public class OrderService { private static final Logger logger = LoggerFactory.getLogger(OrderService.class); public void processOrder(Order order) { // No need to manually add trace_id - MDC handles it logger.info("Processing order: {}", order.getId()); // Output: {"trace_id":"abc123","span_id":"def456",...,"message":"Processing order: ord_123"} }}12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import structlogfrom contextvars import ContextVarfrom functools import wraps # Context variable for current tracetrace_context: ContextVar[dict] = ContextVar('trace_context', default={}) def add_trace_context(logger, method_name, event_dict): """structlog processor that adds trace context to every log.""" ctx = trace_context.get() if ctx: event_dict['trace_id'] = ctx.get('trace_id') event_dict['span_id'] = ctx.get('span_id') if 'parent_span_id' in ctx: event_dict['parent_span_id'] = ctx['parent_span_id'] return event_dict # Configure structlog with trace context processorstructlog.configure( processors=[ structlog.stdlib.filter_by_level, structlog.stdlib.add_logger_name, structlog.stdlib.add_log_level, structlog.stdlib.PositionalArgumentsFormatter(), structlog.processors.TimeStamper(fmt="iso", utc=True), add_trace_context, # <-- Our custom processor structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, structlog.processors.JSONRenderer() ],) # Flask middlewarefrom flask import Flask, request, gapp = Flask(__name__) @app.before_requestdef extract_trace_context(): traceparent = request.headers.get('traceparent') if traceparent: parsed = parse_traceparent(traceparent) if parsed: ctx = { 'trace_id': parsed['trace_id'], 'span_id': generate_span_id(), 'parent_span_id': parsed['parent_id'], } else: ctx = new_trace_context() else: ctx = new_trace_context() # Set in context var (accessible from structlog processor) trace_context.set(ctx) # Also set in Flask g for easy access in views g.trace_context = ctx # Usage - trace context automatically includedlogger = structlog.get_logger() @app.route('/orders', methods=['POST'])def create_order(): logger.info("order_received", order_data=request.json) # Output: {"trace_id":"abc123","span_id":"def456",...,"event":"order_received",...} process_order(request.json) return {'status': 'success'}| Language/Framework | Mechanism | Async-Safe | Usage |
|---|---|---|---|
| Java/Spring | MDC (ThreadLocal) | No* | Filter populates, Logback includes |
| Java/Reactive | Reactor Context | Yes | contextWrite() in reactive chains |
| Python | contextvars | Yes | Native since 3.7, async-aware |
| Node.js | AsyncLocalStorage | Yes | Native since Node 12.17 |
| Go | context.Context | Yes | Explicit passing, standard library |
| .NET | Activity.Current | Yes | DiagnosticSource integration |
OpenTelemetry provides auto-instrumentation for most frameworks, automatically handling trace context extraction, propagation, and injection. Consider using it instead of manual implementation. Commands like opentelemetry-instrument python app.py add tracing without code changes.
Simple trace IDs link all logs for a request. But complex systems benefit from hierarchical identifiers that represent the request structure.
Hierarchy Levels:
This tree structure enables visualization of the request flow and identification of where time was spent.
1234567891011121314151617181920212223242526272829303132333435
Trace ID: abc123│├─► Span: xyz001 (API Gateway - POST /orders)│ │ Duration: 1250ms│ ││ ├─► Span: xyz002 (Auth Service - validate token)│ │ Duration: 45ms│ │ Parent: xyz001│ ││ ├─► Span: xyz003 (Order Service - create order)│ │ │ Duration: 1100ms│ │ │ Parent: xyz001│ │ ││ │ ├─► Span: xyz004 (Inventory - check stock)│ │ │ Duration: 120ms│ │ │ Parent: xyz003│ │ ││ │ ├─► Span: xyz005 (Payment - charge)│ │ │ │ Duration: 850ms│ │ │ │ Parent: xyz003│ │ │ ││ │ │ └─► Span: xyz006 (Payment Gateway - external call)│ │ │ Duration: 780ms│ │ │ Parent: xyz005│ │ ││ │ └─► Span: xyz007 (Notification - send email)│ │ Duration: 95ms│ │ Parent: xyz003│ ││ └─► Span: xyz008 (Response serialization)│ Duration: 15ms│ Parent: xyz001 Querying all spans for trace abc123 returns this complete tree.Timeline view shows where 850ms payment call dominated latency.caused_by_event_id in event metadata.originating_trace_id in the async worker's new trace.123456789101112131415161718192021222324252627282930
{ "timestamp": "2024-01-15T14:32:07.123456Z", "level": "INFO", "service": "payment-service", "message": "Payment processed successfully", "correlation": { "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "00f067aa0ba902b7", "parent_span_id": "83d9c3e5a8d2f71c" }, "context": { "session_id": "sess_m9n8o7p6", "workflow_id": "wf_checkout_a1b2c3", "tenant_id": "acme-corp", "experiment_variant": "checkout-v2" }, "causation": { "triggered_by_event": "evt_inventory_reserved_xyz789", "triggered_by_trace": "abc123def456" }, "event": { "type": "payment_success", "order_id": "ord_123456", "amount_cents": 9999 }}Every correlation ID adds propagation complexity and log storage cost. Start with trace ID and span ID. Add session, workflow, and causation IDs when specific debugging scenarios require them. Over-correlation creates more confusion than it solves.
The purpose of correlation IDs is faster debugging. Here's how to leverage them effectively during incident response.
The Debugging Workflow:
trace_id:"abc123" → See entire request flow12345678910111213141516171819202122232425262728293031323334353637383940414243
INCIDENT: User reports "payment stuck" at 14:32 STEP 1: Get trace ID from user========================================User provides: "X-Trace-Id header said 4bf92f3577b34da6a3ce929d0e0e4736"(Or: Look up in access logs by user_id + timestamp) STEP 2: Query all logs for this trace========================================Kibana Query: trace_id:"4bf92f3577b34da6a3ce929d0e0e4736"Sort by: @timestamp ascending Results (14 log entries):14:32:07.001 api-gateway INFO "Request received"14:32:07.015 auth-service INFO "Token validated"14:32:07.045 order-service INFO "Creating order"14:32:07.150 inventory-svc INFO "Stock reserved"14:32:07.200 payment-service INFO "Initiating payment"14:32:07.210 payment-service DEBUG "Calling payment gateway"14:32:37.210 payment-service ERROR "Payment gateway timeout" ← FOUND IT14:32:37.215 payment-service WARN "Retry attempt 1"14:32:67.220 payment-service ERROR "Payment gateway timeout"14:32:67.225 payment-service WARN "Retry attempt 2" 14:32:97.230 payment-service ERROR "Payment gateway timeout"14:32:97.235 payment-service ERROR "Max retries exceeded" ← Root cause14:32:97.240 order-service ERROR "Payment failed"14:32:97.245 api-gateway ERROR "Request failed: payment_error" STEP 3: Investigate payment gateway issue========================================Query: service:"payment-service" AND message:"gateway timeout" last 1h→ Found 847 timeout errors in the last hour→ Payment gateway is experiencing widespread issues STEP 4: Verify with payment gateway status========================================Dashboard shows: Payment Gateway latency p99 went from 200ms to 45000ms at 14:20Root cause: Payment gateway infrastructure issue (external dependency) RESOLUTION:- Notify users of payment delays- Enable circuit breaker for payment service- Monitor for gateway recovery| Practice | Why It Matters | Implementation |
|---|---|---|
| Generate at the edge | Single source of truth | API gateway/load balancer generates if absent |
| Propagate everywhere | Complete visibility | HTTP headers, gRPC metadata, message queues, jobs |
| Include in ALL logs | No log without context | Logging framework auto-injection via MDC/context |
| Return to client | User can report it | X-Trace-Id response header |
| Use W3C format | Interoperability | Standard traceparent header format |
| Link async processing | Doesn't break on queues | Include in message metadata, create child spans |
For user-facing errors, display trace ID on error pages: 'Error ID: abc123'. Users naturally include this when reporting issues. This single practice reduces debugging time dramatically—you skip the 'when did this happen? what were you doing?' questions.
Implementing correlation IDs organization-wide requires coordination. Use this checklist to ensure complete coverage:
Phase 1: Foundation
1234567891011121314151617181920212223242526272829303132333435363738394041
# Trace Context Implementation Template for New Services ## 1. Dependenciesdependencies: - opentelemetry-api # Or your chosen tracing library - opentelemetry-instrumentation-http - structured-logging-library # e.g., structlog, logstash-logback-encoder ## 2. Configurationconfig: OTEL_SERVICE_NAME: "{{ service-name }}" OTEL_TRACES_SAMPLER: "parentbased_traceidratio" OTEL_TRACES_SAMPLER_ARG: "0.1" # 10% sampling ## 3. Middleware Setupmiddleware: - extract_trace_context_from_headers - create_child_span_for_request - inject_trace_context_to_logging - propagate_trace_context_to_outgoing_calls ## 4. Logging Configurationlogging: format: json auto_include: - trace_id - span_id - parent_span_id - service_name - service_version ## 5. Outgoing Call Instrumentationoutgoing_calls: http: auto-inject-traceparent-header grpc: auto-inject-metadata kafka: include-in-message-headers background_jobs: pass-context-to-job-metadata ## 6. Response Headersresponse_headers: - "X-Trace-Id: {{ trace_id }}"Don't try to instrument everything at once. Start with critical user paths (checkout, authentication). Add instrumentation progressively. Each service added improves visibility for all traces passing through it.
Correlation IDs transform distributed system debugging from guesswork to science. A single identifier links every log, span, and event across your entire service mesh.
Key Takeaways:
Module Complete:
You've now mastered Logging at Scale—the fourth pillar of observability. You understand structured logging formats and schema design, log level semantics and organizational standards, log aggregation with ELK, OpenSearch, and Loki, retention strategies and cost optimization, and correlation IDs for distributed debugging.
With metrics, traces, and logs working together through correlation IDs, you have complete observability over distributed systems.
You've completed the Logging at Scale module! You can now design production-grade logging systems that are structured, cost-efficient, and enable rapid debugging across distributed services. Combined with metrics and traces, you have the full observability toolkit for operating complex systems at scale.