Logging At Scale - Learning Module

Loading content...

0/273

Correlation IDs

The Needle in the Haystack Problem

A user reports: "My order failed at 2:15 PM."

In a monolithic application, debugging is straightforward: search logs for the user ID around that time, follow the single thread of execution, find the error.

In a distributed system, a single order might touch 15 microservices: API gateway → auth → user → inventory → pricing → tax → payment → fulfillment → notification. Each service logs independently. The user's request generates hundreds of log entries scattered across services, interleaved with logs from thousands of concurrent requests.

Without correlation IDs, finding all logs for a single request is nearly impossible.

Correlation IDs solve this by assigning a unique identifier to each request at the edge and propagating it through every service. Query for that ID, and you instantly retrieve every log from the entire request flow, in order, across all services.

What You Will Learn

By the end of this page, you'll understand correlation ID semantics and standards (W3C Trace Context), implement ID generation and propagation across services, integrate correlation IDs with logging frameworks, design correlation hierarchies for complex request flows, and troubleshoot distributed transactions using correlation.

The Correlation Problem in Distributed Systems

When a request traverses multiple services, several correlation challenges emerge:

Challenge 1: Request Fragmentation

A single user action becomes multiple internal requests. "Place order" triggers parallel calls to inventory and payment, then sequential calls to fulfillment and notification. Each service sees its own request; none sees the complete picture.

Challenge 2: Async Processing

Parts of the request are processed asynchronously. Payment confirmation triggers a Kafka event; a worker picks it up minutes later. How do you connect the original request to the async processing?

Challenge 3: Fan-Out and Fan-In

A service might call 10 other services in parallel (fan-out), aggregate responses, then call 3 more (fan-in). The request tree becomes complex, hard to visualize from logs alone.

Challenge 4: Time Ordering

Services run on different machines with clock skew. Log timestamps may not reflect actual execution order. Correlation IDs with sequence or hierarchy solve this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
User: "Place Order"
          │
          ▼
┌─────────────────────────────────────────────────────────────────┐
│ API Gateway                                                      │
│ Log: "Received POST /orders"  ← Which user? Which request?      │
└─────────────────────────────────────────────────────────────────┘
          │
     ┌────┴────┬────────────┬────────────┐
     ▼         ▼            ▼            ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│  Auth   │ │Inventory│ │ Pricing │ │  User   │
│ Service │ │ Service │ │ Service │ │ Service │
│         │ │         │ │         │ │         │
│ Log:    │ │ Log:    │ │ Log:    │ │ Log:    │
│ "Token  │ │ "Check  │ │ "Get    │ │ "Fetch  │
│  valid" │ │ stock"  │ │ price"  │ │ user"   │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
                                     
Question: Which of the 10,000 "Check stock" logs from this 
second is related to THIS user's order?
 
Answer: Without correlation ID... grep and hope.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
User: "Place Order"
          │
          ▼
┌─────────────────────────────────────────────────────────────────┐
│ API Gateway                                                      │
│ Generate: trace_id = "abc123"                                   │
│ Log: {"trace_id":"abc123", "event":"request_received"}          │
└─────────────────────────────────────────────────────────────────┘
          │
          │ trace_id: abc123 (in headers)
     ┌────┴────┬────────────┬────────────┐
     ▼         ▼            ▼            ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│  Auth   │ │Inventory│ │ Pricing │ │  User   │
│         │ │         │ │         │ │         │
│ Log:    │ │ Log:    │ │ Log:    │ │ Log:    │
│trace_id │ │trace_id │ │trace_id │ │trace_id │
│ abc123  │ │ abc123  │ │ abc123  │ │ abc123  │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
 
Query: trace_id:"abc123" → Returns all 47 logs from this 
request, in order, across all services.

Correlation ID vs Trace ID vs Request ID

These terms are often used interchangeably, but have nuances: Trace ID (OpenTelemetry/W3C) spans the entire distributed transaction. Span ID identifies each operation within a trace. Request ID often refers to a single HTTP request. Correlation ID is an umbrella term for any linking identifier. We'll use 'trace ID' following W3C standards.

W3C Trace Context Standard

The W3C Trace Context specification is the industry standard for correlation ID propagation. Adopting it ensures interoperability with distributed tracing tools (Jaeger, Zipkin, OpenTelemetry) and third-party services.

The traceparent Header:

The primary propagation mechanism is the traceparent HTTP header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Format: {version}-{trace-id}-{parent-id}-{trace-flags}

W3C Trace Context Header Components
Field	Length	Example	Meaning
version	2 hex chars	00	Spec version (always 00 currently)
trace-id	32 hex chars	4bf92f3577b34da6a3ce929d0e0e4736	Unique ID for entire distributed transaction
parent-id	16 hex chars	00f067aa0ba902b7	ID of the immediate parent span
trace-flags	2 hex chars	01	Flags: 01 = sampled (recorded)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import { randomBytes } from 'crypto';
 
interface TraceContext {
    version: string;
    traceId: string;
    parentId: string;
    traceFlags: string;
    isSampled: boolean;
}
 
/**
 * Generate a new trace ID (32 hex characters = 128 bits)
 */
function generateTraceId(): string {
    return randomBytes(16).toString('hex');
}
 
/**
 * Generate a new span/parent ID (16 hex characters = 64 bits)
 */
function generateSpanId(): string {
    return randomBytes(8).toString('hex');
}
 
/**
 * Parse W3C traceparent header
 */
function parseTraceParent(header: string): TraceContext | null {
    const pattern = /^([0-9a-f]{2})-([0-9a-f]{32})-([0-9a-f]{16})-([0-9a-f]{2})$/;
    const match = header.toLowerCase().match(pattern);
    
    if (!match) return null;
    
    return {
        version: match[1],
        traceId: match[2],
        parentId: match[3],
        traceFlags: match[4],
        isSampled: (parseInt(match[4], 16) & 0x01) === 1,
    };
}
 
/**
 * Create W3C traceparent header
 */
function createTraceParent(
    traceId: string = generateTraceId(),
    parentId: string = generateSpanId(),
    sampled: boolean = true
): string {
    const flags = sampled ? '01' : '00';
    return `00-${traceId}-${parentId}-${flags}`;
}
 
/**
 * Create child span context (inherits trace ID, new span ID)
 */
function createChildContext(parentContext: TraceContext): TraceContext {
    return {
        ...parentContext,
        parentId: generateSpanId(), // New span ID for this operation
    };
}
 
// Example usage
const incomingHeader = "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01";
const context = parseTraceParent(incomingHeader);
 
if (context) {
    console.log(`Trace ID: ${context.traceId}`);
    console.log(`Parent Span: ${context.parentId}`);
    
    // Create child context for outgoing request
    const childContext = createChildContext(context);
    const outgoingHeader = createTraceParent(
        context.traceId,  // Same trace ID
        childContext.parentId,  // New span ID
        context.isSampled
    );
    console.log(`Outgoing header: ${outgoingHeader}`);
}

W3C Trace Context Components

•traceparent — The primary header containing trace ID, span ID, and sampling decision. Always propagate this.
•tracestate — Optional header for vendor-specific data. Format: vendor1=value1,vendor2=value2. Used by tools like DataDog, Lightstep for their correlation.
•Trace ID uniqueness — 128-bit IDs (32 hex chars) ensure global uniqueness. Collision probability is astronomically low with proper random generation.
•Span ID locality — 64-bit span IDs identify operations within a trace. Each service operation gets a unique span ID.
•Sampling flag — Indicates whether this trace is being recorded. Non-sampled traces may skip expensive tracing overhead.

Legacy Formats

Before W3C standardization, each tracing system had its own format: Zipkin B3 headers (X-B3-TraceId), Jaeger headers (uber-trace-id), AWS X-Ray headers (X-Amzn-Trace-Id). Modern OpenTelemetry bridges these via configuration. When integrating with older services, check which format they expect.

Propagation Patterns

Correlation IDs must be propagated through every boundary the request crosses. Incomplete propagation creates gaps in your ability to trace requests.

Propagation Boundaries:

HTTP Requests — Headers (traceparent, tracestate)
gRPC Calls — Metadata (equivalent to headers)
Message Queues — Message headers/properties
Background Jobs — Job metadata/context
Database Queries — Comments or session context
Thread Boundaries — Thread-local storage or context objects

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import express, { Request, Response, NextFunction } from 'express';
import { AsyncLocalStorage } from 'async_hooks';
 
// Type for trace context
interface TraceContext {
    traceId: string;
    spanId: string;
    parentSpanId?: string;
    isSampled: boolean;
}
 
// Async local storage for context propagation across async boundaries
const traceStorage = new AsyncLocalStorage<TraceContext>();
 
// Middleware: Extract or create trace context
function traceContextMiddleware(req: Request, res: Response, next: NextFunction) {
    const traceparentHeader = req.headers['traceparent'] as string;
    
    let context: TraceContext;
    
    if (traceparentHeader) {
        // Parse incoming trace context
        const parsed = parseTraceParent(traceparentHeader);
        if (parsed) {
            context = {
                traceId: parsed.traceId,
                spanId: generateSpanId(),  // New span for this service
                parentSpanId: parsed.parentId,
                isSampled: parsed.isSampled,
            };
        } else {
            context = createNewTraceContext();
        }
    } else {
        // No incoming context - this is the trace origin
        context = createNewTraceContext();
    }
    
    // Add trace context to response headers (for debugging)
    res.setHeader('X-Trace-Id', context.traceId);
    
    // Run request handler in trace context
    traceStorage.run(context, () => {
        next();
    });
}
 
function createNewTraceContext(): TraceContext {
    return {
        traceId: generateTraceId(),
        spanId: generateSpanId(),
        isSampled: Math.random() < 0.1,  // 10% sampling rate
    };
}
 
// Helper: Get current trace context (from anywhere in async stack)
export function getCurrentTraceContext(): TraceContext | undefined {
    return traceStorage.getStore();
}
 
// Helper: Create outgoing headers for downstream calls
export function getOutgoingTraceHeaders(): Record<string, string> {
    const context = getCurrentTraceContext();
    if (!context) return {};
    
    return {
        'traceparent': `00-${context.traceId}-${context.spanId}-${context.isSampled ? '01' : '00'}`,
    };
}
 
// Usage in route handler
const app = express();
app.use(traceContextMiddleware);
 
app.post('/orders', async (req, res) => {
    const context = getCurrentTraceContext();
    
    // Log with trace context
    logger.info({
        trace_id: context?.traceId,
        span_id: context?.spanId,
        event: 'order_received',
    });
    
    // Call downstream service with propagated context
    const response = await fetch('http://inventory-service/check', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
            ...getOutgoingTraceHeaders(),  // Propagate trace context
        },
        body: JSON.stringify(req.body),
    });
    
    res.json({ success: true });
});

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from kafka import KafkaProducer, KafkaConsumer
import json
from contextvars import ContextVar
from opentelemetry import trace
from opentelemetry.context import Context
 
# Context variable for trace context
current_trace_context: ContextVar[dict] = ContextVar('trace_context', default={})
 
def produce_with_trace_context(topic: str, message: dict):
    """Produce Kafka message with trace context in headers."""
    producer = KafkaProducer(bootstrap_servers='kafka:9092')
    
    # Get current trace context
    ctx = current_trace_context.get()
    trace_id = ctx.get('trace_id', generate_trace_id())
    span_id = generate_span_id()
    
    # Encode trace context as Kafka headers
    headers = [
        ('traceparent', f'00-{trace_id}-{span_id}-01'.encode('utf-8')),
    ]
    
    # Log the produce event
    logger.info({
        'trace_id': trace_id,
        'span_id': span_id,
        'event': 'kafka_produce',
        'topic': topic,
    })
    
    producer.send(
        topic,
        value=json.dumps(message).encode('utf-8'),
        headers=headers
    )
    producer.flush()
 
def consume_with_trace_context(topic: str, handler):
    """Consume Kafka messages with trace context extraction."""
    consumer = KafkaConsumer(
        topic,
        bootstrap_servers='kafka:9092',
        group_id='order-processor'
    )
    
    for message in consumer:
        # Extract trace context from headers
        trace_context = {}
        for key, value in message.headers or []:
            if key == 'traceparent':
                parsed = parse_traceparent(value.decode('utf-8'))
                if parsed:
                    trace_context = {
                        'trace_id': parsed['trace_id'],
                        'parent_span_id': parsed['parent_id'],
                        'span_id': generate_span_id(),  # New span for consumer
                    }
        
        # Set context for this message processing
        token = current_trace_context.set(trace_context)
        
        try:
            # Log consume event
            logger.info({
                'trace_id': trace_context.get('trace_id'),
                'span_id': trace_context.get('span_id'),
                'event': 'kafka_consume',
                'topic': topic,
                'partition': message.partition,
                'offset': message.offset,
            })
            
            # Process message
            handler(json.loads(message.value))
            
        finally:
            current_trace_context.reset(token)
 
# Usage
def handle_order_event(order_data):
    ctx = current_trace_context.get()
    logger.info({
        'trace_id': ctx.get('trace_id'),
        'event': 'processing_order',
        'order_id': order_data['order_id'],
    })
    # Process order...
 
consume_with_trace_context('orders', handle_order_event)

The Async Gap Problem

Thread-local storage (MDC, ThreadLocal, contextvars) only works within the same thread. Async/await, goroutines, and thread pools break propagation. Use AsyncLocalStorage (Node.js), contextvars (Python), or explicit context passing in these scenarios. Many frameworks now have async-aware context propagation—use them.

Integrating Correlation with Logging

For correlation IDs to be useful, they must appear in every log entry. Manual addition is error-prone; automatic injection is essential.

The Goal:

Every log entry should automatically include:

trace_id — Links logs across all services in the request
span_id — Identifies the current operation
parent_span_id — Links to the calling operation (optional but useful)

This happens at the logger configuration level, not at each log call site.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import org.slf4j.MDC;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.servlet.*;
import javax.servlet.http.*;
 
/**
 * Servlet filter that extracts trace context and populates MDC.
 * MDC values automatically appear in all logs within the request.
 */
public class TraceContextFilter implements Filter {
    private static final Logger logger = LoggerFactory.getLogger(TraceContextFilter.class);
    
    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) 
            throws IOException, ServletException {
        HttpServletRequest httpRequest = (HttpServletRequest) request;
        
        try {
            // Extract or generate trace context
            String traceParent = httpRequest.getHeader("traceparent");
            TraceContext context = traceParent != null 
                ? parseTraceParent(traceParent) 
                : newTraceContext();
            
            // Populate MDC - these will appear in every log automatically
            MDC.put("trace_id", context.getTraceId());
            MDC.put("span_id", context.getSpanId());
            if (context.getParentSpanId() != null) {
                MDC.put("parent_span_id", context.getParentSpanId());
            }
            
            // Also set on response for debugging
            ((HttpServletResponse) response).setHeader("X-Trace-Id", context.getTraceId());
            
            // Continue processing
            chain.doFilter(request, response);
            
        } finally {
            // Always clean up MDC to prevent context leakage
            MDC.clear();
        }
    }
}
 
// Logback configuration (logback.xml)
/*
<configuration>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
            <includeMdcKeyName>trace_id</includeMdcKeyName>
            <includeMdcKeyName>span_id</includeMdcKeyName>
            <includeMdcKeyName>parent_span_id</includeMdcKeyName>
        </encoder>
    </appender>
</configuration>
*/
 
// Now ALL logs in the request automatically include trace context:
public class OrderService {
    private static final Logger logger = LoggerFactory.getLogger(OrderService.class);
    
    public void processOrder(Order order) {
        // No need to manually add trace_id - MDC handles it
        logger.info("Processing order: {}", order.getId());
        // Output: {"trace_id":"abc123","span_id":"def456",...,"message":"Processing order: ord_123"}
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import structlog
from contextvars import ContextVar
from functools import wraps
 
# Context variable for current trace
trace_context: ContextVar[dict] = ContextVar('trace_context', default={})
 
def add_trace_context(logger, method_name, event_dict):
    """structlog processor that adds trace context to every log."""
    ctx = trace_context.get()
    if ctx:
        event_dict['trace_id'] = ctx.get('trace_id')
        event_dict['span_id'] = ctx.get('span_id')
        if 'parent_span_id' in ctx:
            event_dict['parent_span_id'] = ctx['parent_span_id']
    return event_dict
 
# Configure structlog with trace context processor
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso", utc=True),
        add_trace_context,  # <-- Our custom processor
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
)
 
# Flask middleware
from flask import Flask, request, g
app = Flask(__name__)
 
@app.before_request
def extract_trace_context():
    traceparent = request.headers.get('traceparent')
    if traceparent:
        parsed = parse_traceparent(traceparent)
        if parsed:
            ctx = {
                'trace_id': parsed['trace_id'],
                'span_id': generate_span_id(),
                'parent_span_id': parsed['parent_id'],
            }
        else:
            ctx = new_trace_context()
    else:
        ctx = new_trace_context()
    
    # Set in context var (accessible from structlog processor)
    trace_context.set(ctx)
    # Also set in Flask g for easy access in views
    g.trace_context = ctx
 
# Usage - trace context automatically included
logger = structlog.get_logger()
 
@app.route('/orders', methods=['POST'])
def create_order():
    logger.info("order_received", order_data=request.json)
    # Output: {"trace_id":"abc123","span_id":"def456",...,"event":"order_received",...}
    
    process_order(request.json)
    return {'status': 'success'}

Framework Context Propagation Mechanisms
Language/Framework	Mechanism	Async-Safe	Usage
Java/Spring	MDC (ThreadLocal)	No*	Filter populates, Logback includes
Java/Reactive	Reactor Context	Yes	contextWrite() in reactive chains
Python	contextvars	Yes	Native since 3.7, async-aware
Node.js	AsyncLocalStorage	Yes	Native since Node 12.17
Go	context.Context	Yes	Explicit passing, standard library
.NET	Activity.Current	Yes	DiagnosticSource integration

OpenTelemetry Auto-Instrumentation

OpenTelemetry provides auto-instrumentation for most frameworks, automatically handling trace context extraction, propagation, and injection. Consider using it instead of manual implementation. Commands like opentelemetry-instrument python app.py add tracing without code changes.

Correlation ID Hierarchies

Simple trace IDs link all logs for a request. But complex systems benefit from hierarchical identifiers that represent the request structure.

Hierarchy Levels:

Trace ID — The entire distributed transaction
Span ID — A single operation within the trace
Parent Span ID — Links child spans to parents, forming a tree

This tree structure enables visualization of the request flow and identification of where time was spent.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Trace ID: abc123
│
├─► Span: xyz001 (API Gateway - POST /orders)
│   │   Duration: 1250ms
│   │
│   ├─► Span: xyz002 (Auth Service - validate token)
│   │       Duration: 45ms
│   │       Parent: xyz001
│   │
│   ├─► Span: xyz003 (Order Service - create order)
│   │   │   Duration: 1100ms
│   │   │   Parent: xyz001
│   │   │
│   │   ├─► Span: xyz004 (Inventory - check stock)
│   │   │       Duration: 120ms
│   │   │       Parent: xyz003
│   │   │
│   │   ├─► Span: xyz005 (Payment - charge)
│   │   │   │   Duration: 850ms
│   │   │   │   Parent: xyz003
│   │   │   │
│   │   │   └─► Span: xyz006 (Payment Gateway - external call)
│   │   │           Duration: 780ms
│   │   │           Parent: xyz005
│   │   │
│   │   └─► Span: xyz007 (Notification - send email)
│   │           Duration: 95ms
│   │           Parent: xyz003
│   │
│   └─► Span: xyz008 (Response serialization)
│           Duration: 15ms
│           Parent: xyz001
 
Querying all spans for trace abc123 returns this complete tree.
Timeline view shows where 850ms payment call dominated latency.

Advanced Correlation Patterns

•Session Correlation — Link multiple requests from the same user session. Useful for understanding user journeys. Separate from request-level trace ID.
•Job/Workflow Correlation — Long-running workflows may spawn thousands of traces. Use a workflow ID to link them all.
•Event Causation — In event-driven systems, events trigger other events. Link cause-effect: caused_by_event_id in event metadata.
•Baggage Propagation — Arbitrary key-value pairs propagated through the entire trace. Use for high-level context (tenant ID, experiment variant) without polluting every log.
•Cross-Trace Linking — When async processing decouples traces, link them: originating_trace_id in the async worker's new trace.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
  "timestamp": "2024-01-15T14:32:07.123456Z",
  "level": "INFO",
  "service": "payment-service",
  "message": "Payment processed successfully",
  
  "correlation": {
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "span_id": "00f067aa0ba902b7",
    "parent_span_id": "83d9c3e5a8d2f71c"
  },
  
  "context": {
    "session_id": "sess_m9n8o7p6",
    "workflow_id": "wf_checkout_a1b2c3",
    "tenant_id": "acme-corp",
    "experiment_variant": "checkout-v2"
  },
  
  "causation": {
    "triggered_by_event": "evt_inventory_reserved_xyz789",
    "triggered_by_trace": "abc123def456"
  },
  
  "event": {
    "type": "payment_success",
    "order_id": "ord_123456",
    "amount_cents": 9999
  }
}

Don't Over-Correlate

Every correlation ID adds propagation complexity and log storage cost. Start with trace ID and span ID. Add session, workflow, and causation IDs when specific debugging scenarios require them. Over-correlation creates more confusion than it solves.

Debugging with Correlation IDs

The purpose of correlation IDs is faster debugging. Here's how to leverage them effectively during incident response.

The Debugging Workflow:

User reports issue → Get their trace ID from response headers/client logs
Query logs: trace_id:"abc123" → See entire request flow
Identify error span → Look at span_id with error log
Query for similar errors → `service:"payment" AND error.type:"TimeoutError"
Correlate with metrics → Use trace_id as Prometheus exemplar

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
INCIDENT: User reports "payment stuck" at 14:32
 
STEP 1: Get trace ID from user
========================================
User provides: "X-Trace-Id header said 4bf92f3577b34da6a3ce929d0e0e4736"
(Or: Look up in access logs by user_id + timestamp)
 
STEP 2: Query all logs for this trace
========================================
Kibana Query: trace_id:"4bf92f3577b34da6a3ce929d0e0e4736"
Sort by: @timestamp ascending
 
Results (14 log entries):
14:32:07.001 api-gateway     INFO  "Request received"
14:32:07.015 auth-service    INFO  "Token validated"
14:32:07.045 order-service   INFO  "Creating order"
14:32:07.150 inventory-svc   INFO  "Stock reserved"
14:32:07.200 payment-service INFO  "Initiating payment"
14:32:07.210 payment-service DEBUG "Calling payment gateway"
14:32:37.210 payment-service ERROR "Payment gateway timeout" ← FOUND IT
14:32:37.215 payment-service WARN  "Retry attempt 1"
14:32:67.220 payment-service ERROR "Payment gateway timeout"
14:32:67.225 payment-service WARN  "Retry attempt 2" 
14:32:97.230 payment-service ERROR "Payment gateway timeout"
14:32:97.235 payment-service ERROR "Max retries exceeded" ← Root cause
14:32:97.240 order-service   ERROR "Payment failed"
14:32:97.245 api-gateway     ERROR "Request failed: payment_error"
 
STEP 3: Investigate payment gateway issue
========================================
Query: service:"payment-service" AND message:"gateway timeout" last 1h
→ Found 847 timeout errors in the last hour
→ Payment gateway is experiencing widespread issues
 
STEP 4: Verify with payment gateway status
========================================
Dashboard shows: Payment Gateway latency p99 went from 200ms to 45000ms at 14:20
Root cause: Payment gateway infrastructure issue (external dependency)
 
RESOLUTION:
- Notify users of payment delays
- Enable circuit breaker for payment service
- Monitor for gateway recovery

Correlation ID Debugging Tips

•Expose trace ID to users — Return X-Trace-Id header or include in error pages. Users can report it, dramatically speeding up investigation.
•Link traces to tickets — When users report issues, immediately capture the trace ID in the support ticket. It's the primary debugging artifact.
•Create trace-ID based alerts — When critical errors occur, include trace_id in alert. Engineers can immediately start investigating.
•Store exemplars in Prometheus — Link specific traces to metric anomalies. When latency spikes, the exemplar points to a representative trace.
•Build saved queries — Pre-build Kibana/Grafana queries: 'All errors for trace', 'Trace timeline', 'Services involved in trace'. Ready during incidents.

Correlation ID Best Practices Summary
Practice	Why It Matters	Implementation
Generate at the edge	Single source of truth	API gateway/load balancer generates if absent
Propagate everywhere	Complete visibility	HTTP headers, gRPC metadata, message queues, jobs
Include in ALL logs	No log without context	Logging framework auto-injection via MDC/context
Return to client	User can report it	X-Trace-Id response header
Use W3C format	Interoperability	Standard traceparent header format
Link async processing	Doesn't break on queues	Include in message metadata, create child spans

Trace-Based Error Pages

For user-facing errors, display trace ID on error pages: 'Error ID: abc123'. Users naturally include this when reporting issues. This single practice reduces debugging time dramatically—you skip the 'when did this happen? what were you doing?' questions.

Implementation Checklist

Implementing correlation IDs organization-wide requires coordination. Use this checklist to ensure complete coverage:

Phase 1: Foundation

Foundation Checklist

•☐ Define trace ID format (W3C traceparent recommended)
•☐ Choose ID generation library (avoid UUIDs without modification—high collision at scale)
•☐ Implement trace context extraction at API gateway/load balancer
•☐ Add traceparent header to all outgoing HTTP requests
•☐ Configure logging frameworks to auto-include trace_id and span_id
•☐ Add X-Trace-Id to HTTP response headers

Complete Coverage Checklist

•☐ Propagate through all message queues (Kafka, RabbitMQ, SQS)
•☐ Include in background job metadata (Sidekiq, Celery, etc.)
•☐ Handle async/await context propagation correctly
•☐ Add to gRPC metadata for gRPC services
•☐ Include trace_id in database query comments (for slow query debugging)
•☐ Propagate to third-party service calls where possible
•☐ Handle trace context in WebSocket connections

Debugging Enablement Checklist

•☐ Create Kibana/Grafana saved queries for trace lookup
•☐ Include trace_id in error alert payloads
•☐ Display trace ID on user-facing error pages
•☐ Document trace ID lookup procedure for support team
•☐ Link traces to distributed tracing UI (Jaeger, Zipkin)
•☐ Configure Prometheus exemplars with trace_id links
•☐ Add trace_id to support ticket templates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Trace Context Implementation Template for New Services
 
## 1. Dependencies
dependencies:
  - opentelemetry-api  # Or your chosen tracing library
  - opentelemetry-instrumentation-http
  - structured-logging-library  # e.g., structlog, logstash-logback-encoder
 
## 2. Configuration
config:
  OTEL_SERVICE_NAME: "{{ service-name }}"
  OTEL_TRACES_SAMPLER: "parentbased_traceidratio"
  OTEL_TRACES_SAMPLER_ARG: "0.1"  # 10% sampling
 
## 3. Middleware Setup
middleware:
  - extract_trace_context_from_headers
  - create_child_span_for_request
  - inject_trace_context_to_logging
  - propagate_trace_context_to_outgoing_calls
 
## 4. Logging Configuration
logging:
  format: json
  auto_include:
    - trace_id
    - span_id
    - parent_span_id
    - service_name
    - service_version
 
## 5. Outgoing Call Instrumentation
outgoing_calls:
  http: auto-inject-traceparent-header
  grpc: auto-inject-metadata
  kafka: include-in-message-headers
  background_jobs: pass-context-to-job-metadata
 
## 6. Response Headers
response_headers:
  - "X-Trace-Id: {{ trace_id }}"

Gradual Rollout

Don't try to instrument everything at once. Start with critical user paths (checkout, authentication). Add instrumentation progressively. Each service added improves visibility for all traces passing through it.

Summary: Correlation ID Mastery

Correlation IDs transform distributed system debugging from guesswork to science. A single identifier links every log, span, and event across your entire service mesh.

Key Takeaways:

Key Takeaways

•Without correlation IDs, distributed debugging is nearly impossible — Finding all logs for a request across 15 services requires knowing what links them.
•W3C Trace Context is the standard — The traceparent header (trace_id + span_id + flags) ensures interoperability with all modern tracing tools.
•Propagation must cover all boundaries — HTTP, gRPC, message queues, background jobs, async/await. Any gap breaks the trace chain.
•Automatic logging integration is essential — Configure loggers to auto-include trace_id. Manual addition is error-prone and incomplete.
•Hierarchical span structure enables visualization — Parent-child span relationships form a tree showing request flow and timing.
•Expose trace ID to users — Error pages and response headers with trace ID enable users to report debuggable issues.
•OpenTelemetry simplifies implementation — Auto-instrumentation handles most propagation without custom code.

Module Complete:

You've now mastered Logging at Scale—the fourth pillar of observability. You understand structured logging formats and schema design, log level semantics and organizational standards, log aggregation with ELK, OpenSearch, and Loki, retention strategies and cost optimization, and correlation IDs for distributed debugging.

With metrics, traces, and logs working together through correlation IDs, you have complete observability over distributed systems.

Module Complete

You've completed the Logging at Scale module! You can now design production-grade logging systems that are structured, cost-efficient, and enable rapid debugging across distributed services. Combined with metrics and traces, you have the full observability toolkit for operating complex systems at scale.