System Design (LLD)Observability Design

Observability Design

LevelAdvanced

Duration60 mins

TopicObservability Design

3 / 4

Tracing Considerations

Following the Thread Through the Maze

In a monolithic application, understanding a request's journey is straightforward—you can trace execution through a single call stack. But in distributed systems, a single user action may trigger dozens of services, each adding latency, potentially failing, and passing work to others. Without distributed tracing, debugging production issues becomes a nightmare of correlating timestamps across logs from different services.

Distributed tracing solves this by propagating context through service boundaries, creating a connected graph of operations that reconstruct a request's complete journey. When a user reports that their checkout took 15 seconds, tracing reveals that 12 of those seconds were spent waiting for the inventory service to respond to the payment service's callback.

This page explores how to design classes that participate effectively in distributed tracing—creating spans, propagating context, and adding the contextual information that makes traces genuinely useful for debugging.

What You Will Learn

By the end of this page, you will understand the core concepts of distributed tracing (traces, spans, context), master techniques for instrumenting classes for tracing, learn context propagation patterns across synchronous and asynchronous boundaries, and develop the ability to add meaningful attributes that accelerate debugging.

Distributed Tracing Fundamentals

Before diving into implementation, let's establish a solid understanding of distributed tracing concepts and terminology. These concepts form the mental model for designing traceable classes.

Core Tracing Concepts
Concept	Definition	Analogy
Trace	The complete journey of a request across all services	A detective case file covering all evidence
Span	A single unit of work within a trace (one operation)	A single piece of evidence in the case file
Trace ID	Unique identifier connecting all spans in a trace	The case file number
Span ID	Unique identifier for a specific span	Individual evidence item number
Parent Span ID	Links a span to its caller, forming a tree	Chain of custody between evidence items
Span Context	Trace ID + Span ID + flags, propagated across boundaries	The reference sheet passed between detectives
Baggage	User-defined key-value pairs propagated with context	Notes attached to the case file

The Trace Tree Structure:

A trace forms a directed acyclic graph (typically a tree) where each span has at most one parent. The root span represents the initial request entry point.

Trace: checkout-request-12345
│
├── [Span: api-gateway/handleCheckout] 150ms
│   │
│   ├── [Span: user-service/getUser] 20ms
│   │
│   ├── [Span: cart-service/getCart] 35ms
│   │
│   ├── [Span: inventory-service/reserve] 45ms
│   │   │
│   │   └── [Span: db/query] 30ms
│   │
│   └── [Span: payment-service/charge] 50ms
│       │
│       ├── [Span: stripe-client/createCharge] 40ms
│       │
│       └── [Span: db/saveTransaction] 8ms

This structure immediately reveals that the checkout request spent most time in payment and inventory services, with the Stripe API call being the single largest contributor.

OpenTelemetry: The Standard

OpenTelemetry (OTel) has emerged as the industry standard for distributed tracing, replacing earlier projects like OpenTracing and OpenCensus. Design your instrumentation around OpenTelemetry concepts and APIs for maximum portability across tracing backends (Jaeger, Zipkin, Datadog, etc.).

Designing Span-Aware Classes

A span-aware class creates meaningful spans for its operations and enriches them with contextual attributes. The goal is to make the class's behavior visible in traces without overwhelming them with noise.

TracedOrderService.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
public class TracedOrderService {
    private static final Tracer tracer = GlobalOpenTelemetry.getTracer(
        "order-service",
        "1.0.0"
    );
    
    private final InventoryClient inventory;
    private final PaymentClient payments;
    private final OrderRepository repository;
    
    public Order processOrder(OrderRequest request) {
        // Create a span for the overall operation
        Span span = tracer.spanBuilder("OrderService.processOrder")
            .setSpanKind(SpanKind.INTERNAL)
            // Add attributes that help with debugging
            .setAttribute("order.customer_id", request.getCustomerId())
            .setAttribute("order.item_count", request.getItems().size())
            .setAttribute("order.total_amount", request.getTotal().doubleValue())
            .setAttribute("order.currency", request.getCurrency())
            .startSpan();
        
        // Make this span the current context
        try (Scope scope = span.makeCurrent()) {
            // Child operations will automatically become child spans
            
            // Step 1: Reserve inventory
            List<ReservationResult> reservations = reserveInventory(request);
            span.setAttribute("order.reservations_count", reservations.size());
            
            // Step 2: Process payment
            PaymentResult payment = processPayment(request);
            span.setAttribute("order.payment_id", payment.getTransactionId());
            
            // Step 3: Create order
            Order order = createOrder(request, reservations, payment);
            span.setAttribute("order.id", order.getId());
            span.setStatus(StatusCode.OK);
            
            return order;
            
        } catch (InventoryException e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, "Inventory reservation failed");
            span.setAttribute("error.type", "inventory");
            throw e;
            
        } catch (PaymentException e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, "Payment processing failed");
            span.setAttribute("error.type", "payment");
            // Attempt to release inventory reservations
            releaseReservations(request);
            throw e;
            
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, "Unexpected error");
            span.setAttribute("error.type", "unexpected");
            throw e;
            
        } finally {
            // Always end the span
            span.end();
        }
    }
    
    private List<ReservationResult> reserveInventory(OrderRequest request) {
        // This creates a child span (because parent span is current)
        Span span = tracer.spanBuilder("OrderService.reserveInventory")
            .setSpanKind(SpanKind.CLIENT)
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            List<ReservationResult> results = new ArrayList<>();
            
            for (OrderItem item : request.getItems()) {
                // Each item reservation could be its own span for fine-grained tracing
                ReservationResult result = inventory.reserve(
                    item.getProductId(), 
                    item.getQuantity()
                );
                results.add(result);
            }
            
            span.setAttribute("reservations.total", results.size());
            span.setStatus(StatusCode.OK);
            return results;
            
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR);
            throw e;
        } finally {
            span.end();
        }
    }
    
    // Similar pattern for other methods...
}

Span Design Best Practices

•Name Spans Descriptively — Use ClassName.methodName or operation-name format; avoid generic names like 'process'
•Set Appropriate SpanKind — SERVER (incoming requests), CLIENT (outgoing calls), INTERNAL (local operations), PRODUCER/CONSUMER (async messaging)
•Add Relevant Attributes — Include IDs, counts, and key parameters that aid debugging
•Record Exceptions — Use span.recordException(e) to capture stack traces
•Set Status — Mark success/failure with setStatus(StatusCode.OK/ERROR)
•Always End Spans — Use try-finally to ensure spans end even on exceptions

Context Propagation Patterns

The magic of distributed tracing lies in context propagation—passing the trace context (trace ID, span ID, flags) across service boundaries so that spans from different services can be connected into a single trace. Different communication patterns require different propagation approaches.

For HTTP calls, trace context is propagated via headers. The W3C Trace Context standard defines traceparent and tracestate headers.

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
             │  │                                │                  │
           version   trace-id                  span-id           flags

HttpContextPropagation.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Outgoing HTTP call - inject context into headers
public class TracedHttpClient {
    private final HttpClient client;
    private final TextMapPropagator propagator;
    private final Tracer tracer;
    
    public <T> T execute(String url, Class<T> responseType) {
        Span span = tracer.spanBuilder("HTTP " + extractMethod(url))
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute(SemanticAttributes.HTTP_URL, url)
            .setAttribute(SemanticAttributes.HTTP_METHOD, "GET")
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            HttpRequest.Builder requestBuilder = HttpRequest.newBuilder()
                .uri(URI.create(url));
            
            // Inject trace context into HTTP headers
            propagator.inject(
                Context.current(), 
                requestBuilder, 
                (builder, key, value) -> builder.header(key, value)
            );
            
            HttpResponse<String> response = client.send(
                requestBuilder.build(), 
                BodyHandlers.ofString()
            );
            
            span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, response.statusCode());
            
            if (response.statusCode() >= 400) {
                span.setStatus(StatusCode.ERROR, "HTTP " + response.statusCode());
            }
            
            return deserialize(response.body(), responseType);
            
        } finally {
            span.end();
        }
    }
}
 
// Incoming HTTP request - extract context from headers
public class TracingFilter implements Filter {
    private final TextMapPropagator propagator;
    private final Tracer tracer;
    
    @Override
    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) {
        HttpServletRequest request = (HttpServletRequest) req;
        
        // Extract context from incoming headers
        Context extractedContext = propagator.extract(
            Context.current(),
            request,
            new TextMapGetter<HttpServletRequest>() {
                @Override
                public Iterable<String> keys(HttpServletRequest carrier) {
                    return Collections.list(carrier.getHeaderNames());
                }
                
                @Override
                public String get(HttpServletRequest carrier, String key) {
                    return carrier.getHeader(key);
                }
            }
        );
        
        // Create server span as child of extracted context
        Span span = tracer.spanBuilder(request.getMethod() + " " + request.getRequestURI())
            .setParent(extractedContext)
            .setSpanKind(SpanKind.SERVER)
            .setAttribute(SemanticAttributes.HTTP_METHOD, request.getMethod())
            .setAttribute(SemanticAttributes.HTTP_URL, request.getRequestURL().toString())
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            chain.doFilter(request, res);
            span.setStatus(StatusCode.OK);
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR);
            throw e;
        } finally {
            span.end();
        }
    }
}

Semantic Attributes: Making Traces Useful

Spans without context are nearly useless. The power of tracing comes from attributes—key-value pairs that provide meaning and enable filtering. OpenTelemetry defines Semantic Conventions for common attributes, ensuring consistency across languages and systems.

Common Semantic Attributes
Category	Attribute	Example
HTTP	`http.method`	GET, POST, PUT
HTTP	`http.status_code`	200, 404, 500
HTTP	`http.url`	https://api.example.com/users
Database	`db.system`	postgresql, mysql, mongodb
Database	`db.statement`	SELECT * FROM users WHERE id = ?
Database	`db.operation`	SELECT, INSERT, UPDATE
Messaging	`messaging.system`	kafka, rabbitmq, sqs
Messaging	`messaging.destination`	orders-topic
RPC	`rpc.system`	grpc, aws-api
RPC	`rpc.service`	UserService
RPC	`rpc.method`	GetUser
Exception	`exception.type`	java.lang.NullPointerException
Exception	`exception.message`	User ID cannot be null

SemanticAttributeUsage.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import io.opentelemetry.semconv.trace.attributes.SemanticAttributes;
 
// Use semantic conventions for standard attributes
public class DatabaseRepository {
    
    public User findById(String userId) {
        Span span = tracer.spanBuilder("SELECT users")
            // Standard database semantic attributes
            .setAttribute(SemanticAttributes.DB_SYSTEM, "postgresql")
            .setAttribute(SemanticAttributes.DB_NAME, "myapp")
            .setAttribute(SemanticAttributes.DB_OPERATION, "SELECT")
            .setAttribute(SemanticAttributes.DB_SQL_TABLE, "users")
            // Use a template to avoid high cardinality
            .setAttribute(SemanticAttributes.DB_STATEMENT, 
                "SELECT * FROM users WHERE id = ?")
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            // Custom domain-specific attributes
            span.setAttribute("db.query.user_id", userId);
            
            User user = executeQuery(userId);
            
            // Add result attributes for debugging
            span.setAttribute("db.query.found", user != null);
            if (user != null) {
                span.setAttribute("db.query.user_status", user.getStatus());
            }
            
            return user;
        } finally {
            span.end();
        }
    }
}
 
// HTTP client with semantic attributes
public class TracedApiClient {
    
    public <T> T get(String url, Class<T> responseType) {
        Span span = tracer.spanBuilder("HTTP GET")
            .setSpanKind(SpanKind.CLIENT)
            // HTTP semantic attributes
            .setAttribute(SemanticAttributes.HTTP_METHOD, "GET")
            .setAttribute(SemanticAttributes.HTTP_URL, url)
            .setAttribute(SemanticAttributes.HTTP_SCHEME, "https")
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            Response response = httpClient.get(url);
            
            // Always record response attributes
            span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, response.code());
            span.setAttribute(SemanticAttributes.HTTP_RESPONSE_CONTENT_LENGTH, 
                response.body().length());
            
            if (response.code() >= 400) {
                span.setStatus(StatusCode.ERROR, "HTTP " + response.code());
            }
            
            return deserialize(response.body(), responseType);
        } finally {
            span.end();
        }
    }
}

Attribute Best Practices

Prefer semantic convention attributes over custom ones for standard operations. For domain-specific data, use a namespace: myapp.order.id, myapp.user.tier. Avoid sensitive data (passwords, tokens) in attributes—they're often stored in plaintext. Use templates for SQL to avoid cardinality explosion.

Span Linking and Complex Relationships

Parent-child relationships aren't always sufficient. Some scenarios require links between spans—causal relationships that aren't hierarchical. Links connect spans from different traces or establish non-parent relationships.

When to Use Span Links

•Batch Processing — A single processing span links to multiple triggering requests
•Fan-Out/Fan-In — One request triggers multiple parallel operations that later converge
•Retry Operations — A retry links to the original failed attempt
•Cross-Trace References — A span references related work in a different trace
•Scheduled Jobs — A scheduled job links to the configuration change that scheduled it

SpanLinking.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
// Example: Batch processor linking to all source messages
public class BatchProcessor {
    
    public void processBatch(List<Message> messages) {
        // Collect span contexts from all messages
        List<SpanContext> sourceContexts = messages.stream()
            .map(m -> extractContext(m))
            .filter(Objects::nonNull)
            .map(Context::getSpan)
            .map(Span::getSpanContext)
            .collect(toList());
        
        // Create batch span with links to all source spans
        SpanBuilder builder = tracer.spanBuilder("processBatch")
            .setSpanKind(SpanKind.CONSUMER)
            .setAttribute("batch.size", messages.size());
        
        // Add links to all source message spans
        for (SpanContext source : sourceContexts) {
            builder.addLink(source, Attributes.of(
                AttributeKey.stringKey("link.type"), "source_message"
            ));
        }
        
        Span span = builder.startSpan();
        try (Scope scope = span.makeCurrent()) {
            for (Message message : messages) {
                processMessage(message);
            }
        } finally {
            span.end();
        }
    }
}
 
// Example: Retry with link to original attempt
public class RetryableClient {
    
    public Response callWithRetry(Request request, int maxRetries) {
        SpanContext lastAttemptContext = null;
        
        for (int attempt = 0; attempt <= maxRetries; attempt++) {
            SpanBuilder builder = tracer.spanBuilder("http.request")
                .setAttribute("http.attempt", attempt + 1)
                .setAttribute("http.max_retries", maxRetries);
            
            // Link retry attempts to the original
            if (lastAttemptContext != null) {
                builder.addLink(lastAttemptContext, Attributes.of(
                    AttributeKey.stringKey("link.type"), "retry",
                    AttributeKey.longKey("previous_attempt"), (long) attempt
                ));
            }
            
            Span span = builder.startSpan();
            lastAttemptContext = span.getSpanContext();
            
            try (Scope scope = span.makeCurrent()) {
                Response response = httpClient.execute(request);
                span.setStatus(StatusCode.OK);
                return response;
            } catch (RetryableException e) {
                span.recordException(e);
                span.setStatus(StatusCode.ERROR, "Attempt " + (attempt + 1) + " failed");
                if (attempt == maxRetries) {
                    throw e;
                }
            } finally {
                span.end();
            }
        }
        throw new IllegalStateException("Unreachable");
    }
}
 
// Example: Request that triggers async work in a different trace
public class AsyncWorkTrigger {
    
    public void triggerAsyncWork(WorkRequest request) {
        Span triggerSpan = tracer.spanBuilder("triggerAsyncWork")
            .startSpan();
        
        try (Scope scope = triggerSpan.makeCurrent()) {
            // Store the trigger span context for the async worker to link back
            request.setTriggerTraceId(triggerSpan.getSpanContext().getTraceId());
            request.setTriggerSpanId(triggerSpan.getSpanContext().getSpanId());
            
            workQueue.enqueue(request);
            triggerSpan.setStatus(StatusCode.OK);
        } finally {
            triggerSpan.end();
        }
    }
}
 
public class AsyncWorker {
    
    public void processWork(WorkRequest request) {
        // Create new trace for async work
        SpanBuilder builder = tracer.spanBuilder("processAsyncWork");
        
        // Link back to the triggering span
        if (request.getTriggerTraceId() != null) {
            SpanContext triggerContext = SpanContext.createFromRemoteParent(
                request.getTriggerTraceId(),
                request.getTriggerSpanId(),
                TraceFlags.getSampled(),
                TraceState.getDefault()
            );
            
            builder.addLink(triggerContext, Attributes.of(
                AttributeKey.stringKey("link.type"), "triggered_by"
            ));
        }
        
        Span span = builder.startSpan();
        // Process...
    }
}

Sampling Strategies for High-Volume Services

At scale, tracing every request becomes prohibitively expensive. Sampling strategies determine which traces are recorded, balancing visibility with cost.

Key Sampling Approaches:

Sampling Strategies Comparison
Strategy	Description	Pros	Cons
Always On	Trace every request	Complete visibility	Expensive at scale
Probabilistic	Sample X% of traces	Predictable cost	May miss rare events
Rate Limited	Sample up to N traces/sec	Bounded cost	Loses visibility under load
Tail-Based	Decide after trace completes	Keeps interesting traces	Complex, high memory
Error-Based	Always sample errors	Captures failures	Still misses successful edge cases
Head-Based Priority	Higher priority = higher sample rate	Prioritizes important work	Requires classification

SamplingConfiguration.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// Composite sampler combining multiple strategies
public class CompositeSampler implements Sampler {
    private final double baseSampleRate = 0.01;  // 1% base rate
    private final double errorSampleRate = 1.0;  // 100% of errors
    private final double slowSampleRate = 0.50;  // 50% of slow requests
    private final Duration slowThreshold = Duration.ofSeconds(5);
    
    @Override
    public SamplingResult shouldSample(
        Context parentContext,
        String traceId,
        String name,
        SpanKind spanKind,
        Attributes attributes,
        List<LinkData> parentLinks
    ) {
        // Always sample if parent was sampled (consistent sampling)
        SpanContext parentSpanContext = Span.fromContext(parentContext).getSpanContext();
        if (parentSpanContext.isValid() && parentSpanContext.isSampled()) {
            return SamplingResult.recordAndSample();
        }
        
        // Priority-based sampling
        String priority = attributes.get(AttributeKey.stringKey("request.priority"));
        if ("high".equals(priority)) {
            return SamplingResult.recordAndSample();
        }
        
        // Probabilistic sampling for new traces
        if (shouldProbabilisticSample(traceId)) {
            return SamplingResult.recordAndSample();
        }
        
        return SamplingResult.drop();
    }
    
    private boolean shouldProbabilisticSample(String traceId) {
        // Use trace ID for deterministic sampling
        // Same trace ID = same decision across all services
        long hash = Math.abs(traceId.hashCode());
        double threshold = baseSampleRate * Long.MAX_VALUE;
        return hash < threshold;
    }
}
 
// Tail-based sampling collector (conceptual)
public class TailBasedSamplingCollector {
    private final Map<String, List<Span>> traceBuffer = new ConcurrentHashMap<>();
    private final Duration traceTimeout = Duration.ofSeconds(30);
    
    public void receiveSpan(Span span) {
        String traceId = span.getSpanContext().getTraceId();
        traceBuffer.computeIfAbsent(traceId, k -> new CopyOnWriteArrayList<>())
            .add(span);
    }
    
    // Called when trace is complete or times out
    public void evaluateTrace(String traceId) {
        List<Span> spans = traceBuffer.remove(traceId);
        if (spans == null) return;
        
        // Keep interesting traces
        if (shouldKeepTrace(spans)) {
            exportToBackend(spans);
        } else {
            // Discard the trace
        }
    }
    
    private boolean shouldKeepTrace(List<Span> spans) {
        // Keep if any span has an error
        if (spans.stream().anyMatch(s -> s.getStatus() == StatusCode.ERROR)) {
            return true;
        }
        
        // Keep if total duration exceeds threshold
        Duration totalDuration = calculateTotalDuration(spans);
        if (totalDuration.compareTo(Duration.ofSeconds(5)) > 0) {
            return true;
        }
        
        // Keep 1% of remaining traces for baseline
        return Math.random() < 0.01;
    }
}

Consistent Sampling Is Critical

When a trace is sampled, ALL spans in that trace must be sampled. If the parent span is sampled but child spans are dropped, the trace becomes fragmented and useless. Use trace ID for deterministic sampling decisions—the same trace ID should produce the same sampling decision in all services.

Summary: Tracing Considerations

We've covered the essential concepts and techniques for designing classes that participate effectively in distributed tracing. Let's consolidate the key takeaways:

Key Takeaways

•Traces Are Trees of Spans — Understand the hierarchy: traces contain spans, spans have parents, context connects them
•Context Propagation Is Essential — Inject/extract context across HTTP, messages, and thread boundaries
•Use Semantic Attributes — Follow OpenTelemetry conventions for standard data; namespace custom attributes
•Always End Spans — Use try-finally patterns; spans without ends leak resources and confuse visualization
•Link Non-Hierarchical Relationships — Batches, retries, and fan-out/fan-in need span links
•Sample Strategically — Balance visibility and cost; always sample errors; maintain consistency
•Thread Boundaries Break Context — Explicitly propagate context when using thread pools or async frameworks

What's Next:

With tracing mastered, we'll explore Debug-Friendly Design—techniques for designing classes that are easy to debug in development and production, including meaningful toString() implementations, debug endpoints, and diagnostic interfaces.

Page Complete

You now understand how to design classes that participate effectively in distributed tracing. These skills enable debugging complex production issues by following requests across service boundaries—a superpower in microservice architectures.

3 / 4

Loading learning content...

System Design (LLD)Observability Design

Observability Design

LevelAdvanced

Duration60 mins

TopicObservability Design

3 / 4

Tracing Considerations

Following the Thread Through the Maze

What You Will Learn

Distributed Tracing Fundamentals

Before diving into implementation, let's establish a solid understanding of distributed tracing concepts and terminology. These concepts form the mental model for designing traceable classes.

Core Tracing Concepts
Concept	Definition	Analogy
Trace	The complete journey of a request across all services	A detective case file covering all evidence
Span	A single unit of work within a trace (one operation)	A single piece of evidence in the case file
Trace ID	Unique identifier connecting all spans in a trace	The case file number
Span ID	Unique identifier for a specific span	Individual evidence item number
Parent Span ID	Links a span to its caller, forming a tree	Chain of custody between evidence items
Span Context	Trace ID + Span ID + flags, propagated across boundaries	The reference sheet passed between detectives
Baggage	User-defined key-value pairs propagated with context	Notes attached to the case file

The Trace Tree Structure:

A trace forms a directed acyclic graph (typically a tree) where each span has at most one parent. The root span represents the initial request entry point.

Trace: checkout-request-12345
│
├── [Span: api-gateway/handleCheckout] 150ms
│   │
│   ├── [Span: user-service/getUser] 20ms
│   │
│   ├── [Span: cart-service/getCart] 35ms
│   │
│   ├── [Span: inventory-service/reserve] 45ms
│   │   │
│   │   └── [Span: db/query] 30ms
│   │
│   └── [Span: payment-service/charge] 50ms
│       │
│       ├── [Span: stripe-client/createCharge] 40ms
│       │
│       └── [Span: db/saveTransaction] 8ms

This structure immediately reveals that the checkout request spent most time in payment and inventory services, with the Stripe API call being the single largest contributor.

OpenTelemetry: The Standard

Designing Span-Aware Classes

TracedOrderService.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
public class TracedOrderService {
    private static final Tracer tracer = GlobalOpenTelemetry.getTracer(
        "order-service",
        "1.0.0"
    );
    
    private final InventoryClient inventory;
    private final PaymentClient payments;
    private final OrderRepository repository;
    
    public Order processOrder(OrderRequest request) {
        // Create a span for the overall operation
        Span span = tracer.spanBuilder("OrderService.processOrder")
            .setSpanKind(SpanKind.INTERNAL)
            // Add attributes that help with debugging
            .setAttribute("order.customer_id", request.getCustomerId())
            .setAttribute("order.item_count", request.getItems().size())
            .setAttribute("order.total_amount", request.getTotal().doubleValue())
            .setAttribute("order.currency", request.getCurrency())
            .startSpan();
        
        // Make this span the current context
        try (Scope scope = span.makeCurrent()) {
            // Child operations will automatically become child spans
            
            // Step 1: Reserve inventory
            List<ReservationResult> reservations = reserveInventory(request);
            span.setAttribute("order.reservations_count", reservations.size());
            
            // Step 2: Process payment
            PaymentResult payment = processPayment(request);
            span.setAttribute("order.payment_id", payment.getTransactionId());
            
            // Step 3: Create order
            Order order = createOrder(request, reservations, payment);
            span.setAttribute("order.id", order.getId());
            span.setStatus(StatusCode.OK);
            
            return order;
            
        } catch (InventoryException e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, "Inventory reservation failed");
            span.setAttribute("error.type", "inventory");
            throw e;
            
        } catch (PaymentException e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, "Payment processing failed");
            span.setAttribute("error.type", "payment");
            // Attempt to release inventory reservations
            releaseReservations(request);
            throw e;
            
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR, "Unexpected error");
            span.setAttribute("error.type", "unexpected");
            throw e;
            
        } finally {
            // Always end the span
            span.end();
        }
    }
    
    private List<ReservationResult> reserveInventory(OrderRequest request) {
        // This creates a child span (because parent span is current)
        Span span = tracer.spanBuilder("OrderService.reserveInventory")
            .setSpanKind(SpanKind.CLIENT)
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            List<ReservationResult> results = new ArrayList<>();
            
            for (OrderItem item : request.getItems()) {
                // Each item reservation could be its own span for fine-grained tracing
                ReservationResult result = inventory.reserve(
                    item.getProductId(), 
                    item.getQuantity()
                );
                results.add(result);
            }
            
            span.setAttribute("reservations.total", results.size());
            span.setStatus(StatusCode.OK);
            return results;
            
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR);
            throw e;
        } finally {
            span.end();
        }
    }
    
    // Similar pattern for other methods...
}

Span Design Best Practices

•Name Spans Descriptively — Use ClassName.methodName or operation-name format; avoid generic names like 'process'
•Set Appropriate SpanKind — SERVER (incoming requests), CLIENT (outgoing calls), INTERNAL (local operations), PRODUCER/CONSUMER (async messaging)
•Add Relevant Attributes — Include IDs, counts, and key parameters that aid debugging
•Record Exceptions — Use span.recordException(e) to capture stack traces
•Set Status — Mark success/failure with setStatus(StatusCode.OK/ERROR)
•Always End Spans — Use try-finally to ensure spans end even on exceptions

Context Propagation Patterns

For HTTP calls, trace context is propagated via headers. The W3C Trace Context standard defines traceparent and tracestate headers.

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
             │  │                                │                  │
           version   trace-id                  span-id           flags

HttpContextPropagation.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Outgoing HTTP call - inject context into headers
public class TracedHttpClient {
    private final HttpClient client;
    private final TextMapPropagator propagator;
    private final Tracer tracer;
    
    public <T> T execute(String url, Class<T> responseType) {
        Span span = tracer.spanBuilder("HTTP " + extractMethod(url))
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute(SemanticAttributes.HTTP_URL, url)
            .setAttribute(SemanticAttributes.HTTP_METHOD, "GET")
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            HttpRequest.Builder requestBuilder = HttpRequest.newBuilder()
                .uri(URI.create(url));
            
            // Inject trace context into HTTP headers
            propagator.inject(
                Context.current(), 
                requestBuilder, 
                (builder, key, value) -> builder.header(key, value)
            );
            
            HttpResponse<String> response = client.send(
                requestBuilder.build(), 
                BodyHandlers.ofString()
            );
            
            span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, response.statusCode());
            
            if (response.statusCode() >= 400) {
                span.setStatus(StatusCode.ERROR, "HTTP " + response.statusCode());
            }
            
            return deserialize(response.body(), responseType);
            
        } finally {
            span.end();
        }
    }
}
 
// Incoming HTTP request - extract context from headers
public class TracingFilter implements Filter {
    private final TextMapPropagator propagator;
    private final Tracer tracer;
    
    @Override
    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) {
        HttpServletRequest request = (HttpServletRequest) req;
        
        // Extract context from incoming headers
        Context extractedContext = propagator.extract(
            Context.current(),
            request,
            new TextMapGetter<HttpServletRequest>() {
                @Override
                public Iterable<String> keys(HttpServletRequest carrier) {
                    return Collections.list(carrier.getHeaderNames());
                }
                
                @Override
                public String get(HttpServletRequest carrier, String key) {
                    return carrier.getHeader(key);
                }
            }
        );
        
        // Create server span as child of extracted context
        Span span = tracer.spanBuilder(request.getMethod() + " " + request.getRequestURI())
            .setParent(extractedContext)
            .setSpanKind(SpanKind.SERVER)
            .setAttribute(SemanticAttributes.HTTP_METHOD, request.getMethod())
            .setAttribute(SemanticAttributes.HTTP_URL, request.getRequestURL().toString())
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            chain.doFilter(request, res);
            span.setStatus(StatusCode.OK);
        } catch (Exception e) {
            span.recordException(e);
            span.setStatus(StatusCode.ERROR);
            throw e;
        } finally {
            span.end();
        }
    }
}

Semantic Attributes: Making Traces Useful

Common Semantic Attributes
Category	Attribute	Example
HTTP	`http.method`	GET, POST, PUT
HTTP	`http.status_code`	200, 404, 500
HTTP	`http.url`	https://api.example.com/users
Database	`db.system`	postgresql, mysql, mongodb
Database	`db.statement`	SELECT * FROM users WHERE id = ?
Database	`db.operation`	SELECT, INSERT, UPDATE
Messaging	`messaging.system`	kafka, rabbitmq, sqs
Messaging	`messaging.destination`	orders-topic
RPC	`rpc.system`	grpc, aws-api
RPC	`rpc.service`	UserService
RPC	`rpc.method`	GetUser
Exception	`exception.type`	java.lang.NullPointerException
Exception	`exception.message`	User ID cannot be null

SemanticAttributeUsage.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import io.opentelemetry.semconv.trace.attributes.SemanticAttributes;
 
// Use semantic conventions for standard attributes
public class DatabaseRepository {
    
    public User findById(String userId) {
        Span span = tracer.spanBuilder("SELECT users")
            // Standard database semantic attributes
            .setAttribute(SemanticAttributes.DB_SYSTEM, "postgresql")
            .setAttribute(SemanticAttributes.DB_NAME, "myapp")
            .setAttribute(SemanticAttributes.DB_OPERATION, "SELECT")
            .setAttribute(SemanticAttributes.DB_SQL_TABLE, "users")
            // Use a template to avoid high cardinality
            .setAttribute(SemanticAttributes.DB_STATEMENT, 
                "SELECT * FROM users WHERE id = ?")
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            // Custom domain-specific attributes
            span.setAttribute("db.query.user_id", userId);
            
            User user = executeQuery(userId);
            
            // Add result attributes for debugging
            span.setAttribute("db.query.found", user != null);
            if (user != null) {
                span.setAttribute("db.query.user_status", user.getStatus());
            }
            
            return user;
        } finally {
            span.end();
        }
    }
}
 
// HTTP client with semantic attributes
public class TracedApiClient {
    
    public <T> T get(String url, Class<T> responseType) {
        Span span = tracer.spanBuilder("HTTP GET")
            .setSpanKind(SpanKind.CLIENT)
            // HTTP semantic attributes
            .setAttribute(SemanticAttributes.HTTP_METHOD, "GET")
            .setAttribute(SemanticAttributes.HTTP_URL, url)
            .setAttribute(SemanticAttributes.HTTP_SCHEME, "https")
            .startSpan();
        
        try (Scope scope = span.makeCurrent()) {
            Response response = httpClient.get(url);
            
            // Always record response attributes
            span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, response.code());
            span.setAttribute(SemanticAttributes.HTTP_RESPONSE_CONTENT_LENGTH, 
                response.body().length());
            
            if (response.code() >= 400) {
                span.setStatus(StatusCode.ERROR, "HTTP " + response.code());
            }
            
            return deserialize(response.body(), responseType);
        } finally {
            span.end();
        }
    }
}

Attribute Best Practices

Span Linking and Complex Relationships

When to Use Span Links

•Batch Processing — A single processing span links to multiple triggering requests
•Fan-Out/Fan-In — One request triggers multiple parallel operations that later converge
•Retry Operations — A retry links to the original failed attempt
•Cross-Trace References — A span references related work in a different trace
•Scheduled Jobs — A scheduled job links to the configuration change that scheduled it

SpanLinking.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
// Example: Batch processor linking to all source messages
public class BatchProcessor {
    
    public void processBatch(List<Message> messages) {
        // Collect span contexts from all messages
        List<SpanContext> sourceContexts = messages.stream()
            .map(m -> extractContext(m))
            .filter(Objects::nonNull)
            .map(Context::getSpan)
            .map(Span::getSpanContext)
            .collect(toList());
        
        // Create batch span with links to all source spans
        SpanBuilder builder = tracer.spanBuilder("processBatch")
            .setSpanKind(SpanKind.CONSUMER)
            .setAttribute("batch.size", messages.size());
        
        // Add links to all source message spans
        for (SpanContext source : sourceContexts) {
            builder.addLink(source, Attributes.of(
                AttributeKey.stringKey("link.type"), "source_message"
            ));
        }
        
        Span span = builder.startSpan();
        try (Scope scope = span.makeCurrent()) {
            for (Message message : messages) {
                processMessage(message);
            }
        } finally {
            span.end();
        }
    }
}
 
// Example: Retry with link to original attempt
public class RetryableClient {
    
    public Response callWithRetry(Request request, int maxRetries) {
        SpanContext lastAttemptContext = null;
        
        for (int attempt = 0; attempt <= maxRetries; attempt++) {
            SpanBuilder builder = tracer.spanBuilder("http.request")
                .setAttribute("http.attempt", attempt + 1)
                .setAttribute("http.max_retries", maxRetries);
            
            // Link retry attempts to the original
            if (lastAttemptContext != null) {
                builder.addLink(lastAttemptContext, Attributes.of(
                    AttributeKey.stringKey("link.type"), "retry",
                    AttributeKey.longKey("previous_attempt"), (long) attempt
                ));
            }
            
            Span span = builder.startSpan();
            lastAttemptContext = span.getSpanContext();
            
            try (Scope scope = span.makeCurrent()) {
                Response response = httpClient.execute(request);
                span.setStatus(StatusCode.OK);
                return response;
            } catch (RetryableException e) {
                span.recordException(e);
                span.setStatus(StatusCode.ERROR, "Attempt " + (attempt + 1) + " failed");
                if (attempt == maxRetries) {
                    throw e;
                }
            } finally {
                span.end();
            }
        }
        throw new IllegalStateException("Unreachable");
    }
}
 
// Example: Request that triggers async work in a different trace
public class AsyncWorkTrigger {
    
    public void triggerAsyncWork(WorkRequest request) {
        Span triggerSpan = tracer.spanBuilder("triggerAsyncWork")
            .startSpan();
        
        try (Scope scope = triggerSpan.makeCurrent()) {
            // Store the trigger span context for the async worker to link back
            request.setTriggerTraceId(triggerSpan.getSpanContext().getTraceId());
            request.setTriggerSpanId(triggerSpan.getSpanContext().getSpanId());
            
            workQueue.enqueue(request);
            triggerSpan.setStatus(StatusCode.OK);
        } finally {
            triggerSpan.end();
        }
    }
}
 
public class AsyncWorker {
    
    public void processWork(WorkRequest request) {
        // Create new trace for async work
        SpanBuilder builder = tracer.spanBuilder("processAsyncWork");
        
        // Link back to the triggering span
        if (request.getTriggerTraceId() != null) {
            SpanContext triggerContext = SpanContext.createFromRemoteParent(
                request.getTriggerTraceId(),
                request.getTriggerSpanId(),
                TraceFlags.getSampled(),
                TraceState.getDefault()
            );
            
            builder.addLink(triggerContext, Attributes.of(
                AttributeKey.stringKey("link.type"), "triggered_by"
            ));
        }
        
        Span span = builder.startSpan();
        // Process...
    }
}

Sampling Strategies for High-Volume Services

At scale, tracing every request becomes prohibitively expensive. Sampling strategies determine which traces are recorded, balancing visibility with cost.

Key Sampling Approaches:

Sampling Strategies Comparison
Strategy	Description	Pros	Cons
Always On	Trace every request	Complete visibility	Expensive at scale
Probabilistic	Sample X% of traces	Predictable cost	May miss rare events
Rate Limited	Sample up to N traces/sec	Bounded cost	Loses visibility under load
Tail-Based	Decide after trace completes	Keeps interesting traces	Complex, high memory
Error-Based	Always sample errors	Captures failures	Still misses successful edge cases
Head-Based Priority	Higher priority = higher sample rate	Prioritizes important work	Requires classification

SamplingConfiguration.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// Composite sampler combining multiple strategies
public class CompositeSampler implements Sampler {
    private final double baseSampleRate = 0.01;  // 1% base rate
    private final double errorSampleRate = 1.0;  // 100% of errors
    private final double slowSampleRate = 0.50;  // 50% of slow requests
    private final Duration slowThreshold = Duration.ofSeconds(5);
    
    @Override
    public SamplingResult shouldSample(
        Context parentContext,
        String traceId,
        String name,
        SpanKind spanKind,
        Attributes attributes,
        List<LinkData> parentLinks
    ) {
        // Always sample if parent was sampled (consistent sampling)
        SpanContext parentSpanContext = Span.fromContext(parentContext).getSpanContext();
        if (parentSpanContext.isValid() && parentSpanContext.isSampled()) {
            return SamplingResult.recordAndSample();
        }
        
        // Priority-based sampling
        String priority = attributes.get(AttributeKey.stringKey("request.priority"));
        if ("high".equals(priority)) {
            return SamplingResult.recordAndSample();
        }
        
        // Probabilistic sampling for new traces
        if (shouldProbabilisticSample(traceId)) {
            return SamplingResult.recordAndSample();
        }
        
        return SamplingResult.drop();
    }
    
    private boolean shouldProbabilisticSample(String traceId) {
        // Use trace ID for deterministic sampling
        // Same trace ID = same decision across all services
        long hash = Math.abs(traceId.hashCode());
        double threshold = baseSampleRate * Long.MAX_VALUE;
        return hash < threshold;
    }
}
 
// Tail-based sampling collector (conceptual)
public class TailBasedSamplingCollector {
    private final Map<String, List<Span>> traceBuffer = new ConcurrentHashMap<>();
    private final Duration traceTimeout = Duration.ofSeconds(30);
    
    public void receiveSpan(Span span) {
        String traceId = span.getSpanContext().getTraceId();
        traceBuffer.computeIfAbsent(traceId, k -> new CopyOnWriteArrayList<>())
            .add(span);
    }
    
    // Called when trace is complete or times out
    public void evaluateTrace(String traceId) {
        List<Span> spans = traceBuffer.remove(traceId);
        if (spans == null) return;
        
        // Keep interesting traces
        if (shouldKeepTrace(spans)) {
            exportToBackend(spans);
        } else {
            // Discard the trace
        }
    }
    
    private boolean shouldKeepTrace(List<Span> spans) {
        // Keep if any span has an error
        if (spans.stream().anyMatch(s -> s.getStatus() == StatusCode.ERROR)) {
            return true;
        }
        
        // Keep if total duration exceeds threshold
        Duration totalDuration = calculateTotalDuration(spans);
        if (totalDuration.compareTo(Duration.ofSeconds(5)) > 0) {
            return true;
        }
        
        // Keep 1% of remaining traces for baseline
        return Math.random() < 0.01;
    }
}

Consistent Sampling Is Critical

Summary: Tracing Considerations

We've covered the essential concepts and techniques for designing classes that participate effectively in distributed tracing. Let's consolidate the key takeaways:

Key Takeaways

•Traces Are Trees of Spans — Understand the hierarchy: traces contain spans, spans have parents, context connects them
•Context Propagation Is Essential — Inject/extract context across HTTP, messages, and thread boundaries
•Use Semantic Attributes — Follow OpenTelemetry conventions for standard data; namespace custom attributes
•Always End Spans — Use try-finally patterns; spans without ends leak resources and confuse visualization
•Link Non-Hierarchical Relationships — Batches, retries, and fan-out/fan-in need span links
•Sample Strategically — Balance visibility and cost; always sample errors; maintain consistency
•Thread Boundaries Break Context — Explicitly propagate context when using thread pools or async frameworks

What's Next:

Page Complete

3 / 4