Loading learning content...
In a monolithic application, understanding a request's journey is straightforward—you can trace execution through a single call stack. But in distributed systems, a single user action may trigger dozens of services, each adding latency, potentially failing, and passing work to others. Without distributed tracing, debugging production issues becomes a nightmare of correlating timestamps across logs from different services.
Distributed tracing solves this by propagating context through service boundaries, creating a connected graph of operations that reconstruct a request's complete journey. When a user reports that their checkout took 15 seconds, tracing reveals that 12 of those seconds were spent waiting for the inventory service to respond to the payment service's callback.
This page explores how to design classes that participate effectively in distributed tracing—creating spans, propagating context, and adding the contextual information that makes traces genuinely useful for debugging.
By the end of this page, you will understand the core concepts of distributed tracing (traces, spans, context), master techniques for instrumenting classes for tracing, learn context propagation patterns across synchronous and asynchronous boundaries, and develop the ability to add meaningful attributes that accelerate debugging.
Before diving into implementation, let's establish a solid understanding of distributed tracing concepts and terminology. These concepts form the mental model for designing traceable classes.
| Concept | Definition | Analogy |
|---|---|---|
| Trace | The complete journey of a request across all services | A detective case file covering all evidence |
| Span | A single unit of work within a trace (one operation) | A single piece of evidence in the case file |
| Trace ID | Unique identifier connecting all spans in a trace | The case file number |
| Span ID | Unique identifier for a specific span | Individual evidence item number |
| Parent Span ID | Links a span to its caller, forming a tree | Chain of custody between evidence items |
| Span Context | Trace ID + Span ID + flags, propagated across boundaries | The reference sheet passed between detectives |
| Baggage | User-defined key-value pairs propagated with context | Notes attached to the case file |
The Trace Tree Structure:
A trace forms a directed acyclic graph (typically a tree) where each span has at most one parent. The root span represents the initial request entry point.
Trace: checkout-request-12345
│
├── [Span: api-gateway/handleCheckout] 150ms
│ │
│ ├── [Span: user-service/getUser] 20ms
│ │
│ ├── [Span: cart-service/getCart] 35ms
│ │
│ ├── [Span: inventory-service/reserve] 45ms
│ │ │
│ │ └── [Span: db/query] 30ms
│ │
│ └── [Span: payment-service/charge] 50ms
│ │
│ ├── [Span: stripe-client/createCharge] 40ms
│ │
│ └── [Span: db/saveTransaction] 8ms
This structure immediately reveals that the checkout request spent most time in payment and inventory services, with the Stripe API call being the single largest contributor.
OpenTelemetry (OTel) has emerged as the industry standard for distributed tracing, replacing earlier projects like OpenTracing and OpenCensus. Design your instrumentation around OpenTelemetry concepts and APIs for maximum portability across tracing backends (Jaeger, Zipkin, Datadog, etc.).
A span-aware class creates meaningful spans for its operations and enriches them with contextual attributes. The goal is to make the class's behavior visible in traces without overwhelming them with noise.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
public class TracedOrderService { private static final Tracer tracer = GlobalOpenTelemetry.getTracer( "order-service", "1.0.0" ); private final InventoryClient inventory; private final PaymentClient payments; private final OrderRepository repository; public Order processOrder(OrderRequest request) { // Create a span for the overall operation Span span = tracer.spanBuilder("OrderService.processOrder") .setSpanKind(SpanKind.INTERNAL) // Add attributes that help with debugging .setAttribute("order.customer_id", request.getCustomerId()) .setAttribute("order.item_count", request.getItems().size()) .setAttribute("order.total_amount", request.getTotal().doubleValue()) .setAttribute("order.currency", request.getCurrency()) .startSpan(); // Make this span the current context try (Scope scope = span.makeCurrent()) { // Child operations will automatically become child spans // Step 1: Reserve inventory List<ReservationResult> reservations = reserveInventory(request); span.setAttribute("order.reservations_count", reservations.size()); // Step 2: Process payment PaymentResult payment = processPayment(request); span.setAttribute("order.payment_id", payment.getTransactionId()); // Step 3: Create order Order order = createOrder(request, reservations, payment); span.setAttribute("order.id", order.getId()); span.setStatus(StatusCode.OK); return order; } catch (InventoryException e) { span.recordException(e); span.setStatus(StatusCode.ERROR, "Inventory reservation failed"); span.setAttribute("error.type", "inventory"); throw e; } catch (PaymentException e) { span.recordException(e); span.setStatus(StatusCode.ERROR, "Payment processing failed"); span.setAttribute("error.type", "payment"); // Attempt to release inventory reservations releaseReservations(request); throw e; } catch (Exception e) { span.recordException(e); span.setStatus(StatusCode.ERROR, "Unexpected error"); span.setAttribute("error.type", "unexpected"); throw e; } finally { // Always end the span span.end(); } } private List<ReservationResult> reserveInventory(OrderRequest request) { // This creates a child span (because parent span is current) Span span = tracer.spanBuilder("OrderService.reserveInventory") .setSpanKind(SpanKind.CLIENT) .startSpan(); try (Scope scope = span.makeCurrent()) { List<ReservationResult> results = new ArrayList<>(); for (OrderItem item : request.getItems()) { // Each item reservation could be its own span for fine-grained tracing ReservationResult result = inventory.reserve( item.getProductId(), item.getQuantity() ); results.add(result); } span.setAttribute("reservations.total", results.size()); span.setStatus(StatusCode.OK); return results; } catch (Exception e) { span.recordException(e); span.setStatus(StatusCode.ERROR); throw e; } finally { span.end(); } } // Similar pattern for other methods...}ClassName.methodName or operation-name format; avoid generic names like 'process'span.recordException(e) to capture stack tracessetStatus(StatusCode.OK/ERROR)The magic of distributed tracing lies in context propagation—passing the trace context (trace ID, span ID, flags) across service boundaries so that spans from different services can be connected into a single trace. Different communication patterns require different propagation approaches.
For HTTP calls, trace context is propagated via headers. The W3C Trace Context standard defines traceparent and tracestate headers.
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
│ │ │ │
version trace-id span-id flags
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
// Outgoing HTTP call - inject context into headerspublic class TracedHttpClient { private final HttpClient client; private final TextMapPropagator propagator; private final Tracer tracer; public <T> T execute(String url, Class<T> responseType) { Span span = tracer.spanBuilder("HTTP " + extractMethod(url)) .setSpanKind(SpanKind.CLIENT) .setAttribute(SemanticAttributes.HTTP_URL, url) .setAttribute(SemanticAttributes.HTTP_METHOD, "GET") .startSpan(); try (Scope scope = span.makeCurrent()) { HttpRequest.Builder requestBuilder = HttpRequest.newBuilder() .uri(URI.create(url)); // Inject trace context into HTTP headers propagator.inject( Context.current(), requestBuilder, (builder, key, value) -> builder.header(key, value) ); HttpResponse<String> response = client.send( requestBuilder.build(), BodyHandlers.ofString() ); span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, response.statusCode()); if (response.statusCode() >= 400) { span.setStatus(StatusCode.ERROR, "HTTP " + response.statusCode()); } return deserialize(response.body(), responseType); } finally { span.end(); } }} // Incoming HTTP request - extract context from headerspublic class TracingFilter implements Filter { private final TextMapPropagator propagator; private final Tracer tracer; @Override public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) { HttpServletRequest request = (HttpServletRequest) req; // Extract context from incoming headers Context extractedContext = propagator.extract( Context.current(), request, new TextMapGetter<HttpServletRequest>() { @Override public Iterable<String> keys(HttpServletRequest carrier) { return Collections.list(carrier.getHeaderNames()); } @Override public String get(HttpServletRequest carrier, String key) { return carrier.getHeader(key); } } ); // Create server span as child of extracted context Span span = tracer.spanBuilder(request.getMethod() + " " + request.getRequestURI()) .setParent(extractedContext) .setSpanKind(SpanKind.SERVER) .setAttribute(SemanticAttributes.HTTP_METHOD, request.getMethod()) .setAttribute(SemanticAttributes.HTTP_URL, request.getRequestURL().toString()) .startSpan(); try (Scope scope = span.makeCurrent()) { chain.doFilter(request, res); span.setStatus(StatusCode.OK); } catch (Exception e) { span.recordException(e); span.setStatus(StatusCode.ERROR); throw e; } finally { span.end(); } }}Spans without context are nearly useless. The power of tracing comes from attributes—key-value pairs that provide meaning and enable filtering. OpenTelemetry defines Semantic Conventions for common attributes, ensuring consistency across languages and systems.
| Category | Attribute | Example |
|---|---|---|
| HTTP | http.method | GET, POST, PUT |
| HTTP | http.status_code | 200, 404, 500 |
| HTTP | http.url | https://api.example.com/users |
| Database | db.system | postgresql, mysql, mongodb |
| Database | db.statement | SELECT * FROM users WHERE id = ? |
| Database | db.operation | SELECT, INSERT, UPDATE |
| Messaging | messaging.system | kafka, rabbitmq, sqs |
| Messaging | messaging.destination | orders-topic |
| RPC | rpc.system | grpc, aws-api |
| RPC | rpc.service | UserService |
| RPC | rpc.method | GetUser |
| Exception | exception.type | java.lang.NullPointerException |
| Exception | exception.message | User ID cannot be null |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import io.opentelemetry.semconv.trace.attributes.SemanticAttributes; // Use semantic conventions for standard attributespublic class DatabaseRepository { public User findById(String userId) { Span span = tracer.spanBuilder("SELECT users") // Standard database semantic attributes .setAttribute(SemanticAttributes.DB_SYSTEM, "postgresql") .setAttribute(SemanticAttributes.DB_NAME, "myapp") .setAttribute(SemanticAttributes.DB_OPERATION, "SELECT") .setAttribute(SemanticAttributes.DB_SQL_TABLE, "users") // Use a template to avoid high cardinality .setAttribute(SemanticAttributes.DB_STATEMENT, "SELECT * FROM users WHERE id = ?") .startSpan(); try (Scope scope = span.makeCurrent()) { // Custom domain-specific attributes span.setAttribute("db.query.user_id", userId); User user = executeQuery(userId); // Add result attributes for debugging span.setAttribute("db.query.found", user != null); if (user != null) { span.setAttribute("db.query.user_status", user.getStatus()); } return user; } finally { span.end(); } }} // HTTP client with semantic attributespublic class TracedApiClient { public <T> T get(String url, Class<T> responseType) { Span span = tracer.spanBuilder("HTTP GET") .setSpanKind(SpanKind.CLIENT) // HTTP semantic attributes .setAttribute(SemanticAttributes.HTTP_METHOD, "GET") .setAttribute(SemanticAttributes.HTTP_URL, url) .setAttribute(SemanticAttributes.HTTP_SCHEME, "https") .startSpan(); try (Scope scope = span.makeCurrent()) { Response response = httpClient.get(url); // Always record response attributes span.setAttribute(SemanticAttributes.HTTP_STATUS_CODE, response.code()); span.setAttribute(SemanticAttributes.HTTP_RESPONSE_CONTENT_LENGTH, response.body().length()); if (response.code() >= 400) { span.setStatus(StatusCode.ERROR, "HTTP " + response.code()); } return deserialize(response.body(), responseType); } finally { span.end(); } }}Prefer semantic convention attributes over custom ones for standard operations. For domain-specific data, use a namespace: myapp.order.id, myapp.user.tier. Avoid sensitive data (passwords, tokens) in attributes—they're often stored in plaintext. Use templates for SQL to avoid cardinality explosion.
Parent-child relationships aren't always sufficient. Some scenarios require links between spans—causal relationships that aren't hierarchical. Links connect spans from different traces or establish non-parent relationships.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
// Example: Batch processor linking to all source messagespublic class BatchProcessor { public void processBatch(List<Message> messages) { // Collect span contexts from all messages List<SpanContext> sourceContexts = messages.stream() .map(m -> extractContext(m)) .filter(Objects::nonNull) .map(Context::getSpan) .map(Span::getSpanContext) .collect(toList()); // Create batch span with links to all source spans SpanBuilder builder = tracer.spanBuilder("processBatch") .setSpanKind(SpanKind.CONSUMER) .setAttribute("batch.size", messages.size()); // Add links to all source message spans for (SpanContext source : sourceContexts) { builder.addLink(source, Attributes.of( AttributeKey.stringKey("link.type"), "source_message" )); } Span span = builder.startSpan(); try (Scope scope = span.makeCurrent()) { for (Message message : messages) { processMessage(message); } } finally { span.end(); } }} // Example: Retry with link to original attemptpublic class RetryableClient { public Response callWithRetry(Request request, int maxRetries) { SpanContext lastAttemptContext = null; for (int attempt = 0; attempt <= maxRetries; attempt++) { SpanBuilder builder = tracer.spanBuilder("http.request") .setAttribute("http.attempt", attempt + 1) .setAttribute("http.max_retries", maxRetries); // Link retry attempts to the original if (lastAttemptContext != null) { builder.addLink(lastAttemptContext, Attributes.of( AttributeKey.stringKey("link.type"), "retry", AttributeKey.longKey("previous_attempt"), (long) attempt )); } Span span = builder.startSpan(); lastAttemptContext = span.getSpanContext(); try (Scope scope = span.makeCurrent()) { Response response = httpClient.execute(request); span.setStatus(StatusCode.OK); return response; } catch (RetryableException e) { span.recordException(e); span.setStatus(StatusCode.ERROR, "Attempt " + (attempt + 1) + " failed"); if (attempt == maxRetries) { throw e; } } finally { span.end(); } } throw new IllegalStateException("Unreachable"); }} // Example: Request that triggers async work in a different tracepublic class AsyncWorkTrigger { public void triggerAsyncWork(WorkRequest request) { Span triggerSpan = tracer.spanBuilder("triggerAsyncWork") .startSpan(); try (Scope scope = triggerSpan.makeCurrent()) { // Store the trigger span context for the async worker to link back request.setTriggerTraceId(triggerSpan.getSpanContext().getTraceId()); request.setTriggerSpanId(triggerSpan.getSpanContext().getSpanId()); workQueue.enqueue(request); triggerSpan.setStatus(StatusCode.OK); } finally { triggerSpan.end(); } }} public class AsyncWorker { public void processWork(WorkRequest request) { // Create new trace for async work SpanBuilder builder = tracer.spanBuilder("processAsyncWork"); // Link back to the triggering span if (request.getTriggerTraceId() != null) { SpanContext triggerContext = SpanContext.createFromRemoteParent( request.getTriggerTraceId(), request.getTriggerSpanId(), TraceFlags.getSampled(), TraceState.getDefault() ); builder.addLink(triggerContext, Attributes.of( AttributeKey.stringKey("link.type"), "triggered_by" )); } Span span = builder.startSpan(); // Process... }}At scale, tracing every request becomes prohibitively expensive. Sampling strategies determine which traces are recorded, balancing visibility with cost.
Key Sampling Approaches:
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Always On | Trace every request | Complete visibility | Expensive at scale |
| Probabilistic | Sample X% of traces | Predictable cost | May miss rare events |
| Rate Limited | Sample up to N traces/sec | Bounded cost | Loses visibility under load |
| Tail-Based | Decide after trace completes | Keeps interesting traces | Complex, high memory |
| Error-Based | Always sample errors | Captures failures | Still misses successful edge cases |
| Head-Based Priority | Higher priority = higher sample rate | Prioritizes important work | Requires classification |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
// Composite sampler combining multiple strategiespublic class CompositeSampler implements Sampler { private final double baseSampleRate = 0.01; // 1% base rate private final double errorSampleRate = 1.0; // 100% of errors private final double slowSampleRate = 0.50; // 50% of slow requests private final Duration slowThreshold = Duration.ofSeconds(5); @Override public SamplingResult shouldSample( Context parentContext, String traceId, String name, SpanKind spanKind, Attributes attributes, List<LinkData> parentLinks ) { // Always sample if parent was sampled (consistent sampling) SpanContext parentSpanContext = Span.fromContext(parentContext).getSpanContext(); if (parentSpanContext.isValid() && parentSpanContext.isSampled()) { return SamplingResult.recordAndSample(); } // Priority-based sampling String priority = attributes.get(AttributeKey.stringKey("request.priority")); if ("high".equals(priority)) { return SamplingResult.recordAndSample(); } // Probabilistic sampling for new traces if (shouldProbabilisticSample(traceId)) { return SamplingResult.recordAndSample(); } return SamplingResult.drop(); } private boolean shouldProbabilisticSample(String traceId) { // Use trace ID for deterministic sampling // Same trace ID = same decision across all services long hash = Math.abs(traceId.hashCode()); double threshold = baseSampleRate * Long.MAX_VALUE; return hash < threshold; }} // Tail-based sampling collector (conceptual)public class TailBasedSamplingCollector { private final Map<String, List<Span>> traceBuffer = new ConcurrentHashMap<>(); private final Duration traceTimeout = Duration.ofSeconds(30); public void receiveSpan(Span span) { String traceId = span.getSpanContext().getTraceId(); traceBuffer.computeIfAbsent(traceId, k -> new CopyOnWriteArrayList<>()) .add(span); } // Called when trace is complete or times out public void evaluateTrace(String traceId) { List<Span> spans = traceBuffer.remove(traceId); if (spans == null) return; // Keep interesting traces if (shouldKeepTrace(spans)) { exportToBackend(spans); } else { // Discard the trace } } private boolean shouldKeepTrace(List<Span> spans) { // Keep if any span has an error if (spans.stream().anyMatch(s -> s.getStatus() == StatusCode.ERROR)) { return true; } // Keep if total duration exceeds threshold Duration totalDuration = calculateTotalDuration(spans); if (totalDuration.compareTo(Duration.ofSeconds(5)) > 0) { return true; } // Keep 1% of remaining traces for baseline return Math.random() < 0.01; }}When a trace is sampled, ALL spans in that trace must be sampled. If the parent span is sampled but child spans are dropped, the trace becomes fragmented and useless. Use trace ID for deterministic sampling decisions—the same trace ID should produce the same sampling decision in all services.
We've covered the essential concepts and techniques for designing classes that participate effectively in distributed tracing. Let's consolidate the key takeaways:
What's Next:
With tracing mastered, we'll explore Debug-Friendly Design—techniques for designing classes that are easy to debug in development and production, including meaningful toString() implementations, debug endpoints, and diagnostic interfaces.
You now understand how to design classes that participate effectively in distributed tracing. These skills enable debugging complex production issues by following requests across service boundaries—a superpower in microservice architectures.