Loading learning content...
Every discipline has its fundamental vocabulary—concepts so foundational that misunderstanding them leads to confusion at every subsequent level. In distributed tracing, two concepts form the bedrock: traces and spans.
These aren't arbitrary terminology choices. Traces and spans are a data model—a way of representing the reality of distributed request processing in a structure that's both mathematically rigorous and practically useful. Once you internalize this data model, you'll see it everywhere: in tracing UIs, in instrumentation libraries, in observability vendor documentation, and in distributed systems conversations.
This page will give you deep fluency in traces and spans—not just definitions, but complete mental models.
By the end of this page, you will understand: how a trace represents an end-to-end request journey; how spans model individual units of work with precise timing; how parent-child relationships encode causality; what data spans carry; and how this data model enables the powerful capabilities of modern tracing systems.
A trace represents the complete journey of a single request through a distributed system. It is a directed acyclic graph (DAG) of spans, where each span represents one unit of work and edges represent causality.
The simplest definition: A trace is a collection of spans that share a common trace ID and together describe a single end-to-end operation.
A more precise definition: A trace is a DAG where:
Trace ID: abc123-def456-789xyz│├── Span A: "HTTP GET /checkout" (root span)│ ├── Start: 2024-01-15T10:00:00.000Z│ ├── End: 2024-01-15T10:00:02.347Z│ ├── Duration: 2,347ms│ ││ ├── Span B: "CartService.getItems" (child of A)│ │ ├── Start: 2024-01-15T10:00:00.050Z│ │ ├── End: 2024-01-15T10:00:00.150Z│ │ ├── Duration: 100ms│ │ ││ │ └── Span D: "PostgreSQL SELECT" (child of B)│ │ ├── Start: 2024-01-15T10:00:00.070Z│ │ ├── End: 2024-01-15T10:00:00.140Z│ │ └── Duration: 70ms│ ││ ├── Span C: "PaymentService.process" (child of A)│ │ ├── Start: 2024-01-15T10:00:00.200Z│ │ ├── End: 2024-01-15T10:00:02.100Z│ │ ├── Duration: 1,900ms ← Clearly the bottleneck!│ │ ││ │ └── Span E: "StripeAPI.charge" (child of C)│ │ ├── Start: 2024-01-15T10:00:00.250Z│ │ ├── End: 2024-01-15T10:00:02.050Z│ │ └── Duration: 1,800ms ← External API is slow│ ││ └── Span F: "OrderService.create" (child of A)│ ├── Start: 2024-01-15T10:00:02.110Z│ ├── End: 2024-01-15T10:00:02.340Z│ └── Duration: 230msKey properties of a trace:
1. Single Trace ID Every span in the trace shares the same trace ID. This ID is generated when the trace begins (typically at the edge of your system, like an API gateway) and propagated to all downstream operations. The trace ID is the correlation key that binds scattered spans into a unified view.
2. One Root Span Every trace has exactly one root span—the span that has no parent. This represents the outermost operation, typically the initial HTTP request or API call that triggered everything else.
3. Hierarchical Structure Spans are organized in a tree structure (technically a DAG, allowing for spans with multiple parents in some implementations). Parent spans represent operations that caused child spans. A child span's lifetime is typically contained within its parent's lifetime.
4. Distributed Nature Unlike a call stack in a single process, a trace spans multiple machines, processes, and networks. Spans in a trace might be generated by completely different services, written in different programming languages, running in different data centers.
The trace ID must flow through every component of your distributed system for tracing to work. If the trace ID is lost at any point—a service that doesn't propagate it, a message queue that drops headers—the trace becomes fragmented. This is why context propagation (covered in a later page) is so critical.
A span represents a single unit of work within a trace. It has a beginning and an end, carries metadata about the work performed, and can have parent and child relationships to other spans.
Think of spans as timed execution scopes. When your code does something worth measuring—handling an HTTP request, making a database query, sending a message to a queue—a span captures that work.
Beyond Core Attributes: Enriching Spans
Spans can carry additional data that makes them far more useful for debugging and analysis:
Attributes (Tags) Key-value pairs providing context about the operation. These are indexed and searchable in most tracing systems.
Examples:
http.method: GEThttp.url: /api/checkouthttp.status_code: 500db.statement: SELECT * FROM users WHERE id = ?user.id: user-12345error.type: ConnectionTimeoutExceptionEvents (Logs) Timestamped annotations within a span. Unlike span start/end, events mark points in time during the span's lifetime.
Examples:
Links References to spans in other traces. This is used for asynchronous relationships where a strict parent-child relationship doesn't apply.
Examples:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
{ "traceId": "abc123def456789xyz", "spanId": "span-001-cart", "parentSpanId": "span-000-root", "operationName": "CartService.getItems", "startTime": "2024-01-15T10:00:00.050Z", "endTime": "2024-01-15T10:00:00.150Z", "duration": 100000000, // nanoseconds "spanKind": "SERVER", "status": { "code": "OK" }, "attributes": { "service.name": "cart-service", "service.version": "2.3.1", "http.method": "GET", "http.url": "/internal/cart/user-12345", "http.status_code": 200, "cart.item_count": 3, "cart.total_value": 149.99, "db.queries_executed": 2, "cache.hit": false }, "events": [ { "name": "cache_miss", "timestamp": "2024-01-15T10:00:00.055Z", "attributes": { "cache.type": "redis", "cache.key": "cart:user-12345" } }, { "name": "db_query_started", "timestamp": "2024-01-15T10:00:00.060Z", "attributes": { "db.statement": "SELECT * FROM cart_items WHERE user_id = $1" } } ], "links": [], "resource": { "service.name": "cart-service", "deployment.environment": "production", "host.name": "cart-service-pod-abc123", "k8s.namespace": "ecommerce" }}Not every function call needs a span. Creating too many spans adds overhead and noise. Create spans for: service boundaries, database calls, external API calls, significant business operations, and anywhere you'd benefit from timing or context. Don't create spans for: fast in-memory computations, utility functions, or tight loops.
Spans have a kind that describes their role in the distributed interaction. Understanding span kinds is essential for correctly interpreting traces and for properly instrumenting your systems.
| Span Kind | Description | When to Use | Examples |
|---|---|---|---|
| CLIENT | The span represents a request sent to a remote service | When your code initiates a synchronous call to another service | HTTP client call, gRPC stub invocation, database query |
| SERVER | The span represents handling a request received from a client | When your code handles an incoming request from another service | HTTP server handling request, gRPC service implementation |
| PRODUCER | The span represents publishing a message for asynchronous processing | When your code puts a message on a queue or publishes an event | Kafka producer, RabbitMQ publish, SQS SendMessage |
| CONSUMER | The span represents processing a message received asynchronously | When your code processes a message from a queue or subscription | Kafka consumer, SQS ReceiveMessage handler, pub/sub subscriber |
| INTERNAL | The span represents an internal operation with no remote component | When you want to trace significant internal operations | Business logic execution, internal computations, local caching operations |
Why Span Kinds Matter:
1. Accurate Latency Attribution With CLIENT and SERVER spans, tracing systems can separate:
2. Topology Discovery Span kinds help tracing systems automatically construct service dependency graphs. A CLIENT span in Service A with a corresponding SERVER span in Service B indicates that A depends on B.
3. Asynchronous Flow Tracking PRODUCER and CONSUMER spans model asynchronous message passing where standard parent-child relationships don't apply. The message queue decouples the producer and consumer in time.
The CLIENT ↔ SERVER Relationship:
When Service A calls Service B:
Service A (Cart Service) Service B (Inventory Service)───────────────────────── ───────────────────────────── ┌──────────────── CLIENT Span ─────────────────┐ │ spanId: "client-span-123" │ │ operation: "InventoryService.checkStock" │ │ spanKind: CLIENT │ │ start: 10:00:00.100 │ │ │ │ ──── HTTP Request ────────> │ │ (headers contain trace/span │ │ context for propagation) │ │ │ │ ┌─────── SERVER Span ────────┐ │ │ spanId: "server-span-456" │ │ │ parentSpanId: "client-123" │ │ │ operation: "HTTP GET" │ │ │ spanKind: SERVER │ │ │ start: 10:00:00.105 │ │ │ (processing...) │ │ │ end: 10:00:00.180 │ │ └────────────────────────────┘ │ │ │ <─── HTTP Response ──────── │ │ │ │ end: 10:00:00.185 │ └──────────────────────────────────────────────┘ Client span: 85ms totalServer span: 75ms processingNetwork overhead: ~10ms (5ms each way)A common instrumentation mistake is creating spans without proper kinds, or using INTERNAL for everything. This breaks topology discovery and latency attribution. Always use CLIENT for outgoing calls and SERVER for incoming calls. Let your tracing library handle this automatically where possible.
The parent-child relationship between spans is the mechanism that encodes causality in a trace. The parent span caused the child span to exist—without the parent operation, the child operation would not have occurred.
Rules of Parent-Child Relationships:
Understanding the Trace Tree:
The parent-child structure creates a tree that mirrors the logical execution flow:
Trace ID: xyz-789 Root: HTTP GET /order/123 [0ms ─────────────────────── 500ms]│├── AuthService.validateToken [10ms ──── 50ms]│ ││ └── RedisCache.get [15ms ── 30ms]│ ││ └── (cache hit, no further children)│├── OrderService.getOrder [60ms ─────────────── 350ms]│ ││ ├── PostgreSQL.query [70ms ────── 150ms]│ ││ └── PaymentService.getStatus [160ms ──────── 340ms]│ ││ └── StripeAPI.retrieve [170ms ──────── 330ms]│└── ResponseSerializer.serialize [360ms ──── 400ms] Visual representation as waterfall: 0ms 100ms 200ms 300ms 400ms 500ms|-------|-------|-------|-------|-------|[======== HTTP GET /order/123 ===============] [===] Auth [=] Redis [========= OrderService ===========] [====] PostgreSQL [======= PaymentService ====] [====== StripeAPI ========] [==] SerializeSequential vs. Parallel Children:
Spans can have multiple children that execute:
Sequentially: One child starts after another ends. This indicates serial processing where each step depends on the previous.
In Parallel: Multiple children overlap in time. This indicates concurrent operations, such as fanning out to multiple downstream services simultaneously.
Being able to distinguish sequential from parallel execution is one of the powerful insights tracing provides. If you see three sequential calls that could be parallel, you've identified an optimization opportunity.
In distributed systems, different machines have different clock times. This can cause apparent violations of temporal containment—a child span appearing to start before its parent. Production tracing systems have heuristics to handle this, but significant clock skew can still cause confusing traces. NTP synchronization across your fleet is important for trace quality.
The most common way to visualize a trace is the waterfall view (also called the timeline or Gantt chart view). This visualization transforms the abstract trace data structure into an immediately comprehensible picture of request flow.
┌────────────────────────────────────────────────────────────────────────────┐│ Trace: abc123 | Duration: 847ms | Spans: 12 | Services: 5 │├────────────────────────────────────────────────────────────────────────────┤│ ││ Service Operation 0ms 200ms 400ms 600ms 847ms││ ─────────────────────────────────────────|─────────|─────────|─────────|───││ ││ api-gateway GET /api/checkout ████████████████████████████████ ││ │ ││ │ auth-service ValidateSession ██████ ││ │ │ redis GET session:user123 ██ ││ │ ││ │ cart-service GetCartItems ████████ ││ │ │ postgres SELECT cart_items ████ ││ │ ││ │ pricing-srv CalculateTotal ████ ││ │ ││ │ payment-srv ProcessPayment ████████████████ ││ │ │ stripe-api POST /v1/charges ████████████ ││ │ │ │ (external API) ││ │ ││ │ order-service CreateOrder ████ ││ │ │ postgres INSERT orders ██ ││ │ │ kafka Produce order.created ██ ││ │├────────────────────────────────────────────────────────────────────────────┤│ ▸ Slowest span: stripe-api POST /v1/charges (523ms, 62% of trace) ││ ▸ Critical path: api-gateway → payment-srv → stripe-api │└────────────────────────────────────────────────────────────────────────────┘What the Waterfall Reveals:
1. Time Distribution Instantly see where time is spent. In the example above, the Stripe API call dominates (62% of total trace time). This immediately focuses optimization efforts.
2. Parallelism and Sequencing Non-overlapping spans indicate sequential processing. The waterfall makes it obvious when operations that could be parallel are actually running one after another.
3. Gaps and Delays White space between spans indicates time not captured—network latency, queue wait time, or un-instrumented code. Significant gaps often indicate hidden bottlenecks.
4. Service Boundaries Color-coding by service shows which teams/services are involved and how much time each contributes. This enables SLO attribution and ownership clarity.
5. Error Propagation Failed spans are typically highlighted (red). You can see exactly where an error occurred and how it propagated up the call chain.
The Critical Path:
The critical path is the sequence of spans that determines the trace's total duration. In a trace with parallelism, the critical path is the longest path through the DAG. Optimizing spans not on the critical path won't improve overall latency.
When analyzing a slow trace: (1) Find the critical path first, (2) Identify the longest span on that path, (3) Drill into that span's details—attributes, events, and child spans, (4) Look for parallel opportunities in sequential operations, (5) Check for gaps that indicate hidden latency.
For traces to work across service boundaries, information must flow with requests. This is accomplished through span context and optionally baggage.
123456789101112131415161718
# Request from Service A to Service B GET /api/inventory/check HTTP/1.1Host: inventory-service.internalContent-Type: application/json # W3C Trace Context headers (standard format)traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01# ^^ ^------------------------------- ^--------------- ^^# | | | |# ver trace-id (32 hex) span-id (16 hex) flags (sampled=01) tracestate: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7# ^---- vendor-specific trace state # Baggage header (user-defined context)baggage: userId=user-12345,tenantId=acme-corp,featureFlag=new-checkout-enabled# ^---- application-level context propagated across all servicesThe Flow of Context:
Why Context Matters:
Without proper context propagation:
This is why context propagation is such a critical topic—a single service that fails to propagate context breaks tracing for all requests that pass through it.
Baggage propagates to ALL downstream services—including third-party APIs if you're not careful. Never put sensitive data in baggage. Keep baggage small; every byte is transmitted with every request. Use baggage for genuinely cross-cutting concerns like tenant-id or correlation-id, not for large data structures.
With experience, you'll learn to recognize common patterns in traces that indicate either healthy architectures or problems to address.
| Pattern | Visual Signature | Interpretation | Action |
|---|---|---|---|
| Shallow Tree | Few levels, many siblings | Parallel fan-out to downstream services | Often healthy; check if sequential would be better for dependencies |
| Deep Tree | Many levels, few siblings | Long chain of sequential calls | Look for unnecessary layers; consider circuit-breaking for deep chains |
| Long Pole | One span dominates duration | Single bottleneck controls latency | Optimize the long pole; everything else is noise until this is fixed |
| Retry Storm | Multiple similar child spans | Repeated attempts due to failures | Fix underlying failure; add jitter; check retry configuration |
| Sparse Trace | Large gaps between spans | Missing instrumentation or network delays | Add instrumentation; investigate network if gaps are excessive |
| Explosion | Hundreds of child spans | N+1 query patterns or excessive fan-out | Batch operations; add caching; review architecture |
The N+1 Query Anti-Pattern in Traces:
One of the most valuable discoveries tracing enables is the N+1 query problem—where code that fetches a list of items then makes individual queries for each item.
Trace: GetOrderDetails (Duration: 2,100ms) OrderService.getOrderWithItems [0ms ────────────────────── 2100ms]│├── PostgreSQL SELECT orders [10ms ── 30ms] (20ms - single query)│├── PostgreSQL SELECT items WHERE id=1 [40ms ── 60ms] ┐├── PostgreSQL SELECT items WHERE id=2 [70ms ── 90ms] │├── PostgreSQL SELECT items WHERE id=3 [100ms ── 120ms]│├── PostgreSQL SELECT items WHERE id=4 [130ms ── 150ms]│ 100 individual queries!├── PostgreSQL SELECT items WHERE id=5 [160ms ── 180ms]│ Each takes ~20ms│ ... (95 more similar spans) ... │ │├── PostgreSQL SELECT items WHERE id=100 [2030ms ── 2050ms] ┘ Problem immediately visible: 100 sequential database queriesSolution: Use a single query with WHERE id IN (1,2,3,...,100)Expected improvement: ~20ms instead of ~2000msThe patterns in your traces reflect your architecture's actual behavior under real load. Learning to 'read' traces is a skill that improves with practice. The more traces you examine, the faster you'll recognize problems and opportunities.
We've built a comprehensive understanding of the trace and span data model. Let's consolidate:
What's Next:
Now that we understand the trace and span data model, we need to understand how trace context flows across service boundaries. The next page covers Context Propagation—the mechanisms that carry trace context through HTTP headers, message queues, and async processing to create unified, end-to-end traces.
You now have deep fluency in the fundamental data model of distributed tracing. Traces, spans, parent-child relationships, span kinds, and context are the vocabulary you'll use when instrumenting systems, reading traces, and discussing observability with colleagues.