Distributed Tracing - Learning Module

Loading content...

0/273

Sampling Strategies

The Volume Problem

At 10:00 AM on a typical Monday, your e-commerce platform processes 50,000 requests per second. Each request generates an average of 15 spans across your microservices. That's 750,000 spans per second—or 64.8 billion spans per day.

At approximately 1KB per span with indexing overhead, you're looking at ~65 terabytes per day of trace data. With 14-day retention, that's nearly a petabyte of storage. The storage costs alone would exceed most engineering budgets. The query performance would be abysmal. The operational burden would be crushing.

This is why sampling exists.

Sampling is the practice of selectively recording traces rather than recording every single one. Done correctly, sampling dramatically reduces costs while preserving the visibility you need. Done poorly, it leads to blind spots, missed incidents, and false confidence in system health.

This page will make you an expert in sampling—the strategies, tradeoffs, and best practices.

What You Will Learn

By the end of this page, you will understand: head-based vs. tail-based sampling; probabilistic, rate-limiting, and adaptive sampling algorithms; priority sampling for important traces; how to configure sampling for different traffic patterns; and the tradeoffs between cost, coverage, and observability.

Why Sampling Is Necessary

Sampling isn't just about cost reduction—it's about making tracing sustainable at scale. Without sampling, tracing becomes a victim of its own success.

Impact of Trace Volume at Scale
Request Rate	Avg Spans/Request	Spans/Second	Daily Storage (1KB/span)	Monthly Cost (at $0.10/GB)
1,000 req/s	10	10,000	864 GB	$2,592
10,000 req/s	15	150,000	12.96 TB	$38,880
100,000 req/s	15	1,500,000	129.6 TB	$388,800
1,000,000 req/s	20	20,000,000	1.73 PB	$5.2M

The costs extend beyond storage:

1. Collection Overhead Every span must be created, serialized, and transmitted. At 100% sampling, this overhead impacts application performance—memory allocation, network bandwidth, CPU for serialization.

2. Backend Ingestion Collectors, processors, and storage backends must handle the volume. More data means more infrastructure.

3. Query Performance Searching for a specific trace among billions is slow. Even indexed queries degrade at scale.

4. Retention Limits With finite storage, higher volume means shorter retention. You may need 30 days of traces for debugging, but can only afford 3 days at 100% sampling.

The Key Insight:

You don't need every trace. In a healthy system, most requests look similar. What you need is:

Representative samples of normal behavior
Complete capture of abnormal behavior (errors, high latency)
Guaranteed capture of critical business transactions

Sampling strategies aim to achieve all three while minimizing costs.

The 10% Rule of Thumb

Many organizations find that 1-10% sampling provides sufficient visibility for debugging while reducing costs by 90-99%. But the right rate depends on your traffic, the cost of missed traces, and your observability requirements. Some systems sample more aggressively during known-good periods and less during incidents.

Head-Based vs. Tail-Based Sampling

The fundamental distinction in sampling strategies is when the sampling decision is made: at the head (start) or tail (end) of a trace.

Head-Based Sampling

•Decision made at trace start — Before any spans are created, decide if this trace will be sampled
•Decision propagated downstream — Sampling decision travels with context; all services honor it
•Consistent trace sampling — Either all spans of a trace are sampled, or none are
•Cannot use outcome information — Decision made before knowing if request will error or be slow
•Simple implementation — Just flip a coin at the entry point
•Lower overhead — Non-sampled traces generate no spans

Tail-Based Sampling

•Decision made after trace completes — All spans collected temporarily; sampling decision made retrospectively
•Can use outcome information — Sample based on errors, latency, specific attributes seen in spans
•Intelligent trace selection — Keep all error traces, high-latency traces, specific user traces
•Requires buffering — All spans held in memory/storage until decision is made
•Complex implementation — Needs a collector that can reassemble and evaluate complete traces
•Higher overhead — All traces processed even if dropped later

Head-Based vs Tail-Based Flow
HEAD-BASED SAMPLING
═══════════════════
 
Request arrives at edge:
 │
 ├─ Sample decision: random(0.0-1.0) < 0.10 (10% rate)
 │   │
 │   ├─ TRUE (sampled = 1):
 │   │   • All downstream services create and export spans
 │   │   • Complete trace stored
 │   │   
 │   └─ FALSE (sampled = 0):
 │       • All downstream services skip span creation entirely
 │       • No trace data generated
 │       • Zero overhead beyond decision propagation
 
Risk: If this request later errors, we have no trace!
 
 
TAIL-BASED SAMPLING
═══════════════════
 
Request arrives and progresses:
 │
 ├─ Service A creates span → buffered
 ├─ Service B creates span → buffered
 ├─ Service C creates span → buffered
 ├─ Service D creates span → buffered
 │
 └─ Request completes, all spans arrive at collector
     │
     └─ Collector evaluates complete trace:
         │
         ├─ Has error spans? → KEEP (100%)
         ├─ Latency > P99? → KEEP (100%)
         ├─ Important user tier? → KEEP (100%)
         ├─ Otherwise → KEEP (1%)
         │
         └─ Only kept traces written to storage
 
Benefit: We ALWAYS have traces for errors and slow requests!
Cost: All spans buffered until trace completes (memory, processing)

Tail-Based Complexity

Tail-based sampling is intellectually appealing but operationally complex. It requires: (1) A collector that can buffer and reassemble traces (OpenTelemetry Collector, Jaeger Streaming, etc.), (2) Enough memory to buffer traces during their lifetime, (3) Logic to detect 'trace complete' (timeout-based or explicit signals), (4) Handling for very long traces that exceed buffers. Many organizations start with head-based and add tail-based selectively.

Probabilistic Sampling

Probabilistic sampling (also called random sampling) is the simplest and most common sampling strategy. Each trace has a fixed probability of being sampled.

Probabilistic Sampling Characteristics

•Fixed probability: If rate = 0.1, then ~10% of traces are sampled
•Deterministic from trace ID: Decision is based on a hash of the trace ID, ensuring consistency across services
•Statistically representative: Over time, sampled traces reflect overall traffic distribution
•No memory of past decisions: Each trace's decision is independent

Probabilistic Sampling Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// How probabilistic sampling works internally
 
function shouldSample(traceId: string, samplingRate: number): boolean {
  // Use trace ID to make deterministic decision
  // This ensures all services agree on the decision for the same trace
  const hash = hashTraceId(traceId);  // Returns 0.0 - 1.0
  return hash < samplingRate;
}
 
// Example: 10% sampling rate
const SAMPLING_RATE = 0.10;
 
function onTraceStart(traceId: string): SamplingDecision {
  if (shouldSample(traceId, SAMPLING_RATE)) {
    return {
      decision: 'RECORD_AND_SAMPLE',  // Record and export
      attributes: {
        'sampling.probability': SAMPLING_RATE,
      },
    };
  }
  return {
    decision: 'DROP',  // Don't record
  };
}
 
// Why hash the trace ID?
// - Trace ID is generated before sampling decision
// - Hash function produces deterministic output for same input
// - All services with same trace ID get same decision
// - No coordination needed between services

Strengths of Probabilistic Sampling:

✓ Simple to understand and configure ✓ Predictable storage costs ✓ Works across distributed services without coordination ✓ Statistically valid for aggregate analysis

Weaknesses:

✗ Errors may be dropped (if error occurs in the 90% not sampled) ✗ Rare paths may never be sampled ✗ No intelligence about trace importance ✗ Uniform rate may be wrong for different endpoints

Choosing the Right Rate

Start with a rate that fits your budget, then adjust based on coverage. Calculate: (your_request_rate × avg_spans × desired_rate × span_size) = daily_storage. Work backward from your storage budget. Monitor for blind spots—are you seeing enough error traces? Rare endpoint traces? Adjust rates by service or endpoint as needed.

Rate-Limiting Sampling

Rate-limiting sampling (also called rate-based or throughput sampling) caps the number of traces sampled per unit time, regardless of traffic volume.

Rate Limiting vs Probabilistic Under Traffic Spikes
Traffic Level:  |  100 req/s  |  1,000 req/s  |  10,000 req/s  |
════════════════════════════════════════════════════════════════
                                                              
Probabilistic 10% sampling:                                   
  Sampled:      |  10/s       |  100/s        |  1,000/s       |
  Problem: Sampling rate constant, but VOLUME scales with traffic
           During traffic spikes, you still get 10x more data
                                                              
Rate-Limiting at 50 traces/s:                                 
  Sampled:      |  50/s       |  50/s         |  50/s          |
  Sampled %:    |  50%        |  5%           |  0.5%          |
  Benefit: Predictable output regardless of input traffic
           Storage costs are bounded and predictable
                                                              
                   Normal        Spike          Mega-spike     
                   Traffic       Traffic        Traffic        
                     │             │               │           
                     ▼             ▼               ▼           
Probabilistic:  ▓▓▓▓▓▓▓▓▓▓   ▓▓▓▓▓▓▓▓▓▓▓▓▓   ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
                (scales up)                  (keeps scaling!)  
                                                              
Rate-Limited:   ▓▓▓▓▓▓▓▓▓▓   ▓▓▓▓▓▓▓▓▓▓     ▓▓▓▓▓▓▓▓▓▓        
                (capped)      (still capped) (still capped!)   

Rate Limiting Implementation:

Rate-Limiting Sampler
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Token bucket rate limiter for sampling
class RateLimitingSampler {
  private tokens: number;
  private readonly maxTokens: number;
  private readonly refillRate: number;  // tokens per second
  private lastRefill: number;
  
  constructor(tracesPerSecond: number) {
    this.maxTokens = tracesPerSecond;
    this.tokens = tracesPerSecond;
    this.refillRate = tracesPerSecond;
    this.lastRefill = Date.now();
  }
  
  shouldSample(traceId: string): boolean {
    this.refillTokens();
    
    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;  // Sample this trace
    }
    return false;  // Drop this trace
  }
  
  private refillTokens(): void {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;  // seconds
    this.tokens = Math.min(
      this.maxTokens,
      this.tokens + (elapsed * this.refillRate)
    );
    this.lastRefill = now;
  }
}
 
// Usage: Sample at most 100 traces per second
const sampler = new RateLimitingSampler(100);
 
function onTraceStart(traceId: string): boolean {
  return sampler.shouldSample(traceId);
}

When to Use Rate Limiting

Rate limiting is excellent when: (1) You have unpredictable traffic patterns, (2) You need strict storage budget guarantees, (3) You're protecting backend infrastructure from overload. Combine with probabilistic: use rate limiting at the collector level as a safety cap on top of probabilistic sampling at the SDK level.

Adaptive Sampling

Adaptive sampling dynamically adjusts the sampling rate based on traffic patterns, system load, or the distribution of operations being sampled.

Adaptive Sampling Problem Statement
Consider a system with two endpoints:
 
/api/search:       10,000 requests/second (high volume, well-tested)
/api/admin/export: 1 request/minute (rare, complex, often breaks)
 
With 1% probabilistic sampling:
  /api/search:       100 samples/second ✓ (plenty of visibility)
  /api/admin/export: ~0.6 samples/hour ✗ (might miss issues entirely!)
 
Problem: Low-volume operations are under-represented in samples.
 
With adaptive sampling targeting 10 samples/minute per operation:
  /api/search:       10/min = 0.001% sampling rate
  /api/admin/export: 10/min = 16.7% sampling rate (or 100% since volume < target)
 
Result: Every operation has representation regardless of volume.

Types of Adaptive Sampling:

1. Per-Operation Adaptive Sampling Maintains a target sample rate for each unique operation (endpoint, method, etc.). High-volume operations are sampled at lower rates; low-volume operations at higher rates.

2. Load-Based Adaptive Sampling Adjusts sampling rate based on system load. When the collector is overloaded, reduce sampling. When idle, increase it.

3. Latency-Based Adaptive Sampling Sample more traces that exhibit high latency. If an operation's P99 latency spikes, increase sampling for that operation.

4. Error-Rate Adaptive Sampling Increase sampling rate when error rates rise. Normal errors at 0.1% → sample 1%. Errors spike to 5% → sample 50%.

Jaeger Adaptive Sampling Configuration
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Jaeger adaptive sampling configuration
# This requires jaeger-collector to compute and distribute rates
 
# Remote sampling endpoint (collector provides rates to SDKs)
sampling:
  strategies_file: /etc/jaeger/sampling_strategies.json
  sampling_refresh_interval: 60s
 
# Example adaptive sampling strategy
{
  "service_strategies": [
    {
      "service": "order-service",
      "type": "adaptive",
      "options": {
        "sampling_refresh_interval": "60s",
        "sampling_store_type": "cassandra",  # or in-memory
        "initial_sampling_rate": 0.1,
        "target_samples_per_second": 10,
        "min_sampling_rate": 0.001,
        "max_sampling_rate": 1.0
      },
      "operation_strategies": [
        {
          "operation": "GET /api/checkout",
          "target_samples_per_second": 50  # Important operation
        },
        {
          "operation": "GET /api/health",
          "target_samples_per_second": 1   # Noisy, less important
        }
      ]
    }
  ]
}

Adaptive Requires Central Coordination

Adaptive sampling needs a central component (like Jaeger Collector) to track traffic patterns and distribute updated sampling rates to SDKs. This adds complexity. Ensure your tracing infrastructure supports adaptive sampling before depending on it. OpenTelemetry's native adaptive sampling is still evolving.

Priority and Debug Sampling

Sometimes you need guaranteed sampling for specific traces regardless of probabilistic rates. Priority sampling and debug flags address these scenarios.

Priority Sampling Scenarios

•Error traces: Always capture traces that contain errors
•High-value users: Always trace enterprise customers or high-LTV users
•Specific transactions: Always trace payment processing, order placement
•Debug requests: Developers explicitly requesting tracing for debugging
•Canary traffic: 100% sampling for canary deployment traffic
•Synthetic tests: Always trace synthetic monitoring requests

Priority Sampling Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import { context, trace, SpanKind, TraceFlags } from '@opentelemetry/api';
 
// Priority sampler that always samples certain traces
class PrioritySampler {
  private readonly baseSampler: Sampler;
  
  constructor(baseSamplingRate: number) {
    this.baseSampler = new ProbabilisticSampler(baseSamplingRate);
  }
  
  shouldSample(
    context: Context,
    traceId: string,
    spanName: string,
    spanKind: SpanKind,
    attributes: Attributes
  ): SamplingResult {
    
    // Priority 1: Always sample if debug flag is set
    const debugHeader = context.getValue('x-trace-debug');
    if (debugHeader === 'true') {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    
    // Priority 2: Always sample high-value users
    const userTier = attributes['user.tier'];
    if (userTier === 'enterprise' || userTier === 'premium') {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    
    // Priority 3: Always sample critical operations
    const criticalOperations = [
      '/api/checkout',
      '/api/payment/process',
      '/api/refund',
    ];
    if (criticalOperations.some(op => spanName.includes(op))) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    
    // Priority 4: Always sample synthetic/canary traffic
    const syntheticHeader = attributes['http.header.x-synthetic-request'];
    if (syntheticHeader === 'true') {
      return { decision: SamplingDecision.RECORD_AND_SAMPLED };
    }
    
    // Default: Fall back to base probabilistic sampling
    return this.baseSampler.shouldSample(
      context, traceId, spanName, spanKind, attributes
    );
  }
}

Beware Debug Flag Abuse

Debug flags bypass sampling entirely. If debug is enabled on high-volume traffic (accidentally or maliciously), you can overwhelm your tracing backend. Implement rate limiting specifically for debug-flagged traces. Log and alert when debug usage is high. Consider requiring authentication for debug flag usage.

Tail-Based Intelligent Sampling

Tail-based sampling unlocks intelligent policies that consider the complete trace before making sampling decisions. This is the most powerful—and most complex—form of sampling.

Tail-Based Sampling Policies
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# OpenTelemetry Collector tail-based sampling configuration
 
processors:
  tail_sampling:
    # How long to wait for all spans of a trace
    decision_wait: 10s
    
    # Number of traces to hold in memory
    num_traces: 100000
    
    # Expected max spans per trace
    expected_new_traces_per_sec: 10000
    
    policies:
      # Policy 1: Always keep error traces
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      # Policy 2: Always keep slow traces (> 2 seconds)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 2000
      
      # Policy 3: Always keep traces with specific attributes
      - name: important-users-policy
        type: string_attribute
        string_attribute:
          key: user.tier
          values: [enterprise, premium, vip]
      
      # Policy 4: Sample 100% of specific operations
      - name: checkout-policy
        type: string_attribute
        string_attribute:
          key: http.route
          values: [/api/checkout, /api/payment]
      
      # Policy 5: Rate limit everything else
      - name: default-policy
        type: rate_limiting
        rate_limiting:
          spans_per_second: 500
      
      # Policy 6: Composite policy example
      - name: composite-policy
        type: composite
        composite:
          max_total_spans_per_second: 1000
          policy_order: [errors-policy, latency-policy, default-policy]
          rate_allocation:
            - policy: errors-policy
              percent: 50
            - policy: latency-policy
              percent: 30
            - policy: default-policy
              percent: 20
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]  # Tail sampling here
      exporters: [jaeger]

Common Tail-Based Policies:

Policy Type	Description	Use Case
`status_code`	Sample based on span status	Keep all error traces
`latency`	Sample based on trace duration	Keep slow traces for analysis
`string_attribute`	Sample based on attribute values	Keep traces with specific user/tenant
`numeric_attribute`	Sample based on numeric comparisons	Keep traces where order value > $1000
`probabilistic`	Random sampling at tail	Base sampling for non-priority traces
`rate_limiting`	Cap traces per second	Protect storage from spikes
`composite`	Combine multiple policies	Complex multi-rule scenarios

Tail-Based Must-Have: Error Traces

If you implement only one tail-based policy, make it 'always keep error traces.' This single policy eliminates the biggest weakness of head-based probabilistic sampling: dropped error traces. Combined with head-based 10% sampling, you get representative normal traces plus complete error coverage.

Sampling Strategy Best Practices

Let's consolidate everything into actionable best practices for production sampling.

Production Sampling Best Practices

•Layer your sampling — Use head-based probabilistic at SDK level (10-20%) plus tail-based policies at collector level for errors/latency. This balances efficiency and coverage.
•Always capture errors — Implement tail-based or priority sampling to guarantee 100% capture of error traces. Missing error traces in production is unacceptable.
•Differentiate by endpoint criticality — Payment processing needs higher sampling than health checks. Configure per-operation or per-service rates.
•Monitor sampling effectiveness — Track what percentage of errors and slow requests you're capturing. If you're only seeing 50% of errors, adjust policies.
•Bound your costs — Use rate limiting as a safety cap regardless of other strategies. Know your storage budget and enforce it.
•Make sampling decisions visible — Include sampling metadata in spans (sampling.probability, sampling.policy_name). This helps explain why a trace exists.
•Plan for traffic spikes — Probabilistic sampling still scales with traffic. Add rate limiting for protection during unexpected volume.
•Test sampling changes safely — Run new sampling configurations alongside existing ones. Compare coverage before switching.

Recommended Production Sampling Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│                   RECOMMENDED SAMPLING ARCHITECTURE                         │
└─────────────────────────────────────────────────────────────────────────────┘
 
  LAYER 1: SDK (Head-Based Sampling)
  ─────────────────────────────────────────────────────────────────────────
  • Probabilistic 10-20% base sampling
  • Priority: Always sample debug flags and critical operations
  • Benefit: Reduces network traffic between apps and collector
  
  LAYER 2: OpenTelemetry Collector (Tail-Based / Additional Filtering)
  ─────────────────────────────────────────────────────────────────────────
  • Tail-based policies:
    - 100% errors
    - 100% latency > P99
    - 100% high-value user tiers
  • Rate limiting: Max 1000 traces/second as safety cap
  • Benefit: Intelligent selection based on complete trace content
  
  LAYER 3: Storage Backend (Retention Policies)
  ─────────────────────────────────────────────────────────────────────────
  • Error traces: 30-day retention
  • Normal traces: 7-day retention
  • Aggregate data: 90-day retention
  • Benefit: Different retention for different trace importance
 
                    ┌─────────────┐
                    │ Application │
                    │    SDKs     │
                    └──────┬──────┘
                           │ 10-20% sampled (head-based)
                           │ + priority traces
                           ▼
                    ┌─────────────┐
                    │   OTel      │
                    │ Collector   │
                    └──────┬──────┘
                           │ Tail-based refinement
                           │ Rate limiting cap
                           ▼
                    ┌─────────────┐
                    │  Storage    │◀─── Tiered retention policies
                    │  Backend    │
                    └─────────────┘

Module Complete

Congratulations! You've completed the Distributed Tracing module. You now understand: why tracing matters in distributed systems; the trace and span data model; how context propagation works; Jaeger and Zipkin architectures; and sampling strategies for production workloads. You have the knowledge to implement, operate, and optimize distributed tracing for any scale of system.