Distributed Tracing - Learning Module

Loading content...

0/273

Jaeger and Zipkin

The Workhorses of Distributed Tracing

Once you understand traces, spans, and context propagation, you need a system to collect, store, and visualize them. The distributed tracing ecosystem has evolved significantly, with two open-source systems standing out as the foundational options: Jaeger and Zipkin.

These aren't just alternatives—they represent different philosophies, originated from different companies, and have different strengths. Understanding both gives you the knowledge to make informed decisions about your tracing infrastructure, whether you deploy them directly or use managed services built on top of them.

This page will make you an expert in both systems, their architectures, and how to choose between them.

What You Will Learn

By the end of this page, you will understand: the architecture of Jaeger and its components; Zipkin's architecture and design philosophy; deployment options for both systems; storage backends and scaling considerations; key differences and selection criteria; and how these systems fit into the modern OpenTelemetry ecosystem.

A Brief History: From Dapper to Open Source

To understand Jaeger and Zipkin, it helps to understand their origins.

2010 — Google Dapper Paper Google published the Dapper paper describing their internal distributed tracing system. This paper became the blueprint for all subsequent tracing systems, introducing concepts like trace trees, span annotations, and sampling.

2012 — Twitter Creates Zipkin Twitter built Zipkin, heavily inspired by Dapper, to solve their microservices observability challenges. They open-sourced it, making it the first widely available distributed tracing system.

2015 — Uber Creates Jaeger As Uber's microservices grew to thousands of services, they built Jaeger (German for 'hunter') to meet their specific needs: high throughput, cloud-native deployment, and Kubernetes-friendly architecture.

2017 — CNCF Adoption Jaeger was donated to the Cloud Native Computing Foundation (CNCF), accelerating its adoption and signaling its importance in the cloud-native ecosystem.

2019 — OpenTelemetry Emerges OpenTelemetry unified tracing, metrics, and logging standards. Both Jaeger and Zipkin adapted to work with OpenTelemetry, becoming backend options for a common instrumentation layer.

Jaeger vs Zipkin: Origins and Adoption
Aspect	Jaeger	Zipkin
Created by	Uber (2015)	Twitter (2012)
Governance	CNCF Graduated Project	Independent open-source
Primary language	Go (backend), React (UI)	Java (backend), React (UI)
Cloud-native focus	Born cloud-native, Kubernetes-first	Adapted over time, supports Kubernetes
OpenTelemetry support	Native OTLP support	Requires collector/adapter
Adoption	CNCF ecosystem, Kubernetes users	Established enterprise, Spring ecosystem

The Convergence Trend

With OpenTelemetry becoming the standard for instrumentation, the differences between Jaeger and Zipkin matter less for instrumentation and more for backend capabilities. You instrument once with OpenTelemetry and can send to either system—or switch between them.

Jaeger Architecture Deep Dive

Jaeger is designed as a distributed system itself, with components that can be deployed independently and scaled separately. This architecture supports high-throughput production deployments.

Jaeger Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│                           JAEGER ARCHITECTURE                               │
└─────────────────────────────────────────────────────────────────────────────┘
 
  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
  │   Service A  │   │   Service B  │   │   Service C  │
  │  (with SDK)  │   │  (with SDK)  │   │  (with SDK)  │
  └──────┬───────┘   └──────┬───────┘   └──────┬───────┘
         │                   │                   │
         │   Span data       │   Span data       │   Span data
         │   (UDP/HTTP)      │   (UDP/HTTP)      │   (UDP/HTTP)
         │                   │                   │
         ▼                   ▼                   ▼
  ┌──────────────────────────────────────────────────────────┐
  │                        JAEGER AGENT                       │
  │  (Lightweight daemon - runs as sidecar or DaemonSet)     │
  │                                                          │
  │  • Receives spans from applications via UDP/HTTP         │
  │  • Batches spans for efficiency                          │
  │  • Handles sampling decisions (adaptive sampling)        │
  │  • Provides service discovery                            │
  └────────────────────────────┬─────────────────────────────┘
                               │
                               │   Batched spans (gRPC/HTTP)
                               │
                               ▼
  ┌──────────────────────────────────────────────────────────┐
  │                      JAEGER COLLECTOR                    │
  │                                                          │
  │  • Receives spans from agents (or directly from apps)    │
  │  • Validates and transforms spans                        │
  │  • Writes to storage backend                             │
  │  • Horizontally scalable (stateless)                     │
  └────────────────────────────┬─────────────────────────────┘
                               │
                               │   Span writes
                               ▼
  ┌──────────────────────────────────────────────────────────┐
  │                      STORAGE BACKEND                     │
  │                                                          │
  │  Options:                                                │
  │  • Elasticsearch / OpenSearch (production recommended)   │
  │  • Cassandra (high-volume, low-query)                    │
  │  • Kafka (as buffer or for stream processing)            │
  │  • In-memory (development/testing only)                  │
  │  • gRPC storage plugin (custom backends)                 │
  └────────────────────────────┬─────────────────────────────┘
                               │
                               │   Trace queries
                               ▼
  ┌──────────────────────────────────────────────────────────┐
  │                       JAEGER QUERY                       │
  │                                                          │
  │  • Serves the Jaeger UI                                  │
  │  • REST API for trace queries                            │
  │  • gRPC API for integrations                             │
  │  • Read-only, horizontally scalable                      │
  └──────────────────────────────────────────────────────────┘
                               │
                               ▼
  ┌──────────────────────────────────────────────────────────┐
  │                       JAEGER UI                          │
  │                                                          │
  │  • Trace search and visualization                        │
  │  • Service dependency graph                              │
  │  • Trace comparison (diff view)                          │
  │  • Monitor view (RED metrics from traces)                │
  └──────────────────────────────────────────────────────────┘

Jaeger Component Details

•Jaeger Agent: A lightweight daemon that acts as a local collection point. Deployed as a sidecar (per-pod) or DaemonSet (per-node). Receives spans locally over UDP (low overhead) or HTTP. Handles batching and forwarding to collectors.
•Jaeger Collector: The central span processing component. Validates spans, processes them (e.g., adds derived data), and writes to storage. Stateless, so you can run multiple replicas behind a load balancer.
•Storage Backend: Persistent storage for trace data. Elasticsearch is most common for its query capabilities. Cassandra excels at high-volume writes. Kafka can serve as a buffer for collector→storage.
•Jaeger Query: Serves the UI and APIs. Queries the storage backend for traces. Stateless and horizontally scalable.
•Jaeger UI: Single-page application for trace exploration. Embedded in jaeger-query or served separately.

All-in-One for Development

Jaeger provides an all-in-one binary that bundles agent, collector, query, and an in-memory storage backend. This is perfect for local development: docker run -p 16686:16686 jaegertracing/all-in-one. For production, always deploy components separately with persistent storage.

Zipkin Architecture Deep Dive

Zipkin has a simpler, more monolithic architecture compared to Jaeger. It consolidates functionality into fewer components, which can be easier to deploy for smaller-scale systems.

Zipkin Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│                           ZIPKIN ARCHITECTURE                               │
└─────────────────────────────────────────────────────────────────────────────┘
 
  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
  │   Service A  │   │   Service B  │   │   Service C  │
  │  (with lib)  │   │  (with lib)  │   │  (with lib)  │
  └──────┬───────┘   └──────┬───────┘   └──────┬───────┘
         │                   │                   │
         │   Span data       │                   │
         │   (HTTP POST)     │                   │
         │                   │                   │
         └───────────────────┴───────────────────┘
                             │
                             ▼
  ┌──────────────────────────────────────────────────────────┐
  │                      ZIPKIN SERVER                       │
  │  (Single deployable unit with embedded components)       │
  │                                                          │
  │  ┌─────────────────────────────────────────────────────┐ │
  │  │                    COLLECTOR                        │ │
  │  │                                                     │ │
  │  │  • HTTP /api/v2/spans endpoint                      │ │
  │  │  • Kafka consumer (optional)                        │ │
  │  │  • gRPC receiver (optional)                         │ │
  │  │  • Thrift receiver (legacy)                         │ │
  │  └─────────────────────────────────────────────────────┘ │
  │                          │                               │
  │                          ▼                               │
  │  ┌─────────────────────────────────────────────────────┐ │
  │  │                    STORAGE                          │ │
  │  │                                                     │ │
  │  │  • Elasticsearch / OpenSearch                       │ │
  │  │  • Cassandra                                        │ │
  │  │  • MySQL                                            │ │
  │  │  • In-memory (testing only)                         │ │
  │  └─────────────────────────────────────────────────────┘ │
  │                          │                               │
  │                          ▼                               │
  │  ┌─────────────────────────────────────────────────────┐ │
  │  │                    QUERY API                        │ │
  │  │                                                     │ │
  │  │  • REST API at /api/v2/traces, /api/v2/services     │ │
  │  │  • Dependency graph API                             │ │
  │  └─────────────────────────────────────────────────────┘ │
  │                          │                               │
  │                          ▼                               │
  │  ┌─────────────────────────────────────────────────────┐ │
  │  │                    WEB UI                           │ │
  │  │                                                     │ │
  │  │  • Trace search and visualization                   │ │
  │  │  • Service dependency graph                         │ │
  │  │  • Trace annotation view                            │ │
  │  └─────────────────────────────────────────────────────┘ │
  └──────────────────────────────────────────────────────────┘
 
Optional architecture with Kafka buffer:
 
  Services ──▸ Kafka ──▸ Zipkin Collector ──▸ Storage

Key Architectural Differences from Jaeger:

1. No Agent Component Zipkin doesn't have an agent layer. Applications send spans directly to the Zipkin server (or to Kafka). This simplifies deployment but shifts batching and sampling to the client libraries.

2. Single Server Process While Zipkin can be scaled horizontally, the collector, query, and UI are typically bundled in one deployable unit. This is simpler for small deployments but requires careful sizing for scale.

3. Libraries-First Design Zipkin has a rich ecosystem of client libraries (Brave for Java, Zipkin-js, etc.) that handle sampling and batching. The server assumes well-behaved clients.

4. Simpler Deployment A single docker run gets you a working Zipkin instance. No sidecars, no DaemonSets. This makes Zipkin attractive for simpler environments.

Spring Cloud Sleuth Integration

Zipkin has particularly strong integration with the Spring ecosystem via Spring Cloud Sleuth. If you're running a primarily Spring-based architecture, this integration provides near-zero-configuration tracing out of the box.

Storage Backends: Choosing Wisely

Storage is often the most critical decision in your tracing infrastructure. The storage backend determines query capabilities, retention costs, and operational complexity.

Tracing Storage Backend Comparison
Backend	Strengths	Weaknesses	Best For
Elasticsearch / OpenSearch	Excellent query capabilities; full-text search; tag filtering; aggregations	Resource intensive; requires tuning for write-heavy loads	Production workloads needing rich querying and exploration
Cassandra	Extreme write throughput; linear scalability; low latency writes	Limited query flexibility; no text search; requires trace ID for lookup	Very high-volume environments; write-heavy workloads with known trace IDs
PostgreSQL / MySQL	Simple operations; familiar technology; ACID transactions	Limited scalability; not designed for append-heavy workloads	Small deployments; development environments
Kafka (as intermediary)	Decouples collectors from storage; enables stream processing; replay capability	Not a storage backend itself; adds complexity	Pattern where you need real-time processing or multi-destination routing
In-memory	Zero setup; instant queries	Data lost on restart; limited by RAM	Local development only

Elasticsearch Configuration for Tracing:

Elasticsearch/OpenSearch is the most common production choice. Here are key considerations:

Elasticsearch Index Configuration for Jaeger
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
  "index.mapping.nested_fields.limit": 50,
  "index.requests.cache.enable": true,
  
  // Tuning for trace workloads:
  "number_of_shards": 5,        // More shards for write parallelism
  "number_of_replicas": 1,      // Production needs replicas
  "refresh_interval": "5s",     // Increase from 1s for write performance
  
  // Lifecycle policy for retention:
  "index.lifecycle.name": "jaeger-traces-policy",
  "index.lifecycle.rollover_alias": "jaeger-span-write",
  
  // ILM Policy phases:
  // - Hot: 0-3 days (SSD, high performance)
  // - Warm: 3-14 days (HDD, reduced replicas)
  // - Cold: 14-30 days (frozen, snapshot to S3)
  // - Delete: >30 days
}

Storage Sizing Considerations

•Span size: Average span is 300-500 bytes, but can be larger with many attributes. Plan for ~1KB average after indexing overhead.
•Volume: Calculate spans/second × span_size × retention_days. A system with 100K spans/sec and 7-day retention needs ~60TB.
•Query patterns: If you query by service name and time range, ensure proper index structure. Tag queries need indexed fields.
•Sampling impact: 10% sampling reduces storage by 10x. Adjust based on your balance of visibility vs. cost.

Storage is the Hidden Cost

Tracing at scale generates enormous data volumes. A medium-sized company with 50 services and 10K requests/second can generate 500K+ spans/second. At 1KB per span, that's ~40TB/day with no sampling. Always implement sampling and retention policies. Most teams keep detailed traces for 7-14 days and sampled/aggregated data longer.

Deployment Patterns

How you deploy your tracing infrastructure significantly impacts reliability, performance, and operational complexity.

jaeger-production-deployment.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Jaeger Operator based deployment (recommended approach)
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-production
spec:
  strategy: production  # Separates collector, query, and ingester
  
  collector:
    replicas: 3
    resources:
      limits:
        cpu: "2"
        memory: "4Gi"
      requests:
        cpu: "500m"
        memory: "1Gi"
    autoscale: true
    minReplicas: 2
    maxReplicas: 10
  
  query:
    replicas: 2
    resources:
      limits:
        cpu: "1"
        memory: "2Gi"
    # Optionally expose via Ingress
    serviceType: ClusterIP
  
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://elasticsearch:9200
        index-prefix: jaeger
        tls:
          ca: /es/certificates/ca.crt
        num-shards: 5
        num-replicas: 1
  
  # Agent can be deployed as sidecar OR DaemonSet
  agent:
    strategy: DaemonSet  # One agent per node
    resources:
      limits:
        cpu: "500m"
        memory: "128Mi"
  
  # Enable sampling configuration
  sampling:
    options:
      default_strategy:
        type: probabilistic
        param: 0.1  # 10% sampling

Use OpenTelemetry Collector

The OpenTelemetry Collector pattern is increasingly the recommended approach. Instrument with OpenTelemetry, collect with the OTel Collector, and export to your backend of choice. This decouples your applications from your observability backend, making it easy to switch, add backends, or route to multiple destinations.

Feature-by-Feature Comparison

Let's compare Jaeger and Zipkin across the features that matter for production use.

Jaeger vs Zipkin Feature Comparison
Feature	Jaeger	Zipkin
Trace search & filtering	Rich filtering by service, operation, tags, duration	Good filtering, somewhat less flexible
Trace visualization	Excellent waterfall with timing, comparison view	Good waterfall, slightly less detailed
Dependency graph	Service dependency DAG from traces	Service dependency DAG from traces
Adaptive sampling	Built-in adaptive sampling support	Client-side only, no server-side adaptive
Native OTLP support	Yes, first-class OTLP receiver	Requires collector translation
Kubernetes integration	Jaeger Operator for declarative management	Standard Kubernetes manifests
Streaming support	Spark and Flink integration	Limited stream processing
Multi-tenancy	Limited (requires workarounds)	Limited (requires workarounds)
Trace comparison	Yes, diff two traces side-by-side	No built-in comparison
System Architecture view	Monitor tab with RED metrics	Dependencies view only

Choose Jaeger When:

•You're in a Kubernetes/CNCF ecosystem
•You need adaptive sampling
•You want native OpenTelemetry OTLP support
•You need trace comparison features
•You prefer declarative operator-based management
•You have high-volume throughput needs

Choose Zipkin When:

•You're in a Spring/Java ecosystem
•You want the simplest possible deployment
•You have an existing Zipkin instrumentation
•You need MySQL storage (Jaeger doesn't support it)
•Your team already knows Zipkin
•You're running smaller-scale deployments

Third-Party Alternatives

Beyond Jaeger and Zipkin, consider: Grafana Tempo (cost-effective, integrates with Grafana), AWS X-Ray (managed, integrates with AWS), Google Cloud Trace (managed, GCP integration), Datadog APM, Honeycomb, Lightstep, and others. Many organizations start with Jaeger/Zipkin and migrate to managed services as they scale.

The OpenTelemetry Ecosystem

OpenTelemetry has become the standard for instrumentation, and both Jaeger and Zipkin work within this ecosystem. Understanding how they fit together is crucial for modern tracing deployments.

OpenTelemetry → Backend Integration
┌─────────────────────────────────────────────────────────────────────────────┐
│                    MODERN TRACING ARCHITECTURE                              │
└─────────────────────────────────────────────────────────────────────────────┘
 
   YOUR APPLICATIONS (instrumented with OpenTelemetry)
   ─────────────────────────────────────────────────────
   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
   │ Node.js │  │  Java   │  │  Go     │  │ Python  │
   │  OTel   │  │  OTel   │  │  OTel   │  │  OTel   │
   │   SDK   │  │   SDK   │  │   SDK   │  │   SDK   │
   └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘
        │            │            │            │
        │     All use OTLP protocol (gRPC or HTTP)
        │            │            │            │
        └────────────┴────────────┴────────────┘
                            │
                            ▼
   ┌─────────────────────────────────────────────────────────────────────────┐
   │                    OPENTELEMETRY COLLECTOR                              │
   │ (Central collection, processing, and routing layer)                     │
   │                                                                         │
   │   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                │
   │   │  Receivers  │───▸│ Processors  │───▸│  Exporters  │                │
   │   │   (OTLP)    │    │  (Batch,    │    │  (Multiple  │                │
   │   │             │    │  Sample,    │    │   backends) │                │
   │   │             │    │  Filter)    │    │             │                │
   │   └─────────────┘    └─────────────┘    └─────────────┘                │
   └──────────────────────────────┬─────────────────────────┬───────────────┘
                                  │                         │
               ┌──────────────────┴──────────┐   ┌──────────┴──────────┐
               ▼                              ▼   ▼                      ▼
   ┌─────────────────────────┐   ┌─────────────────────────┐   ┌─────────────┐
   │         JAEGER          │   │         TEMPO           │   │   Vendor    │
   │                         │   │    (Grafana Tempo)      │   │   (Datadog, │
   │  • Full Jaeger stack    │   │                         │   │  Honeycomb, │
   │  • OTLP receiver        │   │  • Object storage       │   │   etc.)     │
   │  • Elasticsearch        │   │  • Cost-effective       │   │             │
   │                         │   │  • Grafana integration  │   │             │
   └─────────────────────────┘   └─────────────────────────┘   └─────────────┘
 
Benefits of this architecture:
• Instrument once with OpenTelemetry, send anywhere
• Change backends without re-instrumenting
• Route to multiple backends simultaneously
• Apply sampling/filtering centrally
• Add new exporters without application changes

Key Integration Points:

1. Instrumentation → Collector Applications instrumented with OpenTelemetry SDKs export via OTLP (OpenTelemetry Protocol). The Collector receives this data.

2. Collector → Backend The Collector has exporters for Jaeger, Zipkin, Tempo, and commercial vendors. You configure which backend(s) receive data.

3. Backend → Visualization Jaeger has its own UI. Zipkin has its own UI. Tempo integrates with Grafana. Most commercial vendors have integrated UIs.

Why This Matters:

This architecture decouples instrumentation from storage. You can:

Start with Jaeger all-in-one for development
Migrate to production Elasticsearch-backed Jaeger
Later switch to Grafana Tempo for cost savings
Or use a commercial vendor

All without changing your application code.

The Future is OpenTelemetry

If you're starting a new tracing implementation today, use OpenTelemetry for instrumentation. Jaeger and Zipkin are becoming 'backends' in the OTel ecosystem rather than complete solutions. OTel provides a vendor-neutral, future-proof foundation that all major observability vendors support.

Summary: Jaeger and Zipkin

We've deeply explored the two foundational open-source tracing systems. Let's consolidate:

Key Takeaways

•Jaeger is cloud-native — Built for Kubernetes with agent sidecars, CNCF governance, and native OTLP support. Better for high-scale, modern infrastructure.
•Zipkin is simpler — Monolithic architecture, easier deployment, strong Spring ecosystem integration. Better for smaller deployments or existing Java/Spring environments.
•Storage choice is critical — Elasticsearch for queryability, Cassandra for volume, and always plan for retention and sampling to manage costs.
•OpenTelemetry unifies instrumentation — Both systems work as backends for OTel Collector. Instrument once, export anywhere.
•Deployment patterns matter — Use Jaeger Operator for Kubernetes, OTel Collector for flexibility, and always plan for horizontal scaling.
•The ecosystem is converging — Differences between systems matter less as OpenTelemetry becomes the standard. Choose based on operational preferences and existing infrastructure.

What's Next:

With infrastructure decisions understood, the final critical topic is Sampling Strategies. At scale, you cannot store every span—the volume and cost are prohibitive. The next page explores head-based vs. tail-based sampling, adaptive sampling, and how to sample intelligently without losing visibility into critical traces.

Page Complete

You now have comprehensive knowledge of Jaeger and Zipkin—their architectures, deployment patterns, storage options, and how they fit into the OpenTelemetry ecosystem. You can make informed decisions about which system to use and how to deploy it for your organization's needs.