Loading content...
Once you understand traces, spans, and context propagation, you need a system to collect, store, and visualize them. The distributed tracing ecosystem has evolved significantly, with two open-source systems standing out as the foundational options: Jaeger and Zipkin.
These aren't just alternatives—they represent different philosophies, originated from different companies, and have different strengths. Understanding both gives you the knowledge to make informed decisions about your tracing infrastructure, whether you deploy them directly or use managed services built on top of them.
This page will make you an expert in both systems, their architectures, and how to choose between them.
By the end of this page, you will understand: the architecture of Jaeger and its components; Zipkin's architecture and design philosophy; deployment options for both systems; storage backends and scaling considerations; key differences and selection criteria; and how these systems fit into the modern OpenTelemetry ecosystem.
To understand Jaeger and Zipkin, it helps to understand their origins.
2010 — Google Dapper Paper Google published the Dapper paper describing their internal distributed tracing system. This paper became the blueprint for all subsequent tracing systems, introducing concepts like trace trees, span annotations, and sampling.
2012 — Twitter Creates Zipkin Twitter built Zipkin, heavily inspired by Dapper, to solve their microservices observability challenges. They open-sourced it, making it the first widely available distributed tracing system.
2015 — Uber Creates Jaeger As Uber's microservices grew to thousands of services, they built Jaeger (German for 'hunter') to meet their specific needs: high throughput, cloud-native deployment, and Kubernetes-friendly architecture.
2017 — CNCF Adoption Jaeger was donated to the Cloud Native Computing Foundation (CNCF), accelerating its adoption and signaling its importance in the cloud-native ecosystem.
2019 — OpenTelemetry Emerges OpenTelemetry unified tracing, metrics, and logging standards. Both Jaeger and Zipkin adapted to work with OpenTelemetry, becoming backend options for a common instrumentation layer.
| Aspect | Jaeger | Zipkin |
|---|---|---|
| Created by | Uber (2015) | Twitter (2012) |
| Governance | CNCF Graduated Project | Independent open-source |
| Primary language | Go (backend), React (UI) | Java (backend), React (UI) |
| Cloud-native focus | Born cloud-native, Kubernetes-first | Adapted over time, supports Kubernetes |
| OpenTelemetry support | Native OTLP support | Requires collector/adapter |
| Adoption | CNCF ecosystem, Kubernetes users | Established enterprise, Spring ecosystem |
With OpenTelemetry becoming the standard for instrumentation, the differences between Jaeger and Zipkin matter less for instrumentation and more for backend capabilities. You instrument once with OpenTelemetry and can send to either system—or switch between them.
Jaeger is designed as a distributed system itself, with components that can be deployed independently and scaled separately. This architecture supports high-throughput production deployments.
┌─────────────────────────────────────────────────────────────────────────────┐│ JAEGER ARCHITECTURE │└─────────────────────────────────────────────────────────────────────────────┘ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Service A │ │ Service B │ │ Service C │ │ (with SDK) │ │ (with SDK) │ │ (with SDK) │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ Span data │ Span data │ Span data │ (UDP/HTTP) │ (UDP/HTTP) │ (UDP/HTTP) │ │ │ ▼ ▼ ▼ ┌──────────────────────────────────────────────────────────┐ │ JAEGER AGENT │ │ (Lightweight daemon - runs as sidecar or DaemonSet) │ │ │ │ • Receives spans from applications via UDP/HTTP │ │ • Batches spans for efficiency │ │ • Handles sampling decisions (adaptive sampling) │ │ • Provides service discovery │ └────────────────────────────┬─────────────────────────────┘ │ │ Batched spans (gRPC/HTTP) │ ▼ ┌──────────────────────────────────────────────────────────┐ │ JAEGER COLLECTOR │ │ │ │ • Receives spans from agents (or directly from apps) │ │ • Validates and transforms spans │ │ • Writes to storage backend │ │ • Horizontally scalable (stateless) │ └────────────────────────────┬─────────────────────────────┘ │ │ Span writes ▼ ┌──────────────────────────────────────────────────────────┐ │ STORAGE BACKEND │ │ │ │ Options: │ │ • Elasticsearch / OpenSearch (production recommended) │ │ • Cassandra (high-volume, low-query) │ │ • Kafka (as buffer or for stream processing) │ │ • In-memory (development/testing only) │ │ • gRPC storage plugin (custom backends) │ └────────────────────────────┬─────────────────────────────┘ │ │ Trace queries ▼ ┌──────────────────────────────────────────────────────────┐ │ JAEGER QUERY │ │ │ │ • Serves the Jaeger UI │ │ • REST API for trace queries │ │ • gRPC API for integrations │ │ • Read-only, horizontally scalable │ └──────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────┐ │ JAEGER UI │ │ │ │ • Trace search and visualization │ │ • Service dependency graph │ │ • Trace comparison (diff view) │ │ • Monitor view (RED metrics from traces) │ └──────────────────────────────────────────────────────────┘Jaeger provides an all-in-one binary that bundles agent, collector, query, and an in-memory storage backend. This is perfect for local development: docker run -p 16686:16686 jaegertracing/all-in-one. For production, always deploy components separately with persistent storage.
Zipkin has a simpler, more monolithic architecture compared to Jaeger. It consolidates functionality into fewer components, which can be easier to deploy for smaller-scale systems.
┌─────────────────────────────────────────────────────────────────────────────┐│ ZIPKIN ARCHITECTURE │└─────────────────────────────────────────────────────────────────────────────┘ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Service A │ │ Service B │ │ Service C │ │ (with lib) │ │ (with lib) │ │ (with lib) │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ Span data │ │ │ (HTTP POST) │ │ │ │ │ └───────────────────┴───────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────┐ │ ZIPKIN SERVER │ │ (Single deployable unit with embedded components) │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ COLLECTOR │ │ │ │ │ │ │ │ • HTTP /api/v2/spans endpoint │ │ │ │ • Kafka consumer (optional) │ │ │ │ • gRPC receiver (optional) │ │ │ │ • Thrift receiver (legacy) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ STORAGE │ │ │ │ │ │ │ │ • Elasticsearch / OpenSearch │ │ │ │ • Cassandra │ │ │ │ • MySQL │ │ │ │ • In-memory (testing only) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ QUERY API │ │ │ │ │ │ │ │ • REST API at /api/v2/traces, /api/v2/services │ │ │ │ • Dependency graph API │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ WEB UI │ │ │ │ │ │ │ │ • Trace search and visualization │ │ │ │ • Service dependency graph │ │ │ │ • Trace annotation view │ │ │ └─────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────┘ Optional architecture with Kafka buffer: Services ──▸ Kafka ──▸ Zipkin Collector ──▸ StorageKey Architectural Differences from Jaeger:
1. No Agent Component Zipkin doesn't have an agent layer. Applications send spans directly to the Zipkin server (or to Kafka). This simplifies deployment but shifts batching and sampling to the client libraries.
2. Single Server Process While Zipkin can be scaled horizontally, the collector, query, and UI are typically bundled in one deployable unit. This is simpler for small deployments but requires careful sizing for scale.
3. Libraries-First Design Zipkin has a rich ecosystem of client libraries (Brave for Java, Zipkin-js, etc.) that handle sampling and batching. The server assumes well-behaved clients.
4. Simpler Deployment
A single docker run gets you a working Zipkin instance. No sidecars, no DaemonSets. This makes Zipkin attractive for simpler environments.
Zipkin has particularly strong integration with the Spring ecosystem via Spring Cloud Sleuth. If you're running a primarily Spring-based architecture, this integration provides near-zero-configuration tracing out of the box.
Storage is often the most critical decision in your tracing infrastructure. The storage backend determines query capabilities, retention costs, and operational complexity.
| Backend | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Elasticsearch / OpenSearch | Excellent query capabilities; full-text search; tag filtering; aggregations | Resource intensive; requires tuning for write-heavy loads | Production workloads needing rich querying and exploration |
| Cassandra | Extreme write throughput; linear scalability; low latency writes | Limited query flexibility; no text search; requires trace ID for lookup | Very high-volume environments; write-heavy workloads with known trace IDs |
| PostgreSQL / MySQL | Simple operations; familiar technology; ACID transactions | Limited scalability; not designed for append-heavy workloads | Small deployments; development environments |
| Kafka (as intermediary) | Decouples collectors from storage; enables stream processing; replay capability | Not a storage backend itself; adds complexity | Pattern where you need real-time processing or multi-destination routing |
| In-memory | Zero setup; instant queries | Data lost on restart; limited by RAM | Local development only |
Elasticsearch Configuration for Tracing:
Elasticsearch/OpenSearch is the most common production choice. Here are key considerations:
12345678910111213141516171819
{ "index.mapping.nested_fields.limit": 50, "index.requests.cache.enable": true, // Tuning for trace workloads: "number_of_shards": 5, // More shards for write parallelism "number_of_replicas": 1, // Production needs replicas "refresh_interval": "5s", // Increase from 1s for write performance // Lifecycle policy for retention: "index.lifecycle.name": "jaeger-traces-policy", "index.lifecycle.rollover_alias": "jaeger-span-write", // ILM Policy phases: // - Hot: 0-3 days (SSD, high performance) // - Warm: 3-14 days (HDD, reduced replicas) // - Cold: 14-30 days (frozen, snapshot to S3) // - Delete: >30 days}Tracing at scale generates enormous data volumes. A medium-sized company with 50 services and 10K requests/second can generate 500K+ spans/second. At 1KB per span, that's ~40TB/day with no sampling. Always implement sampling and retention policies. Most teams keep detailed traces for 7-14 days and sampled/aggregated data longer.
How you deploy your tracing infrastructure significantly impacts reliability, performance, and operational complexity.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
# Jaeger Operator based deployment (recommended approach)apiVersion: jaegertracing.io/v1kind: Jaegermetadata: name: jaeger-productionspec: strategy: production # Separates collector, query, and ingester collector: replicas: 3 resources: limits: cpu: "2" memory: "4Gi" requests: cpu: "500m" memory: "1Gi" autoscale: true minReplicas: 2 maxReplicas: 10 query: replicas: 2 resources: limits: cpu: "1" memory: "2Gi" # Optionally expose via Ingress serviceType: ClusterIP storage: type: elasticsearch options: es: server-urls: https://elasticsearch:9200 index-prefix: jaeger tls: ca: /es/certificates/ca.crt num-shards: 5 num-replicas: 1 # Agent can be deployed as sidecar OR DaemonSet agent: strategy: DaemonSet # One agent per node resources: limits: cpu: "500m" memory: "128Mi" # Enable sampling configuration sampling: options: default_strategy: type: probabilistic param: 0.1 # 10% samplingThe OpenTelemetry Collector pattern is increasingly the recommended approach. Instrument with OpenTelemetry, collect with the OTel Collector, and export to your backend of choice. This decouples your applications from your observability backend, making it easy to switch, add backends, or route to multiple destinations.
Let's compare Jaeger and Zipkin across the features that matter for production use.
| Feature | Jaeger | Zipkin |
|---|---|---|
| Trace search & filtering | Rich filtering by service, operation, tags, duration | Good filtering, somewhat less flexible |
| Trace visualization | Excellent waterfall with timing, comparison view | Good waterfall, slightly less detailed |
| Dependency graph | Service dependency DAG from traces | Service dependency DAG from traces |
| Adaptive sampling | Built-in adaptive sampling support | Client-side only, no server-side adaptive |
| Native OTLP support | Yes, first-class OTLP receiver | Requires collector translation |
| Kubernetes integration | Jaeger Operator for declarative management | Standard Kubernetes manifests |
| Streaming support | Spark and Flink integration | Limited stream processing |
| Multi-tenancy | Limited (requires workarounds) | Limited (requires workarounds) |
| Trace comparison | Yes, diff two traces side-by-side | No built-in comparison |
| System Architecture view | Monitor tab with RED metrics | Dependencies view only |
Beyond Jaeger and Zipkin, consider: Grafana Tempo (cost-effective, integrates with Grafana), AWS X-Ray (managed, integrates with AWS), Google Cloud Trace (managed, GCP integration), Datadog APM, Honeycomb, Lightstep, and others. Many organizations start with Jaeger/Zipkin and migrate to managed services as they scale.
OpenTelemetry has become the standard for instrumentation, and both Jaeger and Zipkin work within this ecosystem. Understanding how they fit together is crucial for modern tracing deployments.
┌─────────────────────────────────────────────────────────────────────────────┐│ MODERN TRACING ARCHITECTURE │└─────────────────────────────────────────────────────────────────────────────┘ YOUR APPLICATIONS (instrumented with OpenTelemetry) ───────────────────────────────────────────────────── ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Node.js │ │ Java │ │ Go │ │ Python │ │ OTel │ │ OTel │ │ OTel │ │ OTel │ │ SDK │ │ SDK │ │ SDK │ │ SDK │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ All use OTLP protocol (gRPC or HTTP) │ │ │ │ └────────────┴────────────┴────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ OPENTELEMETRY COLLECTOR │ │ (Central collection, processing, and routing layer) │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Receivers │───▸│ Processors │───▸│ Exporters │ │ │ │ (OTLP) │ │ (Batch, │ │ (Multiple │ │ │ │ │ │ Sample, │ │ backends) │ │ │ │ │ │ Filter) │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ └──────────────────────────────┬─────────────────────────┬───────────────┘ │ │ ┌──────────────────┴──────────┐ ┌──────────┴──────────┐ ▼ ▼ ▼ ▼ ┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────┐ │ JAEGER │ │ TEMPO │ │ Vendor │ │ │ │ (Grafana Tempo) │ │ (Datadog, │ │ • Full Jaeger stack │ │ │ │ Honeycomb, │ │ • OTLP receiver │ │ • Object storage │ │ etc.) │ │ • Elasticsearch │ │ • Cost-effective │ │ │ │ │ │ • Grafana integration │ │ │ └─────────────────────────┘ └─────────────────────────┘ └─────────────┘ Benefits of this architecture:• Instrument once with OpenTelemetry, send anywhere• Change backends without re-instrumenting• Route to multiple backends simultaneously• Apply sampling/filtering centrally• Add new exporters without application changesKey Integration Points:
1. Instrumentation → Collector Applications instrumented with OpenTelemetry SDKs export via OTLP (OpenTelemetry Protocol). The Collector receives this data.
2. Collector → Backend The Collector has exporters for Jaeger, Zipkin, Tempo, and commercial vendors. You configure which backend(s) receive data.
3. Backend → Visualization Jaeger has its own UI. Zipkin has its own UI. Tempo integrates with Grafana. Most commercial vendors have integrated UIs.
Why This Matters:
This architecture decouples instrumentation from storage. You can:
All without changing your application code.
If you're starting a new tracing implementation today, use OpenTelemetry for instrumentation. Jaeger and Zipkin are becoming 'backends' in the OTel ecosystem rather than complete solutions. OTel provides a vendor-neutral, future-proof foundation that all major observability vendors support.
We've deeply explored the two foundational open-source tracing systems. Let's consolidate:
What's Next:
With infrastructure decisions understood, the final critical topic is Sampling Strategies. At scale, you cannot store every span—the volume and cost are prohibitive. The next page explores head-based vs. tail-based sampling, adaptive sampling, and how to sample intelligently without losing visibility into critical traces.
You now have comprehensive knowledge of Jaeger and Zipkin—their architectures, deployment patterns, storage options, and how they fit into the OpenTelemetry ecosystem. You can make informed decisions about which system to use and how to deploy it for your organization's needs.