Loading learning content...
At the core of every service discovery system lies a service registry—a distributed database that maintains the inventory of available service instances. The registry must answer a deceptively simple question: "Where are the instances of service X right now?"
But the simplicity is deceiving. The registry must:
Three systems have emerged as the dominant production service registries: Consul, etcd, and Apache Zookeeper. Each has distinct design philosophies, trade-offs, and sweet spots.
By the end of this page, you will deeply understand the architecture of each registry, their consistency models and CAP theorem positioning, operational characteristics and failure modes, when to choose each for your specific requirements, and how to evaluate registries for new projects.
Apache Zookeeper was created at Yahoo in the mid-2000s to solve coordination challenges in their distributed systems. It became a foundational component of the Hadoop ecosystem and remains widely deployed, particularly in organizations using Apache Kafka, HBase, or Solr.
Design Philosophy
Zookeeper is fundamentally a distributed coordination service, not specifically a service registry. It provides low-level primitives from which higher-level coordination patterns (including service discovery) can be built:
Service discovery is built on these primitives: services create ephemeral nodes under a path like /services/payment/instances/, and clients watch that path for changes.
1234567891011121314151617
/services /payment-service /instances /payment-service-001 (ephemeral) {"host": "172.31.1.10", "port": 8080, "version": "2.3.1"} /payment-service-002 (ephemeral) {"host": "172.31.1.11", "port": 8080, "version": "2.3.1"} /inventory-service /instances /inventory-service-001 (ephemeral) {"host": "172.31.2.10", "port": 8081, "version": "1.5.0"} /catalog-service /instances /catalog-service-001 (ephemeral) {"host": "172.31.3.10", "port": 8082, "version": "3.1.0"} /catalog-service-002 (ephemeral) {"host": "172.31.3.11", "port": 8082, "version": "3.1.0"}Architecture
┌─────────────────────────────────────────────────────────┐
│ Zookeeper Ensemble │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Leader │ │ Follower │ │ Follower │ │
│ │ (Writes) │◄─┤ (Reads) │◄─┤ (Reads) │ │
│ │ │ │ │ │ │ │
│ └──────▲──────┘ └──────▲──────┘ └──────▲──────┘ │
│ │ │ │ │
│ │ ZAB Protocol (Consensus) │ │
│ └────────────────┴────────────────┘ │
└─────────────────────────────────────────────────────────┘
▲ ▲ ▲
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Service │ │ Service │ │ Client │
│ Instance │ │ Instance │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
Zookeeper runs as an ensemble of servers (typically 3 or 5) that elect a leader. The leader handles all write operations and broadcasts them to followers using the ZAB consensus protocol. Reads can be served by any server, though by default they may return slightly stale data (for performance). Clients can request linearizable reads from the leader if strict consistency is required.
Key Characteristics:
Apache Curator is a high-level library built on Zookeeper that provides recipes for common patterns: service discovery, leader election, distributed locks, etc. If you use Zookeeper for service discovery, you'll likely use Curator's Service Discovery recipe rather than building on raw Zookeeper primitives.
etcd (pronounced "et-see-dee") was created by CoreOS in 2013 as a distributed key-value store for their container infrastructure. When Kubernetes adopted etcd as its data store, etcd became one of the most critical components in the cloud-native ecosystem.
Design Philosophy
etcd is designed as a simple, reliable key-value store with strong consistency. Unlike Zookeeper's hierarchical focus, etcd emphasizes:
123456789101112131415161718192021222324252627
# Register a service instance with a lease (TTL)# Grant a lease with 30-second TTL$ etcdctl lease grant 30lease 694d7e04c3d10f01 granted with TTL(30s) # Register service instance under the lease$ etcdctl put /services/payment-service/instances/i-abc123 \ '{"host":"172.31.1.10","port":8080,"version":"2.3.1"}' \ --lease=694d7e04c3d10f01 # Keep the lease alive (run in background)$ etcdctl lease keep-alive 694d7e04c3d10f01 # Discover all instances of a service (prefix query)$ etcdctl get /services/payment-service/instances --prefix/services/payment-service/instances/i-abc123{"host":"172.31.1.10","port":8080,"version":"2.3.1"}/services/payment-service/instances/i-def456{"host":"172.31.1.11","port":8080,"version":"2.3.1"} # Watch for changes (reactive discovery)$ etcdctl watch /services/payment-service/instances --prefixPUT/services/payment-service/instances/i-ghi789{"host":"172.31.1.12","port":8080,"version":"2.3.1"}DELETE/services/payment-service/instances/i-abc123Architecture
┌─────────────────────────────────────────────────────────┐
│ etcd Cluster │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Leader │ │ Follower │ │ Follower │ │
│ │ │◄─┤ │◄─┤ │ │
│ │ bbolt store │ │ bbolt store │ │ bbolt store │ │
│ └──────▲──────┘ └──────▲──────┘ └──────▲──────┘ │
│ │ │ │ │
│ │ Raft Protocol (Consensus) │ │
│ └────────────────┴────────────────┘ │
└─────────────────────────────────────────────────────────┘
▲ ▲ ▲
gRPC │ gRPC │ gRPC │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Service │ │ Service │ │ Client │
│ Instance │ │ Instance │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
etcd uses the Raft consensus algorithm for leader election and log replication. All writes go through the leader, and a quorum of nodes must acknowledge writes before they're committed. Each node stores data in an embedded bbolt database, providing persistence.
Key Characteristics:
If you're running on Kubernetes, you're already running etcd (it stores all Kubernetes cluster state). However, using the same etcd cluster for both K8s and application service discovery is risky—etcd issues would impact K8s control plane. Consider a separate etcd cluster for application use or use Kubernetes-native discovery (Services).
HashiCorp Consul, released in 2014, took a different approach than Zookeeper and etcd. While those systems provide generic coordination primitives, Consul is a purpose-built service networking platform that includes service discovery as a first-class feature.
Design Philosophy
Consul is designed as a complete solution for service networking:
12345678910111213141516171819202122232425262728293031323334
{ "service": { "name": "payment-service", "id": "payment-service-i-abc123", "address": "172.31.1.10", "port": 8080, "tags": ["production", "v2.3.1", "us-east-1a"], "meta": { "version": "2.3.1", "protocol": "grpc", "owner": "payments-team" }, "checks": [ { "id": "http-check", "name": "HTTP Health Check", "http": "http://172.31.1.10:8080/health", "interval": "10s", "timeout": "3s" }, { "id": "tcp-check", "name": "TCP Port Check", "tcp": "172.31.1.10:8080", "interval": "5s", "timeout": "1s" } ], "weights": { "passing": 100, "warning": 50 } }}Architecture
┌─────────────────────── Datacenter 1 ───────────────────────┐
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Consul Server Cluster │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Leader │ │Follower │ │Follower │ │ │
│ │ │ │◄──►│ │◄──►│ │ │ │
│ │ └────▲────┘ └────▲────┘ └────▲────┘ │ │
│ │ │ Raft │ │ │ │
│ └───────┼───────────────┼──────────────┼───────────────┘ │
│ │ │ │ │
│ ┌───────┼───────────────┼──────────────┼───────────────┐ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Consul │ │ Consul │ │ Consul │ │ │
│ │ │ Agent │ │ Agent │ │ Agent │ │ │
│ │ │(Client) │ │(Client) │ │(Client) │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
│ │ │ │ │ │ │
│ │ ┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ │ │
│ │ │ Service │ │ Service │ │ Service │ │ │
│ │ │ Instance │ │ Instance │ │ Instance │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
▲
│ WAN Gossip
▼
┌─────────────────────── Datacenter 2 ───────────────────────┐
│ (Similar structure with own server cluster) │
└────────────────────────────────────────────────────────────┘
Consul has a unique two-tier architecture:
Within a datacenter, agents communicate via Gossip protocol (Serf). Across datacenters, only servers communicate via WAN Gossip. This architecture enables massive scale—you can have thousands of client agents without overwhelming the server cluster.
123456789101112131415161718192021222324252627
# Query services via HTTP API$ curl http://localhost:8500/v1/catalog/service/payment-service[ { "ID": "i-abc123", "Node": "node-1", "Address": "172.31.1.10", "ServiceName": "payment-service", "ServicePort": 8080, "ServiceTags": ["production", "v2.3.1"], "ServiceMeta": {"version": "2.3.1"} }] # Query only healthy instances$ curl http://localhost:8500/v1/health/service/payment-service?passing=true # DNS-based discovery (SRV records)$ dig @localhost -p 8600 payment-service.service.consul SRV;; ANSWER SECTION:payment-service.service.consul. 0 IN SRV 1 1 8080 i-abc123.node.dc1.consul. # DNS-based discovery (A records)$ dig @localhost -p 8600 payment-service.service.consul;; ANSWER SECTION:payment-service.service.consul. 0 IN A 172.31.1.10payment-service.service.consul. 0 IN A 172.31.1.11Consul excels when you need service discovery beyond Kubernetes (multi-platform, VMs, bare metal, or multi-cloud), built-in health checking is valuable, DNS-based discovery simplifies integration, or you're considering service mesh but aren't on Kubernetes or don't want Istio complexity.
Let's systematically compare these three registries across the dimensions that matter for production deployments.
| Dimension | Zookeeper | etcd | Consul |
|---|---|---|---|
| Primary Purpose | Coordination primitives | Key-value store | Service networking platform |
| Consensus Protocol | ZAB | Raft | Raft |
| Consistency Model | Linearizable writes, sequential reads | Linearizable | Linearizable (default) |
| Data Model | Hierarchical znodes | Flat key-value | Service catalog + KV |
| Native Health Checking | No | No | Yes |
| DNS Interface | No | No | Yes |
| Multi-Datacenter | Limited | Limited | First-class |
| Service Mesh | No | No | Yes (Consul Connect) |
| Typical Cluster Size | 3-5 nodes | 3-5 nodes | 3-5 servers + many agents |
| Client Architecture | Direct connection | Direct connection | Local agent |
| Characteristic | Zookeeper | etcd | Consul |
|---|---|---|---|
| Operational Complexity | High | Medium | Medium-High |
| Resource Footprint | Medium | Low | Medium (with agents) |
| Upgrade Difficulty | Medium | Low | Medium |
| Monitoring/Observability | Good (many metrics) | Excellent (Prometheus) | Excellent (built-in UI) |
| Documentation Quality | Good | Excellent | Excellent |
| Community Activity | Active (Apache) | Very Active (CNCF) | Active (HashiCorp) |
| Commercial Support | Confluent, Cloudera | CNCF ecosystem | HashiCorp Enterprise |
Performance Considerations
Benchmark data varies by workload, but general characteristics:
Read Performance:
Write Performance:
Watch/Notification:
Published benchmarks should be viewed skeptically. Performance depends heavily on workload patterns, network topology, hardware, and configuration. The differences between these systems are usually smaller than the impact of proper tuning. Test with YOUR workload before making decisions based on benchmarks.
Selecting a service registry isn't primarily a performance decision—it's about fit with your ecosystem, team expertise, and requirements profile.
The Kubernetes Consideration
If you're running on Kubernetes, the calculus changes significantly:
Kubernetes already provides:
When you might still need a registry:
For pure Kubernetes environments, the default answer is increasingly: just use Kubernetes native discovery. External registries add operational burden without proportional benefit.
If you're starting fresh with no existing registry expertise: On Kubernetes, use native Kubernetes Services. Off Kubernetes with service discovery needs, Consul is often the best fit due to its purpose-built features. If you just need a consistent KV store, etcd is simplest. Only choose Zookeeper if you're already in its ecosystem.
Running a service registry in production requires attention to several critical operational concerns.
1. Cluster Sizing
All three systems typically run 3 or 5 node clusters:
Even numbers don't help—quorum requirements mean 4 nodes and 3 nodes tolerate the same number of failures, but 4 nodes have higher coordination overhead.
2. Hardware Recommendations
Service registries are latency-sensitive:
| Component | Zookeeper | etcd | Consul Server |
|---|---|---|---|
| CPU | 2-4 cores | 2-4 cores | 2-4 cores |
| Memory | 4-8 GB | 2-8 GB | 4-8 GB |
| Storage | SSD, 20-50 GB | SSD, 20-50 GB | SSD, 20-50 GB |
| Network | 1 Gbps, low latency | 1 Gbps, low latency | 1 Gbps, low latency |
| IOPS | 500+ | 500+ | 500+ |
3. Monitoring and Alerting
Critical metrics to track:
Alert on:
4. Backup and Recovery
All three systems require backup strategies:
1234567891011121314151617
# Create snapshot backup$ etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db # Verify snapshot$ etcdctl snapshot status /backup/etcd-snapshot-20240115.db+----------+----------+------------+------------+| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |+----------+----------+------------+------------+| 3c9cd0d7 | 152843 | 1250 | 2.1 MB |+----------+----------+------------+------------+ # Restore from snapshot (on new cluster)$ etcdctl snapshot restore /backup/etcd-snapshot-20240115.db \ --name node1 \ --initial-cluster node1=https://node1:2380,node2=https://node2:2380,node3=https://node3:2380 \ --initial-cluster-token etcd-cluster-1 \ --initial-advertise-peer-urls https://node1:2380Service registry failure can cascade to system-wide outage. Treat your registry with the same care as your database. Invest in monitoring, alerting, runbooks, and regular failure drills. When the registry is down, your services can't find each other.
Organizations sometimes need to migrate between registries as requirements evolve. This is a high-risk operation requiring careful planning.
Common Migration Scenarios:
Migration Strategy: Dual-Write/Dual-Read
The safest migration approach:
Phase 1: Dual-Write
Phase 2: Dual-Read
Phase 3: Cutover
Phase 4: Cleanup
1234567891011121314151617181920212223242526272829303132333435363738394041424344
class MigrationDiscoveryClient { private primaryRegistry: ServiceRegistry; private fallbackRegistry: ServiceRegistry; private migrationPhase: 'dual-write' | 'dual-read' | 'primary-only'; async discoverService(serviceName: string): Promise<ServiceInstance[]> { switch (this.migrationPhase) { case 'dual-write': // Still reading from old (fallback) registry return this.fallbackRegistry.discover(serviceName); case 'dual-read': // Try new registry first, fall back if needed try { const instances = await this.primaryRegistry.discover(serviceName); if (instances.length > 0) { return instances; } } catch (error) { this.metrics.increment('discovery.primary.failures'); } // Fallback to old registry return this.fallbackRegistry.discover(serviceName); case 'primary-only': return this.primaryRegistry.discover(serviceName); } } async registerService(service: ServiceDefinition): Promise<void> { // Always register to primary await this.primaryRegistry.register(service); // Also register to fallback during migration if (this.migrationPhase !== 'primary-only') { try { await this.fallbackRegistry.register(service); } catch (error) { // Don't fail if fallback registration fails this.log.warn('Fallback registration failed', { error }); } } }}Registry migrations are high-risk operations. Plan for weeks of dual-running, extensive testing, and rollback capability. Never rush a registry migration—the blast radius of failure is your entire service mesh.
We've deeply examined the three major service registries that power production distributed systems. Let's consolidate the essential insights:
What's Next:
Now that you understand dedicated service registries, we'll explore a ubiquitous alternative: DNS-based service discovery. DNS is the original distributed naming system, and modern systems use it creatively for service discovery. In the next page, we'll examine how DNS works for discovery, its limitations, and when to use DNS versus registry-based approaches.
You now have comprehensive knowledge of the major service registries—Zookeeper, etcd, and Consul. You understand their architectures, trade-offs, and when to choose each. This knowledge enables you to make informed decisions about service discovery infrastructure.