Service Discovery - Learning Module

Loading content...

0/273

Service Registries

Purpose-Built Discovery Infrastructure

While DNS provides basic name-to-address resolution, the demands of modern distributed systems require more sophisticated discovery mechanisms. Service registries are specialized distributed systems designed specifically for service discovery—providing real-time updates, health checking, rich metadata, and the consistency guarantees necessary for dynamic environments.

A service registry serves as the authoritative source of truth for which service instances are available at any given moment. Unlike DNS, which evolved from the static world of early internet naming, service registries were designed from the ground up for the flywheel of modern microservices: continuous deployment, auto-scaling, and ephemeral infrastructure.

This page provides a comprehensive exploration of service registries: their architecture, how they solve DNS's limitations, popular implementations, and the operational considerations that determine success in production.

What You Will Learn

By the end of this page, you will understand service registry architecture, the registration lifecycle (self-registration vs. third-party registration), health checking mechanisms, consistency trade-offs, and popular registry implementations including Consul, etcd, ZooKeeper, and Eureka. You'll gain the knowledge to select and operate a service registry for production environments.

What is a Service Registry

A service registry is a distributed database of service instance locations. It maintains a dynamic mapping from logical service names to network endpoints (IP:port combinations), along with associated metadata. The registry enables services to discover each other without hardcoded knowledge of network topology.

Core Functions of a Service Registry:

Registry Core Functions

•Registration: Accept and store service instance information including network location, health status, and metadata
•Deregistration: Remove service instances when they shut down or become permanently unavailable
•Health Monitoring: Continuously verify that registered instances are healthy and responsive
•Query/Lookup: Respond to discovery queries with the current set of healthy instances for a service
•Change Notification: Inform interested parties when the set of available instances changes
•Metadata Storage: Store and retrieve arbitrary metadata associated with services and instances

Converting Mermaid diagram...

Registry Data Model:

A service registry typically organizes data hierarchically:

namespace/
└── services/
    └── order-service/
        ├── metadata
        │   ├── version: "2.3.1"
        │   ├── team: "commerce"
        │   └── protocol: "grpc"
        └── instances/
            ├── instance-abc123
            │   ├── address: "10.0.1.10"
            │   ├── port: 8080
            │   ├── health: "passing"
            │   └── metadata: {...}
            └── instance-def456
                ├── address: "10.0.1.11"
                ├── port: 8080
                ├── health: "passing"
                └── metadata: {...}

This hierarchical organization enables efficient queries at multiple levels: all instances of a service, all services in a namespace, or specific instance details.

Registry vs. Configuration Store

Many systems that serve as service registries (Consul, etcd, ZooKeeper) also function as general-purpose distributed key-value stores. This overlap is intentional—service discovery and configuration management have similar requirements for consistency, availability, and change notification.

Registration Patterns

Service instances must register with the registry to be discoverable. There are two fundamental approaches to registration, each with distinct trade-offs.

In the self-registration pattern, service instances are responsible for registering themselves with the registry during startup and deregistering during shutdown. The service must also maintain its registration by sending periodic heartbeats.

How Self-Registration Works:

Startup: Service initializes, binds to a port, and registers with the registry
Heartbeat Loop: Service periodically sends heartbeats to renew its registration
Operation: Service handles requests while maintaining registration
Shutdown: Service deregisters before exiting
Failure Handling: If heartbeats stop (crash), registry evicts after timeout

self-registration.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import Consul from 'consul';
 
class ServiceRegistry {
  private consul: Consul.Consul;
  private serviceId: string;
  private heartbeatInterval: NodeJS.Timer | null = null;
 
  constructor(
    private serviceName: string,
    private host: string,
    private port: number
  ) {
    this.consul = new Consul({ host: 'consul.internal', port: 8500 });
    this.serviceId = `${serviceName}-${host}-${port}`;
  }
 
  async register(): Promise<void> {
    // Register service with Consul
    await this.consul.agent.service.register({
      id: this.serviceId,
      name: this.serviceName,
      address: this.host,
      port: this.port,
      tags: ['v2.3.1', 'production'],
      meta: {
        version: '2.3.1',
        protocol: 'http',
        team: 'commerce'
      },
      check: {
        // Consul will call this endpoint to verify health
        http: `http://${this.host}:${this.port}/health`,
        interval: '10s',
        timeout: '5s',
        deregistercriticalserviceafter: '1m'
      }
    });
 
    console.log(`Registered ${this.serviceId} with Consul`);
 
    // Start heartbeat to maintain registration
    this.startHeartbeat();
  }
 
  private startHeartbeat(): void {
    // While Consul handles health checks, we maintain a heartbeat
    // for additional resilience
    this.heartbeatInterval = setInterval(async () => {
      try {
        await this.consul.agent.check.pass(`service:${this.serviceId}`);
      } catch (error) {
        console.error('Heartbeat failed:', error);
      }
    }, 5000);
  }
 
  async deregister(): Promise<void> {
    if (this.heartbeatInterval) {
      clearInterval(this.heartbeatInterval);
    }
    
    await this.consul.agent.service.deregister(this.serviceId);
    console.log(`Deregistered ${this.serviceId} from Consul`);
  }
}
 
// Usage with graceful shutdown
const registry = new ServiceRegistry('order-service', '10.0.1.10', 8080);
 
async function main() {
  await registry.register();
  
  // Handle graceful shutdown
  process.on('SIGTERM', async () => {
    console.log('Received SIGTERM, deregistering...');
    await registry.deregister();
    process.exit(0);
  });
}

Advantages

•Service controls its own registration
•Can include service-specific metadata
•Simpler infrastructure (no registrar)
•Registration tied to service lifecycle

Disadvantages

•Couples service to specific registry
•Registration logic in every service
•Language-specific client libraries
•Consistent implementation required

Modern Trend: Platform-Managed Registration

Modern container orchestration platforms (Kubernetes, ECS, Nomad) increasingly handle registration as a platform concern. This third-party pattern has become dominant because it eliminates registration logic from application code and ensures consistent behavior across all services regardless of language or framework.

Health Checking Mechanisms

Health checking is critical to service discovery—it ensures that clients only receive endpoints for instances that can actually serve requests. Service registries implement various health checking strategies, each with different characteristics.

Types of Health Checks:

Health Check Types
Type	Mechanism	Pros	Cons
TTL/Heartbeat	Service sends periodic heartbeats; registry evicts if heartbeats stop	Simple, low overhead, service-controlled	Only detects complete failure, not degradation
HTTP Check	Registry periodically calls HTTP endpoint on service	Can verify application health, not just process	Requires exposed endpoint, increased traffic
TCP Check	Registry opens TCP connection to service port	Simple, verifies port is listening	Doesn't verify application functionality
gRPC Check	Registry uses gRPC health checking protocol	First-class gRPC support	Only for gRPC services
Script/Command	Registry executes a script that checks health	Flexible, custom logic	Security concerns, complexity
DNS Check	Registry verifies DNS resolution of a name	Useful for external dependencies	Limited applicability

health-checks.hcl

Consul HCL

# Consul Service Definition with Multiple Health Checks
 
service {
  name = "order-service"
  id   = "order-service-1"
  port = 8080
  
  tags = ["v2.3.1", "production"]
  
  meta = {
    version  = "2.3.1"
    protocol = "http"
  }
 
  # HTTP health check - verifies application responds correctly
  check {
    id       = "order-http-check"
    name     = "HTTP Health Check"
    http     = "http://localhost:8080/health"
    method   = "GET"
    interval = "10s"
    timeout  = "5s"
    
    # Tuning parameters
    success_before_passing   = 2  # Must pass 2x before healthy
    failures_before_critical = 3  # Must fail 3x before unhealthy
    
    header {
      Accept = ["application/json"]
    }
  }
 
  # TCP check - verifies port is listening
  check {
    id       = "order-tcp-check"
    name     = "TCP Port Check"
    tcp      = "localhost:8080"
    interval = "5s"
    timeout  = "2s"
  }
 
  # TTL check - service must actively report health
  check {
    id   = "order-ttl-check"
    name = "Service Heartbeat"
    ttl  = "30s"
    
    # Service must call: consul.agent.check.pass("order-ttl-check")
    # within every 30 seconds or check fails
  }
 
  # Deregister service if critical for too long
  # Prevents stale entries from accumulating
  check {
    id                             = "order-deregister"
    name                           = "Deregister Check"
    http                           = "http://localhost:8080/health"
    interval                       = "10s"
    deregister_critical_service_after = "5m"
  }
}

Health Check Design Considerations:

1. Check Depth (Shallow vs. Deep)

Shallow checks verify basic responsiveness—the process is running and accepting connections. Deep checks verify that the service can actually process requests, including validating database connectivity, cache availability, and downstream dependencies.

Shallow: Can TCP connect to :8080? → Process is running
Medium:  Does GET /health return 200? → HTTP stack works
Deep:    Can service execute a test transaction? → Full functionality

Deep checks catch more issues but can cause cascading failures if dependencies become slow—every service fails its health check simultaneously.

2. Check Frequency vs. System Load

Health checks consume resources. At scale, thousands of instances each receiving multiple health check requests per minute can create significant load. Balance check frequency against:

Time to detect failures (lower frequency = longer detection time)
System overhead (higher frequency = more CPU, network, database load)
False positives (too aggressive = flapping status)

3. Handling Transient Failures

Single failed health checks shouldn't immediately remove instances from service. Use:

Threshold counting: Require N consecutive failures before marking unhealthy
Cooldown periods: Wait before re-checking after a failure
Grace periods: Allow newly-started instances time to initialize

The Dependency Health Check Trap

Deep health checks that verify all dependencies can cause entire clusters to fail simultaneously during a single dependency issue. If your service's health check calls the database, and the database has 100ms latency spike, every instance might fail its health check and be removed from service—causing a complete outage even though the services themselves are fine. Consider separating liveness (am I running?) from readiness (can I serve traffic?) checks.

Consistency and Availability Trade-offs

Service registries are distributed systems and thus subject to the CAP theorem. The choice between consistency and availability directly affects system behavior during network partitions and failures.

Understanding the Trade-off:

During normal operation, a distributed registry can provide both consistency and availability. But when network partitions occur, the system must choose:

CP (Consistent, Partition-Tolerant):

All clients see the same view of registered services
Registry may become unavailable during partitions
No risk of routing to dead instances (assuming health checks work)
Better for systems where routing to wrong instance is catastrophic

AP (Available, Partition-Tolerant):

Registry remains available during partitions
Different clients may see different sets of instances
May route to instances that are no longer valid
Better for systems where some errors are acceptable but unavailability isn't

Registry Consistency Comparison
Registry	Consistency Model	Consensus Protocol	Trade-off Implications
ZooKeeper	Strong (CP)	ZAB (Paxos-like)	Unavailable if quorum lost; strong guarantees
etcd	Strong (CP)	Raft	Unavailable if quorum lost; linearizable reads available
Consul	Strong (CP) for writes, configurable for reads	Raft	Stale reads option for AP behavior when needed
Eureka	Eventually Consistent (AP)	Peer replication	Always available; may return stale data
CoreDNS (Kubernetes)	Eventually Consistent	Watches + caching	Based on underlying etcd; watch delays possible

Consistency Levels in Practice:

Many registries offer configurable consistency levels for read operations, allowing you to choose the trade-off per-query:

consistency-options.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import Consul from 'consul';
 
const consul = new Consul({ host: 'consul.internal' });
 
// STRONGLY CONSISTENT READ
// Queries the leader directly, always returns current state
// Higher latency, may fail if leader unavailable
async function getInstancesStrong(serviceName: string) {
  return consul.health.service({
    service: serviceName,
    passing: true,  // Only healthy instances
    consistent: true  // Force strongly consistent read
  });
}
 
// EVENTUALLY CONSISTENT READ (STALE)
// May return from local cache, potentially stale
// Lower latency, always available
async function getInstancesStale(serviceName: string) {
  return consul.health.service({
    service: serviceName,
    passing: true,
    stale: true  // Allow stale reads from any server
  });
}
 
// CACHED READ
// Uses locally cached data, refreshed periodically
// Lowest latency, but may be out of date
async function getInstancesCached(serviceName: string) {
  return consul.health.service({
    service: serviceName,
    passing: true,
    cached: true,  // Use agent's local cache
    'max-age': 30  // Accept cached data up to 30 seconds old
  });
}
 
// PRACTICAL STRATEGY:
// 1. Normal operations: Use cached/stale reads for low latency
// 2. Critical decisions: Use consistent reads
// 3. Have fallbacks: Cache last-known-good endpoints locally
 
class ServiceDiscovery {
  private cache = new Map<string, Endpoint[]>();
  
  async getEndpoints(serviceName: string): Promise<Endpoint[]> {
    try {
      // Try stale read first (fast, usually accurate)
      const endpoints = await getInstancesStale(serviceName);
      this.cache.set(serviceName, endpoints);
      return endpoints;
    } catch (error) {
      // Fall back to cache if registry unavailable
      const cached = this.cache.get(serviceName);
      if (cached && cached.length > 0) {
        console.warn(`Using cached endpoints for ${serviceName}`);
        return cached;
      }
      throw new Error(`No endpoints available for ${serviceName}`);
    }
  }
}

Practical Consistency Strategy

For most service discovery use cases, eventual consistency with local caching is appropriate. The brief window where a client might route to a recently-failed instance is acceptable—connections will fail and trigger retry logic. Reserve strong consistency for operations where routing to the wrong instance would cause data corruption or irrecoverable errors.

Popular Registry Implementations

Several production-grade service registries are available, each with distinct characteristics. Understanding their trade-offs helps you select the right tool for your environment.

HashiCorp Consul is a full-featured service mesh that includes service discovery, health checking, KV store, and network security. It's designed for multi-datacenter deployments and provides first-class support for service discovery.

Key Characteristics

•Consistency: Raft-based consensus for strong consistency; optional stale reads
•Health Checking: Built-in HTTP, TCP, gRPC, script, and TTL checks
•Multi-DC: Native multi-datacenter federation with WAN gossip
•DNS Interface: Services discoverable via DNS without code changes
•API: Full HTTP API for registration, lookup, and health
•UI: Built-in web UI for visualization and debugging
•Ecosystem: Connect sidecar proxy for service mesh capabilities

consul-query.sh
Bash
1
2
3
4
5
6
7
8
# Discover services via DNS
$ dig @consul.internal order-service.service.consul SRV
 
# Discover via HTTP API
$ curl 'http://consul.internal:8500/v1/health/service/order-service?passing=true'
 
# Watch for changes (blocks until change occurs)
$ curl 'http://consul.internal:8500/v1/health/service/order-service?wait=60s&index=123'

Selection Guidance

Choose Consul for full-featured service mesh with multi-DC support. Choose etcd if you're already in the Kubernetes ecosystem or need a lightweight consistent store. Choose Eureka for Spring Cloud environments prioritizing availability. Choose ZooKeeper if you're in the Hadoop/Kafka ecosystem. For new Kubernetes-native projects, platform-integrated discovery often makes dedicated registries unnecessary.

Client Integration Patterns

Once you have a service registry, clients need to integrate with it to discover services. Several patterns exist for this integration, each trading off different concerns.

Direct Registry Integration:

The client application directly queries the registry to discover service endpoints.

direct-integration.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import Consul from 'consul';
 
class DirectDiscoveryClient {
  private consul: Consul.Consul;
  private cache = new Map<string, CachedEndpoints>();
  
  constructor() {
    this.consul = new Consul({ host: 'consul.internal' });
  }
 
  async callService(serviceName: string, path: string): Promise<Response> {
    // Get endpoints from registry (with caching)
    const endpoints = await this.getEndpoints(serviceName);
    
    // Client-side load balancing
    const endpoint = this.selectEndpoint(endpoints);
    
    // Make the actual call
    return fetch(`http://${endpoint.address}:${endpoint.port}${path}`);
  }
 
  private async getEndpoints(serviceName: string): Promise<Endpoint[]> {
    const cached = this.cache.get(serviceName);
    if (cached && Date.now() - cached.timestamp < 30000) {
      return cached.endpoints;
    }
 
    const services = await this.consul.health.service({
      service: serviceName,
      passing: true,
    });
 
    const endpoints = services.map(s => ({
      address: s.Service.Address,
      port: s.Service.Port,
      meta: s.Service.Meta,
    }));
 
    this.cache.set(serviceName, { endpoints, timestamp: Date.now() });
    return endpoints;
  }
 
  private selectEndpoint(endpoints: Endpoint[]): Endpoint {
    // Simple round-robin (real implementations use more sophisticated algorithms)
    return endpoints[Math.floor(Math.random() * endpoints.length)];
  }
}

Sidecar Proxy Pattern:

A local proxy handles all service discovery and load balancing. The application connects to localhost, and the proxy routes to discovered endpoints.

Converting Mermaid diagram...

DNS Integration:

The registry exposes a DNS interface, allowing services to be discovered using standard DNS queries. This provides compatibility with any application that can resolve hostnames.

dns-integration.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Application uses standard DNS resolution
// Consul provides DNS interface at port 8600
 
import dns from 'dns';
import { Resolver } from 'dns';
 
// Configure resolver to use Consul DNS
const resolver = new Resolver();
resolver.setServers(['consul.internal:8600']);
 
async function discoverViaConsulDNS(serviceName: string): Promise<string[]> {
  // Query SRV records for full service info
  return new Promise((resolve, reject) => {
    resolver.resolveSrv(`${serviceName}.service.consul`, (err, addresses) => {
      if (err) reject(err);
      else resolve(addresses.map(a => `${a.name}:${a.port}`));
    });
  });
}
 
// Or simply use the service name as hostname
// Consul DNS returns round-robin A record
async function callServiceViaDNS() {
  // Application doesn't know about Consul - just uses hostname
  // Consul DNS at 8600 resolves 'order-service.service.consul' to instance IP
  return fetch('http://order-service.service.consul:8080/api/orders');
}

Client Integration Pattern Comparison
Pattern	Application Change	Load Balancing	Failover	Best For
Direct Integration	Registry client library required	Client-side	Client-implemented	Full control, custom logic
Sidecar Proxy	None (localhost connect)	Proxy handles	Proxy handles	Service mesh, polyglot
DNS Integration	None (use hostname)	DNS round-robin	Limited (TTL wait)	Legacy apps, simplicity
Client Library	SDK integration	Library handles	Library handles	Framework ecosystems

Sidecar Pattern Dominance

The sidecar proxy pattern has become dominant in modern microservices because it decouples the application from discovery infrastructure. Applications make simple localhost calls; the sidecar handles the complexity of discovery, load balancing, retries, circuit breaking, and observability. This is the foundation of service mesh architectures.

Operational Considerations

Running a service registry in production requires careful attention to operational concerns. The registry is critical infrastructure—its failure can bring down the entire system.

Critical Operational Concerns

•High Availability — Run registry in a cluster (3-5 nodes minimum for quorum). Distribute across availability zones. Plan for node failures without service disruption.
•Backup and Recovery — Regular backups of registry data. Tested recovery procedures. Consider what happens if you lose the entire registry cluster.
•Monitoring and Alerting — Monitor cluster health, replication lag, query latency, and resource utilization. Alert on leadership changes, split-brain conditions, and capacity thresholds.
•Capacity Planning — Registries scale with number of services, instances, and query rate. Monitor memory, disk, and CPU. Plan for growth.
•Network Partitions — Understand behavior during network partitions. Test failure scenarios. Configure appropriate timeouts and retry policies.
•Upgrade Procedures — Rolling upgrade capability. Version compatibility between clients and servers. Rollback procedures.
•Security — TLS for client-server and server-server communication. Access control for registration and lookup. Audit logging.

registry-monitoring.yaml
Prometheus Alerts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Prometheus alerting rules for Consul service registry
groups:
  - name: consul-alerts
    rules:
      # Alert if cluster loses quorum (critical)
      - alert: ConsulClusterNoLeader
        expr: consul_raft_leader == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Consul cluster has no leader"
          description: "Consul cluster has been without a leader for more than 1 minute"
 
      # Alert if peer count drops below threshold
      - alert: ConsulInsufficientPeers
        expr: consul_raft_peers < 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Consul cluster has fewer than 3 peers"
 
      # Alert on high health check failure rate
      - alert: ConsulHighHealthCheckFailures
        expr: rate(consul_health_checks_critical[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rate of health check failures"
 
      # Alert on high query latency
      - alert: ConsulHighQueryLatency
        expr: histogram_quantile(0.99, consul_http_request_duration_seconds) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Consul query latency is high (p99 > 500ms)"
 
      # Alert on replication lag in multi-DC setup
      - alert: ConsulReplicationLag
        expr: consul_rpc_raft_verify_leader_duration_seconds > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Consul replication lag detected"

Disaster Recovery Planning:

What happens if the registry is completely unavailable?

Client Caching: Clients should cache last-known-good endpoints and continue using them
Graceful Degradation: Services should remain connected to existing backends even if discovery is down
Static Fallback: Have a mechanism to provide static endpoint lists for critical services
Fast Recovery: Maintain offline backups and tested recovery procedures
Multi-Region: For critical systems, federate registries across regions

The Single Point of Failure Risk

A failed registry can prevent all services from discovering each other, causing a complete system outage. This makes client-side caching essential. Clients should always have a fallback to cached endpoints and should never fail immediately just because the registry is unreachable. Design for registry unavailability, not just registry accuracy.

Summary: Purpose-Built Discovery

We've comprehensively explored service registries—from fundamental concepts to operational considerations. Let's consolidate the key insights:

Key Takeaways

•Purpose-built for discovery — Service registries are designed specifically for the dynamic nature of modern distributed systems, unlike DNS which evolved from static infrastructure.
•Registration patterns matter — Self-registration gives services control; third-party registration decouples services from infrastructure. Modern platforms increasingly handle registration automatically.
•Health checking is critical — Without health checking, discovery routes traffic to dead instances. Balance check depth and frequency against system load and failure detection time.
•Consistency vs availability trade-offs — CP registries (Consul, etcd, ZooKeeper) provide strong guarantees but may be unavailable during partitions. AP registries (Eureka) stay available but may return stale data.
•Multiple implementations exist — Consul for full-featured service mesh, etcd for Kubernetes ecosystems, Eureka for Spring Cloud, ZooKeeper for Hadoop/Kafka ecosystems.
•Client integration varies — Direct integration offers control, sidecar proxies provide transparency, DNS integration offers compatibility. Sidecar pattern is increasingly dominant.
•Operations are critical — The registry is foundational infrastructure. High availability, monitoring, backup, and disaster recovery are essential considerations.

What's Next:

With both DNS-based and registry-based discovery understood, we'll explore the distinction between client-side and server-side discovery patterns. Where does the load balancing decision happen—in the client or in a dedicated infrastructure component? This architectural choice has profound implications for complexity, flexibility, and operational requirements.

Page Complete

You now have a comprehensive understanding of service registries. You understand their architecture, registration patterns, health checking mechanisms, consistency trade-offs, and popular implementations. This knowledge prepares you to implement and operate service discovery in production environments.

Service Registries

Purpose-Built Discovery Infrastructure

What You Will Learn

What is a Service Registry

Core Functions of a Service Registry:

Registry Core Functions

•Registration: Accept and store service instance information including network location, health status, and metadata
•Deregistration: Remove service instances when they shut down or become permanently unavailable
•Health Monitoring: Continuously verify that registered instances are healthy and responsive
•Query/Lookup: Respond to discovery queries with the current set of healthy instances for a service
•Change Notification: Inform interested parties when the set of available instances changes
•Metadata Storage: Store and retrieve arbitrary metadata associated with services and instances

Converting Mermaid diagram...

Registry Data Model:

A service registry typically organizes data hierarchically:

namespace/
└── services/
    └── order-service/
        ├── metadata
        │   ├── version: "2.3.1"
        │   ├── team: "commerce"
        │   └── protocol: "grpc"
        └── instances/
            ├── instance-abc123
            │   ├── address: "10.0.1.10"
            │   ├── port: 8080
            │   ├── health: "passing"
            │   └── metadata: {...}
            └── instance-def456
                ├── address: "10.0.1.11"
                ├── port: 8080
                ├── health: "passing"
                └── metadata: {...}

This hierarchical organization enables efficient queries at multiple levels: all instances of a service, all services in a namespace, or specific instance details.

Registry vs. Configuration Store

Registration Patterns

Service instances must register with the registry to be discoverable. There are two fundamental approaches to registration, each with distinct trade-offs.

How Self-Registration Works:

Startup: Service initializes, binds to a port, and registers with the registry
Heartbeat Loop: Service periodically sends heartbeats to renew its registration
Operation: Service handles requests while maintaining registration
Shutdown: Service deregisters before exiting
Failure Handling: If heartbeats stop (crash), registry evicts after timeout

self-registration.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import Consul from 'consul';
 
class ServiceRegistry {
  private consul: Consul.Consul;
  private serviceId: string;
  private heartbeatInterval: NodeJS.Timer | null = null;
 
  constructor(
    private serviceName: string,
    private host: string,
    private port: number
  ) {
    this.consul = new Consul({ host: 'consul.internal', port: 8500 });
    this.serviceId = `${serviceName}-${host}-${port}`;
  }
 
  async register(): Promise<void> {
    // Register service with Consul
    await this.consul.agent.service.register({
      id: this.serviceId,
      name: this.serviceName,
      address: this.host,
      port: this.port,
      tags: ['v2.3.1', 'production'],
      meta: {
        version: '2.3.1',
        protocol: 'http',
        team: 'commerce'
      },
      check: {
        // Consul will call this endpoint to verify health
        http: `http://${this.host}:${this.port}/health`,
        interval: '10s',
        timeout: '5s',
        deregistercriticalserviceafter: '1m'
      }
    });
 
    console.log(`Registered ${this.serviceId} with Consul`);
 
    // Start heartbeat to maintain registration
    this.startHeartbeat();
  }
 
  private startHeartbeat(): void {
    // While Consul handles health checks, we maintain a heartbeat
    // for additional resilience
    this.heartbeatInterval = setInterval(async () => {
      try {
        await this.consul.agent.check.pass(`service:${this.serviceId}`);
      } catch (error) {
        console.error('Heartbeat failed:', error);
      }
    }, 5000);
  }
 
  async deregister(): Promise<void> {
    if (this.heartbeatInterval) {
      clearInterval(this.heartbeatInterval);
    }
    
    await this.consul.agent.service.deregister(this.serviceId);
    console.log(`Deregistered ${this.serviceId} from Consul`);
  }
}
 
// Usage with graceful shutdown
const registry = new ServiceRegistry('order-service', '10.0.1.10', 8080);
 
async function main() {
  await registry.register();
  
  // Handle graceful shutdown
  process.on('SIGTERM', async () => {
    console.log('Received SIGTERM, deregistering...');
    await registry.deregister();
    process.exit(0);
  });
}

Advantages

•Service controls its own registration
•Can include service-specific metadata
•Simpler infrastructure (no registrar)
•Registration tied to service lifecycle

Disadvantages

•Couples service to specific registry
•Registration logic in every service
•Language-specific client libraries
•Consistent implementation required

Modern Trend: Platform-Managed Registration

Health Checking Mechanisms

Types of Health Checks:

Health Check Types
Type	Mechanism	Pros	Cons
TTL/Heartbeat	Service sends periodic heartbeats; registry evicts if heartbeats stop	Simple, low overhead, service-controlled	Only detects complete failure, not degradation
HTTP Check	Registry periodically calls HTTP endpoint on service	Can verify application health, not just process	Requires exposed endpoint, increased traffic
TCP Check	Registry opens TCP connection to service port	Simple, verifies port is listening	Doesn't verify application functionality
gRPC Check	Registry uses gRPC health checking protocol	First-class gRPC support	Only for gRPC services
Script/Command	Registry executes a script that checks health	Flexible, custom logic	Security concerns, complexity
DNS Check	Registry verifies DNS resolution of a name	Useful for external dependencies	Limited applicability

health-checks.hcl

Consul HCL

# Consul Service Definition with Multiple Health Checks
 
service {
  name = "order-service"
  id   = "order-service-1"
  port = 8080
  
  tags = ["v2.3.1", "production"]
  
  meta = {
    version  = "2.3.1"
    protocol = "http"
  }
 
  # HTTP health check - verifies application responds correctly
  check {
    id       = "order-http-check"
    name     = "HTTP Health Check"
    http     = "http://localhost:8080/health"
    method   = "GET"
    interval = "10s"
    timeout  = "5s"
    
    # Tuning parameters
    success_before_passing   = 2  # Must pass 2x before healthy
    failures_before_critical = 3  # Must fail 3x before unhealthy
    
    header {
      Accept = ["application/json"]
    }
  }
 
  # TCP check - verifies port is listening
  check {
    id       = "order-tcp-check"
    name     = "TCP Port Check"
    tcp      = "localhost:8080"
    interval = "5s"
    timeout  = "2s"
  }
 
  # TTL check - service must actively report health
  check {
    id   = "order-ttl-check"
    name = "Service Heartbeat"
    ttl  = "30s"
    
    # Service must call: consul.agent.check.pass("order-ttl-check")
    # within every 30 seconds or check fails
  }
 
  # Deregister service if critical for too long
  # Prevents stale entries from accumulating
  check {
    id                             = "order-deregister"
    name                           = "Deregister Check"
    http                           = "http://localhost:8080/health"
    interval                       = "10s"
    deregister_critical_service_after = "5m"
  }
}

Health Check Design Considerations:

1. Check Depth (Shallow vs. Deep)

Shallow: Can TCP connect to :8080? → Process is running
Medium:  Does GET /health return 200? → HTTP stack works
Deep:    Can service execute a test transaction? → Full functionality

Deep checks catch more issues but can cause cascading failures if dependencies become slow—every service fails its health check simultaneously.

2. Check Frequency vs. System Load

Health checks consume resources. At scale, thousands of instances each receiving multiple health check requests per minute can create significant load. Balance check frequency against:

Time to detect failures (lower frequency = longer detection time)
System overhead (higher frequency = more CPU, network, database load)
False positives (too aggressive = flapping status)

3. Handling Transient Failures

Single failed health checks shouldn't immediately remove instances from service. Use:

Threshold counting: Require N consecutive failures before marking unhealthy
Cooldown periods: Wait before re-checking after a failure
Grace periods: Allow newly-started instances time to initialize

The Dependency Health Check Trap

Consistency and Availability Trade-offs

Understanding the Trade-off:

During normal operation, a distributed registry can provide both consistency and availability. But when network partitions occur, the system must choose:

CP (Consistent, Partition-Tolerant):

All clients see the same view of registered services
Registry may become unavailable during partitions
No risk of routing to dead instances (assuming health checks work)
Better for systems where routing to wrong instance is catastrophic

AP (Available, Partition-Tolerant):

Registry remains available during partitions
Different clients may see different sets of instances
May route to instances that are no longer valid
Better for systems where some errors are acceptable but unavailability isn't

Registry Consistency Comparison
Registry	Consistency Model	Consensus Protocol	Trade-off Implications
ZooKeeper	Strong (CP)	ZAB (Paxos-like)	Unavailable if quorum lost; strong guarantees
etcd	Strong (CP)	Raft	Unavailable if quorum lost; linearizable reads available
Consul	Strong (CP) for writes, configurable for reads	Raft	Stale reads option for AP behavior when needed
Eureka	Eventually Consistent (AP)	Peer replication	Always available; may return stale data
CoreDNS (Kubernetes)	Eventually Consistent	Watches + caching	Based on underlying etcd; watch delays possible

Consistency Levels in Practice:

Many registries offer configurable consistency levels for read operations, allowing you to choose the trade-off per-query:

consistency-options.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import Consul from 'consul';
 
const consul = new Consul({ host: 'consul.internal' });
 
// STRONGLY CONSISTENT READ
// Queries the leader directly, always returns current state
// Higher latency, may fail if leader unavailable
async function getInstancesStrong(serviceName: string) {
  return consul.health.service({
    service: serviceName,
    passing: true,  // Only healthy instances
    consistent: true  // Force strongly consistent read
  });
}
 
// EVENTUALLY CONSISTENT READ (STALE)
// May return from local cache, potentially stale
// Lower latency, always available
async function getInstancesStale(serviceName: string) {
  return consul.health.service({
    service: serviceName,
    passing: true,
    stale: true  // Allow stale reads from any server
  });
}
 
// CACHED READ
// Uses locally cached data, refreshed periodically
// Lowest latency, but may be out of date
async function getInstancesCached(serviceName: string) {
  return consul.health.service({
    service: serviceName,
    passing: true,
    cached: true,  // Use agent's local cache
    'max-age': 30  // Accept cached data up to 30 seconds old
  });
}
 
// PRACTICAL STRATEGY:
// 1. Normal operations: Use cached/stale reads for low latency
// 2. Critical decisions: Use consistent reads
// 3. Have fallbacks: Cache last-known-good endpoints locally
 
class ServiceDiscovery {
  private cache = new Map<string, Endpoint[]>();
  
  async getEndpoints(serviceName: string): Promise<Endpoint[]> {
    try {
      // Try stale read first (fast, usually accurate)
      const endpoints = await getInstancesStale(serviceName);
      this.cache.set(serviceName, endpoints);
      return endpoints;
    } catch (error) {
      // Fall back to cache if registry unavailable
      const cached = this.cache.get(serviceName);
      if (cached && cached.length > 0) {
        console.warn(`Using cached endpoints for ${serviceName}`);
        return cached;
      }
      throw new Error(`No endpoints available for ${serviceName}`);
    }
  }
}

Practical Consistency Strategy

Popular Registry Implementations

Several production-grade service registries are available, each with distinct characteristics. Understanding their trade-offs helps you select the right tool for your environment.

Key Characteristics

•Consistency: Raft-based consensus for strong consistency; optional stale reads
•Health Checking: Built-in HTTP, TCP, gRPC, script, and TTL checks
•Multi-DC: Native multi-datacenter federation with WAN gossip
•DNS Interface: Services discoverable via DNS without code changes
•API: Full HTTP API for registration, lookup, and health
•UI: Built-in web UI for visualization and debugging
•Ecosystem: Connect sidecar proxy for service mesh capabilities

consul-query.sh
Bash
1
2
3
4
5
6
7
8
# Discover services via DNS
$ dig @consul.internal order-service.service.consul SRV
 
# Discover via HTTP API
$ curl 'http://consul.internal:8500/v1/health/service/order-service?passing=true'
 
# Watch for changes (blocks until change occurs)
$ curl 'http://consul.internal:8500/v1/health/service/order-service?wait=60s&index=123'

Selection Guidance

Client Integration Patterns

Once you have a service registry, clients need to integrate with it to discover services. Several patterns exist for this integration, each trading off different concerns.

Direct Registry Integration:

The client application directly queries the registry to discover service endpoints.

direct-integration.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import Consul from 'consul';
 
class DirectDiscoveryClient {
  private consul: Consul.Consul;
  private cache = new Map<string, CachedEndpoints>();
  
  constructor() {
    this.consul = new Consul({ host: 'consul.internal' });
  }
 
  async callService(serviceName: string, path: string): Promise<Response> {
    // Get endpoints from registry (with caching)
    const endpoints = await this.getEndpoints(serviceName);
    
    // Client-side load balancing
    const endpoint = this.selectEndpoint(endpoints);
    
    // Make the actual call
    return fetch(`http://${endpoint.address}:${endpoint.port}${path}`);
  }
 
  private async getEndpoints(serviceName: string): Promise<Endpoint[]> {
    const cached = this.cache.get(serviceName);
    if (cached && Date.now() - cached.timestamp < 30000) {
      return cached.endpoints;
    }
 
    const services = await this.consul.health.service({
      service: serviceName,
      passing: true,
    });
 
    const endpoints = services.map(s => ({
      address: s.Service.Address,
      port: s.Service.Port,
      meta: s.Service.Meta,
    }));
 
    this.cache.set(serviceName, { endpoints, timestamp: Date.now() });
    return endpoints;
  }
 
  private selectEndpoint(endpoints: Endpoint[]): Endpoint {
    // Simple round-robin (real implementations use more sophisticated algorithms)
    return endpoints[Math.floor(Math.random() * endpoints.length)];
  }
}

Sidecar Proxy Pattern:

A local proxy handles all service discovery and load balancing. The application connects to localhost, and the proxy routes to discovered endpoints.

Converting Mermaid diagram...

DNS Integration:

The registry exposes a DNS interface, allowing services to be discovered using standard DNS queries. This provides compatibility with any application that can resolve hostnames.

dns-integration.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Application uses standard DNS resolution
// Consul provides DNS interface at port 8600
 
import dns from 'dns';
import { Resolver } from 'dns';
 
// Configure resolver to use Consul DNS
const resolver = new Resolver();
resolver.setServers(['consul.internal:8600']);
 
async function discoverViaConsulDNS(serviceName: string): Promise<string[]> {
  // Query SRV records for full service info
  return new Promise((resolve, reject) => {
    resolver.resolveSrv(`${serviceName}.service.consul`, (err, addresses) => {
      if (err) reject(err);
      else resolve(addresses.map(a => `${a.name}:${a.port}`));
    });
  });
}
 
// Or simply use the service name as hostname
// Consul DNS returns round-robin A record
async function callServiceViaDNS() {
  // Application doesn't know about Consul - just uses hostname
  // Consul DNS at 8600 resolves 'order-service.service.consul' to instance IP
  return fetch('http://order-service.service.consul:8080/api/orders');
}

Client Integration Pattern Comparison
Pattern	Application Change	Load Balancing	Failover	Best For
Direct Integration	Registry client library required	Client-side	Client-implemented	Full control, custom logic
Sidecar Proxy	None (localhost connect)	Proxy handles	Proxy handles	Service mesh, polyglot
DNS Integration	None (use hostname)	DNS round-robin	Limited (TTL wait)	Legacy apps, simplicity
Client Library	SDK integration	Library handles	Library handles	Framework ecosystems

Sidecar Pattern Dominance

Operational Considerations

Running a service registry in production requires careful attention to operational concerns. The registry is critical infrastructure—its failure can bring down the entire system.

Critical Operational Concerns

•High Availability — Run registry in a cluster (3-5 nodes minimum for quorum). Distribute across availability zones. Plan for node failures without service disruption.
•Backup and Recovery — Regular backups of registry data. Tested recovery procedures. Consider what happens if you lose the entire registry cluster.
•Monitoring and Alerting — Monitor cluster health, replication lag, query latency, and resource utilization. Alert on leadership changes, split-brain conditions, and capacity thresholds.
•Capacity Planning — Registries scale with number of services, instances, and query rate. Monitor memory, disk, and CPU. Plan for growth.
•Network Partitions — Understand behavior during network partitions. Test failure scenarios. Configure appropriate timeouts and retry policies.
•Upgrade Procedures — Rolling upgrade capability. Version compatibility between clients and servers. Rollback procedures.
•Security — TLS for client-server and server-server communication. Access control for registration and lookup. Audit logging.

registry-monitoring.yaml
Prometheus Alerts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Prometheus alerting rules for Consul service registry
groups:
  - name: consul-alerts
    rules:
      # Alert if cluster loses quorum (critical)
      - alert: ConsulClusterNoLeader
        expr: consul_raft_leader == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Consul cluster has no leader"
          description: "Consul cluster has been without a leader for more than 1 minute"
 
      # Alert if peer count drops below threshold
      - alert: ConsulInsufficientPeers
        expr: consul_raft_peers < 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Consul cluster has fewer than 3 peers"
 
      # Alert on high health check failure rate
      - alert: ConsulHighHealthCheckFailures
        expr: rate(consul_health_checks_critical[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rate of health check failures"
 
      # Alert on high query latency
      - alert: ConsulHighQueryLatency
        expr: histogram_quantile(0.99, consul_http_request_duration_seconds) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Consul query latency is high (p99 > 500ms)"
 
      # Alert on replication lag in multi-DC setup
      - alert: ConsulReplicationLag
        expr: consul_rpc_raft_verify_leader_duration_seconds > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Consul replication lag detected"

Disaster Recovery Planning:

What happens if the registry is completely unavailable?

Client Caching: Clients should cache last-known-good endpoints and continue using them
Graceful Degradation: Services should remain connected to existing backends even if discovery is down
Static Fallback: Have a mechanism to provide static endpoint lists for critical services
Fast Recovery: Maintain offline backups and tested recovery procedures
Multi-Region: For critical systems, federate registries across regions

The Single Point of Failure Risk

Summary: Purpose-Built Discovery

We've comprehensively explored service registries—from fundamental concepts to operational considerations. Let's consolidate the key insights:

Key Takeaways

•Purpose-built for discovery — Service registries are designed specifically for the dynamic nature of modern distributed systems, unlike DNS which evolved from static infrastructure.
•Registration patterns matter — Self-registration gives services control; third-party registration decouples services from infrastructure. Modern platforms increasingly handle registration automatically.
•Health checking is critical — Without health checking, discovery routes traffic to dead instances. Balance check depth and frequency against system load and failure detection time.
•Consistency vs availability trade-offs — CP registries (Consul, etcd, ZooKeeper) provide strong guarantees but may be unavailable during partitions. AP registries (Eureka) stay available but may return stale data.
•Multiple implementations exist — Consul for full-featured service mesh, etcd for Kubernetes ecosystems, Eureka for Spring Cloud, ZooKeeper for Hadoop/Kafka ecosystems.
•Client integration varies — Direct integration offers control, sidecar proxies provide transparency, DNS integration offers compatibility. Sidecar pattern is increasingly dominant.
•Operations are critical — The registry is foundational infrastructure. High availability, monitoring, backup, and disaster recovery are essential considerations.

What's Next:

Page Complete