System Design (HLD)Service Discovery Mechanisms

Service Discovery Mechanisms

LevelIntermediate

Duration75 mins

TopicService Discovery Mechanisms

4 / 5

DNS-Based Discovery

The Original Distributed Naming System

Long before the term "service discovery" existed, the Domain Name System (DNS) was solving a remarkably similar problem: how do you find a resource by name in a distributed network?

DNS, designed in the 1980s, has scaled from the early internet to today's billions of devices. Its fundamental model—querying a hierarchical naming system to resolve names to addresses—applies directly to service discovery. When you want to call the "payment service," you need to resolve that name to an IP address, just as your browser resolves "google.com" to an IP address.

This makes DNS an obvious candidate for service discovery. It's ubiquitous, well-understood, and requires no special client libraries. But DNS was designed for a different era—one of stable hosts and infrequent changes. Understanding both the power and limitations of DNS for service discovery is essential for architectural decisions.

What You Will Learn

By the end of this page, you will understand how DNS works at a fundamental level, how to leverage DNS for service discovery, the critical limitations of DNS in dynamic environments, modern DNS-based discovery patterns, and when DNS is sufficient versus when you need more sophisticated solutions.

DNS Fundamentals for Service Discovery

Before exploring DNS for service discovery, let's establish a solid understanding of how DNS operates. This foundation is critical for understanding both its capabilities and limitations.

The DNS Hierarchy

DNS is a hierarchical, distributed database. When you query for api.payments.us-east.company.com, the query traverses:

.                           (root)
└── com                     (top-level domain)
    └── company             (second-level domain)
        └── us-east         (subdomain)
            └── payments    (subdomain)
                └── api     (hostname/service)

Each level can delegate authority to the level below. Your organization controls company.com and can create any subdomains without external coordination. This enables internal service naming schemes like service-name.namespace.svc.cluster.local (Kubernetes) or service-name.service.consul (Consul).

Essential DNS Record Types for Discovery

A Records (Address) Map a name directly to an IPv4 address.

payment-service.internal.  300  IN  A  172.31.1.10
payment-service.internal.  300  IN  A  172.31.1.11
payment-service.internal.  300  IN  A  172.31.1.12

Multiple A records for the same name enable round-robin load distribution.

AAAA Records Same as A records but for IPv6 addresses.

SRV Records (Service) Provide richer information: priority, weight, port, and target host.

_payment._tcp.internal.  300  IN  SRV  10 50 8080 payment-1.internal.
_payment._tcp.internal.  300  IN  SRV  10 50 8080 payment-2.internal.
_payment._tcp.internal.  300  IN  SRV  20 100 8080 payment-backup.internal.

Format: priority weight port target

Priority: Lower values preferred; enables primary/backup patterns
Weight: Load distribution among same-priority records
Port: The port the service listens on
Target: Hostname (must be resolved again via A record)

TXT Records Store arbitrary text, often used for service metadata.

payment-service.internal.  300  IN  TXT  "version=2.3.1 protocol=grpc"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Query A records for a service
$ dig payment-service.internal +short
172.31.1.10
172.31.1.11
172.31.1.12
 
# Query SRV records (includes port information)
$ dig _payment._tcp.internal SRV +short
10 50 8080 payment-1.internal.
10 50 8080 payment-2.internal.
 
# Query with full details including TTL
$ dig payment-service.internal
 
;; ANSWER SECTION:
payment-service.internal. 300 IN A 172.31.1.10
payment-service.internal. 300 IN A 172.31.1.11
 
# Query Consul DNS interface
$ dig @127.0.0.1 -p 8600 payment-service.service.consul SRV
 
# Query Kubernetes DNS
$ dig payment-service.default.svc.cluster.local +short
10.96.45.123

The Resolution Process

Application queries local resolver (typically configured via /etc/resolv.conf or equivalent)
Resolver checks its cache; if cached and TTL valid, returns immediately
If not cached, resolver queries upstream DNS servers (recursive resolution)
Response is cached according to TTL
Future queries return cached data until TTL expires

The caching behavior at step 2 and 4 is the source of both DNS's efficiency and its challenges for dynamic discovery.

TTL: The Critical Parameter

TTL (Time To Live) determines how long DNS responses are cached. Lower TTL means faster discovery updates but higher DNS query load. Higher TTL means better performance but slower propagation of changes. This trade-off is central to DNS-based discovery design.

Simple DNS-Based Service Discovery

The simplest form of DNS-based discovery uses standard A records managed by your DNS infrastructure. This approach works well for relatively stable services and requires minimal additional tooling.

Pattern 1: Static DNS Entries

For stable services with infrequent changes:

# Internal DNS zone file
; Database (rarely changes)
db-primary.internal.     600   IN   A   10.0.1.10
db-replica-1.internal.   600   IN   A   10.0.1.11
db-replica-2.internal.   600   IN   A   10.0.1.12

; Cache cluster (moderately stable)
cache.internal.          300   IN   A   10.0.2.10
cache.internal.          300   IN   A   10.0.2.11
cache.internal.          300   IN   A   10.0.2.12

; API gateway (can change with deployments)
api.internal.            60    IN   A   10.0.3.10
api.internal.            60    IN   A   10.0.3.11

Applications connect using DNS names (db-primary.internal). Changes require updating DNS records and waiting for TTL expiration.

Pattern 2: Dynamic DNS with Registration Scripts

For more dynamic environments, services can register themselves with DNS on startup:

#!/bin/bash
# Service registration script (run on service start)

SERVICE_NAME="payment-service"
INSTANCE_IP=$(hostname -I | awk '{print $1}')
DNS_SERVER="ns1.internal"
ZONE="services.internal"
TTL=30

# Register with dynamic DNS (nsupdate)
nsupdate -k /etc/dns/update.key << EOF
server ${DNS_SERVER}
zone ${ZONE}
update add ${SERVICE_NAME}.${ZONE}. ${TTL} A ${INSTANCE_IP}
send
EOF

echo "Registered ${SERVICE_NAME} at ${INSTANCE_IP}"

# Deregister on shutdown
trap 'nsupdate -k /etc/dns/update.key << EOF
server ${DNS_SERVER}
zone ${ZONE}
update delete ${SERVICE_NAME}.${ZONE}. A ${INSTANCE_IP}
send
EOF' EXIT

# Start the actual service
exec /app/payment-service

This pattern enables dynamic updates but adds complexity:

Services must know how to register/deregister
DNS server must support dynamic updates
Security considerations for DNS update permissions
No built-in health checking

Pattern 3: DNS with Load Balancer Integration

Cloud providers often integrate DNS with load balancers:

; AWS Route 53 with ALB
payment.api.company.com.  60  IN  ALIAS  payment-alb-123.us-east-1.elb.amazonaws.com.

; The ALB handles actual service instance discovery
; DNS just points to the stable load balancer endpoint

This hybrid approach:

Uses DNS for service name resolution
Delegates instance discovery to the load balancer
Load balancer handles health checking and routing
DNS TTL matters less (LB endpoint is stable)

This is effectively server-side discovery with DNS as the front door.

Simple DNS Discovery Advantages

•Universal client support: Every language and platform has DNS resolution built-in
•No special client libraries: Standard networking works out of the box
•Familiar technology: Operations teams already understand DNS
•Low latency: Cached queries are extremely fast (microseconds)
•High availability: Well-understood DNS HA patterns exist

When Simple DNS Works

Simple DNS-based discovery is often sufficient for small-to-medium deployments (under 50 services), environments with relatively stable service endpoints, developers who want to avoid additional infrastructure, and applications that tolerate short periods of stale discovery data.

The Critical Challenges of DNS Discovery

DNS works well for its original purpose—resolving relatively stable internet hostnames. But service discovery in dynamic environments exposes fundamental limitations that can cause serious production issues.

Challenge 1: TTL and Caching Problems

DNS responses are cached at multiple levels:

Application-level connection pools
Language runtime DNS caches (JVM DNS cache, Go resolver cache)
Operating system resolver cache
Local DNS server cache
Intermediate DNS servers

Each cache has its own TTL behavior, and many don't respect the authoritative TTL:

The JVM TTL Problem: By default, the JVM caches successful DNS lookups forever (or for 30 seconds with a security manager). This means your Java application might never see DNS updates during its lifetime.

1
2
3
4
5
6
7
8
9
10
11
// Set DNS cache TTL at JVM startup
// In production, set via JVM arguments:
// -Dsun.net.inetaddr.ttl=30 -Dsun.net.inetaddr.negative.ttl=10
 
// Or programmatically (must run before any DNS lookups)
java.security.Security.setProperty("networkaddress.cache.ttl", "30");
java.security.Security.setProperty("networkaddress.cache.negative.ttl", "10");
 
// Or in $JAVA_HOME/lib/security/java.security:
// networkaddress.cache.ttl=30
// networkaddress.cache.negative.ttl=10

Challenge 2: Connection Pool Stickiness

Even with proper DNS TTL settings, connection pools create sticky behavior:

Application resolves DNS → gets instances A, B, C
Application opens persistent connections to A, B, C
Instance D is added, instance A is removed
DNS TTL expires, new resolution returns B, C, D
But existing connections to A continue; no new connections to D

The application doesn't re-resolve DNS for existing connections. HTTP/2 and gRPC with long-lived connections make this worse—connections may persist for hours.

Mitigation strategies:

Periodic connection pool refresh (drop and reconnect)
Maximum connection lifetime settings
Client-side health checking
Application-level periodic re-resolution

DNS Caching Issues by Platform
Platform	Default Cache Behavior	Problem	Mitigation
JVM (Java)	Forever (or 30s with SecurityManager)	Never sees updates	Set networkaddress.cache.ttl
Go	None (re-resolves each time)	High DNS load	Connection pooling + TTL-like logic
Node.js	Uses OS resolver	Varies by OS	dns.setServers(), custom lookup
Python	Uses OS resolver	Varies by OS	socket.setdefaulttimeout()
Linux libc	Respects TTL, min 60s common	May over-cache	nscd configuration
macOS	Aggressive caching	May over-cache	dscacheutil -flushcache

Challenge 3: No Health Information

DNS is purely a naming system—it has no concept of service health:

DNS returns all registered records regardless of health
Clients route to unhealthy instances and get errors
Removal of dead instances requires external process updating DNS
The update → propagation delay means stale records during failures

Challenge 4: Limited Load Balancing

DNS provides only basic load distribution:

Round-robin: DNS servers may rotate A record order, but caching defeats this
SRV weights: Better, but many clients don't support SRV records
No intelligent routing: No awareness of server load, latency, or capacity
No affinity options: Can't ensure related requests go to same instance

Challenge 5: No Service Metadata

DNS provides minimal information about services:

A records: Just IP addresses
SRV records: Priority, weight, port, target
TXT records: Arbitrary text (limited, awkward to parse)

No structured support for:

Service version
Capabilities/features
Zone/region information
Custom routing tags

The Stale Record Problem

In a real incident: An auto-scaling event removed 30% of instances. DNS TTL was 30 seconds, but JVM applications had infinite caching. For 2 hours, 30% of traffic went to non-existent instances, causing timeouts and errors. The fix: Configure JVM DNS TTL and implement connection pool health checking.

Modern DNS Discovery Patterns

Despite its limitations, DNS remains valuable for service discovery when used correctly. Modern systems address DNS limitations through enhanced DNS implementations and careful architecture.

Pattern 1: Consul DNS Interface

Consul exposes its service registry via DNS, combining registry features with DNS simplicity:

# Standard service lookup
$ dig @localhost -p 8600 payment-service.service.consul
172.31.1.10
172.31.1.11

# Only healthy instances (Consul checks health)
$ dig @localhost -p 8600 payment-service.service.consul
# Only returns instances passing health checks

# Tag-based filtering
$ dig @localhost -p 8600 v2.payment-service.service.consul
# Returns only instances tagged with 'v2'

# Datacenter-specific
$ dig @localhost -p 8600 payment-service.service.dc2.consul
# Returns instances in datacenter 'dc2'

Consul DNS provides:

Health-aware responses (only healthy instances returned)
Tag-based filtering in DNS queries
Multi-datacenter queries
SRV records with port information

Key configuration: TTL

# consul agent configuration
dns_config {
  service_ttl {
    "*" = "5s"  # Default TTL for services
    "slow-changing" = "30s"  # Higher TTL for stable services
  }
  node_ttl = "5s"
}**

Pattern 2: Kubernetes DNS (CoreDNS)

Kubernetes provides built-in DNS for service discovery via CoreDNS:

# ClusterIP Service (stable virtual IP)
payment-service.default.svc.cluster.local → 10.96.45.123

# Individual pod DNS
10-244-1-5.default.pod.cluster.local → 10.244.1.5

# Headless Service (returns all pod IPs)
payment-service.default.svc.cluster.local → 10.244.1.5, 10.244.1.6, 10.244.1.7

ClusterIP vs Headless Services:

ClusterIP Service (default):

DNS returns single virtual IP (ClusterIP)
kube-proxy handles load balancing to actual pods
Pod changes don't affect DNS response
Simpler for most use cases

Headless Service (clusterIP: None):

DNS returns all pod IPs directly
Client sees and chooses among instances
Useful for stateful services (databases, Kafka)
Subject to DNS caching issues

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Standard ClusterIP Service
apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  selector:
    app: payment
  ports:
  - port: 80
    targetPort: 8080
  # DNS: payment-service.default.svc.cluster.local → 10.96.45.123 (ClusterIP)
 
---
# Headless Service (for client-side discovery)
apiVersion: v1
kind: Service
metadata:
  name: payment-headless
spec:
  clusterIP: None  # Key difference
  selector:
    app: payment
  ports:
  - port: 80
    targetPort: 8080
  # DNS: payment-headless.default.svc.cluster.local → 10.244.1.5, 10.244.1.6, ...
 
---
# ExternalName Service (CNAME to external service)
apiVersion: v1
kind: Service
metadata:
  name: external-payment
spec:
  type: ExternalName
  externalName: payment-api.external-provider.com
  # DNS: external-payment.default.svc.cluster.local → payment-api.external-provider.com

Pattern 3: AWS Route 53 Service Discovery

AWS Cloud Map integrates with Route 53 for service discovery:

# Services registered with Cloud Map become DNS records
payment-srv.services.internal → Auto-updates based on ECS/EC2 instance health

# Features:
# - Automatic registration/deregistration for ECS/EC2
# - Health checking integration
# - Multi-value answer routing
# - Private DNS zones

Cloud Map provides:

Automatic registration for ECS tasks, EC2 instances
Integration with AWS health checks
Both DNS and HTTP API access
Private DNS within VPC

Pattern 4: GeoDNS for Global Load Balancing

DNS can route to the closest datacenter:

# Route 53 Geolocation Routing
api.company.com → 
  (EU users) → eu-west-1-lb.company.com → 52.30.x.x
  (US users) → us-east-1-lb.company.com → 54.83.x.x
  (Asia users) → ap-southeast-1-lb.company.com → 13.228.x.x

GeoDNS provides:

Latency optimization (route to nearest datacenter)
Compliance (keep data in specific regions)
Disaster recovery (remove failed regions from DNS)

DNS + Load Balancer: The Sweet Spot

The most practical DNS pattern: DNS resolves to a stable load balancer (ALB, NLB, NGINX), which handles actual instance discovery and health checking. DNS benefits (universal support, simplicity) without DNS limitations (no health checking, caching issues).

Implementing Robust DNS-Based Discovery

If you choose DNS-based discovery, careful implementation prevents common pitfalls. Here's a comprehensive approach to robust DNS discovery.

1. Configure Client DNS Behavior

Every platform needs explicit DNS TTL configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import dns from 'dns';
import { LookupAddress } from 'net';
 
// Use system resolver with caching control
const resolver = new dns.Resolver();
resolver.setServers(['10.0.0.53']); // Internal DNS server
 
// DNS lookup with explicit TTL handling
async function resolveWithTTL(hostname: string): Promise<LookupAddress[]> {
  return new Promise((resolve, reject) => {
    dns.resolve4(hostname, { ttl: true }, (err, addresses) => {
      if (err) {
        reject(err);
        return;
      }
      
      // addresses includes TTL information
      // Cache accordingly, but don't exceed returned TTL
      resolve(addresses);
    });
  });
}
 
// Connection pool with periodic refresh
class DNSAwareConnectionPool {
  private connections: Map<string, Connection[]> = new Map();
  private refreshInterval: NodeJS.Timeout;
  
  constructor(private serviceName: string) {
    // Periodically refresh DNS and connections
    this.refreshInterval = setInterval(() => this.refresh(), 30000);
  }
  
  private async refresh(): Promise<void> {
    const addresses = await resolveWithTTL(this.serviceName);
    const newIPs = new Set(addresses.map(a => a.address));
    
    // Close connections to IPs no longer in DNS
    for (const [ip, conns] of this.connections) {
      if (!newIPs.has(ip)) {
        conns.forEach(c => c.close());
        this.connections.delete(ip);
      }
    }
    
    // Open connections to new IPs
    for (const ip of newIPs) {
      if (!this.connections.has(ip)) {
        this.connections.set(ip, await this.openConnections(ip, 5));
      }
    }
  }
}

2. Implement Client-Side Health Checking

Since DNS doesn't include health information, clients must check health themselves:

Passive health checking: Track error rates per endpoint; stop using endpoints with high error rates
Active health probing: Periodically call health endpoints on all known instances
Circuit breaker: Stop calling failing endpoints; periodically test for recovery

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
package discovery
 
import (
    "context"
    "net"
    "sync"
    "time"
)
 
type HealthAwareDNSClient struct {
    serviceName    string
    resolver       *net.Resolver
    healthyHosts   map[string]bool
    mu             sync.RWMutex
    refreshTicker  *time.Ticker
    healthChecker  *time.Ticker
}
 
func NewHealthAwareDNSClient(serviceName string) *HealthAwareDNSClient {
    client := &HealthAwareDNSClient{
        serviceName:  serviceName,
        resolver:     &net.Resolver{PreferGo: true},
        healthyHosts: make(map[string]bool),
    }
    
    // Refresh DNS every 30 seconds
    client.refreshTicker = time.NewTicker(30 * time.Second)
    go client.refreshLoop()
    
    // Check health every 10 seconds
    client.healthChecker = time.NewTicker(10 * time.Second)
    go client.healthCheckLoop()
    
    // Initial resolution
    client.refreshDNS()
    
    return client
}
 
func (c *HealthAwareDNSClient) refreshDNS() {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    
    ips, err := c.resolver.LookupHost(ctx, c.serviceName)
    if err != nil {
        return
    }
    
    c.mu.Lock()
    defer c.mu.Unlock()
    
    // Add new hosts (assume healthy until proven otherwise)
    for _, ip := range ips {
        if _, exists := c.healthyHosts[ip]; !exists {
            c.healthyHosts[ip] = true
        }
    }
    
    // Remove hosts no longer in DNS
    dnsSet := make(map[string]bool)
    for _, ip := range ips {
        dnsSet[ip] = true
    }
    for ip := range c.healthyHosts {
        if !dnsSet[ip] {
            delete(c.healthyHosts, ip)
        }
    }
}
 
func (c *HealthAwareDNSClient) GetHealthyHost() (string, error) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    
    for host, healthy := range c.healthyHosts {
        if healthy {
            return host, nil
        }
    }
    return "", errors.New("no healthy hosts available")
}

3. Tune DNS Infrastructure

Low TTL Trade-offs:

5-10 seconds: Fast updates, high DNS query load
30-60 seconds: Balance between freshness and load
300+ seconds: Only for very stable services

DNS Server Capacity Planning:

Queries/sec = (Number of clients × Requests/sec per client) / TTL

Example:
- 1000 client instances
- Each makes 100 HTTP requests/second  
- If each request requires DNS (worst case): 100,000 queries/sec
- With proper connection pooling + 30s TTL: ~33 queries/sec per client = 33,000 queries/sec
- Plan DNS infrastructure accordingly

4. Implement Fallbacks

DNS infrastructure can fail. Plan for it:

Cache last-known-good DNS responses
Have static fallback addresses for critical services
Implement retry logic for DNS failures
Monitor DNS resolution latency and failures

The 0 TTL Trap

Setting TTL to 0 doesn't guarantee no caching—many resolvers have minimum cache times (often 30-60 seconds). And even TTL 0 may overwhelm your DNS infrastructure at scale. Always plan for some caching and implement client-side health checking.

DNS vs Registry-Based Discovery: Decision Framework

When should you use DNS-based discovery versus a dedicated service registry? This decision depends on your system's characteristics and requirements.

Choose DNS Discovery When

•Services are relatively stable (not scaling frequently)
•Simple load balancing (round-robin) is sufficient
•You want to minimize infrastructure complexity
•Universal client support is important (no client libraries)
•You're using DNS → Load Balancer → Services pattern
•Kubernetes with ClusterIP Services (kube-proxy handles instance discovery)
•Team has limited distributed systems expertise

Choose Registry-Based Discovery When

•Services scale frequently (auto-scaling, spot instances)
•Health checking is critical for routing decisions
•You need rich service metadata for routing
•Multi-datacenter service discovery is required
•Advanced load balancing (least-connections, latency-based) is needed
•You're building a service mesh
•You need service discovery across heterogeneous environments (VMs + containers)

DNS vs Registry: Feature Comparison
Requirement	Pure DNS	Consul DNS	Registry-Based
Universal client support	✓ Excellent	✓ Good	✗ Requires library
Health-aware routing	✗ No	✓ Yes	✓ Yes
Fast update propagation	✗ TTL-limited	○ Short TTL + blocking	✓ Push-based
Rich metadata	✗ Limited (TXT)	✓ Via HTTP API	✓ Full support
Multi-datacenter	○ GeoDNS	✓ Native	✓ Native
Operational complexity	Low	Medium	Medium-High
Infrastructure cost	Low	Medium	Medium

The Hybrid Answer

In practice, many production systems use layered approaches:

External traffic: DNS → Cloud Load Balancer → Services
- DNS provides stable, cacheable endpoints
- Load balancer handles health and routing
Internal service-to-service: Kubernetes DNS OR Service Mesh
- Kubernetes Services with kube-proxy (simple)
- Or Istio/Linkerd for advanced features (complex)
Cross-environment: Consul or similar registry
- For services spanning K8s and VMs
- Or multi-cluster/multi-cloud scenarios

The key insight: DNS is almost always part of the solution, but it's often not the complete solution. Use DNS where its simplicity helps, and supplement with registries where its limitations hurt.

Start Simple, Evolve as Needed

Begin with DNS → Load Balancer patterns. Add a registry only when you hit DNS limitations. Many successful systems never need more than Kubernetes Services. Don't adopt complex discovery infrastructure to solve problems you don't have yet.

DNS Discovery Best Practices

Whether using DNS as your primary discovery mechanism or as a layer in a larger system, these best practices prevent common problems.

Essential Best Practices

•Configure client DNS caching explicitly — Don't rely on defaults. JVM, Go, Node.js, Python all need attention. Document your settings.
•Implement client-side health checking — DNS doesn't know if a server is healthy. Your clients must check and avoid unhealthy endpoints.
•Refresh connection pools periodically — Long-lived connections ignore DNS changes. Set max connection age or refresh connections.
•Use SRV records when possible — They carry port information, eliminating port configuration. However, verify client library support.
•Monitor DNS resolution — Track resolution latency, failure rates, and query volumes. DNS failures are often silent until they cause widespread issues.
•Plan DNS infrastructure capacity — Low TTLs multiply query load. Capacity plan based on clients × request rate × TTL.
•Have fallback strategies — Cache last-known-good values. Have static fallback for critical services. Handle dns.NXDOMAIN gracefully.
•Use DNS → Load Balancer for external traffic — Don't expose direct service IPs via public DNS. Load balancers provide stability and security.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# CoreDNS ConfigMap for production
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
            lameduck 5s
        }
        ready
        
        # Kubernetes DNS
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
            ttl 30  # Control cache TTL
        }
        
        # Forward external queries
        forward . /etc/resolv.conf {
            max_concurrent 1000
        }
        
        # Caching
        cache 30
        
        # Rate limiting
        ratelimit 100
        
        # Metrics for monitoring
        prometheus :9153
        
        loop
        reload
        loadbalance
    }

Observability Is Critical

DNS issues are notoriously difficult to debug. Implement monitoring for: resolution latency, failure rates, query volumes, cache hit rates, and unexpected NXDOMAIN responses. When DNS fails, it often fails silently—applications just can't reach other services.

Summary: DNS-Based Discovery

We've comprehensively explored DNS as a service discovery mechanism—its fundamentals, capabilities, limitations, and best practices. Let's consolidate the essential insights:

Key Takeaways

•DNS provides universal discovery with no client libraries required, but it was designed for stable hosts, not dynamic services.
•TTL and caching are the core challenges — multiple caching layers with different behaviors make updates unpredictable.
•DNS has no health awareness — clients must implement their own health checking to avoid routing to unhealthy endpoints.
•Modern patterns enhance DNS — Consul DNS, Kubernetes DNS, and Cloud Map add health checking and dynamic updates.
•DNS → Load Balancer is often the sweet spot — stable DNS endpoints plus load balancer health checking and routing.
•Configure every client explicitly — JVM, Go, Node.js all have different DNS caching defaults that can cause issues.
•Refresh connection pools — long-lived connections ignore DNS changes; implement periodic refresh.
•Start simple — Kubernetes Services or DNS → LB patterns work for most systems without complex registries.

What's Next:

We've now covered the foundational discovery mechanisms. In our final page, we'll focus on Kubernetes service discovery — the platform that's become the de facto standard for container orchestration. You'll learn how Kubernetes Services, DNS, and kube-proxy work together to provide discovery, and how to extend Kubernetes discovery for multi-cluster and hybrid scenarios.

Page Complete

You now have deep knowledge of DNS-based service discovery—when to use it, its limitations, and how to implement it robustly. You can make informed decisions about when DNS is sufficient and when you need more sophisticated discovery mechanisms.

4 / 5

Loading learning content...

System Design (HLD)Service Discovery Mechanisms

Service Discovery Mechanisms

LevelIntermediate

Duration75 mins

TopicService Discovery Mechanisms

4 / 5

DNS-Based Discovery

The Original Distributed Naming System

Long before the term "service discovery" existed, the Domain Name System (DNS) was solving a remarkably similar problem: how do you find a resource by name in a distributed network?

What You Will Learn

DNS Fundamentals for Service Discovery

Before exploring DNS for service discovery, let's establish a solid understanding of how DNS operates. This foundation is critical for understanding both its capabilities and limitations.

The DNS Hierarchy

DNS is a hierarchical, distributed database. When you query for api.payments.us-east.company.com, the query traverses:

.                           (root)
└── com                     (top-level domain)
    └── company             (second-level domain)
        └── us-east         (subdomain)
            └── payments    (subdomain)
                └── api     (hostname/service)

Essential DNS Record Types for Discovery

A Records (Address) Map a name directly to an IPv4 address.

payment-service.internal.  300  IN  A  172.31.1.10
payment-service.internal.  300  IN  A  172.31.1.11
payment-service.internal.  300  IN  A  172.31.1.12

Multiple A records for the same name enable round-robin load distribution.

AAAA Records Same as A records but for IPv6 addresses.

SRV Records (Service) Provide richer information: priority, weight, port, and target host.

_payment._tcp.internal.  300  IN  SRV  10 50 8080 payment-1.internal.
_payment._tcp.internal.  300  IN  SRV  10 50 8080 payment-2.internal.
_payment._tcp.internal.  300  IN  SRV  20 100 8080 payment-backup.internal.

Format: priority weight port target

Priority: Lower values preferred; enables primary/backup patterns
Weight: Load distribution among same-priority records
Port: The port the service listens on
Target: Hostname (must be resolved again via A record)

TXT Records Store arbitrary text, often used for service metadata.

payment-service.internal.  300  IN  TXT  "version=2.3.1 protocol=grpc"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Query A records for a service
$ dig payment-service.internal +short
172.31.1.10
172.31.1.11
172.31.1.12
 
# Query SRV records (includes port information)
$ dig _payment._tcp.internal SRV +short
10 50 8080 payment-1.internal.
10 50 8080 payment-2.internal.
 
# Query with full details including TTL
$ dig payment-service.internal
 
;; ANSWER SECTION:
payment-service.internal. 300 IN A 172.31.1.10
payment-service.internal. 300 IN A 172.31.1.11
 
# Query Consul DNS interface
$ dig @127.0.0.1 -p 8600 payment-service.service.consul SRV
 
# Query Kubernetes DNS
$ dig payment-service.default.svc.cluster.local +short
10.96.45.123

The Resolution Process

Application queries local resolver (typically configured via /etc/resolv.conf or equivalent)
Resolver checks its cache; if cached and TTL valid, returns immediately
If not cached, resolver queries upstream DNS servers (recursive resolution)
Response is cached according to TTL
Future queries return cached data until TTL expires

The caching behavior at step 2 and 4 is the source of both DNS's efficiency and its challenges for dynamic discovery.

TTL: The Critical Parameter

Simple DNS-Based Service Discovery

The simplest form of DNS-based discovery uses standard A records managed by your DNS infrastructure. This approach works well for relatively stable services and requires minimal additional tooling.

Pattern 1: Static DNS Entries

For stable services with infrequent changes:

# Internal DNS zone file
; Database (rarely changes)
db-primary.internal.     600   IN   A   10.0.1.10
db-replica-1.internal.   600   IN   A   10.0.1.11
db-replica-2.internal.   600   IN   A   10.0.1.12

; Cache cluster (moderately stable)
cache.internal.          300   IN   A   10.0.2.10
cache.internal.          300   IN   A   10.0.2.11
cache.internal.          300   IN   A   10.0.2.12

; API gateway (can change with deployments)
api.internal.            60    IN   A   10.0.3.10
api.internal.            60    IN   A   10.0.3.11

Applications connect using DNS names (db-primary.internal). Changes require updating DNS records and waiting for TTL expiration.

Pattern 2: Dynamic DNS with Registration Scripts

For more dynamic environments, services can register themselves with DNS on startup:

#!/bin/bash
# Service registration script (run on service start)

SERVICE_NAME="payment-service"
INSTANCE_IP=$(hostname -I | awk '{print $1}')
DNS_SERVER="ns1.internal"
ZONE="services.internal"
TTL=30

# Register with dynamic DNS (nsupdate)
nsupdate -k /etc/dns/update.key << EOF
server ${DNS_SERVER}
zone ${ZONE}
update add ${SERVICE_NAME}.${ZONE}. ${TTL} A ${INSTANCE_IP}
send
EOF

echo "Registered ${SERVICE_NAME} at ${INSTANCE_IP}"

# Deregister on shutdown
trap 'nsupdate -k /etc/dns/update.key << EOF
server ${DNS_SERVER}
zone ${ZONE}
update delete ${SERVICE_NAME}.${ZONE}. A ${INSTANCE_IP}
send
EOF' EXIT

# Start the actual service
exec /app/payment-service

This pattern enables dynamic updates but adds complexity:

Services must know how to register/deregister
DNS server must support dynamic updates
Security considerations for DNS update permissions
No built-in health checking

Pattern 3: DNS with Load Balancer Integration

Cloud providers often integrate DNS with load balancers:

; AWS Route 53 with ALB
payment.api.company.com.  60  IN  ALIAS  payment-alb-123.us-east-1.elb.amazonaws.com.

; The ALB handles actual service instance discovery
; DNS just points to the stable load balancer endpoint

This hybrid approach:

Uses DNS for service name resolution
Delegates instance discovery to the load balancer
Load balancer handles health checking and routing
DNS TTL matters less (LB endpoint is stable)

This is effectively server-side discovery with DNS as the front door.

Simple DNS Discovery Advantages

•Universal client support: Every language and platform has DNS resolution built-in
•No special client libraries: Standard networking works out of the box
•Familiar technology: Operations teams already understand DNS
•Low latency: Cached queries are extremely fast (microseconds)
•High availability: Well-understood DNS HA patterns exist

When Simple DNS Works

The Critical Challenges of DNS Discovery

Challenge 1: TTL and Caching Problems

DNS responses are cached at multiple levels:

Application-level connection pools
Language runtime DNS caches (JVM DNS cache, Go resolver cache)
Operating system resolver cache
Local DNS server cache
Intermediate DNS servers

Each cache has its own TTL behavior, and many don't respect the authoritative TTL:

1
2
3
4
5
6
7
8
9
10
11
// Set DNS cache TTL at JVM startup
// In production, set via JVM arguments:
// -Dsun.net.inetaddr.ttl=30 -Dsun.net.inetaddr.negative.ttl=10
 
// Or programmatically (must run before any DNS lookups)
java.security.Security.setProperty("networkaddress.cache.ttl", "30");
java.security.Security.setProperty("networkaddress.cache.negative.ttl", "10");
 
// Or in $JAVA_HOME/lib/security/java.security:
// networkaddress.cache.ttl=30
// networkaddress.cache.negative.ttl=10

Challenge 2: Connection Pool Stickiness

Even with proper DNS TTL settings, connection pools create sticky behavior:

Application resolves DNS → gets instances A, B, C
Application opens persistent connections to A, B, C
Instance D is added, instance A is removed
DNS TTL expires, new resolution returns B, C, D
But existing connections to A continue; no new connections to D

The application doesn't re-resolve DNS for existing connections. HTTP/2 and gRPC with long-lived connections make this worse—connections may persist for hours.

Mitigation strategies:

Periodic connection pool refresh (drop and reconnect)
Maximum connection lifetime settings
Client-side health checking
Application-level periodic re-resolution

DNS Caching Issues by Platform
Platform	Default Cache Behavior	Problem	Mitigation
JVM (Java)	Forever (or 30s with SecurityManager)	Never sees updates	Set networkaddress.cache.ttl
Go	None (re-resolves each time)	High DNS load	Connection pooling + TTL-like logic
Node.js	Uses OS resolver	Varies by OS	dns.setServers(), custom lookup
Python	Uses OS resolver	Varies by OS	socket.setdefaulttimeout()
Linux libc	Respects TTL, min 60s common	May over-cache	nscd configuration
macOS	Aggressive caching	May over-cache	dscacheutil -flushcache

Challenge 3: No Health Information

DNS is purely a naming system—it has no concept of service health:

DNS returns all registered records regardless of health
Clients route to unhealthy instances and get errors
Removal of dead instances requires external process updating DNS
The update → propagation delay means stale records during failures

Challenge 4: Limited Load Balancing

DNS provides only basic load distribution:

Round-robin: DNS servers may rotate A record order, but caching defeats this
SRV weights: Better, but many clients don't support SRV records
No intelligent routing: No awareness of server load, latency, or capacity
No affinity options: Can't ensure related requests go to same instance

Challenge 5: No Service Metadata

DNS provides minimal information about services:

A records: Just IP addresses
SRV records: Priority, weight, port, target
TXT records: Arbitrary text (limited, awkward to parse)

No structured support for:

Service version
Capabilities/features
Zone/region information
Custom routing tags

The Stale Record Problem

Modern DNS Discovery Patterns

Despite its limitations, DNS remains valuable for service discovery when used correctly. Modern systems address DNS limitations through enhanced DNS implementations and careful architecture.

Pattern 1: Consul DNS Interface

Consul exposes its service registry via DNS, combining registry features with DNS simplicity:

# Standard service lookup
$ dig @localhost -p 8600 payment-service.service.consul
172.31.1.10
172.31.1.11

# Only healthy instances (Consul checks health)
$ dig @localhost -p 8600 payment-service.service.consul
# Only returns instances passing health checks

# Tag-based filtering
$ dig @localhost -p 8600 v2.payment-service.service.consul
# Returns only instances tagged with 'v2'

# Datacenter-specific
$ dig @localhost -p 8600 payment-service.service.dc2.consul
# Returns instances in datacenter 'dc2'

Consul DNS provides:

Health-aware responses (only healthy instances returned)
Tag-based filtering in DNS queries
Multi-datacenter queries
SRV records with port information

Key configuration: TTL

# consul agent configuration
dns_config {
  service_ttl {
    "*" = "5s"  # Default TTL for services
    "slow-changing" = "30s"  # Higher TTL for stable services
  }
  node_ttl = "5s"
}**

Pattern 2: Kubernetes DNS (CoreDNS)

Kubernetes provides built-in DNS for service discovery via CoreDNS:

# ClusterIP Service (stable virtual IP)
payment-service.default.svc.cluster.local → 10.96.45.123

# Individual pod DNS
10-244-1-5.default.pod.cluster.local → 10.244.1.5

# Headless Service (returns all pod IPs)
payment-service.default.svc.cluster.local → 10.244.1.5, 10.244.1.6, 10.244.1.7

ClusterIP vs Headless Services:

ClusterIP Service (default):

DNS returns single virtual IP (ClusterIP)
kube-proxy handles load balancing to actual pods
Pod changes don't affect DNS response
Simpler for most use cases

Headless Service (clusterIP: None):

DNS returns all pod IPs directly
Client sees and chooses among instances
Useful for stateful services (databases, Kafka)
Subject to DNS caching issues

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Standard ClusterIP Service
apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  selector:
    app: payment
  ports:
  - port: 80
    targetPort: 8080
  # DNS: payment-service.default.svc.cluster.local → 10.96.45.123 (ClusterIP)
 
---
# Headless Service (for client-side discovery)
apiVersion: v1
kind: Service
metadata:
  name: payment-headless
spec:
  clusterIP: None  # Key difference
  selector:
    app: payment
  ports:
  - port: 80
    targetPort: 8080
  # DNS: payment-headless.default.svc.cluster.local → 10.244.1.5, 10.244.1.6, ...
 
---
# ExternalName Service (CNAME to external service)
apiVersion: v1
kind: Service
metadata:
  name: external-payment
spec:
  type: ExternalName
  externalName: payment-api.external-provider.com
  # DNS: external-payment.default.svc.cluster.local → payment-api.external-provider.com

Pattern 3: AWS Route 53 Service Discovery

AWS Cloud Map integrates with Route 53 for service discovery:

# Services registered with Cloud Map become DNS records
payment-srv.services.internal → Auto-updates based on ECS/EC2 instance health

# Features:
# - Automatic registration/deregistration for ECS/EC2
# - Health checking integration
# - Multi-value answer routing
# - Private DNS zones

Cloud Map provides:

Automatic registration for ECS tasks, EC2 instances
Integration with AWS health checks
Both DNS and HTTP API access
Private DNS within VPC

Pattern 4: GeoDNS for Global Load Balancing

DNS can route to the closest datacenter:

# Route 53 Geolocation Routing
api.company.com → 
  (EU users) → eu-west-1-lb.company.com → 52.30.x.x
  (US users) → us-east-1-lb.company.com → 54.83.x.x
  (Asia users) → ap-southeast-1-lb.company.com → 13.228.x.x

GeoDNS provides:

Latency optimization (route to nearest datacenter)
Compliance (keep data in specific regions)
Disaster recovery (remove failed regions from DNS)

DNS + Load Balancer: The Sweet Spot

Implementing Robust DNS-Based Discovery

If you choose DNS-based discovery, careful implementation prevents common pitfalls. Here's a comprehensive approach to robust DNS discovery.

1. Configure Client DNS Behavior

Every platform needs explicit DNS TTL configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import dns from 'dns';
import { LookupAddress } from 'net';
 
// Use system resolver with caching control
const resolver = new dns.Resolver();
resolver.setServers(['10.0.0.53']); // Internal DNS server
 
// DNS lookup with explicit TTL handling
async function resolveWithTTL(hostname: string): Promise<LookupAddress[]> {
  return new Promise((resolve, reject) => {
    dns.resolve4(hostname, { ttl: true }, (err, addresses) => {
      if (err) {
        reject(err);
        return;
      }
      
      // addresses includes TTL information
      // Cache accordingly, but don't exceed returned TTL
      resolve(addresses);
    });
  });
}
 
// Connection pool with periodic refresh
class DNSAwareConnectionPool {
  private connections: Map<string, Connection[]> = new Map();
  private refreshInterval: NodeJS.Timeout;
  
  constructor(private serviceName: string) {
    // Periodically refresh DNS and connections
    this.refreshInterval = setInterval(() => this.refresh(), 30000);
  }
  
  private async refresh(): Promise<void> {
    const addresses = await resolveWithTTL(this.serviceName);
    const newIPs = new Set(addresses.map(a => a.address));
    
    // Close connections to IPs no longer in DNS
    for (const [ip, conns] of this.connections) {
      if (!newIPs.has(ip)) {
        conns.forEach(c => c.close());
        this.connections.delete(ip);
      }
    }
    
    // Open connections to new IPs
    for (const ip of newIPs) {
      if (!this.connections.has(ip)) {
        this.connections.set(ip, await this.openConnections(ip, 5));
      }
    }
  }
}

2. Implement Client-Side Health Checking

Since DNS doesn't include health information, clients must check health themselves:

Passive health checking: Track error rates per endpoint; stop using endpoints with high error rates
Active health probing: Periodically call health endpoints on all known instances
Circuit breaker: Stop calling failing endpoints; periodically test for recovery

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
package discovery
 
import (
    "context"
    "net"
    "sync"
    "time"
)
 
type HealthAwareDNSClient struct {
    serviceName    string
    resolver       *net.Resolver
    healthyHosts   map[string]bool
    mu             sync.RWMutex
    refreshTicker  *time.Ticker
    healthChecker  *time.Ticker
}
 
func NewHealthAwareDNSClient(serviceName string) *HealthAwareDNSClient {
    client := &HealthAwareDNSClient{
        serviceName:  serviceName,
        resolver:     &net.Resolver{PreferGo: true},
        healthyHosts: make(map[string]bool),
    }
    
    // Refresh DNS every 30 seconds
    client.refreshTicker = time.NewTicker(30 * time.Second)
    go client.refreshLoop()
    
    // Check health every 10 seconds
    client.healthChecker = time.NewTicker(10 * time.Second)
    go client.healthCheckLoop()
    
    // Initial resolution
    client.refreshDNS()
    
    return client
}
 
func (c *HealthAwareDNSClient) refreshDNS() {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    
    ips, err := c.resolver.LookupHost(ctx, c.serviceName)
    if err != nil {
        return
    }
    
    c.mu.Lock()
    defer c.mu.Unlock()
    
    // Add new hosts (assume healthy until proven otherwise)
    for _, ip := range ips {
        if _, exists := c.healthyHosts[ip]; !exists {
            c.healthyHosts[ip] = true
        }
    }
    
    // Remove hosts no longer in DNS
    dnsSet := make(map[string]bool)
    for _, ip := range ips {
        dnsSet[ip] = true
    }
    for ip := range c.healthyHosts {
        if !dnsSet[ip] {
            delete(c.healthyHosts, ip)
        }
    }
}
 
func (c *HealthAwareDNSClient) GetHealthyHost() (string, error) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    
    for host, healthy := range c.healthyHosts {
        if healthy {
            return host, nil
        }
    }
    return "", errors.New("no healthy hosts available")
}

3. Tune DNS Infrastructure

Low TTL Trade-offs:

5-10 seconds: Fast updates, high DNS query load
30-60 seconds: Balance between freshness and load
300+ seconds: Only for very stable services

DNS Server Capacity Planning:

Queries/sec = (Number of clients × Requests/sec per client) / TTL

Example:
- 1000 client instances
- Each makes 100 HTTP requests/second  
- If each request requires DNS (worst case): 100,000 queries/sec
- With proper connection pooling + 30s TTL: ~33 queries/sec per client = 33,000 queries/sec
- Plan DNS infrastructure accordingly

4. Implement Fallbacks

DNS infrastructure can fail. Plan for it:

Cache last-known-good DNS responses
Have static fallback addresses for critical services
Implement retry logic for DNS failures
Monitor DNS resolution latency and failures

The 0 TTL Trap

DNS vs Registry-Based Discovery: Decision Framework

When should you use DNS-based discovery versus a dedicated service registry? This decision depends on your system's characteristics and requirements.

Choose DNS Discovery When

•Services are relatively stable (not scaling frequently)
•Simple load balancing (round-robin) is sufficient
•You want to minimize infrastructure complexity
•Universal client support is important (no client libraries)
•You're using DNS → Load Balancer → Services pattern
•Kubernetes with ClusterIP Services (kube-proxy handles instance discovery)
•Team has limited distributed systems expertise

Choose Registry-Based Discovery When

•Services scale frequently (auto-scaling, spot instances)
•Health checking is critical for routing decisions
•You need rich service metadata for routing
•Multi-datacenter service discovery is required
•Advanced load balancing (least-connections, latency-based) is needed
•You're building a service mesh
•You need service discovery across heterogeneous environments (VMs + containers)

DNS vs Registry: Feature Comparison
Requirement	Pure DNS	Consul DNS	Registry-Based
Universal client support	✓ Excellent	✓ Good	✗ Requires library
Health-aware routing	✗ No	✓ Yes	✓ Yes
Fast update propagation	✗ TTL-limited	○ Short TTL + blocking	✓ Push-based
Rich metadata	✗ Limited (TXT)	✓ Via HTTP API	✓ Full support
Multi-datacenter	○ GeoDNS	✓ Native	✓ Native
Operational complexity	Low	Medium	Medium-High
Infrastructure cost	Low	Medium	Medium

The Hybrid Answer

In practice, many production systems use layered approaches:

External traffic: DNS → Cloud Load Balancer → Services
- DNS provides stable, cacheable endpoints
- Load balancer handles health and routing
Internal service-to-service: Kubernetes DNS OR Service Mesh
- Kubernetes Services with kube-proxy (simple)
- Or Istio/Linkerd for advanced features (complex)
Cross-environment: Consul or similar registry
- For services spanning K8s and VMs
- Or multi-cluster/multi-cloud scenarios

Start Simple, Evolve as Needed

DNS Discovery Best Practices

Whether using DNS as your primary discovery mechanism or as a layer in a larger system, these best practices prevent common problems.

Essential Best Practices

•Configure client DNS caching explicitly — Don't rely on defaults. JVM, Go, Node.js, Python all need attention. Document your settings.
•Implement client-side health checking — DNS doesn't know if a server is healthy. Your clients must check and avoid unhealthy endpoints.
•Refresh connection pools periodically — Long-lived connections ignore DNS changes. Set max connection age or refresh connections.
•Use SRV records when possible — They carry port information, eliminating port configuration. However, verify client library support.
•Monitor DNS resolution — Track resolution latency, failure rates, and query volumes. DNS failures are often silent until they cause widespread issues.
•Plan DNS infrastructure capacity — Low TTLs multiply query load. Capacity plan based on clients × request rate × TTL.
•Have fallback strategies — Cache last-known-good values. Have static fallback for critical services. Handle dns.NXDOMAIN gracefully.
•Use DNS → Load Balancer for external traffic — Don't expose direct service IPs via public DNS. Load balancers provide stability and security.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# CoreDNS ConfigMap for production
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
            lameduck 5s
        }
        ready
        
        # Kubernetes DNS
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
            ttl 30  # Control cache TTL
        }
        
        # Forward external queries
        forward . /etc/resolv.conf {
            max_concurrent 1000
        }
        
        # Caching
        cache 30
        
        # Rate limiting
        ratelimit 100
        
        # Metrics for monitoring
        prometheus :9153
        
        loop
        reload
        loadbalance
    }

Observability Is Critical

Summary: DNS-Based Discovery

We've comprehensively explored DNS as a service discovery mechanism—its fundamentals, capabilities, limitations, and best practices. Let's consolidate the essential insights:

Key Takeaways

•DNS provides universal discovery with no client libraries required, but it was designed for stable hosts, not dynamic services.
•TTL and caching are the core challenges — multiple caching layers with different behaviors make updates unpredictable.
•DNS has no health awareness — clients must implement their own health checking to avoid routing to unhealthy endpoints.
•Modern patterns enhance DNS — Consul DNS, Kubernetes DNS, and Cloud Map add health checking and dynamic updates.
•DNS → Load Balancer is often the sweet spot — stable DNS endpoints plus load balancer health checking and routing.
•Configure every client explicitly — JVM, Go, Node.js all have different DNS caching defaults that can cause issues.
•Refresh connection pools — long-lived connections ignore DNS changes; implement periodic refresh.
•Start simple — Kubernetes Services or DNS → LB patterns work for most systems without complex registries.

What's Next:

Page Complete

4 / 5