Service Discovery - Learning Module

Loading content...

0/273

DNS-Based Discovery

The Internet's Original Discovery Protocol

Before sophisticated service registries, before container orchestration, before the term 'microservices' existed—there was DNS. The Domain Name System, conceived in 1983 and standardized in 1987, is the internet's original service discovery mechanism. When you type google.com into your browser, DNS translates that human-readable name into an IP address your computer can route to.

DNS is so fundamental to internet infrastructure that it's easy to overlook as a service discovery mechanism. Yet DNS-based discovery remains relevant today, serving as the backbone for many cloud platforms and the conceptual foundation for more sophisticated approaches. Understanding DNS-based service discovery isn't just historical curiosity—it's practical knowledge you'll apply whether using Kubernetes, cloud load balancers, or designing your own discovery systems.

This page provides a comprehensive exploration of DNS-based service discovery: how it works, its capabilities and limitations, and when it's the right choice for your architecture.

What You Will Learn

By the end of this page, you will understand DNS fundamentals relevant to service discovery, standard DNS record types (A, AAAA, SRV, CNAME), caching mechanisms, TTL management, DNS-based load balancing strategies, and the inherent limitations of DNS for dynamic environments. You'll learn when DNS is sufficient and when you need more sophisticated approaches.

DNS Fundamentals for Service Discovery

DNS is a hierarchical, distributed naming system that translates human-readable domain names into IP addresses. To understand how DNS enables service discovery, we must first understand its core components and resolution process.

The DNS Hierarchy:

DNS operates as a distributed database organized in a tree structure. At the top is the root zone (represented by .), followed by top-level domains (TLDs like .com, .org, .io), then second-level domains (example.com), and finally subdomains (api.example.com).

Converting Mermaid diagram...

DNS Resolution Process:

When a service needs to resolve a hostname, it follows this general process:

Local Cache Check: The resolver first checks its local cache for a cached response
Recursive Resolver Query: If not cached, the query goes to a recursive resolver (often ISP-provided or a service like 8.8.8.8)
Iterative Resolution: The recursive resolver queries root servers → TLD servers → authoritative servers
Response Caching: The response is cached according to its TTL (Time-To-Live)
Application Delivery: The resolved IP address is returned to the requesting application

Key DNS Concepts for Service Discovery:

DNS Concepts Critical for Service Discovery
Concept	Definition	Service Discovery Relevance
Authoritative Server	DNS server that holds the master records for a domain	Where you configure service endpoints
Recursive Resolver	DNS server that performs full resolution on client's behalf	Introduces caching, affects propagation time
TTL (Time-To-Live)	Duration in seconds a record can be cached	Controls staleness vs. query load trade-off
Round-Robin	Returning multiple IPs in rotating order	Basic load distribution mechanism
Zone	Administrative namespace within the DNS hierarchy	Organizational boundary for service records
NXDOMAIN	Response indicating the domain doesn't exist	Indicates service is not registered/available

Internal vs External DNS

For service-to-service discovery, organizations typically run internal DNS servers (sometimes called 'split-horizon DNS') that are only accessible within the private network. These internal zones (e.g., '.internal', '.local', '.svc.cluster.local') are not resolvable from the public internet, providing both security and flexibility.

DNS Record Types for Service Discovery

DNS supports multiple record types, each serving different purposes. For service discovery, several record types are particularly important. Understanding when to use each type is crucial for effective DNS-based discovery.

A Records (Address Records)

•Purpose: Maps a hostname to an IPv4 address
•Format: hostname IN A IPv4-address
•Service Discovery Use: Basic name-to-IP mapping for services
•Multiple Values: Can return multiple IPs for load distribution
•Example: api.example.com. 300 IN A 10.0.1.100

dns-zone-file.txt

Zone File

; DNS Zone file for service discovery example
; Each service has multiple A records for different instances
 
; User Service - 3 instances for high availability
user-service.internal.example.com.    60  IN  A  10.0.1.10
user-service.internal.example.com.    60  IN  A  10.0.1.11
user-service.internal.example.com.    60  IN  A  10.0.1.12
 
; Order Service - 2 instances with geographic distribution
order-service.internal.example.com.   60  IN  A  10.0.2.20
order-service.internal.example.com.   60  IN  A  10.0.2.21
 
; Payment Service - single instance (less critical path)
payment-service.internal.example.com. 300 IN  A  10.0.3.30
 
; Note the TTL values:
; - 60 seconds for services that change frequently
; - 300 seconds for more stable services

AAAA Records (IPv6 Address Records)

•Purpose: Maps a hostname to an IPv6 address
•Format: hostname IN AAAA IPv6-address
•Service Discovery Use: Same as A records, but for IPv6 environments
•Modern Relevance: Increasingly important as IPv4 addresses deplete
•Example: api.example.com. 300 IN AAAA 2001:db8::1

SRV Records (Service Records)

•Purpose: Specifies port, priority, and weight in addition to hostname
•Format: _service._protocol.name TTL class SRV priority weight port target
•Service Discovery Use: Enables port discovery and weighted load balancing
•Advantage: Services can run on non-standard ports; supports priority failover
•Limitation: Not all applications support SRV record resolution
•Example: _http._tcp.api.example.com. 60 IN SRV 10 50 8080 api1.example.com.

srv-records.txt

SRV Records

; SRV Record Format: priority weight port target
; Lower priority = preferred; weight = load distribution within same priority
 
; Primary user service (priority 10) with weighted load balancing
_grpc._tcp.user-service.internal.example.com. 60 IN SRV 10 60 9090 user-1.internal.example.com.
_grpc._tcp.user-service.internal.example.com. 60 IN SRV 10 40 9090 user-2.internal.example.com.
 
; Backup/DR user service (priority 20) - only used if primary fails
_grpc._tcp.user-service.internal.example.com. 60 IN SRV 20 100 9090 user-dr.failover.example.com.
 
; Order service with different port for different protocol
_http._tcp.order-service.internal.example.com. 60 IN SRV 10 50 8080 order-1.internal.example.com.
_http._tcp.order-service.internal.example.com. 60 IN SRV 10 50 8080 order-2.internal.example.com.
 
; gRPC endpoint on different port
_grpc._tcp.order-service.internal.example.com. 60 IN SRV 10 50 9090 order-1.internal.example.com.
_grpc._tcp.order-service.internal.example.com. 60 IN SRV 10 50 9090 order-2.internal.example.com.

CNAME Records (Canonical Name Records)

•Purpose: Creates an alias from one name to another
•Format: alias-name IN CNAME canonical-name
•Service Discovery Use: Abstract service names from specific endpoints
•Blue-Green Deployments: Switch traffic by updating a single CNAME
•Limitation: Adds additional DNS lookup (CNAME → A record)
•Example: api.example.com. 60 IN CNAME api-blue.example.com.

Record Type Selection Guide
Scenario	Recommended Record Type	Reasoning
Simple IP-based discovery	A / AAAA	Direct mapping, widely supported
Non-standard ports	SRV	Only SRV includes port information
Weighted load balancing	SRV	SRV weight field enables proportional routing
Failover scenarios	SRV	SRV priority field enables primary/backup
Blue-green deployments	CNAME	Single record change switches all traffic
Geographic routing	Multiple A + GeoDNS	Different records per geographic region
Cloud load balancer integration	CNAME or ALIAS	Points to cloud-managed endpoints

SRV Record Support

Not all applications and libraries support SRV record resolution. Many HTTP clients and standard libraries only support A/AAAA records. Before planning a SRV-based discovery strategy, verify that your technology stack supports SRV lookups or be prepared to implement custom resolver logic.

TTL and Caching Dynamics

Time-To-Live (TTL) is the most critical configuration parameter for DNS-based service discovery. TTL determines how long a DNS response can be cached before requiring a fresh lookup. This single value creates a fundamental trade-off that shapes the behavior of your discovery system.

The Core Trade-off:

Low TTL (e.g., 5-60 seconds)

•Faster propagation of changes
•Reduced stale data window
•Better for dynamic environments
•Higher DNS query volume
•Increased DNS server load
•Higher latency (more lookups)
•More sensitive to DNS failures

High TTL (e.g., 300-3600 seconds)

•Fewer DNS queries
•Lower DNS server load
•Better performance (cached)
•More resilient to DNS failures
•Slower change propagation
•Larger stale data window
•Longer outage during failures

Understanding Propagation Delay:

When you update a DNS record, the change doesn't take effect immediately across all clients. The propagation delay is determined by the previous TTL value (not the new one). Consider this timeline:

Time 0:00 - Client A queries DNS, receives IP 10.0.1.1 (TTL: 300 seconds)
Time 1:00 - You update DNS record to point to 10.0.2.2
Time 1:30 - Client B queries DNS, receives new IP 10.0.2.2
Time 5:00 - Client A's cache expires, next query gets 10.0.2.2

Result: Client A used stale IP for 4 minutes after the change

This propagation delay is inherent to DNS caching. In the worst case, a client that cached the record just before a change will continue using the old value for the full TTL duration.

The Multi-Layer Caching Problem:

DNS caching occurs at multiple layers, each with potentially different behavior:

DNS Caching Layers
Layer	Typical TTL Behavior	Impact on Service Discovery
Application Cache	May ignore TTL, cache indefinitely	Most dangerous—may never refresh
OS DNS Cache	Generally respects TTL	Usually well-behaved
Local DNS Resolver	Respects TTL, may have minimums	5-30 second minimum TTL common
Corporate DNS	May override TTL for performance	Can extend staleness unexpectedly
ISP Recursive Resolver	Generally respects TTL	Usually well-behaved
CDN/Edge DNS	May have their own caching logic	Varies by provider

connection-cache-issue.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// DANGER: Connection pooling can defeat DNS TTL
// This is a common anti-pattern that causes service discovery failures
 
import { Pool } from 'pg';
 
// ❌ BAD: Connection pool caches initial DNS resolution
const pool = new Pool({
  host: 'database.internal.example.com',
  port: 5432,
  max: 20,
  // Connection pool maintains persistent connections
  // DNS is only resolved when NEW connections are created
  // If all 20 connections are healthy, DNS is NEVER re-resolved
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});
 
// After a database failover:
// - DNS updates to point to new primary
// - But pool still has connections to OLD primary (now a replica)
// - All writes fail until connections are manually recycled
 
// ✅ BETTER: Implement connection recycling
const poolWithRecycling = new Pool({
  host: 'database.internal.example.com',
  port: 5432,
  max: 20,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
  // Force connections to be recycled periodically
  maxLifetimeSeconds: 300, // Recycle connections every 5 minutes
});
 
// ✅ BEST: Use a discovery-aware connection manager
// Many modern database drivers support automatic failover
// Example: Using a connection string with multiple endpoints
const connectionString = 
  'postgresql://user:pass@primary.db.example.com:5432,' +
  'replica1.db.example.com:5432,replica2.db.example.com:5432' +
  '/mydb?target_session_attrs=read-write';

Practical TTL Guidelines

For internal service discovery, use TTLs between 30-60 seconds as a starting point. This balances reasonable propagation delay with acceptable DNS load. For services that change very frequently (auto-scaled, frequently deployed), consider TTLs as low as 5-10 seconds, but ensure your DNS infrastructure can handle the increased query volume.

DNS-Based Load Balancing

DNS can provide basic load balancing by returning multiple IP addresses for a single hostname. Understanding the different DNS load balancing strategies helps you choose the right approach for your requirements.

Round-Robin DNS:

The simplest DNS load balancing strategy is round-robin: the DNS server returns all available IP addresses in a rotating order. Each subsequent query receives the same IPs in a different sequence, distributing clients across instances.

round-robin-example.txt

DNS Queries

# First query - returns IPs in order A, B, C
$ dig +short api.example.com A
10.0.1.1
10.0.1.2
10.0.1.3
 
# Second query - rotated to B, C, A
$ dig +short api.example.com A
10.0.1.2
10.0.1.3
10.0.1.1
 
# Third query - rotated to C, A, B
$ dig +short api.example.com A
10.0.1.3
10.0.1.1
10.0.1.2
 
# Most clients use the first IP returned
# Round-robin distributes first-choice across instances

Limitations of Round-Robin DNS:

Uneven Distribution: Clients typically use only the first IP returned. Caching means the same client repeatedly uses the same server until TTL expires.
No Health Awareness: DNS has no knowledge of backend health. Unhealthy instances remain in the rotation until manually removed.
Session Affinity: No built-in session affinity. Different requests from the same client may hit different backends.
Caching Defeats Balancing: Once a client caches a response, all its requests go to the same server until cache expires.

Weighted DNS Distribution:

For more control over traffic distribution, you can use weighted DNS. This is commonly achieved through SRV records or specialized DNS servers that support weights:

weighted-dns.txt

Weighted Distribution

# SRV records with weights for proportional distribution
# Weights are relative within the same priority level
 
; 70% of traffic to primary data center
_http._tcp.api.example.com. 60 IN SRV 10 70 8080 api-primary.example.com.
 
; 30% of traffic to secondary data center
_http._tcp.api.example.com. 60 IN SRV 10 30 8080 api-secondary.example.com.
 
; DR site only if both primary and secondary fail (priority 20)
_http._tcp.api.example.com. 60 IN SRV 20 100 8080 api-dr.example.com.
 
# Alternative: Some DNS providers support weighted A records
# This is provider-specific and not standard DNS
api.example.com. 60 IN A 10.0.1.1 ; weight: 70
api.example.com. 60 IN A 10.0.2.1 ; weight: 30

Geographic DNS (GeoDNS):

GeoDNS returns different IP addresses based on the geographic location of the client. This enables routing users to the nearest data center, reducing latency and enabling geographic redundancy.

Converting Mermaid diagram...

Cloud DNS Load Balancing

Major cloud providers offer DNS-based load balancing services (AWS Route 53, Google Cloud DNS, Azure Traffic Manager) with health checking, geographic routing, weighted distribution, and latency-based routing. These extend standard DNS with active health probing and automated failover.

DNS Limitations for Dynamic Environments

Despite its ubiquity and simplicity, DNS has fundamental limitations that make it insufficient for highly dynamic environments. Understanding these limitations helps you recognize when DNS is appropriate and when you need more sophisticated discovery mechanisms.

Fundamental DNS Limitations

•Propagation Delay: Even with low TTLs, DNS changes take time to propagate through caching layers. In fast-moving environments (auto-scaling, frequent deployments), DNS can't keep pace with the rate of change.
•No Native Health Checking: Standard DNS doesn't verify backend health. Unhealthy instances remain in DNS until explicitly removed. Traffic may route to dead or degraded instances.
•Cache Inconsistency: Different clients see different states based on when they last queried and refreshed their cache. This leads to inconsistent behavior across client population.
•Limited Metadata: DNS provides minimal information—essentially just addresses (and ports with SRV). There's no standard way to convey service version, capacity, feature flags, or other metadata.
•Client Caching Variability: Different clients (browsers, libraries, operating systems) handle DNS caching differently. You can't guarantee consistent cache behavior across your client ecosystem.
•No Real-Time Updates: DNS is inherently poll-based via TTL expiration. There's no push mechanism to immediately notify clients of changes.
•Connection Pool Interaction: Persistent connection pools may hold connections to resolved IPs indefinitely, ignoring DNS changes entirely.

The Staleness Problem Quantified:

Let's quantify the staleness problem. Assume you're running a service with:

100 client instances
60-second TTL
Clients query DNS at random times within the TTL window

When you remove an instance from DNS:

Staleness Impact Analysis
Time After Change	Clients with Stale Data	Requests to Dead Instance
Immediately	~50 clients (50%)	50% of requests fail
15 seconds	~37 clients (37%)	37% of requests fail
30 seconds	~25 clients (25%)	25% of requests fail
45 seconds	~12 clients (12%)	12% of requests fail
60 seconds	~0 clients	Normal operation resumes

Even with a 60-second TTL, you experience up to 60 seconds of degraded operation after removing an unhealthy instance. In high-traffic systems, this translates to thousands of failed requests.

The Health Check Gap:

Consider this timeline showing the gap between instance failure and DNS propagation:

Time 0:00 - Instance becomes unhealthy (process crash, OOM, etc.)
Time 0:01 - External monitoring detects failure
Time 0:05 - Alert fired, on-call responds
Time 0:10 - Operator removes instance from DNS
Time 1:10 - Last client's cache expires (60s TTL)

Total exposure: 70 seconds of routing traffic to a dead instance

Automated health checking (available in cloud DNS services) reduces this gap but doesn't eliminate it—the health check interval plus propagation delay still creates a window of vulnerability.

The Minimum TTL Problem

Many DNS resolvers enforce minimum TTLs (commonly 30 seconds) regardless of what the authoritative server returns. Some corporate DNS servers may enforce multi-minute minimums for performance. Your configured 5-second TTL might become 60+ seconds in practice, dramatically extending your staleness window.

Modern DNS for Service Discovery

Despite its limitations, DNS remains valuable for service discovery when augmented with modern techniques and tooling. Let's explore how contemporary systems use DNS effectively.

Cloud-Based DNS with Health Checks:

Cloud DNS services (AWS Route 53, Google Cloud DNS, Azure Traffic Manager) extend standard DNS with active health checking. The DNS service periodically probes each endpoint and automatically removes unhealthy instances from responses.

route53-health-check.yaml
AWS Route 53
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# AWS Route 53 Health Check and DNS Configuration
# Health checks enable automatic failover
 
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  # Health check for API instance 1
  ApiInstance1HealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        Type: HTTP
        ResourcePath: /health
        FullyQualifiedDomainName: api-1.example.com
        Port: 443
        RequestInterval: 10        # Check every 10 seconds
        FailureThreshold: 3        # 3 failures = unhealthy
        MeasureLatency: true
 
  # DNS Record Set with health check
  ApiDNSRecord:
    Type: AWS::Route53::RecordSetGroup
    Properties:
      HostedZoneName: example.com.
      RecordSets:
        - Name: api.example.com.
          Type: A
          SetIdentifier: api-instance-1
          Weight: 50
          TTL: 60
          ResourceRecords:
            - 10.0.1.1
          HealthCheckId: !Ref ApiInstance1HealthCheck
        - Name: api.example.com.
          Type: A
          SetIdentifier: api-instance-2
          Weight: 50
          TTL: 60
          ResourceRecords:
            - 10.0.1.2
          HealthCheckId: !Ref ApiInstance2HealthCheck
 
# Route 53 automatically removes unhealthy instances from responses
# Health check failure → Instance removed from DNS → No TTL wait

Internal DNS with Service Meshes:

Modern service meshes (Istio, Linkerd) often use DNS for initial service naming while implementing their own discovery and routing layer. DNS provides the entry point; the mesh handles dynamic routing.

Converting Mermaid diagram...

Headless Services for Client-Side Discovery:

In Kubernetes, 'headless services' return all pod IPs instead of a single cluster IP. This enables clients to implement their own discovery and load balancing logic while still using DNS for name resolution.

headless-service.yaml
Kubernetes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Kubernetes Headless Service
# DNS returns all pod IPs instead of a cluster IP
 
apiVersion: v1
kind: Service
metadata:
  name: user-service
  namespace: production
spec:
  clusterIP: None  # This makes it a headless service
  selector:
    app: user-service
  ports:
    - port: 8080
      targetPort: 8080
      name: http
    - port: 9090
      targetPort: 9090
      name: grpc
 
---
# DNS resolution for headless service returns all pod IPs:
# 
# $ nslookup user-service.production.svc.cluster.local
# 
# Name:    user-service.production.svc.cluster.local
# Address: 10.244.1.15  (pod 1)
# Address: 10.244.2.23  (pod 2)
# Address: 10.244.3.31  (pod 3)
#
# Client libraries like gRPC can use this for client-side load balancing

DNS as Foundation, Not Destination

Modern systems often use DNS as the foundation for service naming while building more sophisticated discovery on top. DNS provides stable, well-understood naming; additional layers provide health checking, real-time updates, and advanced routing. This layered approach combines DNS's ubiquity with dynamic discovery's flexibility.

When to Use DNS-Based Discovery

DNS-based discovery isn't obsolete—it's appropriate for many scenarios. Understanding when DNS is sufficient helps you avoid over-engineering while recognizing when more sophisticated approaches are warranted.

DNS-Based Discovery Works Well For

•Relatively stable services: Services that change infrequently (hourly rather than per-minute)
•External APIs: Third-party services you don't control often use DNS
•Legacy integration: Systems that can't be modified to use service registries
•Human-triggered changes: When changes are manually deployed and verified
•Database connections: With connection recycling, DNS works well for database discovery
•Cloud load balancer backends: DNS pointing to cloud LB provides reliable discovery
•Simple architectures: Few services with stable instance counts

DNS-Based Discovery Struggles With

•Frequent auto-scaling: Instances added/removed multiple times per hour
•Continuous deployment: Services deployed many times per day
•Fast failure detection: Need to remove unhealthy instances within seconds
•Rich metadata: Need to route based on version, capacity, features
•Real-time updates: Need immediate notification of changes
•Sophisticated load balancing: Need weighted routing, circuit breaking, etc.
•Large microservices ecosystems: Many services with complex dependencies

Discovery Approach Selection Matrix
Factor	DNS Sufficient	Registry Recommended	Service Mesh Recommended
Instance change frequency	< 1/hour	1-10/hour	10/hour
Deployment frequency	Weekly	Daily	Continuous
Health check latency required	60 seconds	< 30 seconds	< 5 seconds
Number of services	< 10	10-50	50
Load balancing needs	Round-robin	Weighted, priority	L7, traffic shaping
Metadata requirements	None	Version, tags	Rich observability
Team expertise	Basic ops	Platform team	Dedicated platform

The Hybrid Approach

You don't have to choose one approach for all services. Many organizations use DNS for stable external dependencies and database connections while using service registries for microservice-to-microservice communication. Match the discovery mechanism to the dynamism of the service.

Summary: DNS as Discovery Foundation

We've comprehensively explored DNS-based service discovery—from fundamental concepts to practical limitations. Let's consolidate the key insights:

Key Takeaways

•DNS is the internet's original discovery protocol — Understanding DNS is foundational for all service discovery approaches.
•Record types serve different purposes — A/AAAA for simple resolution, SRV for ports and weights, CNAME for aliasing and blue-green deployments.
•TTL creates a fundamental trade-off — Lower TTLs mean faster propagation but higher DNS load; higher TTLs mean better performance but longer staleness windows.
•Caching happens at multiple layers — Application caches, OS caches, resolvers, and connection pools all affect how quickly DNS changes propagate.
•DNS has inherent limitations — No native health checking, no push updates, limited metadata, and variable client caching behavior.
•Modern DNS is augmented — Cloud DNS services add health checking; service meshes use DNS for naming while implementing dynamic routing.
•DNS is appropriate for many scenarios — Stable services, external dependencies, and simpler architectures work well with DNS-based discovery.

What's Next:

While DNS provides a foundation, dynamic environments require more sophisticated discovery mechanisms. In the next page, we'll explore Service Registries—dedicated systems designed specifically for service discovery. You'll learn about popular registry implementations (Consul, etcd, Eureka), registration patterns, health checking, and how registries overcome DNS's limitations.

Page Complete

You now have a comprehensive understanding of DNS-based service discovery. You understand how DNS works for discovery, its record types and caching mechanisms, and its fundamental limitations. This foundation prepares you to appreciate why service registries were developed and when each approach is appropriate.

DNS-Based Discovery

The Internet's Original Discovery Protocol

This page provides a comprehensive exploration of DNS-based service discovery: how it works, its capabilities and limitations, and when it's the right choice for your architecture.

What You Will Learn

DNS Fundamentals for Service Discovery

The DNS Hierarchy:

Converting Mermaid diagram...

DNS Resolution Process:

When a service needs to resolve a hostname, it follows this general process:

Local Cache Check: The resolver first checks its local cache for a cached response
Recursive Resolver Query: If not cached, the query goes to a recursive resolver (often ISP-provided or a service like 8.8.8.8)
Iterative Resolution: The recursive resolver queries root servers → TLD servers → authoritative servers
Response Caching: The response is cached according to its TTL (Time-To-Live)
Application Delivery: The resolved IP address is returned to the requesting application

Key DNS Concepts for Service Discovery:

DNS Concepts Critical for Service Discovery
Concept	Definition	Service Discovery Relevance
Authoritative Server	DNS server that holds the master records for a domain	Where you configure service endpoints
Recursive Resolver	DNS server that performs full resolution on client's behalf	Introduces caching, affects propagation time
TTL (Time-To-Live)	Duration in seconds a record can be cached	Controls staleness vs. query load trade-off
Round-Robin	Returning multiple IPs in rotating order	Basic load distribution mechanism
Zone	Administrative namespace within the DNS hierarchy	Organizational boundary for service records
NXDOMAIN	Response indicating the domain doesn't exist	Indicates service is not registered/available

Internal vs External DNS

DNS Record Types for Service Discovery

A Records (Address Records)

•Purpose: Maps a hostname to an IPv4 address
•Format: hostname IN A IPv4-address
•Service Discovery Use: Basic name-to-IP mapping for services
•Multiple Values: Can return multiple IPs for load distribution
•Example: api.example.com. 300 IN A 10.0.1.100

dns-zone-file.txt

Zone File

; DNS Zone file for service discovery example
; Each service has multiple A records for different instances
 
; User Service - 3 instances for high availability
user-service.internal.example.com.    60  IN  A  10.0.1.10
user-service.internal.example.com.    60  IN  A  10.0.1.11
user-service.internal.example.com.    60  IN  A  10.0.1.12
 
; Order Service - 2 instances with geographic distribution
order-service.internal.example.com.   60  IN  A  10.0.2.20
order-service.internal.example.com.   60  IN  A  10.0.2.21
 
; Payment Service - single instance (less critical path)
payment-service.internal.example.com. 300 IN  A  10.0.3.30
 
; Note the TTL values:
; - 60 seconds for services that change frequently
; - 300 seconds for more stable services

AAAA Records (IPv6 Address Records)

•Purpose: Maps a hostname to an IPv6 address
•Format: hostname IN AAAA IPv6-address
•Service Discovery Use: Same as A records, but for IPv6 environments
•Modern Relevance: Increasingly important as IPv4 addresses deplete
•Example: api.example.com. 300 IN AAAA 2001:db8::1

SRV Records (Service Records)

•Purpose: Specifies port, priority, and weight in addition to hostname
•Format: _service._protocol.name TTL class SRV priority weight port target
•Service Discovery Use: Enables port discovery and weighted load balancing
•Advantage: Services can run on non-standard ports; supports priority failover
•Limitation: Not all applications support SRV record resolution
•Example: _http._tcp.api.example.com. 60 IN SRV 10 50 8080 api1.example.com.

srv-records.txt

SRV Records

; SRV Record Format: priority weight port target
; Lower priority = preferred; weight = load distribution within same priority
 
; Primary user service (priority 10) with weighted load balancing
_grpc._tcp.user-service.internal.example.com. 60 IN SRV 10 60 9090 user-1.internal.example.com.
_grpc._tcp.user-service.internal.example.com. 60 IN SRV 10 40 9090 user-2.internal.example.com.
 
; Backup/DR user service (priority 20) - only used if primary fails
_grpc._tcp.user-service.internal.example.com. 60 IN SRV 20 100 9090 user-dr.failover.example.com.
 
; Order service with different port for different protocol
_http._tcp.order-service.internal.example.com. 60 IN SRV 10 50 8080 order-1.internal.example.com.
_http._tcp.order-service.internal.example.com. 60 IN SRV 10 50 8080 order-2.internal.example.com.
 
; gRPC endpoint on different port
_grpc._tcp.order-service.internal.example.com. 60 IN SRV 10 50 9090 order-1.internal.example.com.
_grpc._tcp.order-service.internal.example.com. 60 IN SRV 10 50 9090 order-2.internal.example.com.

CNAME Records (Canonical Name Records)

•Purpose: Creates an alias from one name to another
•Format: alias-name IN CNAME canonical-name
•Service Discovery Use: Abstract service names from specific endpoints
•Blue-Green Deployments: Switch traffic by updating a single CNAME
•Limitation: Adds additional DNS lookup (CNAME → A record)
•Example: api.example.com. 60 IN CNAME api-blue.example.com.

Record Type Selection Guide
Scenario	Recommended Record Type	Reasoning
Simple IP-based discovery	A / AAAA	Direct mapping, widely supported
Non-standard ports	SRV	Only SRV includes port information
Weighted load balancing	SRV	SRV weight field enables proportional routing
Failover scenarios	SRV	SRV priority field enables primary/backup
Blue-green deployments	CNAME	Single record change switches all traffic
Geographic routing	Multiple A + GeoDNS	Different records per geographic region
Cloud load balancer integration	CNAME or ALIAS	Points to cloud-managed endpoints

SRV Record Support

TTL and Caching Dynamics

The Core Trade-off:

Low TTL (e.g., 5-60 seconds)

•Faster propagation of changes
•Reduced stale data window
•Better for dynamic environments
•Higher DNS query volume
•Increased DNS server load
•Higher latency (more lookups)
•More sensitive to DNS failures

High TTL (e.g., 300-3600 seconds)

•Fewer DNS queries
•Lower DNS server load
•Better performance (cached)
•More resilient to DNS failures
•Slower change propagation
•Larger stale data window
•Longer outage during failures

Understanding Propagation Delay:

When you update a DNS record, the change doesn't take effect immediately across all clients. The propagation delay is determined by the previous TTL value (not the new one). Consider this timeline:

Time 0:00 - Client A queries DNS, receives IP 10.0.1.1 (TTL: 300 seconds)
Time 1:00 - You update DNS record to point to 10.0.2.2
Time 1:30 - Client B queries DNS, receives new IP 10.0.2.2
Time 5:00 - Client A's cache expires, next query gets 10.0.2.2

Result: Client A used stale IP for 4 minutes after the change

This propagation delay is inherent to DNS caching. In the worst case, a client that cached the record just before a change will continue using the old value for the full TTL duration.

The Multi-Layer Caching Problem:

DNS caching occurs at multiple layers, each with potentially different behavior:

DNS Caching Layers
Layer	Typical TTL Behavior	Impact on Service Discovery
Application Cache	May ignore TTL, cache indefinitely	Most dangerous—may never refresh
OS DNS Cache	Generally respects TTL	Usually well-behaved
Local DNS Resolver	Respects TTL, may have minimums	5-30 second minimum TTL common
Corporate DNS	May override TTL for performance	Can extend staleness unexpectedly
ISP Recursive Resolver	Generally respects TTL	Usually well-behaved
CDN/Edge DNS	May have their own caching logic	Varies by provider

connection-cache-issue.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// DANGER: Connection pooling can defeat DNS TTL
// This is a common anti-pattern that causes service discovery failures
 
import { Pool } from 'pg';
 
// ❌ BAD: Connection pool caches initial DNS resolution
const pool = new Pool({
  host: 'database.internal.example.com',
  port: 5432,
  max: 20,
  // Connection pool maintains persistent connections
  // DNS is only resolved when NEW connections are created
  // If all 20 connections are healthy, DNS is NEVER re-resolved
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
});
 
// After a database failover:
// - DNS updates to point to new primary
// - But pool still has connections to OLD primary (now a replica)
// - All writes fail until connections are manually recycled
 
// ✅ BETTER: Implement connection recycling
const poolWithRecycling = new Pool({
  host: 'database.internal.example.com',
  port: 5432,
  max: 20,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,
  // Force connections to be recycled periodically
  maxLifetimeSeconds: 300, // Recycle connections every 5 minutes
});
 
// ✅ BEST: Use a discovery-aware connection manager
// Many modern database drivers support automatic failover
// Example: Using a connection string with multiple endpoints
const connectionString = 
  'postgresql://user:pass@primary.db.example.com:5432,' +
  'replica1.db.example.com:5432,replica2.db.example.com:5432' +
  '/mydb?target_session_attrs=read-write';

Practical TTL Guidelines

DNS-Based Load Balancing

Round-Robin DNS:

round-robin-example.txt

DNS Queries

# First query - returns IPs in order A, B, C
$ dig +short api.example.com A
10.0.1.1
10.0.1.2
10.0.1.3
 
# Second query - rotated to B, C, A
$ dig +short api.example.com A
10.0.1.2
10.0.1.3
10.0.1.1
 
# Third query - rotated to C, A, B
$ dig +short api.example.com A
10.0.1.3
10.0.1.1
10.0.1.2
 
# Most clients use the first IP returned
# Round-robin distributes first-choice across instances

Limitations of Round-Robin DNS:

Uneven Distribution: Clients typically use only the first IP returned. Caching means the same client repeatedly uses the same server until TTL expires.
No Health Awareness: DNS has no knowledge of backend health. Unhealthy instances remain in the rotation until manually removed.
Session Affinity: No built-in session affinity. Different requests from the same client may hit different backends.
Caching Defeats Balancing: Once a client caches a response, all its requests go to the same server until cache expires.

Weighted DNS Distribution:

For more control over traffic distribution, you can use weighted DNS. This is commonly achieved through SRV records or specialized DNS servers that support weights:

weighted-dns.txt

Weighted Distribution

# SRV records with weights for proportional distribution
# Weights are relative within the same priority level
 
; 70% of traffic to primary data center
_http._tcp.api.example.com. 60 IN SRV 10 70 8080 api-primary.example.com.
 
; 30% of traffic to secondary data center
_http._tcp.api.example.com. 60 IN SRV 10 30 8080 api-secondary.example.com.
 
; DR site only if both primary and secondary fail (priority 20)
_http._tcp.api.example.com. 60 IN SRV 20 100 8080 api-dr.example.com.
 
# Alternative: Some DNS providers support weighted A records
# This is provider-specific and not standard DNS
api.example.com. 60 IN A 10.0.1.1 ; weight: 70
api.example.com. 60 IN A 10.0.2.1 ; weight: 30

Geographic DNS (GeoDNS):

GeoDNS returns different IP addresses based on the geographic location of the client. This enables routing users to the nearest data center, reducing latency and enabling geographic redundancy.

Converting Mermaid diagram...

Cloud DNS Load Balancing

DNS Limitations for Dynamic Environments

Fundamental DNS Limitations

•Propagation Delay: Even with low TTLs, DNS changes take time to propagate through caching layers. In fast-moving environments (auto-scaling, frequent deployments), DNS can't keep pace with the rate of change.
•No Native Health Checking: Standard DNS doesn't verify backend health. Unhealthy instances remain in DNS until explicitly removed. Traffic may route to dead or degraded instances.
•Cache Inconsistency: Different clients see different states based on when they last queried and refreshed their cache. This leads to inconsistent behavior across client population.
•Limited Metadata: DNS provides minimal information—essentially just addresses (and ports with SRV). There's no standard way to convey service version, capacity, feature flags, or other metadata.
•Client Caching Variability: Different clients (browsers, libraries, operating systems) handle DNS caching differently. You can't guarantee consistent cache behavior across your client ecosystem.
•No Real-Time Updates: DNS is inherently poll-based via TTL expiration. There's no push mechanism to immediately notify clients of changes.
•Connection Pool Interaction: Persistent connection pools may hold connections to resolved IPs indefinitely, ignoring DNS changes entirely.

The Staleness Problem Quantified:

Let's quantify the staleness problem. Assume you're running a service with:

100 client instances
60-second TTL
Clients query DNS at random times within the TTL window

When you remove an instance from DNS:

Staleness Impact Analysis
Time After Change	Clients with Stale Data	Requests to Dead Instance
Immediately	~50 clients (50%)	50% of requests fail
15 seconds	~37 clients (37%)	37% of requests fail
30 seconds	~25 clients (25%)	25% of requests fail
45 seconds	~12 clients (12%)	12% of requests fail
60 seconds	~0 clients	Normal operation resumes

Even with a 60-second TTL, you experience up to 60 seconds of degraded operation after removing an unhealthy instance. In high-traffic systems, this translates to thousands of failed requests.

The Health Check Gap:

Consider this timeline showing the gap between instance failure and DNS propagation:

Time 0:00 - Instance becomes unhealthy (process crash, OOM, etc.)
Time 0:01 - External monitoring detects failure
Time 0:05 - Alert fired, on-call responds
Time 0:10 - Operator removes instance from DNS
Time 1:10 - Last client's cache expires (60s TTL)

Total exposure: 70 seconds of routing traffic to a dead instance

Automated health checking (available in cloud DNS services) reduces this gap but doesn't eliminate it—the health check interval plus propagation delay still creates a window of vulnerability.

The Minimum TTL Problem

Modern DNS for Service Discovery

Despite its limitations, DNS remains valuable for service discovery when augmented with modern techniques and tooling. Let's explore how contemporary systems use DNS effectively.

Cloud-Based DNS with Health Checks:

route53-health-check.yaml
AWS Route 53
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# AWS Route 53 Health Check and DNS Configuration
# Health checks enable automatic failover
 
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  # Health check for API instance 1
  ApiInstance1HealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        Type: HTTP
        ResourcePath: /health
        FullyQualifiedDomainName: api-1.example.com
        Port: 443
        RequestInterval: 10        # Check every 10 seconds
        FailureThreshold: 3        # 3 failures = unhealthy
        MeasureLatency: true
 
  # DNS Record Set with health check
  ApiDNSRecord:
    Type: AWS::Route53::RecordSetGroup
    Properties:
      HostedZoneName: example.com.
      RecordSets:
        - Name: api.example.com.
          Type: A
          SetIdentifier: api-instance-1
          Weight: 50
          TTL: 60
          ResourceRecords:
            - 10.0.1.1
          HealthCheckId: !Ref ApiInstance1HealthCheck
        - Name: api.example.com.
          Type: A
          SetIdentifier: api-instance-2
          Weight: 50
          TTL: 60
          ResourceRecords:
            - 10.0.1.2
          HealthCheckId: !Ref ApiInstance2HealthCheck
 
# Route 53 automatically removes unhealthy instances from responses
# Health check failure → Instance removed from DNS → No TTL wait

Internal DNS with Service Meshes:

Converting Mermaid diagram...

Headless Services for Client-Side Discovery:

headless-service.yaml
Kubernetes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Kubernetes Headless Service
# DNS returns all pod IPs instead of a cluster IP
 
apiVersion: v1
kind: Service
metadata:
  name: user-service
  namespace: production
spec:
  clusterIP: None  # This makes it a headless service
  selector:
    app: user-service
  ports:
    - port: 8080
      targetPort: 8080
      name: http
    - port: 9090
      targetPort: 9090
      name: grpc
 
---
# DNS resolution for headless service returns all pod IPs:
# 
# $ nslookup user-service.production.svc.cluster.local
# 
# Name:    user-service.production.svc.cluster.local
# Address: 10.244.1.15  (pod 1)
# Address: 10.244.2.23  (pod 2)
# Address: 10.244.3.31  (pod 3)
#
# Client libraries like gRPC can use this for client-side load balancing

DNS as Foundation, Not Destination

When to Use DNS-Based Discovery

DNS-Based Discovery Works Well For

•Relatively stable services: Services that change infrequently (hourly rather than per-minute)
•External APIs: Third-party services you don't control often use DNS
•Legacy integration: Systems that can't be modified to use service registries
•Human-triggered changes: When changes are manually deployed and verified
•Database connections: With connection recycling, DNS works well for database discovery
•Cloud load balancer backends: DNS pointing to cloud LB provides reliable discovery
•Simple architectures: Few services with stable instance counts

DNS-Based Discovery Struggles With

•Frequent auto-scaling: Instances added/removed multiple times per hour
•Continuous deployment: Services deployed many times per day
•Fast failure detection: Need to remove unhealthy instances within seconds
•Rich metadata: Need to route based on version, capacity, features
•Real-time updates: Need immediate notification of changes
•Sophisticated load balancing: Need weighted routing, circuit breaking, etc.
•Large microservices ecosystems: Many services with complex dependencies

Discovery Approach Selection Matrix
Factor	DNS Sufficient	Registry Recommended	Service Mesh Recommended
Instance change frequency	< 1/hour	1-10/hour	10/hour
Deployment frequency	Weekly	Daily	Continuous
Health check latency required	60 seconds	< 30 seconds	< 5 seconds
Number of services	< 10	10-50	50
Load balancing needs	Round-robin	Weighted, priority	L7, traffic shaping
Metadata requirements	None	Version, tags	Rich observability
Team expertise	Basic ops	Platform team	Dedicated platform

The Hybrid Approach

Summary: DNS as Discovery Foundation

We've comprehensively explored DNS-based service discovery—from fundamental concepts to practical limitations. Let's consolidate the key insights:

Key Takeaways

•DNS is the internet's original discovery protocol — Understanding DNS is foundational for all service discovery approaches.
•Record types serve different purposes — A/AAAA for simple resolution, SRV for ports and weights, CNAME for aliasing and blue-green deployments.
•TTL creates a fundamental trade-off — Lower TTLs mean faster propagation but higher DNS load; higher TTLs mean better performance but longer staleness windows.
•Caching happens at multiple layers — Application caches, OS caches, resolvers, and connection pools all affect how quickly DNS changes propagate.
•DNS has inherent limitations — No native health checking, no push updates, limited metadata, and variable client caching behavior.
•Modern DNS is augmented — Cloud DNS services add health checking; service meshes use DNS for naming while implementing dynamic routing.
•DNS is appropriate for many scenarios — Stable services, external dependencies, and simpler architectures work well with DNS-based discovery.

What's Next:

Page Complete