Global Load Balancing - Learning Module

Loading content...

0/273

DNS-Based Load Balancing

The Invisible Traffic Controller

Before a single byte of application data travels across the internet, before any TCP handshake begins, before your load balancer even knows a request is coming—a critical routing decision has already been made. That decision happens in the Domain Name System (DNS), the internet's hierarchical naming infrastructure that translates human-readable hostnames into IP addresses.

DNS-based load balancing leverages this universal resolution step to influence where traffic flows. By controlling which IP addresses the DNS system returns for your hostname, you can distribute users across servers, data centers, or even continents—all before the first packet leaves the user's device.

This approach is foundational to GSLB, CDN routing, and virtually every globally distributed internet service. Understanding DNS-based load balancing mechanics is essential for architecting resilient, performant systems.

What You Will Learn

By the end of this page, you will have mastered: how DNS resolution enables traffic distribution; the role of authoritative servers, recursive resolvers, and caching; TTL management and its impact on failover speed; EDNS Client Subnet (ECS) for improved routing accuracy; Round-Robin DNS and its limitations; and advanced DNS patterns including weighted records and failover configurations.

DNS Resolution Fundamentals

Before diving into DNS-based load balancing, we must understand how DNS resolution works. The journey from hostname to IP address involves multiple parties, each with different responsibilities and caching behaviors.

The Resolution Chain:

When a user's browser requests api.example.com, the following sequence occurs:

Client Stub Resolver: The operating system's built-in resolver checks its local cache. If the hostname was recently resolved, the cached IP is returned immediately.
Recursive Resolver: If not cached locally, the query goes to a recursive resolver (typically operated by the ISP, or a public resolver like Google's 8.8.8.8 or Cloudflare's 1.1.1.1). The recursive resolver does the heavy lifting of traversing the DNS hierarchy.
Root Servers: The recursive resolver queries root servers to find the authoritative servers for the .com TLD.
TLD Servers: The .com TLD servers direct to the authoritative nameservers for example.com.
Authoritative Servers: Your authoritative nameservers (where DNS-based load balancing logic lives) return the final IP address(es) for api.example.com.
Response Propagation: The answer flows back through the chain, with each layer caching according to the record's TTL (Time To Live).

Converting Mermaid diagram...

The Critical Insight for Load Balancing:

DNS-based load balancing operates at step 5—the authoritative nameserver response. By returning different IP addresses based on policy (geographic, health, weighted), you control where users' subsequent connections are directed.

However, the caching layers introduce critical constraints:

You can't force immediate updates: Once a recursive resolver caches your response, users going through that resolver will see the cached IP until TTL expires.
You don't see the real user: By default, your authoritative server sees the recursive resolver's IP, not the end user's. This limits geographic accuracy.
TTL is a tradeoff, not a guarantee: Even with short TTLs, some resolvers ignore TTL or cache longer than specified.

The Resolver Problem

Your authoritative DNS server typically sees the IP address of the recursive resolver, not the end user. A user in Tokyo using Google's 8.8.8.8 might appear to be coming from a Google data center in the US. This is why EDNS Client Subnet (covered later) is critical for accurate geographic routing.

Round-Robin DNS: The Simplest Approach

The simplest form of DNS-based load balancing is Round-Robin DNS (RRDNS). When multiple A records exist for a hostname, DNS servers rotate the order in which IP addresses are returned. Clients typically connect to the first IP in the list, resulting in traffic distribution across servers.

Configuration Example:

dns-zone-rrdns.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
; Round-Robin DNS Configuration for api.example.com
; Each A record points to a different server
; DNS servers rotate the order for each query
 
$ORIGIN example.com.
$TTL 300
 
api     IN  A   203.0.113.10    ; Server 1
api     IN  A   203.0.113.11    ; Server 2
api     IN  A   203.0.113.12    ; Server 3
api     IN  A   203.0.113.13    ; Server 4
 
; Queries receive all IPs, but in rotating order:
; Query 1: 203.0.113.10, 203.0.113.11, 203.0.113.12, 203.0.113.13
; Query 2: 203.0.113.11, 203.0.113.12, 203.0.113.13, 203.0.113.10
; Query 3: 203.0.113.12, 203.0.113.13, 203.0.113.10, 203.0.113.11
; etc.

How Round-Robin Works:

The authoritative DNS server holds multiple A records for the same hostname.
With each query, the server rotates the order of IP addresses in the response.
Most clients connect to the first IP address in the list.
Over many queries, traffic is distributed roughly evenly across all servers.

Advantages of Round-Robin DNS:

Simplicity: Native to DNS; requires no special load balancing infrastructure.
Zero Cost: Built into standard DNS software; no additional licensing.
Global Distribution: Works across any network topology.
Stateless: No session state to maintain; infinitely scalable.

Critical Limitations:

Despite its simplicity, Round-Robin DNS has severe limitations that make it unsuitable as a sole load balancing mechanism for production systems:

Round-Robin DNS Limitations

•No Health Awareness — DNS continues returning the IP of a failed server. Users experience errors until the record is manually removed or TTL-based caching expires across all resolvers.
•Caching Skews Distribution — Recursive resolvers cache responses. High-traffic resolvers (e.g., a large ISP) cache one rotation, sending disproportionate traffic to one server for the TTL duration.
•No Session Affinity — Subsequent requests may resolve to different servers, breaking stateful sessions. Applications must handle this or use external session stores.
•No Capacity Awareness — Cannot weight traffic based on server capacity. A server with 2x capacity receives the same traffic share as one with 1x capacity.
•No Geographic Optimization — All users receive the same rotation regardless of location. A user in Tokyo might be directed to a server in New York.
•Slow Failover — Removing a failed server requires waiting for DNS cache expiration across all resolvers worldwide. This can take hours depending on TTL.

When Round-Robin Makes Sense

Round-Robin DNS remains useful in specific scenarios: internal services with short TTLs and health checks at other layers; as a supplement to application-layer load balancing (distributing traffic across multiple load balancer VIPs); or for non-critical services where brief outages are acceptable. Never rely on RRDNS alone for production traffic serving end users.

TTL Management and Failover Dynamics

Time To Live (TTL) is the critical parameter that determines how long DNS responses are cached before expiration. TTL management directly impacts failover speed, DNS query volume, and the accuracy of traffic steering.

The TTL Spectrum:

TTL Configuration Guide
TTL Range	Use Case	Advantages	Disadvantages
1-30 seconds	Real-time traffic steering, active failover	Near-instant failover; highly responsive to changes	Massive DNS query volume; latency exposure from resolution
60-300 seconds	Standard GSLB deployments	Balanced failover speed and query volume	Minutes to propagate changes globally
300-3600 seconds	Stable services, cost optimization	Lower DNS costs; reduced resolution latency	Slow failover; changes take 1+ hours to propagate
3600+ seconds	Static resources, legacy systems	Minimal DNS load; heavily cached	Impractical for any dynamic routing

Failover Timing Analysis:

Let's analyze what happens during a data center failure with different TTL configurations:

Scenario: Data center A serving 50% of global traffic fails at T=0. You update DNS to remove A's IP.

With 60-second TTL:

T=0: Failure occurs. About half of cached responses still point to A.
T=30: On average, half of caches have expired. ~25% still going to A.
T=60: All caches should have expired. New queries get only healthy IPs.
T=120: Even slow resolvers should have updated. A receives very little traffic.
Total impact duration: ~2-3 minutes of partial availability degradation.

With 3600-second (1 hour) TTL:

T=0: Failure occurs. Half of cached responses point to A.
T=30 minutes: Still half of traffic attempting to reach A.
T=60 minutes: Most caches expired, but some aggressive caches persist.
T=2 hours: Failure fully mitigated.
Total impact duration: 1-2 hours of significant availability impact.

ttl-strategy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Production TTL Strategy Configuration
dns_ttl_policy:
  # Primary application endpoints - balance failover speed and query volume
  production:
    default_ttl: 60
    health_aware: true
    on_degraded:
      ttl: 30  # Reduce TTL when health is uncertain to accelerate updates
      
  # CDN and static assets - longer TTL acceptable
  static_assets:
    default_ttl: 3600
    on_origin_change: 
      purge: true  # Cache purge instead of TTL expiration
      
  # Critical failover records - shortest practical TTL
  disaster_recovery:
    default_ttl: 30
    failover_ttl: 10  # Extremely aggressive during active incident
    
  # Internal services - can tolerate short TTL overhead
  internal:
    default_ttl: 30
    health_check_interval: 10
 
# Common anti-patterns to avoid
anti_patterns:
  - description: "TTL of 0"
    issue: "Many resolvers treat 0 as 'cache indefinitely' or reject"
  - description: "TTL > 86400 (24 hours)"
    issue: "Changes become operationally dangerous; any update takes a day+"
  - description: "Different TTLs for A and AAAA records"
    issue: "IPv4/IPv6 traffic routes differently during changes"

TTL Honoring Reality:

While TTL defines the intended caching duration, not all resolvers honor it:

Aggressive Caching: Some ISP resolvers ignore short TTLs to reduce query volume, caching for minutes or hours regardless of your TTL.
Minimum TTL Floors: Resolvers may enforce minimum TTLs (e.g., 30 seconds) even if you specify shorter.
Client Caching: Operating systems and browsers cache DNS independently with their own logic.
Application Caching: JVMs, connection pools, and libraries often cache resolved addresses, sometimes indefinitely until restart.

Mitigation Strategies:

Connection Draining: When rotating out a server, drain existing connections rather than immediately terminating.
Return Multiple IPs: Include healthy servers in responses so clients can failover client-side.
Application-Layer Fallback: Implement retry logic to try alternative servers when connection fails.
Service Mesh: Use service mesh (Envoy, Linkerd) for client-side load balancing that bypasses DNS.

Java's DNS Caching Pitfall

By default, the JVM caches DNS lookups indefinitely when running with a security manager (common in containers). This means DNS-based failover won't work without configuring networkaddress.cache.ttl in java.security or programmatically. Many production outages have occurred due to this default behavior.

EDNS Client Subnet (ECS)

EDNS Client Subnet (ECS) is an extension to DNS that helps authoritative servers make better routing decisions by revealing information about the client's network location. It addresses a fundamental limitation of traditional DNS: your authoritative server sees the recursive resolver's IP, not the user's.

The Problem ECS Solves:

Consider a user in Mumbai using Google's public DNS resolver (8.8.8.8). When their query reaches your authoritative server, you see an IP address belonging to Google's infrastructure—potentially in a different country. Without additional information, you might route this user to a suboptimal data center.

How ECS Works:

With ECS enabled:

The recursive resolver includes a truncated version of the client's IP address in the DNS query (e.g., the first 24 bits: 203.0.113.0/24).
Your authoritative server uses this subnet to determine the user's approximate location.
You return an IP address optimized for that subnet.
The resolver caches this response keyed by both the hostname AND the client subnet, ensuring users in different subnets can get different responses.

Converting Mermaid diagram...

ECS in Practice:

ecs-dns-query.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Standard DNS Query (No ECS) - Limited Information
; Query from recursive resolver to authoritative
;; QUESTION SECTION:
;api.example.com.        IN  A
 
;; Authoritative sees:
;;   Source IP: 8.8.8.8 (Google resolver)
;;   No client info - must route for Google's IP location
 
# ========================================
 
# DNS Query WITH ECS - Enhanced Location Info
; Query from ECS-enabled resolver to authoritative
;; QUESTION SECTION:
;api.example.com.        IN  A
 
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; CLIENT-SUBNET: 203.0.113.0/24/0
 
;; Authoritative sees:
;;   Source IP: 8.8.8.8 (Google resolver)
;;   Client Subnet: 203.0.113.0/24 (User's network)
;;   Can route based on user location!
 
# Response includes scope for caching:
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 203.0.113.0/24/24
;; This tells resolver to cache per /24

ECS Deployment Considerations:

ECS Best Practices

•Privacy Implications — ECS reveals client network information to authoritative servers. Some privacy-focused resolvers (e.g., 1.1.1.1 by default) do not send ECS. Consider privacy policies when implementing.
•Scope Prefix Selection — The /24 prefix (256 IPs) balances location accuracy with cache efficiency. Smaller prefixes increase cache fragmentation; larger prefixes reduce accuracy.
•Not Universally Supported — Many resolvers don't support ECS. Your authoritative server must gracefully fall back to resolver IP-based routing when ECS isn't present.
•Cache Explosion Risk — ECS creates per-subnet cache entries. Without careful scope management, cache sizes can grow dramatically.
•CDN/GSLB Support — Major providers (Route 53, Cloudflare, NS1) support ECS. Verify your DNS provider's ECS implementation and configuration options.

When ECS Matters Most

ECS provides the largest benefit when: your users heavily use public DNS resolvers (Google, Cloudflare, OpenDNS); you have data centers in many regions where nearest-DC routing significantly impacts latency; and you're using latency-based routing where resolver location is a poor proxy for user location. For simpler deployments with region-level routing, resolver IP may be sufficient.

Advanced DNS Load Balancing Patterns

Beyond simple round-robin and ECS-enhanced geographic routing, modern DNS infrastructure supports sophisticated traffic management patterns. These patterns enable fine-grained control over traffic distribution, failover behavior, and capacity management.

Weighted DNS Records:

Weighted routing allows you to specify the proportion of traffic each endpoint receives. This is essential for capacity-proportional distribution, gradual migrations, and canary deployments.

weighted-dns-policy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# AWS Route 53 Weighted Routing Policy
resource "aws_route53_record" "api_weighted_primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 70  # 70% of traffic
  }
  
  set_identifier = "primary"
  records        = ["203.0.113.10"]
  
  health_check_id = aws_route53_health_check.primary.id
}
 
resource "aws_route53_record" "api_weighted_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 30  # 30% of traffic
  }
  
  set_identifier = "secondary"
  records        = ["203.0.113.20"]
  
  health_check_id = aws_route53_health_check.secondary.id
}
 
# Use case examples:
# - Canary deployment: weight=99 (stable) vs weight=1 (canary)
# - Blue/green: weight=100 (blue) vs weight=0 (green), swap during deploy
# - Capacity: weight=70 (large DC) vs weight=30 (smaller DC)

Failover DNS Configuration:

Failover routing defines explicit primary/secondary relationships between endpoints. Traffic flows to the primary until health checks determine it's unavailable, at which point traffic automatically shifts to secondary endpoints.

failover-dns-policy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# AWS Route 53 Failover Routing Policy
resource "aws_route53_health_check" "primary" {
  fqdn              = "primary-dc.internal.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = "3"
  request_interval  = "30"
  
  regions = ["us-east-1", "eu-west-1", "ap-northeast-1"]
}
 
resource "aws_route53_record" "api_failover_primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  set_identifier  = "primary"
  records         = ["203.0.113.10"]
  health_check_id = aws_route53_health_check.primary.id
}
 
resource "aws_route53_record" "api_failover_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier = "secondary"
  records        = ["203.0.113.20"]
  # Secondary typically always healthy (DR site)
}

Multi-Value Answer Routing:

This pattern returns multiple IP addresses in each DNS response (like round-robin) but with health checking. Unhealthy endpoints are automatically removed from responses. Clients receive only healthy IPs and can implement client-side failover.

DNS Load Balancing Patterns Summary
Pattern	Traffic Distribution	Health Aware	Best For
Round-Robin	Equal, rotating	No	Non-critical internal services
Weighted	Proportional to weights	Yes (with health checks)	Capacity management, canary deploys
Failover	All to primary, then secondary	Yes (drives failover)	Active-passive DR
Latency-Based	Lowest measured RTT	Yes	Performance optimization
Geolocation	By user location	Yes	Data residency, regional services
Multi-Value	Multiple healthy IPs	Yes	Client-side resilience

Combining Patterns with Policy Trees:

Advanced DNS platforms support policy trees that evaluate multiple routing rules in sequence. A typical production configuration might:

First, eliminate unhealthy endpoints
Then, apply geolocation to select the preferred region
Within that region, apply weighted routing for capacity distribution
Finally, return multiple IPs for client-side failover

This layered approach enables sophisticated traffic management while remaining operationally manageable.

Test Your DNS Policies

Always test DNS routing policies from multiple locations before production deployment. Tools like dig, kdig (with ECS support), and global DNS testing services (e.g., WhatsMyDNS, DNSChecker) help verify that routing behaves as expected from different geographic regions and resolver contexts.

DNS Security and Reliability Considerations

DNS-based load balancing makes DNS a critical path component. If DNS fails or is compromised, your entire service becomes unreachable regardless of backend health. Security and reliability engineering for DNS is paramount.

High Availability for DNS:

Authoritative DNS must be highly available and globally distributed:

DNS High Availability Patterns

•Multiple Nameservers — Register at least 2 (ideally 4+) nameservers in different networks and availability zones. TLD registries enforce this requirement.
•Anycast DNS — Deploy authoritative nameservers on anycast networks so queries route to the nearest available instance. Most cloud DNS services use anycast.
•DDoS Protection — DNS is a common DDoS target. Use providers with robust DDoS mitigation (Cloudflare, AWS Shield, Google Cloud Armor).
•Secondary DNS — Configure secondary/slave DNS with automatic zone transfers. Ensures DNS continues if primary nameserver infrastructure fails.
•Hidden Primary — Keep the primary authoritative server hidden (not listed in NS records), only serving zone transfers to public secondaries. Reduces attack surface.

DNS Security Threats:

Common DNS Attack Vectors and Mitigations
Attack	Description	Mitigation
DNS Cache Poisoning	Attacker injects false records into resolver caches	DNSSEC, Response Rate Limiting (RRL)
DNS Amplification DDoS	Attacker uses DNS servers to amplify traffic toward victim	RRL, disable open resolvers, anycast
DNS Hijacking	Attacker modifies DNS responses via MITM	DNSSEC, DNS over HTTPS (DoH), DNS over TLS (DoT)
Domain Hijacking	Attacker gains control of domain registration	Registrar lock, DNSSEC, monitoring
Subdomain Takeover	Orphaned DNS records point to attacker-controlled resources	DNS hygiene, automated scanning

DNSSEC:

DNS Security Extensions (DNSSEC) add cryptographic signatures to DNS responses, enabling clients to verify that responses haven't been tampered with. While DNSSEC adoption has been slow, it's increasingly important for high-security environments.

Monitoring DNS Health:

Given DNS's critical role, comprehensive monitoring is essential:

Query volume and latency — Track DNS query patterns and response times globally
Resolution accuracy — Periodically verify that DNS returns expected IPs from multiple locations
Propagation monitoring — After changes, track propagation across major resolvers
Certificate transparency — Monitor for unauthorized SSL certificates that might indicate DNS hijacking

The DNS Registrar is a Single Point of Failure

Your domain registrar controls your NS records. If an attacker gains access to your registrar account, they can redirect your entire domain. Use strong authentication (hardware keys), registrar lock features, and monitor for unauthorized changes. Consider premium registrars with enhanced security controls for mission-critical domains.

Summary: DNS-Based Load Balancing

DNS-based load balancing is a foundational technique that turns the universal DNS resolution step into an intelligent routing decision point. Let's consolidate what we've learned:

Key Takeaways

•DNS Acts as the First Routing Layer — Before any application traffic flows, DNS resolution determines which IP addresses (and thus which servers/data centers) clients will contact.
•Round-Robin DNS is Insufficient Alone — While simple, RRDNS lacks health awareness, capacity awareness, and geographic optimization. Use it only with additional load balancing layers.
•TTL Defines Failover Speed — Short TTLs enable rapid failover but increase DNS query volume. Most production systems use 60-300 second TTLs as a balance.
•EDNS Client Subnet Improves Accuracy — ECS allows authoritative servers to see client network location, enabling more accurate geographic routing when users use public resolvers.
•Advanced Patterns Enable Sophisticated Routing — Weighted, failover, latency-based, and geolocation policies can be combined for fine-grained traffic control.
•DNS Security is Critical — As DNS becomes a critical path, security (DNSSEC, registrar security) and reliability (anycast, DDoS protection) are essential investments.

What's Next:

With DNS-based load balancing understood, we'll explore Anycast Routing—a network-layer technique that complements DNS by routing packets to the topologically nearest server sharing the same IP address. Anycast is the backbone of CDN performance and global service distribution.

Page Complete

You've mastered DNS-based load balancing—from fundamental resolution mechanics through advanced routing patterns and security considerations. You understand how DNS serves as the invisible traffic controller enabling global service distribution. Next, we'll explore Anycast routing for network-layer geographic distribution.

DNS-Based Load Balancing

The Invisible Traffic Controller

What You Will Learn

DNS Resolution Fundamentals

The Resolution Chain:

When a user's browser requests api.example.com, the following sequence occurs:

Client Stub Resolver: The operating system's built-in resolver checks its local cache. If the hostname was recently resolved, the cached IP is returned immediately.
Recursive Resolver: If not cached locally, the query goes to a recursive resolver (typically operated by the ISP, or a public resolver like Google's 8.8.8.8 or Cloudflare's 1.1.1.1). The recursive resolver does the heavy lifting of traversing the DNS hierarchy.
Root Servers: The recursive resolver queries root servers to find the authoritative servers for the .com TLD.
TLD Servers: The .com TLD servers direct to the authoritative nameservers for example.com.
Authoritative Servers: Your authoritative nameservers (where DNS-based load balancing logic lives) return the final IP address(es) for api.example.com.
Response Propagation: The answer flows back through the chain, with each layer caching according to the record's TTL (Time To Live).

Converting Mermaid diagram...

The Critical Insight for Load Balancing:

However, the caching layers introduce critical constraints:

You can't force immediate updates: Once a recursive resolver caches your response, users going through that resolver will see the cached IP until TTL expires.
You don't see the real user: By default, your authoritative server sees the recursive resolver's IP, not the end user's. This limits geographic accuracy.
TTL is a tradeoff, not a guarantee: Even with short TTLs, some resolvers ignore TTL or cache longer than specified.

The Resolver Problem

Round-Robin DNS: The Simplest Approach

Configuration Example:

dns-zone-rrdns.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
; Round-Robin DNS Configuration for api.example.com
; Each A record points to a different server
; DNS servers rotate the order for each query
 
$ORIGIN example.com.
$TTL 300
 
api     IN  A   203.0.113.10    ; Server 1
api     IN  A   203.0.113.11    ; Server 2
api     IN  A   203.0.113.12    ; Server 3
api     IN  A   203.0.113.13    ; Server 4
 
; Queries receive all IPs, but in rotating order:
; Query 1: 203.0.113.10, 203.0.113.11, 203.0.113.12, 203.0.113.13
; Query 2: 203.0.113.11, 203.0.113.12, 203.0.113.13, 203.0.113.10
; Query 3: 203.0.113.12, 203.0.113.13, 203.0.113.10, 203.0.113.11
; etc.

How Round-Robin Works:

The authoritative DNS server holds multiple A records for the same hostname.
With each query, the server rotates the order of IP addresses in the response.
Most clients connect to the first IP address in the list.
Over many queries, traffic is distributed roughly evenly across all servers.

Advantages of Round-Robin DNS:

Simplicity: Native to DNS; requires no special load balancing infrastructure.
Zero Cost: Built into standard DNS software; no additional licensing.
Global Distribution: Works across any network topology.
Stateless: No session state to maintain; infinitely scalable.

Critical Limitations:

Despite its simplicity, Round-Robin DNS has severe limitations that make it unsuitable as a sole load balancing mechanism for production systems:

Round-Robin DNS Limitations

•No Health Awareness — DNS continues returning the IP of a failed server. Users experience errors until the record is manually removed or TTL-based caching expires across all resolvers.
•Caching Skews Distribution — Recursive resolvers cache responses. High-traffic resolvers (e.g., a large ISP) cache one rotation, sending disproportionate traffic to one server for the TTL duration.
•No Session Affinity — Subsequent requests may resolve to different servers, breaking stateful sessions. Applications must handle this or use external session stores.
•No Capacity Awareness — Cannot weight traffic based on server capacity. A server with 2x capacity receives the same traffic share as one with 1x capacity.
•No Geographic Optimization — All users receive the same rotation regardless of location. A user in Tokyo might be directed to a server in New York.
•Slow Failover — Removing a failed server requires waiting for DNS cache expiration across all resolvers worldwide. This can take hours depending on TTL.

When Round-Robin Makes Sense

TTL Management and Failover Dynamics

The TTL Spectrum:

TTL Configuration Guide
TTL Range	Use Case	Advantages	Disadvantages
1-30 seconds	Real-time traffic steering, active failover	Near-instant failover; highly responsive to changes	Massive DNS query volume; latency exposure from resolution
60-300 seconds	Standard GSLB deployments	Balanced failover speed and query volume	Minutes to propagate changes globally
300-3600 seconds	Stable services, cost optimization	Lower DNS costs; reduced resolution latency	Slow failover; changes take 1+ hours to propagate
3600+ seconds	Static resources, legacy systems	Minimal DNS load; heavily cached	Impractical for any dynamic routing

Failover Timing Analysis:

Let's analyze what happens during a data center failure with different TTL configurations:

Scenario: Data center A serving 50% of global traffic fails at T=0. You update DNS to remove A's IP.

With 60-second TTL:

T=0: Failure occurs. About half of cached responses still point to A.
T=30: On average, half of caches have expired. ~25% still going to A.
T=60: All caches should have expired. New queries get only healthy IPs.
T=120: Even slow resolvers should have updated. A receives very little traffic.
Total impact duration: ~2-3 minutes of partial availability degradation.

With 3600-second (1 hour) TTL:

T=0: Failure occurs. Half of cached responses point to A.
T=30 minutes: Still half of traffic attempting to reach A.
T=60 minutes: Most caches expired, but some aggressive caches persist.
T=2 hours: Failure fully mitigated.
Total impact duration: 1-2 hours of significant availability impact.

ttl-strategy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Production TTL Strategy Configuration
dns_ttl_policy:
  # Primary application endpoints - balance failover speed and query volume
  production:
    default_ttl: 60
    health_aware: true
    on_degraded:
      ttl: 30  # Reduce TTL when health is uncertain to accelerate updates
      
  # CDN and static assets - longer TTL acceptable
  static_assets:
    default_ttl: 3600
    on_origin_change: 
      purge: true  # Cache purge instead of TTL expiration
      
  # Critical failover records - shortest practical TTL
  disaster_recovery:
    default_ttl: 30
    failover_ttl: 10  # Extremely aggressive during active incident
    
  # Internal services - can tolerate short TTL overhead
  internal:
    default_ttl: 30
    health_check_interval: 10
 
# Common anti-patterns to avoid
anti_patterns:
  - description: "TTL of 0"
    issue: "Many resolvers treat 0 as 'cache indefinitely' or reject"
  - description: "TTL > 86400 (24 hours)"
    issue: "Changes become operationally dangerous; any update takes a day+"
  - description: "Different TTLs for A and AAAA records"
    issue: "IPv4/IPv6 traffic routes differently during changes"

TTL Honoring Reality:

While TTL defines the intended caching duration, not all resolvers honor it:

Aggressive Caching: Some ISP resolvers ignore short TTLs to reduce query volume, caching for minutes or hours regardless of your TTL.
Minimum TTL Floors: Resolvers may enforce minimum TTLs (e.g., 30 seconds) even if you specify shorter.
Client Caching: Operating systems and browsers cache DNS independently with their own logic.
Application Caching: JVMs, connection pools, and libraries often cache resolved addresses, sometimes indefinitely until restart.

Mitigation Strategies:

Connection Draining: When rotating out a server, drain existing connections rather than immediately terminating.
Return Multiple IPs: Include healthy servers in responses so clients can failover client-side.
Application-Layer Fallback: Implement retry logic to try alternative servers when connection fails.
Service Mesh: Use service mesh (Envoy, Linkerd) for client-side load balancing that bypasses DNS.

Java's DNS Caching Pitfall

EDNS Client Subnet (ECS)

The Problem ECS Solves:

How ECS Works:

With ECS enabled:

The recursive resolver includes a truncated version of the client's IP address in the DNS query (e.g., the first 24 bits: 203.0.113.0/24).
Your authoritative server uses this subnet to determine the user's approximate location.
You return an IP address optimized for that subnet.
The resolver caches this response keyed by both the hostname AND the client subnet, ensuring users in different subnets can get different responses.

Converting Mermaid diagram...

ECS in Practice:

ecs-dns-query.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Standard DNS Query (No ECS) - Limited Information
; Query from recursive resolver to authoritative
;; QUESTION SECTION:
;api.example.com.        IN  A
 
;; Authoritative sees:
;;   Source IP: 8.8.8.8 (Google resolver)
;;   No client info - must route for Google's IP location
 
# ========================================
 
# DNS Query WITH ECS - Enhanced Location Info
; Query from ECS-enabled resolver to authoritative
;; QUESTION SECTION:
;api.example.com.        IN  A
 
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; CLIENT-SUBNET: 203.0.113.0/24/0
 
;; Authoritative sees:
;;   Source IP: 8.8.8.8 (Google resolver)
;;   Client Subnet: 203.0.113.0/24 (User's network)
;;   Can route based on user location!
 
# Response includes scope for caching:
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 203.0.113.0/24/24
;; This tells resolver to cache per /24

ECS Deployment Considerations:

ECS Best Practices

•Privacy Implications — ECS reveals client network information to authoritative servers. Some privacy-focused resolvers (e.g., 1.1.1.1 by default) do not send ECS. Consider privacy policies when implementing.
•Scope Prefix Selection — The /24 prefix (256 IPs) balances location accuracy with cache efficiency. Smaller prefixes increase cache fragmentation; larger prefixes reduce accuracy.
•Not Universally Supported — Many resolvers don't support ECS. Your authoritative server must gracefully fall back to resolver IP-based routing when ECS isn't present.
•Cache Explosion Risk — ECS creates per-subnet cache entries. Without careful scope management, cache sizes can grow dramatically.
•CDN/GSLB Support — Major providers (Route 53, Cloudflare, NS1) support ECS. Verify your DNS provider's ECS implementation and configuration options.

When ECS Matters Most

Advanced DNS Load Balancing Patterns

Weighted DNS Records:

Weighted routing allows you to specify the proportion of traffic each endpoint receives. This is essential for capacity-proportional distribution, gradual migrations, and canary deployments.

weighted-dns-policy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# AWS Route 53 Weighted Routing Policy
resource "aws_route53_record" "api_weighted_primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 70  # 70% of traffic
  }
  
  set_identifier = "primary"
  records        = ["203.0.113.10"]
  
  health_check_id = aws_route53_health_check.primary.id
}
 
resource "aws_route53_record" "api_weighted_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 30  # 30% of traffic
  }
  
  set_identifier = "secondary"
  records        = ["203.0.113.20"]
  
  health_check_id = aws_route53_health_check.secondary.id
}
 
# Use case examples:
# - Canary deployment: weight=99 (stable) vs weight=1 (canary)
# - Blue/green: weight=100 (blue) vs weight=0 (green), swap during deploy
# - Capacity: weight=70 (large DC) vs weight=30 (smaller DC)

Failover DNS Configuration:

failover-dns-policy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# AWS Route 53 Failover Routing Policy
resource "aws_route53_health_check" "primary" {
  fqdn              = "primary-dc.internal.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = "3"
  request_interval  = "30"
  
  regions = ["us-east-1", "eu-west-1", "ap-northeast-1"]
}
 
resource "aws_route53_record" "api_failover_primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  set_identifier  = "primary"
  records         = ["203.0.113.10"]
  health_check_id = aws_route53_health_check.primary.id
}
 
resource "aws_route53_record" "api_failover_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  ttl     = 60
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier = "secondary"
  records        = ["203.0.113.20"]
  # Secondary typically always healthy (DR site)
}

Multi-Value Answer Routing:

DNS Load Balancing Patterns Summary
Pattern	Traffic Distribution	Health Aware	Best For
Round-Robin	Equal, rotating	No	Non-critical internal services
Weighted	Proportional to weights	Yes (with health checks)	Capacity management, canary deploys
Failover	All to primary, then secondary	Yes (drives failover)	Active-passive DR
Latency-Based	Lowest measured RTT	Yes	Performance optimization
Geolocation	By user location	Yes	Data residency, regional services
Multi-Value	Multiple healthy IPs	Yes	Client-side resilience

Combining Patterns with Policy Trees:

Advanced DNS platforms support policy trees that evaluate multiple routing rules in sequence. A typical production configuration might:

First, eliminate unhealthy endpoints
Then, apply geolocation to select the preferred region
Within that region, apply weighted routing for capacity distribution
Finally, return multiple IPs for client-side failover

This layered approach enables sophisticated traffic management while remaining operationally manageable.

Test Your DNS Policies

DNS Security and Reliability Considerations

High Availability for DNS:

Authoritative DNS must be highly available and globally distributed:

DNS High Availability Patterns

•Multiple Nameservers — Register at least 2 (ideally 4+) nameservers in different networks and availability zones. TLD registries enforce this requirement.
•Anycast DNS — Deploy authoritative nameservers on anycast networks so queries route to the nearest available instance. Most cloud DNS services use anycast.
•DDoS Protection — DNS is a common DDoS target. Use providers with robust DDoS mitigation (Cloudflare, AWS Shield, Google Cloud Armor).
•Secondary DNS — Configure secondary/slave DNS with automatic zone transfers. Ensures DNS continues if primary nameserver infrastructure fails.
•Hidden Primary — Keep the primary authoritative server hidden (not listed in NS records), only serving zone transfers to public secondaries. Reduces attack surface.

DNS Security Threats:

Common DNS Attack Vectors and Mitigations
Attack	Description	Mitigation
DNS Cache Poisoning	Attacker injects false records into resolver caches	DNSSEC, Response Rate Limiting (RRL)
DNS Amplification DDoS	Attacker uses DNS servers to amplify traffic toward victim	RRL, disable open resolvers, anycast
DNS Hijacking	Attacker modifies DNS responses via MITM	DNSSEC, DNS over HTTPS (DoH), DNS over TLS (DoT)
Domain Hijacking	Attacker gains control of domain registration	Registrar lock, DNSSEC, monitoring
Subdomain Takeover	Orphaned DNS records point to attacker-controlled resources	DNS hygiene, automated scanning

DNSSEC:

Monitoring DNS Health:

Given DNS's critical role, comprehensive monitoring is essential:

Query volume and latency — Track DNS query patterns and response times globally
Resolution accuracy — Periodically verify that DNS returns expected IPs from multiple locations
Propagation monitoring — After changes, track propagation across major resolvers
Certificate transparency — Monitor for unauthorized SSL certificates that might indicate DNS hijacking

The DNS Registrar is a Single Point of Failure

Summary: DNS-Based Load Balancing

DNS-based load balancing is a foundational technique that turns the universal DNS resolution step into an intelligent routing decision point. Let's consolidate what we've learned:

Key Takeaways

•DNS Acts as the First Routing Layer — Before any application traffic flows, DNS resolution determines which IP addresses (and thus which servers/data centers) clients will contact.
•Round-Robin DNS is Insufficient Alone — While simple, RRDNS lacks health awareness, capacity awareness, and geographic optimization. Use it only with additional load balancing layers.
•TTL Defines Failover Speed — Short TTLs enable rapid failover but increase DNS query volume. Most production systems use 60-300 second TTLs as a balance.
•EDNS Client Subnet Improves Accuracy — ECS allows authoritative servers to see client network location, enabling more accurate geographic routing when users use public resolvers.
•Advanced Patterns Enable Sophisticated Routing — Weighted, failover, latency-based, and geolocation policies can be combined for fine-grained traffic control.
•DNS Security is Critical — As DNS becomes a critical path, security (DNSSEC, registrar security) and reliability (anycast, DDoS protection) are essential investments.

What's Next:

Page Complete