Computer NetworksDNS Caching

DNS Caching Mechanisms

LevelIntermediate

Duration60 mins

TopicDNS Caching

5 / 5

DNS Cache Management: Operational Excellence

The Art of Cache Management

Effective DNS cache management is the operational discipline that ensures DNS caching achieves its intended benefits—reduced latency, lower query volumes, and improved resilience—while minimizing the drawbacks of stale data, security vulnerabilities, and resource exhaustion.

DNS cache management encompasses:

Cache sizing and capacity planning — Ensuring caches have adequate capacity for workload
TTL policy decisions — Balancing freshness against efficiency for different record types
Cache flushing and invalidation — Clearing stale or incorrect data when needed
Monitoring and observability — Understanding cache health and performance
Change management integration — Coordinating DNS changes with cache behavior
Troubleshooting procedures — Diagnosing and resolving cache-related issues

This page covers the operational practices that network administrators use to manage DNS caches across enterprise, ISP, and service provider environments.

What You Will Learn

By the end of this page, you will understand cache sizing and capacity planning, TTL strategy development, cache flushing techniques across platforms, monitoring and alerting for cache health, integration with change management processes, and advanced cache optimization techniques.

Cache Sizing and Capacity Planning

DNS cache capacity directly impacts performance and hit rates. Undersized caches evict entries prematurely, reducing hit rates and increasing upstream query load. Oversized caches waste memory with minimal marginal benefit.

Cache Size Determinants:

Client population — More clients generate more unique queries requiring cache entries
Query diversity — Organizations accessing many unique domains need larger caches
TTL distribution — Lower average TTLs mean entries expire faster, affecting working set size
Peak query rate — High query rates during business hours drive cache requirements
Memory availability — Physical memory constraints limit practical cache size

Sizing Guidelines:

DNS Cache Sizing Recommendations
Environment	Client Count	Recommended Cache Size	Memory Estimate
Home network	5-20 devices	500-1000 entries	1-2 MB
Small office	50-100 devices	5,000-10,000 entries	10-20 MB
Medium enterprise	1,000-5,000 devices	50,000-200,000 entries	50-200 MB
Large enterprise	10,000+ devices	500,000-2,000,000 entries	500 MB-2 GB
ISP resolver	100,000+ subscribers	10,000,000+ entries	10+ GB

BIND Cache Configuration:

# named.conf options for cache management
options {
    # Maximum cache size (memory limit)
    max-cache-size 256M;   # Limit cache to 256 MB
    
    # Alternative: percentage of physical memory
    max-cache-size 50%;    # Use up to 50% of system memory
    
    # Maximum cache TTL (cap very high TTLs)
    max-cache-ttl 86400;   # 24 hours maximum
    
    # Maximum negative cache TTL
    max-ncache-ttl 3600;   # 1 hour for NXDOMAIN caching
    
    # Minimum cache TTL (floor for very low TTLs)
    min-cache-ttl 60;      # Don't cache less than 60 seconds
};

Unbound Cache Configuration:

# unbound.conf cache settings
server:
    # Cache memory size
    msg-cache-size: 128m
    rrset-cache-size: 256m
    
    # Number of cache slabs (performance tuning)
    msg-cache-slabs: 4
    rrset-cache-slabs: 4
    
    # Cache TTL limits
    cache-min-ttl: 60
    cache-max-ttl: 86400
    cache-max-negative-ttl: 3600
    
    # Prefetch expiring entries
    prefetch: yes
    prefetch-key: yes

Monitor Before Sizing

Before setting cache sizes, monitor actual cache usage. Most DNS servers expose cache statistics showing current entries, hit rates, and memory usage. Size based on observed peak usage plus 20-50% headroom for growth and traffic spikes.

TTL Policy Development

Organizations should develop explicit TTL policies rather than accepting default values. Well-designed TTL policies balance operational needs (failover speed, change agility) against infrastructure efficiency (cache hit rates, upstream load).

TTL Policy Framework:

TTL Policy Considerations

•Record type — NS records can have high TTLs; A/AAAA records for dynamic infrastructure need lower TTLs
•Change frequency — Frequently updated records need lower TTLs; stable infrastructure can use higher TTLs
•Failover requirements — RTO (Recovery Time Objective) requirements determine maximum acceptable TTL
•Geographic distribution — CDN-backed services may need very low TTLs for geographic steering
•Cost implications — Low TTLs on high-traffic domains increase managed DNS costs

Example TTL Policy Matrix
Record Category	Recommended TTL	Rationale	Change Procedure
NS records (delegation)	86400-172800s (24-48h)	Rarely change; high cache value	Planned change window
MX records (mail)	3600-86400s (1-24h)	Mail server changes uncommon	Coordinated with mail ops
A/AAAA (stable services)	3600s (1h)	Balance of efficiency/agility	Standard change process
A/AAAA (cloud/dynamic)	300s (5min)	IPs may change frequently	Automated/frequent changes
A/AAAA (active failover)	60-120s (1-2min)	Fast failover required	Health-check driven
CNAME records	Match target TTL	Should align with target	Per target policy
TXT (verification)	300-3600s	May need updates for validation	As needed
SOA MINIMUM (neg cache)	300-600s (5-10min)	New subdomain responsiveness	Rarely changed

Documenting TTL Policies:

Enterprise DNS documentation should include:

Standard TTL values by record type
Exceptions process for non-standard TTL requests
Pre-change procedures (TTL lowering before changes)
Post-change procedures (TTL restoration after verification)
Emergency procedures for rapid changes

Example Zone File with Policy-Driven TTLs:

$TTL 3600                           ; Default 1-hour TTL for most records

; SOA with 5-minute negative cache TTL
@       IN SOA  ns1.example.com. admin.example.com. (
                2024011701 7200 3600 1209600 300 )

; NS records: 24-hour TTL (stable infrastructure)
@       86400   IN NS   ns1.example.com.
@       86400   IN NS   ns2.example.com.

; MX records: 1-hour TTL (default)
@               IN MX   10 mail1.example.com.
@               IN MX   20 mail2.example.com.

; Primary website: 1-hour TTL (default)
www             IN A    93.184.216.34

; API with active health checks: 2-minute TTL
api     120     IN A    10.0.1.100
api     120     IN A    10.0.1.101

; CDN edge: 1-minute TTL (geographic steering)
cdn     60      IN CNAME example.com.cdn.cloudflare.net.

Test TTL Behavior Before Production

Before implementing TTL policies, test resolver behavior. Some ISP resolvers enforce TTL minimums; some browsers override TTLs. Monitor actual propagation times during test changes to validate that TTL policies achieve intended behavior.

Cache Flushing Techniques

Cache flushing removes entries from DNS caches before TTL expiry. Flushing is required when:

DNS records change and immediate propagation is needed
Poisoned or incorrect entries must be cleared
Troubleshooting requires fresh resolution
Testing requires predictable cache state

Flushing Scope Considerations:

Scope	When to Use	Impact
Specific record	Single record correction	Minimal—only affects one entry
Zone flush	Zone-wide changes	Moderate—clears all zone entries
Full cache flush	Major incident response	Significant—temporary latency increase

Platform-Specific Flushing Commands:

cache-flush-commands.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# ===== BIND DNS Server =====
 
# Flush entire cache
rndc flush
 
# Flush specific domain (and subdomains)
rndc flushname example.com
 
# Flush specific zone
rndc flushtree example.com
 
# View cache statistics before/after
rndc stats    # Writes to named.stats
 
# ===== Unbound DNS Server =====
 
# Flush entire cache
unbound-control flush_zone .
 
# Flush specific zone
unbound-control flush_zone example.com
 
# Flush specific name
unbound-control flush example.com
 
# Flush with type
unbound-control flush_type A
 
# View cache statistics
unbound-control stats_noreset
 
# ===== Windows DNS Server =====
 
# PowerShell: Clear entire server cache
Clear-DnsServerCache
 
# Clear specific record from cache
Clear-DnsServerCache -Name "example.com"
 
# View cached records
Show-DnsServerCache
 
# ===== Client-Side Flushing =====
 
# Windows client
ipconfig /flushdns
 
# macOS
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
 
# Linux (systemd-resolved)
resolvectl flush-caches
 
# Linux (nscd)
sudo nscd -i hosts

Flushing Public Resolvers:

Public DNS providers don't allow external cache flushing, but some offer cache status checking:

Google Public DNS:

Visit: https://dns.google/cache
Enter domain to view cached records
No direct flush capability; wait for TTL

Cloudflare DNS:

Visit: https://1.1.1.1/purge-cache/
Enter domain to request cache purge
May take a few minutes to propagate

Automated Flushing Scripts:

#!/bin/bash
# Flush DNS across infrastructure after change

DOMAIN="$1"

if [ -z "$DOMAIN" ]; then
    echo "Usage: $0 <domain>"
    exit 1
fi

echo "Flushing DNS caches for: $DOMAIN"

# Flush internal resolvers (BIND servers)
for server in dns1.corp.example.com dns2.corp.example.com; do
    echo "Flushing $server..."
    ssh $server "rndc flushtree $DOMAIN" 2>/dev/null || echo "  Failed: $server"
done

# Notify team
echo "Cache flush complete. TTL-based propagation continues."
echo "Verify: dig @8.8.8.8 $DOMAIN"

Don't Over-Flush

Frequent full cache flushing defeats the purpose of caching—it increases upstream query load and latency. Use targeted flushes (specific names or zones) whenever possible. Reserve full cache flushes for genuine incidents, not routine operations.

Cache Monitoring and Observability

Effective cache management requires visibility into cache behavior. Monitoring DNS cache metrics enables capacity planning, performance optimization, and anomaly detection.

Key Metrics to Monitor:

DNS Cache Metrics
Metric	What It Measures	Healthy Range	Alert Threshold
Cache Hit Rate	% queries served from cache	80-95%	< 70%
Cache Size (entries)	Number of cached records	Within max capacity	90% capacity
Cache Memory Usage	Memory consumed by cache	Within allocation	85% allocation
Query Rate	Queries per second	Within capacity	Sustained > 90% capacity
Upstream Query Rate	Cache misses forwarded	~5-20% of total	30% of total
SERVFAIL Rate	Failed resolutions	< 0.1%	1%
Average Latency	Response time	< 50ms cached	100ms sustained

Extracting BIND Statistics:

# Generate statistics dump
rndc stats

# Parse cache statistics from named.stats
grep -A5 'Cache Statistics' /var/named/data/named.stats

# Sample output:
# ++ Cache Statistics ++
#     cache hits = 1523847
#     cache misses = 234521
#     cache hit rate 86.67%
#     CacheSize 45123

# Enable query logging (verbose, use sparingly)
rndc querylog on

Extracting Unbound Statistics:

# Get current statistics
unbound-control stats

# Key metrics in output:
# total.num.queries=1284753
# total.num.cachehits=1089428
# total.num.cachemiss=195325
# total.num.recursivereplies=195325
# mem.cache.rrset=98745632
# msg.cache.count=45231

# Calculate hit rate:
# hit_rate = cachehits / queries = 1089428/1284753 = 84.8%

Integration with Monitoring Systems:

Modern DNS servers export metrics in formats compatible with monitoring systems:

Prometheus — BIND exporter, Unbound exporter available
Grafana — Dashboard templates for DNS metrics
Datadog/New Relic — DNS integration plugins
SNMP — Traditional monitoring via SNMP MIBs

prometheus-dns-query.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Example Prometheus alerting rules for DNS cache
groups:
  - name: dns_cache_alerts
    rules:
      # Alert on low cache hit rate
      - alert: DNSCacheHitRateLow
        expr: dns_cache_hit_rate < 0.70
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "DNS cache hit rate below 70%"
          description: "Cache hit rate at {{ $value | humanizePercentage }}"
      
      # Alert on high cache memory usage
      - alert: DNSCacheMemoryHigh
        expr: dns_cache_memory_bytes / dns_cache_memory_max_bytes > 0.90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "DNS cache memory above 90%"
          
      # Alert on elevated SERVFAIL rate
      - alert: DNSServfailRateHigh
        expr: rate(dns_queries_servfail[5m]) / rate(dns_queries_total[5m]) > 0.01
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "DNS SERVFAIL rate above 1%"

Baseline Before Alerting

Establish baseline metrics for your environment before setting alert thresholds. Normal cache hit rates, query volumes, and latencies vary significantly between environments. A 75% hit rate might be critical for an ISP but normal for a small office with diverse browsing.

Advanced Cache Optimization

Beyond basic configuration, advanced techniques can further optimize DNS cache performance for demanding environments.

Prefetching (Cache Warming):

Prefetching proactively refreshes cache entries before they expire, eliminating the latency spike that occurs when a popular entry expires and must be re-resolved:

# Unbound prefetch configuration
server:
    # Enable prefetching of expiring entries
    prefetch: yes
    
    # When TTL is x% expired, prefetch
    # (default is when 10% of TTL remains)
    prefetch-key: yes

How Prefetching Works:

Entry cached with TTL=3600 seconds
At T=3240 (90% of TTL elapsed, 10% remaining)
Next query for that entry triggers background refresh
Client gets immediate cached response
Cache entry updated with fresh data before expiry

Serve-Stale (RFC 8767):

Serve-stale allows resolvers to return expired cache entries when authoritative servers are unreachable, improving availability:

# BIND serve-stale configuration
options {
    stale-answer-enable yes;
    stale-answer-ttl 30;           # Return stale with 30s TTL
    max-stale-ttl 86400;           # Serve stale up to 24 hours old
    stale-answer-client-timeout 1800;  # Wait 1.8s before serving stale
    stale-refresh-time 30;         # Retry authoritative every 30s
};

Advanced Optimization Techniques

•QNAME minimization — Reduces information leaked in queries while maintaining cache efficiency; enabled by default in modern resolvers
•Aggressive NSEC caching — Caches DNSSEC denial-of-existence records to answer negative queries without upstream resolution
•Cache partitioning — Separate cache instances for different security zones (internal vs. external queries)
•Anycast caching — Deploy resolver instances geographically close to clients for reduced latency
•Negative cache tuning — Adjust SOA MINIMUM values to balance new subdomain responsiveness against NXDOMAIN attack protection

Cache Partitioning for Security:

Enterprise environments may benefit from separate caches for different purposes:

Cache Instance	Purpose	Client Population
Internal resolver	Corporate clients	Employee workstations
DMZ resolver	Externally-facing servers	Web servers, mail servers
Guest resolver	Guest network	Visitor devices
IoT resolver	IoT devices	Sensors, cameras, devices

Separation prevents cache poisoning in one zone from affecting others and allows zone-specific policies (e.g., IoT resolver might have stricter filtering).

Performance Tuning:

# Unbound performance tuning for high-volume resolver
server:
    # Thread and socket settings
    num-threads: 4                 # Match CPU cores
    so-reuseport: yes              # Better multi-thread performance
    
    # Buffer sizes
    so-rcvbuf: 4m
    so-sndbuf: 4m
    
    # Cache slab count (power of 2, >= num-threads)
    msg-cache-slabs: 4
    rrset-cache-slabs: 4
    infra-cache-slabs: 4
    key-cache-slabs: 4
    
    # Outgoing query optimization
    outgoing-range: 8192           # Concurrent outgoing queries
    num-queries-per-thread: 4096

Measure Before Optimizing

Advanced optimizations provide marginal gains in most environments. Measure current performance, identify bottlenecks, and optimize specifically for observed issues. Default configurations are suitable for most deployments—optimize only when monitoring reveals genuine performance needs.

Troubleshooting Cache Issues

DNS cache issues manifest as resolution failures, stale data, or inconsistent behavior. Systematic troubleshooting isolates problems to specific cache layers.

Common Cache-Related Problems:

Cache Issue Diagnosis
Symptom	Possible Cause	Diagnostic Steps
'Works for some users, not others'	Different cache state per resolver	Query from multiple resolvers; compare results
'Was working, suddenly stopped'	Stale negative cache (NXDOMAIN)	Check for negative caching; new record may be masked
'Works after flush, fails again'	Authoritative returns incorrect data	Query authoritative directly; check zone configuration
'External works, internal fails'	Split-horizon DNS misconfiguration	Compare internal vs. external resolver responses
'Slow after DNS change'	Cache misses during propagation	Normal behavior; wait for cache rebuild

Systematic Troubleshooting Workflow:

# Step 1: Verify authoritative response (ground truth)
dig @ns1.example.com problematic.example.com A +short

# Step 2: Check each cache layer
# ISP/Public resolver
dig @8.8.8.8 problematic.example.com A +short
dig @1.1.1.1 problematic.example.com A +short

# Internal resolver (replace with your resolver IP)
dig @192.168.1.1 problematic.example.com A +short

# Local OS resolution
nslookup problematic.example.com

# Step 3: Compare TTL values at each layer
dig @ns1.example.com problematic.example.com A +noall +answer
dig @8.8.8.8 problematic.example.com A +noall +answer
# Authoritative shows original TTL; caches show remaining TTL

# Step 4: If stale, identify which layer is stale
# The layer with incorrect data AND high remaining TTL is the culprit

# Step 5: Flush specific layer if needed
# (Use appropriate flush command for identified layer)

The 'Negative Cache' Trap:

A common issue: creating a new subdomain that was previously queried. If new.example.com was queried before the record existed, resolvers cache NXDOMAIN. After creating the record, resolvers continue returning NXDOMAIN until negative cache expires.

Diagnosis:

# Query with +nsid to see if response is from cache
dig @8.8.8.8 new.example.com A

# If NXDOMAIN returned, check authoritative
dig @ns1.example.com new.example.com A

# If authoritative has the record but public doesn't → negative cache
# Solution: wait for SOA MINIMUM TTL to expire

Keep Troubleshooting Tools Ready

Maintain a list of resolver IPs (internal, ISP, public), authoritative server hostnames, and flush procedures for quick reference during incidents. DNS troubleshooting under pressure is easier with prepared reference materials.

Summary: DNS Cache Management Excellence

Effective DNS cache management ensures that caching achieves its intended benefits while minimizing operational issues. Proactive monitoring, clear policies, and systematic procedures enable reliable DNS operations.

Key takeaways from this page:

Key Takeaways

•Size caches appropriately — Base cache sizing on client population, query diversity, and observed usage patterns with headroom for growth.
•Develop explicit TTL policies — Document standard TTLs for each record type based on operational requirements and change frequency.
•Use targeted cache flushing — Flush specific records or zones when needed; avoid routine full cache flushes that defeat caching benefits.
•Monitor cache health metrics — Track hit rates, memory usage, and SERVFAIL rates; alert on anomalies indicating problems.
•Integrate DNS with change management — Pre-change TTL reduction, proper waiting periods, and post-change verification prevent propagation issues.
•Consider advanced optimizations — Prefetching, serve-stale, and cache partitioning provide additional performance and availability gains for demanding environments.
•Follow systematic troubleshooting — Compare responses across cache layers to isolate issues; understand negative caching behavior.
•Document procedures — Maintain runbooks with flush commands, resolver addresses, and escalation procedures for rapid incident response.

Module Complete:

You've now completed the comprehensive study of DNS Caching. You understand:

Why caching is essential — The scale and performance requirements that make caching mandatory
How TTL governs cache behavior — The freshness vs. efficiency trade-off administrators must balance
The complete cache hierarchy — From browser to ISP, each layer's role and characteristics
Cache poisoning threats — Attack mechanisms and the defenses that protect modern infrastructure
Operational cache management — Sizing, monitoring, change management, and troubleshooting

This knowledge enables you to design, operate, and troubleshoot DNS caching infrastructure at enterprise scale.

Module Complete

Congratulations! You've mastered DNS Caching—from understanding why caching is architecturally essential to operating enterprise cache infrastructure. You can now effectively manage TTL policies, monitor cache health, respond to poisoning threats, and troubleshoot cache-related issues in production environments.

5 / 5

Loading learning content...