Loading learning content...
Effective DNS cache management is the operational discipline that ensures DNS caching achieves its intended benefits—reduced latency, lower query volumes, and improved resilience—while minimizing the drawbacks of stale data, security vulnerabilities, and resource exhaustion.
DNS cache management encompasses:
This page covers the operational practices that network administrators use to manage DNS caches across enterprise, ISP, and service provider environments.
By the end of this page, you will understand cache sizing and capacity planning, TTL strategy development, cache flushing techniques across platforms, monitoring and alerting for cache health, integration with change management processes, and advanced cache optimization techniques.
DNS cache capacity directly impacts performance and hit rates. Undersized caches evict entries prematurely, reducing hit rates and increasing upstream query load. Oversized caches waste memory with minimal marginal benefit.
Cache Size Determinants:
Sizing Guidelines:
| Environment | Client Count | Recommended Cache Size | Memory Estimate |
|---|---|---|---|
| Home network | 5-20 devices | 500-1000 entries | 1-2 MB |
| Small office | 50-100 devices | 5,000-10,000 entries | 10-20 MB |
| Medium enterprise | 1,000-5,000 devices | 50,000-200,000 entries | 50-200 MB |
| Large enterprise | 10,000+ devices | 500,000-2,000,000 entries | 500 MB-2 GB |
| ISP resolver | 100,000+ subscribers | 10,000,000+ entries | 10+ GB |
BIND Cache Configuration:
# named.conf options for cache management
options {
# Maximum cache size (memory limit)
max-cache-size 256M; # Limit cache to 256 MB
# Alternative: percentage of physical memory
max-cache-size 50%; # Use up to 50% of system memory
# Maximum cache TTL (cap very high TTLs)
max-cache-ttl 86400; # 24 hours maximum
# Maximum negative cache TTL
max-ncache-ttl 3600; # 1 hour for NXDOMAIN caching
# Minimum cache TTL (floor for very low TTLs)
min-cache-ttl 60; # Don't cache less than 60 seconds
};
Unbound Cache Configuration:
# unbound.conf cache settings
server:
# Cache memory size
msg-cache-size: 128m
rrset-cache-size: 256m
# Number of cache slabs (performance tuning)
msg-cache-slabs: 4
rrset-cache-slabs: 4
# Cache TTL limits
cache-min-ttl: 60
cache-max-ttl: 86400
cache-max-negative-ttl: 3600
# Prefetch expiring entries
prefetch: yes
prefetch-key: yes
Before setting cache sizes, monitor actual cache usage. Most DNS servers expose cache statistics showing current entries, hit rates, and memory usage. Size based on observed peak usage plus 20-50% headroom for growth and traffic spikes.
Organizations should develop explicit TTL policies rather than accepting default values. Well-designed TTL policies balance operational needs (failover speed, change agility) against infrastructure efficiency (cache hit rates, upstream load).
TTL Policy Framework:
| Record Category | Recommended TTL | Rationale | Change Procedure |
|---|---|---|---|
| NS records (delegation) | 86400-172800s (24-48h) | Rarely change; high cache value | Planned change window |
| MX records (mail) | 3600-86400s (1-24h) | Mail server changes uncommon | Coordinated with mail ops |
| A/AAAA (stable services) | 3600s (1h) | Balance of efficiency/agility | Standard change process |
| A/AAAA (cloud/dynamic) | 300s (5min) | IPs may change frequently | Automated/frequent changes |
| A/AAAA (active failover) | 60-120s (1-2min) | Fast failover required | Health-check driven |
| CNAME records | Match target TTL | Should align with target | Per target policy |
| TXT (verification) | 300-3600s | May need updates for validation | As needed |
| SOA MINIMUM (neg cache) | 300-600s (5-10min) | New subdomain responsiveness | Rarely changed |
Documenting TTL Policies:
Enterprise DNS documentation should include:
Example Zone File with Policy-Driven TTLs:
$TTL 3600 ; Default 1-hour TTL for most records
; SOA with 5-minute negative cache TTL
@ IN SOA ns1.example.com. admin.example.com. (
2024011701 7200 3600 1209600 300 )
; NS records: 24-hour TTL (stable infrastructure)
@ 86400 IN NS ns1.example.com.
@ 86400 IN NS ns2.example.com.
; MX records: 1-hour TTL (default)
@ IN MX 10 mail1.example.com.
@ IN MX 20 mail2.example.com.
; Primary website: 1-hour TTL (default)
www IN A 93.184.216.34
; API with active health checks: 2-minute TTL
api 120 IN A 10.0.1.100
api 120 IN A 10.0.1.101
; CDN edge: 1-minute TTL (geographic steering)
cdn 60 IN CNAME example.com.cdn.cloudflare.net.
Before implementing TTL policies, test resolver behavior. Some ISP resolvers enforce TTL minimums; some browsers override TTLs. Monitor actual propagation times during test changes to validate that TTL policies achieve intended behavior.
Cache flushing removes entries from DNS caches before TTL expiry. Flushing is required when:
Flushing Scope Considerations:
| Scope | When to Use | Impact |
|---|---|---|
| Specific record | Single record correction | Minimal—only affects one entry |
| Zone flush | Zone-wide changes | Moderate—clears all zone entries |
| Full cache flush | Major incident response | Significant—temporary latency increase |
Platform-Specific Flushing Commands:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
# ===== BIND DNS Server ===== # Flush entire cacherndc flush # Flush specific domain (and subdomains)rndc flushname example.com # Flush specific zonerndc flushtree example.com # View cache statistics before/afterrndc stats # Writes to named.stats # ===== Unbound DNS Server ===== # Flush entire cacheunbound-control flush_zone . # Flush specific zoneunbound-control flush_zone example.com # Flush specific nameunbound-control flush example.com # Flush with typeunbound-control flush_type A # View cache statisticsunbound-control stats_noreset # ===== Windows DNS Server ===== # PowerShell: Clear entire server cacheClear-DnsServerCache # Clear specific record from cacheClear-DnsServerCache -Name "example.com" # View cached recordsShow-DnsServerCache # ===== Client-Side Flushing ===== # Windows clientipconfig /flushdns # macOSsudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder # Linux (systemd-resolved)resolvectl flush-caches # Linux (nscd)sudo nscd -i hostsFlushing Public Resolvers:
Public DNS providers don't allow external cache flushing, but some offer cache status checking:
Google Public DNS:
Cloudflare DNS:
Automated Flushing Scripts:
#!/bin/bash
# Flush DNS across infrastructure after change
DOMAIN="$1"
if [ -z "$DOMAIN" ]; then
echo "Usage: $0 <domain>"
exit 1
fi
echo "Flushing DNS caches for: $DOMAIN"
# Flush internal resolvers (BIND servers)
for server in dns1.corp.example.com dns2.corp.example.com; do
echo "Flushing $server..."
ssh $server "rndc flushtree $DOMAIN" 2>/dev/null || echo " Failed: $server"
done
# Notify team
echo "Cache flush complete. TTL-based propagation continues."
echo "Verify: dig @8.8.8.8 $DOMAIN"
Frequent full cache flushing defeats the purpose of caching—it increases upstream query load and latency. Use targeted flushes (specific names or zones) whenever possible. Reserve full cache flushes for genuine incidents, not routine operations.
Effective cache management requires visibility into cache behavior. Monitoring DNS cache metrics enables capacity planning, performance optimization, and anomaly detection.
Key Metrics to Monitor:
| Metric | What It Measures | Healthy Range | Alert Threshold |
|---|---|---|---|
| Cache Hit Rate | % queries served from cache | 80-95% | < 70% |
| Cache Size (entries) | Number of cached records | Within max capacity | 90% capacity |
| Cache Memory Usage | Memory consumed by cache | Within allocation | 85% allocation |
| Query Rate | Queries per second | Within capacity | Sustained > 90% capacity |
| Upstream Query Rate | Cache misses forwarded | ~5-20% of total | 30% of total |
| SERVFAIL Rate | Failed resolutions | < 0.1% | 1% |
| Average Latency | Response time | < 50ms cached | 100ms sustained |
Extracting BIND Statistics:
# Generate statistics dump
rndc stats
# Parse cache statistics from named.stats
grep -A5 'Cache Statistics' /var/named/data/named.stats
# Sample output:
# ++ Cache Statistics ++
# cache hits = 1523847
# cache misses = 234521
# cache hit rate 86.67%
# CacheSize 45123
# Enable query logging (verbose, use sparingly)
rndc querylog on
Extracting Unbound Statistics:
# Get current statistics
unbound-control stats
# Key metrics in output:
# total.num.queries=1284753
# total.num.cachehits=1089428
# total.num.cachemiss=195325
# total.num.recursivereplies=195325
# mem.cache.rrset=98745632
# msg.cache.count=45231
# Calculate hit rate:
# hit_rate = cachehits / queries = 1089428/1284753 = 84.8%
Integration with Monitoring Systems:
Modern DNS servers export metrics in formats compatible with monitoring systems:
12345678910111213141516171819202122232425262728293031
# Example Prometheus alerting rules for DNS cachegroups: - name: dns_cache_alerts rules: # Alert on low cache hit rate - alert: DNSCacheHitRateLow expr: dns_cache_hit_rate < 0.70 for: 15m labels: severity: warning annotations: summary: "DNS cache hit rate below 70%" description: "Cache hit rate at {{ $value | humanizePercentage }}" # Alert on high cache memory usage - alert: DNSCacheMemoryHigh expr: dns_cache_memory_bytes / dns_cache_memory_max_bytes > 0.90 for: 5m labels: severity: warning annotations: summary: "DNS cache memory above 90%" # Alert on elevated SERVFAIL rate - alert: DNSServfailRateHigh expr: rate(dns_queries_servfail[5m]) / rate(dns_queries_total[5m]) > 0.01 for: 10m labels: severity: critical annotations: summary: "DNS SERVFAIL rate above 1%"Establish baseline metrics for your environment before setting alert thresholds. Normal cache hit rates, query volumes, and latencies vary significantly between environments. A 75% hit rate might be critical for an ISP but normal for a small office with diverse browsing.
DNS changes must account for cache behavior. Poor change management leads to extended outages as stale caches direct traffic incorrectly. Integrating DNS cache considerations into change management prevents these issues.
DNS Change Lifecycle:
Change Management Checklist:
Pre-Change (48-72 hours before):
During Change:
Post-Change (after verification):
Rollback Considerations:
If a DNS change causes problems:
After lowering TTL, you must wait for the PREVIOUS TTL duration before the new TTL takes full effect. If original TTL was 86400 (24 hours), some resolvers may have cached with 24 hours remaining just before you lowered it. They won't query again until their cached entry expires.
Beyond basic configuration, advanced techniques can further optimize DNS cache performance for demanding environments.
Prefetching (Cache Warming):
Prefetching proactively refreshes cache entries before they expire, eliminating the latency spike that occurs when a popular entry expires and must be re-resolved:
# Unbound prefetch configuration
server:
# Enable prefetching of expiring entries
prefetch: yes
# When TTL is x% expired, prefetch
# (default is when 10% of TTL remains)
prefetch-key: yes
How Prefetching Works:
Serve-Stale (RFC 8767):
Serve-stale allows resolvers to return expired cache entries when authoritative servers are unreachable, improving availability:
# BIND serve-stale configuration
options {
stale-answer-enable yes;
stale-answer-ttl 30; # Return stale with 30s TTL
max-stale-ttl 86400; # Serve stale up to 24 hours old
stale-answer-client-timeout 1800; # Wait 1.8s before serving stale
stale-refresh-time 30; # Retry authoritative every 30s
};
Cache Partitioning for Security:
Enterprise environments may benefit from separate caches for different purposes:
| Cache Instance | Purpose | Client Population |
|---|---|---|
| Internal resolver | Corporate clients | Employee workstations |
| DMZ resolver | Externally-facing servers | Web servers, mail servers |
| Guest resolver | Guest network | Visitor devices |
| IoT resolver | IoT devices | Sensors, cameras, devices |
Separation prevents cache poisoning in one zone from affecting others and allows zone-specific policies (e.g., IoT resolver might have stricter filtering).
Performance Tuning:
# Unbound performance tuning for high-volume resolver
server:
# Thread and socket settings
num-threads: 4 # Match CPU cores
so-reuseport: yes # Better multi-thread performance
# Buffer sizes
so-rcvbuf: 4m
so-sndbuf: 4m
# Cache slab count (power of 2, >= num-threads)
msg-cache-slabs: 4
rrset-cache-slabs: 4
infra-cache-slabs: 4
key-cache-slabs: 4
# Outgoing query optimization
outgoing-range: 8192 # Concurrent outgoing queries
num-queries-per-thread: 4096
Advanced optimizations provide marginal gains in most environments. Measure current performance, identify bottlenecks, and optimize specifically for observed issues. Default configurations are suitable for most deployments—optimize only when monitoring reveals genuine performance needs.
DNS cache issues manifest as resolution failures, stale data, or inconsistent behavior. Systematic troubleshooting isolates problems to specific cache layers.
Common Cache-Related Problems:
| Symptom | Possible Cause | Diagnostic Steps |
|---|---|---|
| 'Works for some users, not others' | Different cache state per resolver | Query from multiple resolvers; compare results |
| 'Was working, suddenly stopped' | Stale negative cache (NXDOMAIN) | Check for negative caching; new record may be masked |
| 'Works after flush, fails again' | Authoritative returns incorrect data | Query authoritative directly; check zone configuration |
| 'External works, internal fails' | Split-horizon DNS misconfiguration | Compare internal vs. external resolver responses |
| 'Slow after DNS change' | Cache misses during propagation | Normal behavior; wait for cache rebuild |
Systematic Troubleshooting Workflow:
# Step 1: Verify authoritative response (ground truth)
dig @ns1.example.com problematic.example.com A +short
# Step 2: Check each cache layer
# ISP/Public resolver
dig @8.8.8.8 problematic.example.com A +short
dig @1.1.1.1 problematic.example.com A +short
# Internal resolver (replace with your resolver IP)
dig @192.168.1.1 problematic.example.com A +short
# Local OS resolution
nslookup problematic.example.com
# Step 3: Compare TTL values at each layer
dig @ns1.example.com problematic.example.com A +noall +answer
dig @8.8.8.8 problematic.example.com A +noall +answer
# Authoritative shows original TTL; caches show remaining TTL
# Step 4: If stale, identify which layer is stale
# The layer with incorrect data AND high remaining TTL is the culprit
# Step 5: Flush specific layer if needed
# (Use appropriate flush command for identified layer)
The 'Negative Cache' Trap:
A common issue: creating a new subdomain that was previously queried. If new.example.com was queried before the record existed, resolvers cache NXDOMAIN. After creating the record, resolvers continue returning NXDOMAIN until negative cache expires.
Diagnosis:
# Query with +nsid to see if response is from cache
dig @8.8.8.8 new.example.com A
# If NXDOMAIN returned, check authoritative
dig @ns1.example.com new.example.com A
# If authoritative has the record but public doesn't → negative cache
# Solution: wait for SOA MINIMUM TTL to expire
Maintain a list of resolver IPs (internal, ISP, public), authoritative server hostnames, and flush procedures for quick reference during incidents. DNS troubleshooting under pressure is easier with prepared reference materials.
Effective DNS cache management ensures that caching achieves its intended benefits while minimizing operational issues. Proactive monitoring, clear policies, and systematic procedures enable reliable DNS operations.
Key takeaways from this page:
Module Complete:
You've now completed the comprehensive study of DNS Caching. You understand:
This knowledge enables you to design, operate, and troubleshoot DNS caching infrastructure at enterprise scale.
Congratulations! You've mastered DNS Caching—from understanding why caching is architecturally essential to operating enterprise cache infrastructure. You can now effectively manage TTL policies, monitor cache health, respond to poisoning threats, and troubleshoot cache-related issues in production environments.