Loading content...
The network is the connective tissue of every distributed system. It's also the component most likely to behave unexpectedly in production. Unlike local function calls—which either succeed quickly or fail immediately with a clear exception—network calls exist in a probabilistic realm where packets can be delayed, duplicated, reordered, corrupted, or dropped entirely. Connections can hang indefinitely. DNS can return stale data. Routers can silently blackhole traffic.
Peter Deutsch's famous observation encapsulates this reality: "The network is reliable" is the first and perhaps most dangerous of the fallacies of distributed computing. Engineers who design systems assuming reliable networks build systems that catastrophically fail when that assumption breaks—as it inevitably does.
Network failure injection forces your system to confront the harsh reality of network behavior. By deliberately introducing the very conditions that production networks will eventually impose, you discover weaknesses before they become outages.
By the end of this page, you will understand how to inject five major categories of network failures: latency, packet loss, network partitions, DNS failures, and bandwidth constraints. For each, you'll learn the injection mechanism, observable system effects, what the experiment reveals, and practical implementation approaches using industry-standard tools.
Latency injection adds artificial delay to network communications. This is arguably the most important network failure to test because latency is more dangerous than complete failure. A dead service fails fast—your retry logic kicks in, your circuit breaker triggers, your fallback activates. A slow service ties up resources for extended periods, consuming threads, connections, memory, and patience.
Why Latency Is Insidious:
Consider a service that normally responds in 50ms. If it dies completely, your timeout fires after (say) 5 seconds, and you fail the request or try a fallback. But if that service responds in 4.9 seconds instead—just under your timeout—you've waited 100x longer and still gotten a response. Your system spends 100x more resources on each request, and the experience for users is terrible.
Now multiply this across many concurrent requests. Threads block. Connection pools exhaust. Request queues back up. Memory fills with pending requests. The slow dependency doesn't just hurt itself—it propagates slowness upstream to every service that depends on it.
1234567891011121314151617
# Using Linux tc (traffic control) to add 200ms latency with 50ms jitter# This affects all traffic on the specified network interface # Add latency to outgoing trafficsudo tc qdisc add dev eth0 root netem delay 200ms 50ms distribution normal # Verify the rule is appliedtc qdisc show dev eth0 # Remove the latency rulesudo tc qdisc del dev eth0 root # Add latency to specific traffic (e.g., traffic to port 5432 for PostgreSQL)sudo tc qdisc add dev eth0 root handle 1: priosudo tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 500mssudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \ match ip dport 5432 0xffff flowid 1:3| Observation | What It Indicates | Healthy Response |
|---|---|---|
| Request latency increases proportionally | Latency propagates through system | Expected baseline behavior |
| Thread pools exhaust | Not enough threads for slow requests | Increase pool size or add request shedding |
| Connection pools exhaust | Connections held too long | Implement connection timeouts |
| Memory usage increases | Pending requests accumulate | Implement backpressure mechanisms |
| Circuit breaker opens | Latency exceeded thresholds | Good—protection mechanism working |
| User requests timeout | Latency exceeds client timeout | Adjust timeouts or add deadline propagation |
| Cascading slowness to upstream services | Caller waits for callee | Implement timeout budgets |
Latency injection often reveals that services don't coordinate timeouts. Service A waits 30 seconds for Service B, which waits 30 seconds for Service C. A slow Service C causes 30-second waits everywhere. The solution is deadline propagation—passing a 'deadline' through the call chain so each service knows how much time remains and can fail fast when the deadline is impossible to meet.
Packet loss occurs when network data fails to reach its destination. This happens constantly in real networks—router buffers overflow, cables experience interference, wireless signals degrade, network equipment fails. TCP handles most packet loss transparently through retransmission, but this has costs: increased latency, reduced throughput, and eventually, connection failures if loss is severe enough.
How Packet Loss Manifests:
At low rates (1-5%), packet loss typically manifests as increased latency because TCP retransmit timers fire and data must be resent. At moderate rates (5-20%), connections become sluggish as TCP's congestion control algorithm reduces throughput. At high rates (20%+), connections may fail entirely as retransmission limits are exceeded.
Packet loss also affects different protocols differently. TCP handles packet loss (eventually), but UDP doesn't—lost UDP packets are simply gone. Application-level protocols may have their own retry mechanisms that interact with network-level behavior in complex ways.
123456789101112131415161718192021
# Using Linux tc netem to inject packet loss # Random 5% packet losssudo tc qdisc add dev eth0 root netem loss 5% # Correlated packet loss (25% correlation with previous loss)sudo tc qdisc add dev eth0 root netem loss 5% 25% # Bursty packet loss using loss state model (Gilbert-Elliott)# Good state: 0.1% loss, Bad state: 50% loss# Probability of transitioning Good->Bad: 20%, Bad->Good: 80%sudo tc qdisc add dev eth0 root netem loss state 0.1% 50% 20% 80% # Combine with latency for more realistic simulationsudo tc qdisc add dev eth0 root netem delay 50ms 10ms loss 2% # Verify active rulestc qdisc show dev eth0 # Remove rulessudo tc qdisc del dev eth0 rootObservable Effects and Analysis:
Monitor these metrics during packet loss experiments:
What Packet Loss Reveals:
| Finding | Implication | Remediation |
|---|---|---|
| Requests succeed but latency spikes | TCP retransmits working | Consider if latency is acceptable |
| Connections timeout entirely | Loss rate exceeds TCP tolerance | Improve network quality or add redundant paths |
| Some services fail, others don't | Different timeout configurations | Standardize timeout settings |
| Application retries cause duplicates | Non-idempotent operations | Implement idempotency keys |
| Uneven degradation across service instances | Some network paths worse than others | Review network topology |
When packet loss causes requests to fail, application-level retry logic may fire. If retries are too aggressive, they can amplify load during the very conditions where the network is already struggling. Monitor for retry storms during packet loss experiments—a sign that retry backoff is too aggressive or missing entirely.
A network partition occurs when network failures divide a system into isolated segments that cannot communicate with each other but may still be individually functional. Partitions are the most challenging network failure to handle correctly because they create situations where different parts of your system make conflicting decisions based on incomplete information.
The Fundamental Challenge:
During a partition, isolated segments cannot know whether other segments are down or simply unreachable. This ambiguity creates impossible choices:
If a primary database becomes unreachable, should a replica promote itself to primary? If yes, you risk split-brain with two primaries accepting conflicting writes. If no, you risk unavailability while the primary might be perfectly healthy but network-isolated.
If a service instance loses contact with its load balancer, should it continue serving requests from clients that can reach it directly? Or should it shut down to avoid inconsistent behavior?
The CAP theorem formalizes this dilemma: during partitions, you must choose between consistency (all nodes see the same data) and availability (all requests receive a response). Different systems make different tradeoffs, and partition injection reveals which tradeoff your system actually implements versus which tradeoff you think it implements.
1234567891011121314151617181920212223242526272829303132
# Using iptables to create network partitions # Block all traffic from a specific IP (simulating partition from that host)sudo iptables -A INPUT -s 192.168.1.100 -j DROPsudo iptables -A OUTPUT -d 192.168.1.100 -j DROP # Create partition between two network segments# (run on nodes in segment A to isolate from segment B)sudo iptables -A INPUT -s 192.168.2.0/24 -j DROPsudo iptables -A OUTPUT -d 192.168.2.0/24 -j DROP # Create asymmetric partition (block incoming but allow outgoing)sudo iptables -A INPUT -s 192.168.1.100 -j DROP# Note: outgoing traffic is still allowed # Block traffic to specific port (partition service-level, not host-level)sudo iptables -A OUTPUT -p tcp --dport 5432 -j DROP # List current rulessudo iptables -L -n -v # Remove all rules (heal the partition)sudo iptables -F # Using timeout to auto-heal partition after 30 secondstimeout 30s bash -c ' iptables -A INPUT -s 192.168.1.100 -j DROP iptables -A OUTPUT -d 192.168.1.100 -j DROP sleep 30'iptables -D INPUT -s 192.168.1.100 -j DROPiptables -D OUTPUT -d 192.168.1.100 -j DROP| Behavior During Partition | Interpretation | Risk Level |
|---|---|---|
| System becomes completely unavailable | Chose consistency over availability (CP) | Acceptable if consistency is critical |
| System continues serving requests with stale data | Chose availability over consistency (AP) | Acceptable if staleness is tolerable |
| Split-brain: conflicting writes accepted by different nodes | Consistency violation—data corruption risk | Critical: must fix before production |
| Some operations work, others don't | Partial consistency based on operation type | Evaluate which operations are affected |
| System heals cleanly when partition resolves | Good recovery behavior | Expected for well-designed systems |
| Data loss or corruption after partition heals | Conflict resolution is broken | Critical: reconciliation logic needed |
If your partition experiment reveals split-brain behavior—where multiple nodes believe they are the leader or multiple copies of data can be independently modified—this is almost always a critical bug. Split-brain leads to data corruption that can be extremely difficult to detect and repair. Prioritize fixing split-brain scenarios before any other chaos engineering work.
DNS (Domain Name System) is the first step in almost every network connection—your services must resolve hostnames to IP addresses before they can connect. DNS failures are particularly dangerous because they affect all new connections while often being invisible to monitoring that tracks individual services.
The Hidden DNS Dependency:
Every time your code opens a connection to database.internal, api.partner.com, or s3.amazonaws.com, DNS resolution happens first. This resolution typically uses cached results, but caches expire. When they do, fresh DNS queries must succeed, or connections fail.
DNS failures can manifest in multiple ways:
12345678910111213141516171819202122232425
# Block DNS traffic (port 53) to simulate DNS server unavailabilitysudo iptables -A OUTPUT -p udp --dport 53 -j DROPsudo iptables -A OUTPUT -p tcp --dport 53 -j DROP # Add latency to DNS queries specificallysudo tc qdisc add dev eth0 root handle 1: priosudo tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 5000mssudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \ match ip dport 53 0xffff flowid 1:3 # Using dnsmasq to return controlled DNS responses# Install dnsmasq and configure /etc/dnsmasq.conf:# address=/api.example.com/127.0.0.1 # Wrong IP# address=/database.internal/ # NXDOMAIN (empty) # Flush DNS cache to force re-resolution (Linux systemd)sudo systemd-resolve --flush-caches # Force DNS cache expiration by manipulating TTL# This requires a DNS proxy that can modify TTL values # Verify DNS behaviordig @8.8.8.8 api.example.comnslookup api.example.comhost api.example.comObservable Effects and Their Meaning:
| Observation | What It Indicates | Remediation |
|---|---|---|
| Existing connections work, new connections fail | DNS affects only new connections | Extend connection keep-alive and pooling |
| All traffic fails immediately | No DNS caching, aggressive TTL | Implement local DNS caching |
| Gradual failure as cache expires | DNS caching working but has limits | Consider longer cache TTL or fallback DNS |
| Connections to IP addresses still work | Services can work with IPs | Consider IP-based fallbacks for critical paths |
| Service discovery fails | DNS-based service discovery affected | Evaluate alternative service discovery |
| SSL/TLS failures after DNS change | Certificates bound to old IPs | Ensure certificates cover new IPs |
DNS failure experiments often reveal that DNS caching strategy wasn't deliberately designed. Some services cache aggressively (resilient but slow to pick up changes), others don't cache at all (responsive but fragile). The right strategy depends on your tradeoff between resilience and responsiveness—but it should be an explicit choice, not an accident.
Bandwidth throttling limits the rate at which data can flow between systems. While modern data centers typically have abundant bandwidth, throttling experiments are valuable for testing:
1234567891011121314151617181920212223242526
# Using tc to limit bandwidth # Limit outgoing bandwidth to 1 Mbit/s with 32kbit queuesudo tc qdisc add dev eth0 root tbf rate 1mbit burst 32kbit latency 400ms # Limit bandwidth with realistic latency for cross-region simulation# (10 Mbit/s with 50ms latency simulates distant datacenter)sudo tc qdisc add dev eth0 root handle 1: netem delay 50mssudo tc qdisc add dev eth0 parent 1:1 handle 10: tbf rate 10mbit burst 32kbit latency 400ms # Limit bandwidth to specific destinationsudo tc qdisc add dev eth0 root handle 1: priosudo tc qdisc add dev eth0 parent 1:3 handle 30: tbf rate 1mbit burst 32kbit latency 400mssudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \ match ip dst 192.168.1.100 flowid 1:3 # Simulate cellular network (1 Mbit/s, variable latency, some loss)sudo tc qdisc add dev eth0 root netem delay 100ms 50ms loss 1% rate 1mbit # View current throttling rulestc qdisc show dev eth0tc class show dev eth0tc filter show dev eth0 # Remove throttlingsudo tc qdisc del dev eth0 rootMost bandwidth issues in modern systems involve cross-region communication. The link between your Kubernetes cluster and a third-party API, the connection between primary and replica databases in different regions, the traffic between your CDN and origin servers—these paths often have lower bandwidth than internal datacenter links and deserve specific testing.
Real-world network degradation rarely involves just one type of failure. A congested network path typically exhibits increased latency, higher packet loss, and reduced effective bandwidth simultaneously. A failing network switch might partition some traffic while introducing latency to other traffic. Realistic chaos experiments combine multiple network failure types to model these compound scenarios.
| Scenario | Failure Combination | Real-World Cause |
|---|---|---|
| Congested link | Latency + packet loss + reduced bandwidth | Router buffer overflow during traffic spike |
| Flaky connection | Intermittent partitions + high latency | Failing network cable or port |
| DNS storm | DNS latency + service partition | DNS server overload during incident |
| WAN degradation | High latency + bandwidth limits + low packet loss | Cross-region link saturation |
| Cloud provider issue | Random subset of IPs partitioned | Partial availability zone failure |
| Cellular network | Variable latency + packet loss + bandwidth limits | Mobile user experience |
123456789101112131415161718
# Simulate degraded WAN link to another datacenter# 50ms latency, 25ms jitter, 1% packet loss, 10Mbit bandwidth limit sudo tc qdisc add dev eth0 root handle 1: netem \ delay 50ms 25ms distribution normal \ loss 1% sudo tc qdisc add dev eth0 parent 1:1 handle 10: tbf \ rate 10mbit burst 32kbit latency 400ms # Simulate intermittent connectivity (flaky link)# Alternate between working (3s) and partitioned (1s)while true; do iptables -D OUTPUT -d 192.168.1.100 -j DROP 2>/dev/null sleep 3 iptables -A OUTPUT -d 192.168.1.100 -j DROP sleep 1doneBest Practices for Combined Failure Experiments:
Start simple, add complexity — Begin with single failure types to establish baseline behavior, then combine them.
Model real incidents — Base combined scenarios on past incidents you've experienced or read about in postmortems.
Vary intensity — Test combinations at different severity levels (e.g., 1% loss + 50ms latency vs. 10% loss + 500ms latency).
Test recovery — Combined failures may heal in different orders. Test how systems behave as individual components recover.
Monitor holistically — Combined failures may cause unexpected interactions. Monitor all system components, not just those directly affected.
Document precisely — Record exact parameters so experiments can be reproduced and compared over time.
While tc and iptables provide low-level control for Linux-based network failure injection, several higher-level tools offer more convenient interfaces and additional capabilities:
| Tool | Scope | Key Capabilities | Best For |
|---|---|---|---|
| tc/netem | Linux host | Latency, loss, bandwidth, reordering, duplication | Low-level control, specific scenarios |
| iptables/nftables | Linux host | Packet filtering, blocking, marking | Partitions, firewall-level failures |
| Chaos Mesh | Kubernetes | Pod-level network chaos, DNS chaos, HTTP chaos | K8s environments, automated experiments |
| Gremlin | Multi-platform | Full network failure suite with UI and automation | Enterprise, regulated environments |
| Toxiproxy | Application proxy | TCP-level manipulation, client library integration | Application-layer testing, development |
| Comcast | Go-based tool | Simple CLI for network failures | Quick testing, CI/CD integration |
| Pumba | Docker | Container-level network chaos | Docker environments without K8s |
1234567891011121314151617
# Toxiproxy example - proxy that can inject failures# Useful for testing against services you can route through it # Create a proxy for a database connectiontoxiproxy-cli create postgres_proxy -l localhost:25432 -u database.internal:5432 # Add 200ms latency toxictoxiproxy-cli toxic add postgres_proxy -t latency -a latency=200 # Add 10% packet loss (timeout toxic approximates this)toxiproxy-cli toxic add postgres_proxy -t timeout -a timeout=10000 # List active toxicstoxiproxy-cli toxic list postgres_proxy # Remove all toxicstoxiproxy-cli toxic remove postgres_proxyIn Kubernetes environments, Chaos Mesh or LitmusChaos provide native integration with pod networking. For testing during development, Toxiproxy is lightweight and easy to integrate. For enterprise environments requiring audit trails and approval workflows, Gremlin provides the governance features needed for production chaos engineering.
Network failure injection is foundational to chaos engineering. The network is simultaneously the most critical and least reliable component of distributed systems, making network resilience testing essential for any system that aims to operate reliably at scale.
What's Next:
With network failures covered, we'll now examine Service Failures—the application-layer failures that occur when the services your system depends on crash, return errors, or behave unexpectedly. Service failure injection tests your error handling, retry logic, circuit breakers, and graceful degradation capabilities.
You now understand how to inject and analyze network failures—latency, packet loss, partitions, DNS failures, and bandwidth throttling. These techniques will reveal weaknesses in your distributed system's network resilience. Next, we'll explore service-level failure injection.