Failure Injection - Learning Module

Loading content...

0/273

Network Failures

The Network: Your System's Most Fragile Link

The network is the connective tissue of every distributed system. It's also the component most likely to behave unexpectedly in production. Unlike local function calls—which either succeed quickly or fail immediately with a clear exception—network calls exist in a probabilistic realm where packets can be delayed, duplicated, reordered, corrupted, or dropped entirely. Connections can hang indefinitely. DNS can return stale data. Routers can silently blackhole traffic.

Peter Deutsch's famous observation encapsulates this reality: "The network is reliable" is the first and perhaps most dangerous of the fallacies of distributed computing. Engineers who design systems assuming reliable networks build systems that catastrophically fail when that assumption breaks—as it inevitably does.

Network failure injection forces your system to confront the harsh reality of network behavior. By deliberately introducing the very conditions that production networks will eventually impose, you discover weaknesses before they become outages.

What You Will Learn

By the end of this page, you will understand how to inject five major categories of network failures: latency, packet loss, network partitions, DNS failures, and bandwidth constraints. For each, you'll learn the injection mechanism, observable system effects, what the experiment reveals, and practical implementation approaches using industry-standard tools.

Latency Injection

Latency injection adds artificial delay to network communications. This is arguably the most important network failure to test because latency is more dangerous than complete failure. A dead service fails fast—your retry logic kicks in, your circuit breaker triggers, your fallback activates. A slow service ties up resources for extended periods, consuming threads, connections, memory, and patience.

Why Latency Is Insidious:

Consider a service that normally responds in 50ms. If it dies completely, your timeout fires after (say) 5 seconds, and you fail the request or try a fallback. But if that service responds in 4.9 seconds instead—just under your timeout—you've waited 100x longer and still gotten a response. Your system spends 100x more resources on each request, and the experience for users is terrible.

Now multiply this across many concurrent requests. Threads block. Connection pools exhaust. Request queues back up. Memory fills with pending requests. The slow dependency doesn't just hurt itself—it propagates slowness upstream to every service that depends on it.

Types of Latency to Inject

•Constant Latency — Add fixed delay (e.g., +500ms) to all requests. Tests baseline timeout and queue behavior. Useful as a starting point.
•Variable Latency — Add random delay within a range (e.g., 100ms-2000ms). More realistic—tests how systems handle unpredictable response times.
•Increasing Latency — Start with small delays and gradually increase. Models degrading systems and tests when thresholds trigger (alerts, circuit breakers).
•Percentile Latency — Add extreme latency to only a percentage of requests (e.g., 1% get +10s delay). Tests handling of tail latency, a common production pattern.
•Correlated Latency — Add latency to requests with specific characteristics (e.g., certain users, certain query types). Tests whether latency affects specific paths differently.

latency-injection-example.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Using Linux tc (traffic control) to add 200ms latency with 50ms jitter
# This affects all traffic on the specified network interface
 
# Add latency to outgoing traffic
sudo tc qdisc add dev eth0 root netem delay 200ms 50ms distribution normal
 
# Verify the rule is applied
tc qdisc show dev eth0
 
# Remove the latency rule
sudo tc qdisc del dev eth0 root
 
# Add latency to specific traffic (e.g., traffic to port 5432 for PostgreSQL)
sudo tc qdisc add dev eth0 root handle 1: prio
sudo tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 500ms
sudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
  match ip dport 5432 0xffff flowid 1:3

Observable Effects of Latency Injection
Observation	What It Indicates	Healthy Response
Request latency increases proportionally	Latency propagates through system	Expected baseline behavior
Thread pools exhaust	Not enough threads for slow requests	Increase pool size or add request shedding
Connection pools exhaust	Connections held too long	Implement connection timeouts
Memory usage increases	Pending requests accumulate	Implement backpressure mechanisms
Circuit breaker opens	Latency exceeded thresholds	Good—protection mechanism working
User requests timeout	Latency exceeds client timeout	Adjust timeouts or add deadline propagation
Cascading slowness to upstream services	Caller waits for callee	Implement timeout budgets

The Critical Insight: Timeout Budgets

Latency injection often reveals that services don't coordinate timeouts. Service A waits 30 seconds for Service B, which waits 30 seconds for Service C. A slow Service C causes 30-second waits everywhere. The solution is deadline propagation—passing a 'deadline' through the call chain so each service knows how much time remains and can fail fast when the deadline is impossible to meet.

Packet Loss Injection

Packet loss occurs when network data fails to reach its destination. This happens constantly in real networks—router buffers overflow, cables experience interference, wireless signals degrade, network equipment fails. TCP handles most packet loss transparently through retransmission, but this has costs: increased latency, reduced throughput, and eventually, connection failures if loss is severe enough.

How Packet Loss Manifests:

At low rates (1-5%), packet loss typically manifests as increased latency because TCP retransmit timers fire and data must be resent. At moderate rates (5-20%), connections become sluggish as TCP's congestion control algorithm reduces throughput. At high rates (20%+), connections may fail entirely as retransmission limits are exceeded.

Packet loss also affects different protocols differently. TCP handles packet loss (eventually), but UDP doesn't—lost UDP packets are simply gone. Application-level protocols may have their own retry mechanisms that interact with network-level behavior in complex ways.

Packet Loss Patterns to Test

•Random Loss — Drop X% of packets randomly. Tests baseline TCP resilience and retry behavior.
•Bursty Loss — Drop packets in bursts (e.g., all packets for 500ms, then normal for 2s). Models real-world network congestion patterns.
•Correlated Loss — Drop packets based on previous loss (if a packet was lost, the next is more likely to be lost). Models link degradation scenarios.
•Asymmetric Loss — Different loss rates for inbound vs outbound traffic. Tests protocols that assume symmetric connectivity.
•Selective Loss — Drop packets only for certain traffic types (e.g., only TCP SYN packets). Tests connection establishment behavior specifically.

packet-loss-injection.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Using Linux tc netem to inject packet loss
 
# Random 5% packet loss
sudo tc qdisc add dev eth0 root netem loss 5%
 
# Correlated packet loss (25% correlation with previous loss)
sudo tc qdisc add dev eth0 root netem loss 5% 25%
 
# Bursty packet loss using loss state model (Gilbert-Elliott)
# Good state: 0.1% loss, Bad state: 50% loss
# Probability of transitioning Good->Bad: 20%, Bad->Good: 80%
sudo tc qdisc add dev eth0 root netem loss state 0.1% 50% 20% 80%
 
# Combine with latency for more realistic simulation
sudo tc qdisc add dev eth0 root netem delay 50ms 10ms loss 2%
 
# Verify active rules
tc qdisc show dev eth0
 
# Remove rules
sudo tc qdisc del dev eth0 root

Observable Effects and Analysis:

Monitor these metrics during packet loss experiments:

TCP retransmission rate: Should increase proportionally with induced loss
Request latency distribution: Tail latency (p99, p999) will increase significantly
Throughput: Should decrease as TCP backs off
Connection timeouts: May increase with high loss rates
Application error rates: Reveals whether application-level retries help or hurt

What Packet Loss Reveals:

Finding	Implication	Remediation
Requests succeed but latency spikes	TCP retransmits working	Consider if latency is acceptable
Connections timeout entirely	Loss rate exceeds TCP tolerance	Improve network quality or add redundant paths
Some services fail, others don't	Different timeout configurations	Standardize timeout settings
Application retries cause duplicates	Non-idempotent operations	Implement idempotency keys
Uneven degradation across service instances	Some network paths worse than others	Review network topology

Beware Retry Amplification

When packet loss causes requests to fail, application-level retry logic may fire. If retries are too aggressive, they can amplify load during the very conditions where the network is already struggling. Monitor for retry storms during packet loss experiments—a sign that retry backoff is too aggressive or missing entirely.

Network Partitions

A network partition occurs when network failures divide a system into isolated segments that cannot communicate with each other but may still be individually functional. Partitions are the most challenging network failure to handle correctly because they create situations where different parts of your system make conflicting decisions based on incomplete information.

The Fundamental Challenge:

During a partition, isolated segments cannot know whether other segments are down or simply unreachable. This ambiguity creates impossible choices:

If a primary database becomes unreachable, should a replica promote itself to primary? If yes, you risk split-brain with two primaries accepting conflicting writes. If no, you risk unavailability while the primary might be perfectly healthy but network-isolated.
If a service instance loses contact with its load balancer, should it continue serving requests from clients that can reach it directly? Or should it shut down to avoid inconsistent behavior?

The CAP theorem formalizes this dilemma: during partitions, you must choose between consistency (all nodes see the same data) and availability (all requests receive a response). Different systems make different tradeoffs, and partition injection reveals which tradeoff your system actually implements versus which tradeoff you think it implements.

Partition Scenarios to Test

•Complete Partition — Node A cannot reach Node B, and Node B cannot reach Node A. Symmetric isolation testing basic partition handling.
•Asymmetric Partition — Node A can reach Node B, but Node B cannot reach Node A. Tests assumption that connectivity is bidirectional.
•Partial Partition — Some nodes can reach the leader, others cannot. Tests whether quorum-based systems correctly determine majority.
•Time-bounded Partition — Partition lasts for a specific duration then heals. Tests recovery behavior and data reconciliation.
•Progressive Partition — Nodes drop out of the cluster one by one. Tests graceful degradation and minimum viable cluster size.
•Cross-datacenter Partition — Entire datacenter becomes isolated. Tests multi-region failover and data consistency across regions.

partition-injection.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Using iptables to create network partitions
 
# Block all traffic from a specific IP (simulating partition from that host)
sudo iptables -A INPUT -s 192.168.1.100 -j DROP
sudo iptables -A OUTPUT -d 192.168.1.100 -j DROP
 
# Create partition between two network segments
# (run on nodes in segment A to isolate from segment B)
sudo iptables -A INPUT -s 192.168.2.0/24 -j DROP
sudo iptables -A OUTPUT -d 192.168.2.0/24 -j DROP
 
# Create asymmetric partition (block incoming but allow outgoing)
sudo iptables -A INPUT -s 192.168.1.100 -j DROP
# Note: outgoing traffic is still allowed
 
# Block traffic to specific port (partition service-level, not host-level)
sudo iptables -A OUTPUT -p tcp --dport 5432 -j DROP
 
# List current rules
sudo iptables -L -n -v
 
# Remove all rules (heal the partition)
sudo iptables -F
 
# Using timeout to auto-heal partition after 30 seconds
timeout 30s bash -c '
  iptables -A INPUT -s 192.168.1.100 -j DROP
  iptables -A OUTPUT -d 192.168.1.100 -j DROP
  sleep 30
'
iptables -D INPUT -s 192.168.1.100 -j DROP
iptables -D OUTPUT -d 192.168.1.100 -j DROP

What Partition Experiments Reveal
Behavior During Partition	Interpretation	Risk Level
System becomes completely unavailable	Chose consistency over availability (CP)	Acceptable if consistency is critical
System continues serving requests with stale data	Chose availability over consistency (AP)	Acceptable if staleness is tolerable
Split-brain: conflicting writes accepted by different nodes	Consistency violation—data corruption risk	Critical: must fix before production
Some operations work, others don't	Partial consistency based on operation type	Evaluate which operations are affected
System heals cleanly when partition resolves	Good recovery behavior	Expected for well-designed systems
Data loss or corruption after partition heals	Conflict resolution is broken	Critical: reconciliation logic needed

Split-Brain Is Almost Always Wrong

If your partition experiment reveals split-brain behavior—where multiple nodes believe they are the leader or multiple copies of data can be independently modified—this is almost always a critical bug. Split-brain leads to data corruption that can be extremely difficult to detect and repair. Prioritize fixing split-brain scenarios before any other chaos engineering work.

DNS Failures

DNS (Domain Name System) is the first step in almost every network connection—your services must resolve hostnames to IP addresses before they can connect. DNS failures are particularly dangerous because they affect all new connections while often being invisible to monitoring that tracks individual services.

The Hidden DNS Dependency:

Every time your code opens a connection to database.internal, api.partner.com, or s3.amazonaws.com, DNS resolution happens first. This resolution typically uses cached results, but caches expire. When they do, fresh DNS queries must succeed, or connections fail.

DNS failures can manifest in multiple ways:

Complete failure: DNS queries receive no response
Slow resolution: Queries take seconds instead of milliseconds
Wrong answers: DNS returns stale or incorrect IP addresses
Partial failure: Some record types fail, others succeed
Infrastructure failure: The DNS servers themselves are unreachable

DNS Failure Scenarios to Test

•DNS Server Unavailable — Block access to DNS servers entirely. Tests local DNS cache behavior and whether services have fallback DNS.
•DNS Latency — Add significant delay (5-30 seconds) to DNS queries. Tests timeout handling in DNS resolution and connection establishment.
•NXDOMAIN Responses — Make DNS return 'domain not found' for valid domains. Tests error handling when resolution fails.
•Stale DNS Responses — Return IP addresses that no longer point to the correct service. Tests handling of unreachable IPs after DNS success.
•DNS Cache Expiration — Force DNS TTL to expire faster than normal. Tests behavior during cache refresh periods.
•Partial DNS Failure — Make some records fail while others succeed. Tests whether all DNS dependencies are equally resilient.

dns-failure-injection.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Block DNS traffic (port 53) to simulate DNS server unavailability
sudo iptables -A OUTPUT -p udp --dport 53 -j DROP
sudo iptables -A OUTPUT -p tcp --dport 53 -j DROP
 
# Add latency to DNS queries specifically
sudo tc qdisc add dev eth0 root handle 1: prio
sudo tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 5000ms
sudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
  match ip dport 53 0xffff flowid 1:3
 
# Using dnsmasq to return controlled DNS responses
# Install dnsmasq and configure /etc/dnsmasq.conf:
# address=/api.example.com/127.0.0.1  # Wrong IP
# address=/database.internal/        # NXDOMAIN (empty)
 
# Flush DNS cache to force re-resolution (Linux systemd)
sudo systemd-resolve --flush-caches
 
# Force DNS cache expiration by manipulating TTL
# This requires a DNS proxy that can modify TTL values
 
# Verify DNS behavior
dig @8.8.8.8 api.example.com
nslookup api.example.com
host api.example.com

Observable Effects and Their Meaning:

Observation	What It Indicates	Remediation
Existing connections work, new connections fail	DNS affects only new connections	Extend connection keep-alive and pooling
All traffic fails immediately	No DNS caching, aggressive TTL	Implement local DNS caching
Gradual failure as cache expires	DNS caching working but has limits	Consider longer cache TTL or fallback DNS
Connections to IP addresses still work	Services can work with IPs	Consider IP-based fallbacks for critical paths
Service discovery fails	DNS-based service discovery affected	Evaluate alternative service discovery
SSL/TLS failures after DNS change	Certificates bound to old IPs	Ensure certificates cover new IPs

DNS Caching Strategy

DNS failure experiments often reveal that DNS caching strategy wasn't deliberately designed. Some services cache aggressively (resilient but slow to pick up changes), others don't cache at all (responsive but fragile). The right strategy depends on your tradeoff between resilience and responsiveness—but it should be an explicit choice, not an accident.

Bandwidth Throttling

Bandwidth throttling limits the rate at which data can flow between systems. While modern data centers typically have abundant bandwidth, throttling experiments are valuable for testing:

Cross-region traffic: WAN links often have limited bandwidth compared to local networks
Third-party API limits: Many APIs impose rate limits that effectively throttle bandwidth
User traffic over cellular networks: Mobile users experience constrained bandwidth regularly
Burst capacity exhaustion: Even high-bandwidth links can saturate during traffic spikes
Cost optimization scenarios: Understanding minimum viable bandwidth helps optimize cloud networking costs

bandwidth-throttling.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Using tc to limit bandwidth
 
# Limit outgoing bandwidth to 1 Mbit/s with 32kbit queue
sudo tc qdisc add dev eth0 root tbf rate 1mbit burst 32kbit latency 400ms
 
# Limit bandwidth with realistic latency for cross-region simulation
# (10 Mbit/s with 50ms latency simulates distant datacenter)
sudo tc qdisc add dev eth0 root handle 1: netem delay 50ms
sudo tc qdisc add dev eth0 parent 1:1 handle 10: tbf rate 10mbit burst 32kbit latency 400ms
 
# Limit bandwidth to specific destination
sudo tc qdisc add dev eth0 root handle 1: prio
sudo tc qdisc add dev eth0 parent 1:3 handle 30: tbf rate 1mbit burst 32kbit latency 400ms
sudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
  match ip dst 192.168.1.100 flowid 1:3
 
# Simulate cellular network (1 Mbit/s, variable latency, some loss)
sudo tc qdisc add dev eth0 root netem delay 100ms 50ms loss 1% rate 1mbit
 
# View current throttling rules
tc qdisc show dev eth0
tc class show dev eth0
tc filter show dev eth0
 
# Remove throttling
sudo tc qdisc del dev eth0 root

What Bandwidth Throttling Reveals

•Large payload problems — Bulk data transfers (file uploads, large API responses) become bottlenecks. Reveals need for streaming, compression, or chunking.
•Head-of-line blocking — Single slow transfer blocks other traffic on shared connections. Reveals need for connection isolation.
•Timeout behavior under slow transfers — Requests that succeed quickly at normal bandwidth may timeout when throttled. Reveals need for size-aware timeouts.
•Retry amplification — Retried transfers consume even more scarce bandwidth. Reveals need for exponential backoff.
•Protocol overhead — Chatty protocols that make many small requests suffer more than protocols using batching. Reveals need for request batching.
•Streaming behavior — Services that depend on real-time streaming may fail entirely. Reveals buffering and degraded mode requirements.

Cross-Region Bandwidth is Often the Constraint

Most bandwidth issues in modern systems involve cross-region communication. The link between your Kubernetes cluster and a third-party API, the connection between primary and replica databases in different regions, the traffic between your CDN and origin servers—these paths often have lower bandwidth than internal datacenter links and deserve specific testing.

Combining Network Failures for Realistic Scenarios

Real-world network degradation rarely involves just one type of failure. A congested network path typically exhibits increased latency, higher packet loss, and reduced effective bandwidth simultaneously. A failing network switch might partition some traffic while introducing latency to other traffic. Realistic chaos experiments combine multiple network failure types to model these compound scenarios.

Realistic Combined Network Failure Scenarios
Scenario	Failure Combination	Real-World Cause
Congested link	Latency + packet loss + reduced bandwidth	Router buffer overflow during traffic spike
Flaky connection	Intermittent partitions + high latency	Failing network cable or port
DNS storm	DNS latency + service partition	DNS server overload during incident
WAN degradation	High latency + bandwidth limits + low packet loss	Cross-region link saturation
Cloud provider issue	Random subset of IPs partitioned	Partial availability zone failure
Cellular network	Variable latency + packet loss + bandwidth limits	Mobile user experience

combined-failure-example.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Simulate degraded WAN link to another datacenter
# 50ms latency, 25ms jitter, 1% packet loss, 10Mbit bandwidth limit
 
sudo tc qdisc add dev eth0 root handle 1: netem \
  delay 50ms 25ms distribution normal \
  loss 1%
 
sudo tc qdisc add dev eth0 parent 1:1 handle 10: tbf \
  rate 10mbit burst 32kbit latency 400ms
 
# Simulate intermittent connectivity (flaky link)
# Alternate between working (3s) and partitioned (1s)
while true; do
  iptables -D OUTPUT -d 192.168.1.100 -j DROP 2>/dev/null
  sleep 3
  iptables -A OUTPUT -d 192.168.1.100 -j DROP
  sleep 1
done

Best Practices for Combined Failure Experiments:

Start simple, add complexity — Begin with single failure types to establish baseline behavior, then combine them.
Model real incidents — Base combined scenarios on past incidents you've experienced or read about in postmortems.
Vary intensity — Test combinations at different severity levels (e.g., 1% loss + 50ms latency vs. 10% loss + 500ms latency).
Test recovery — Combined failures may heal in different orders. Test how systems behave as individual components recover.
Monitor holistically — Combined failures may cause unexpected interactions. Monitor all system components, not just those directly affected.
Document precisely — Record exact parameters so experiments can be reproduced and compared over time.

Network Failure Injection Tools

While tc and iptables provide low-level control for Linux-based network failure injection, several higher-level tools offer more convenient interfaces and additional capabilities:

Network Failure Injection Tools Comparison
Tool	Scope	Key Capabilities	Best For
tc/netem	Linux host	Latency, loss, bandwidth, reordering, duplication	Low-level control, specific scenarios
iptables/nftables	Linux host	Packet filtering, blocking, marking	Partitions, firewall-level failures
Chaos Mesh	Kubernetes	Pod-level network chaos, DNS chaos, HTTP chaos	K8s environments, automated experiments
Gremlin	Multi-platform	Full network failure suite with UI and automation	Enterprise, regulated environments
Toxiproxy	Application proxy	TCP-level manipulation, client library integration	Application-layer testing, development
Comcast	Go-based tool	Simple CLI for network failures	Quick testing, CI/CD integration
Pumba	Docker	Container-level network chaos	Docker environments without K8s

toxiproxy-example.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Toxiproxy example - proxy that can inject failures
# Useful for testing against services you can route through it
 
# Create a proxy for a database connection
toxiproxy-cli create postgres_proxy -l localhost:25432 -u database.internal:5432
 
# Add 200ms latency toxic
toxiproxy-cli toxic add postgres_proxy -t latency -a latency=200
 
# Add 10% packet loss (timeout toxic approximates this)
toxiproxy-cli toxic add postgres_proxy -t timeout -a timeout=10000
 
# List active toxics
toxiproxy-cli toxic list postgres_proxy
 
# Remove all toxics
toxiproxy-cli toxic remove postgres_proxy

Choose Tools Based on Your Environment

In Kubernetes environments, Chaos Mesh or LitmusChaos provide native integration with pod networking. For testing during development, Toxiproxy is lightweight and easy to integrate. For enterprise environments requiring audit trails and approval workflows, Gremlin provides the governance features needed for production chaos engineering.

Summary: Mastering Network Failure Injection

Network failure injection is foundational to chaos engineering. The network is simultaneously the most critical and least reliable component of distributed systems, making network resilience testing essential for any system that aims to operate reliably at scale.

Key Takeaways

•Latency is more dangerous than failure — Slow dependencies cause resource exhaustion; dead dependencies fail fast. Always test latency, not just outages.
•Packet loss compounds — Even small loss rates cause TCP retransmits that increase latency. Higher rates can cause connection failures.
•Partitions test fundamental tradeoffs — Network partitions reveal your actual CAP theorem position—consistency vs. availability.
•DNS is a hidden dependency — Every hostname resolution is a potential failure point. Cache behavior determines resilience.
•Bandwidth limits affect large operations — Bulk transfers, cross-region traffic, and streaming operations need specific testing.
•Real failures combine multiple modes — Production network issues typically involve latency, loss, and reduced bandwidth together.
•Use appropriate tools — From low-level tc/iptables to high-level chaos platforms, match tool to environment and maturity level.

What's Next:

With network failures covered, we'll now examine Service Failures—the application-layer failures that occur when the services your system depends on crash, return errors, or behave unexpectedly. Service failure injection tests your error handling, retry logic, circuit breakers, and graceful degradation capabilities.

Page Complete

You now understand how to inject and analyze network failures—latency, packet loss, partitions, DNS failures, and bandwidth throttling. These techniques will reveal weaknesses in your distributed system's network resilience. Next, we'll explore service-level failure injection.