System Design (HLD)Fallacies of Distributed Computing

Fallacies of Distributed Computing

LevelIntermediate

Duration90 mins

TopicFallacies of Distributed Computing

1 / 5

The Network Is Reliable (It's Not)

The Most Expensive Assumption in Computing

In 1994, Peter Deutsch at Sun Microsystems identified a set of dangerous assumptions that programmers new to distributed computing invariably make. These became known as the Fallacies of Distributed Computing—eight beliefs that seem reasonable but lead to software that fails spectacularly in production.

The first and most treacherous fallacy is this: The network is reliable.

It's a seductive assumption. When you send a request from your laptop to a server sitting next to you, it works. Every time. The request goes out, the response comes back. Networking feels invisible, magical, perfectly dependable. And so developers write code that assumes this reliability extends to production environments with hundreds of services communicating across data centers spanning continents.

The Cost of This Fallacy

This single assumption has caused more distributed system outages than perhaps any other design flaw. Companies have lost billions in revenue, engineers have spent countless sleepless nights debugging phantom failures, and users have experienced data loss—all because someone assumed the network would always be there when needed.

In this page, we will dissect exactly why networks are unreliable, understand the myriad ways they fail, and develop architectural patterns that embrace—rather than ignore—this fundamental reality of distributed computing.

Why Networks Are Inherently Unreliable

To understand network unreliability, we must first understand what a network actually is: a complex web of physical hardware, software protocols, and human-managed configurations spanning your application's process to another process that might be across the room or across the planet.

The physical reality:

Every network packet you send travels through a remarkable journey. It begins as electrical signals or photons, traverses cables, routers, switches, and sometimes satellites. Each component in this chain can fail:

Network Interface Cards (NICs) can malfunction or become overloaded
Cables can be cut, corroded, or disconnected (intentionally or accidentally)
Switches and routers can crash, reboot, or become congested
DNS servers can return stale data or become unreachable
Load balancers can misconfigure or fail over incorrectly
Firewalls can block traffic based on misconfigured rules
ISPs and cloud providers can experience outages affecting regions or the entire internet

Common Network Failure Modes
Failure Type	Description	Typical Duration	Frequency
Packet Loss	Packets dropped by congested routers or faulty hardware	Milliseconds (per packet)	Continuous (0.1-1% typical)
Network Partition	Two groups of nodes cannot communicate with each other	Seconds to hours	Rare but catastrophic
Timeout	Request sent but no response received within expected time	Seconds	Regular occurrence
DNS Failure	Domain name resolution fails or returns incorrect IP	Minutes to hours	Occasional
BGP Misconfiguration	Internet routing tables corrupted, traffic misrouted	Minutes to hours	Rare but widespread
Cable Cut	Physical infrastructure damage (construction, sharks, etc.)	Hours to days	Uncommon but severe
Hardware Failure	NIC, switch, or router fails completely	Minutes to hours	Regular in large deployments
Congestion Collapse	Network saturated, most packets dropped	Minutes	Under peak load

The Fallacy of the Local Network

Even networks that seem 'local'—machines in the same rack, same data center, same availability zone—can and do fail. A misconfigured switch can partition a rack. A firmware bug can cause a network card to silently drop specific packet types. Google's extensive research shows that even within a single data center, network failures happen more often than most engineers expect.

The Anatomy of Network Failure

Network failures come in many forms, and understanding their characteristics is essential for building resilient systems. The most important distinction is between detectable and undetectable failures.

Detectable failures include:

Connection refused (server not listening)
Connection reset (server crashed mid-request)
DNS resolution failure
TLS handshake failure

These are actually the good failures—your code can catch them immediately and respond appropriately.

Undetectable failures are far more dangerous:

Request sent but lost in transit
Request received and processed, but response lost
Request delivered but stuck in a queue
Response delayed beyond any reasonable timeout

The Three States of a Remote Request

•Success — Request was received, processed, and response delivered. You can proceed with confidence.
•Failure — Request definitively failed at some point. You know to retry or handle the error.
•Unknown — The dangerous state. Request may have succeeded, failed, or be in progress. No timeout gives you certainty; it only tells you that you waited too long.

The unknown state is the central challenge of distributed systems.

Consider this scenario: You send a request to a payment service to charge a customer $100. Your request times out after 30 seconds. What happened?

The request was never received (you should retry)
The request was received but processing failed (you should retry)
The request was received, the customer was charged, but the response was lost (retrying will double-charge)
The request is still processing and will eventually succeed (retrying will double-charge)

Without additional mechanisms like idempotency keys or exactly-once semantics, you literally cannot know which scenario occurred. This fundamental uncertainty drives much of the complexity in distributed systems design.

The Two Generals Problem

This uncertainty isn't a solvable engineering problem—it's a fundamental property of asynchronous communication. The 'Two Generals Problem' proves that no protocol can guarantee consensus between two parties communicating over an unreliable channel. We can only mitigate, never eliminate, this uncertainty.

Real-World Network Failure Case Studies

Abstract concepts become concrete through real-world examples. These case studies illustrate how network unreliability has caused major outages at companies that should—and did—know better.

Case Study: The GitHub Outage (2012)

•What happened: A network partition separated GitHub's Redis masters from their slaves for about 10 seconds.
•The assumption: The Redis failover system assumed network partitions would be short, and slaves would receive all writes before potential promotion.
•The reality: During the partition, the master continued accepting writes. When network healed, data from both former-master and promoted-slave had to be reconciled—but couldn't be.
•The impact: Significant data loss and extended downtime as engineers manually reconciled database state.
•The lesson: Network partitions don't just prevent communication; they create parallel universes that diverge and must somehow be merged.

Case Study: AWS US-East-1 (2017)

•What happened: A routine debugging operation required removing a small number of S3 storage servers from service.
•The assumption: The removal tool's command syntax would be correctly specified, and the system would handle partial outages gracefully.
•The reality: A typo caused far more servers than intended to be removed, effectively creating a massive network partition for S3 subsystems.
•The impact: Major portions of the internet went offline for hours. Services from Quora to Slack to parts of Docker Hub became unavailable.
•The lesson: Cascading failures from network-related issues don't stop at your service boundary. Your dependencies have dependencies.

Case Study: The Split-Brain Problem

•What happens: A database cluster consists of multiple nodes. A network partition divides them into two groups, each believing the other has failed.
•The assumption: Each partition detects the other as failed and promotes itself to primary to maintain availability.
•The reality: Both partitions accept writes. When network heals, you have two conflicting views of reality.
•The impact: This exact scenario has caused data loss at MongoDB deployments, MySQL clusters, and essentially every distributed database at some point.
•The lesson: This is why distributed databases require quorum-based systems and why the CAP theorem forces a choice between consistency and availability during partitions.

Study Post-Mortems

Major tech companies publish detailed post-mortems of their outages. Reading these is one of the most effective ways to internalize the reality of network failures. Every post-mortem teaches lessons that no textbook can match.

Patterns for Handling Network Unreliability

Since we cannot make networks reliable, we must architect systems that function correctly despite unreliable networks. This requires specific design patterns applied at multiple layers.

Essential Resilience Patterns

•Timeouts — Every network call must have a timeout. An operation without a timeout can hang forever, consuming resources and propagating failures. Choose timeout values based on observed latency distributions, not guesses.
•Retries with Exponential Backoff — Transient failures often resolve quickly. Retry with increasing delays (1s, 2s, 4s, 8s...) to avoid overwhelming recovering services. Add jitter to prevent thundering herds.
•Idempotency — Design operations to produce the same result whether executed once or multiple times. This allows safe retries when you're in the 'unknown' state.
•Circuit Breakers — Track failure rates for dependencies. When failures exceed a threshold, 'open' the circuit and fail fast instead of waiting for timeouts. Periodically test if the dependency has recovered.
•Fallbacks and Degradation — When a dependency fails, provide degraded functionality instead of complete failure. Show cached data. Disable non-essential features. Always prefer partial functionality to full outages.

resilient-http-client.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
/**
 * Example: Resilient HTTP client with timeout, retry, and circuit breaker
 * This illustrates the patterns needed for network unreliability
 */
 
interface RetryConfig {
    maxRetries: number;
    baseDelayMs: number;
    maxDelayMs: number;
}
 
interface CircuitBreakerConfig {
    failureThreshold: number;    // Opens after this many failures
    recoveryTimeMs: number;      // Try again after this interval
}
 
class ResilientHttpClient {
    private failureCount = 0;
    private lastFailureTime = 0;
    private circuitOpen = false;
 
    constructor(
        private retryConfig: RetryConfig = { 
            maxRetries: 3, 
            baseDelayMs: 100, 
            maxDelayMs: 10000 
        },
        private circuitConfig: CircuitBreakerConfig = { 
            failureThreshold: 5, 
            recoveryTimeMs: 30000 
        },
        private timeoutMs: number = 5000
    ) {}
 
    async request<T>(url: string, options: RequestInit = {}): Promise<T> {
        // Check circuit breaker state
        if (this.isCircuitOpen()) {
            throw new Error('Circuit breaker open - failing fast');
        }
 
        let lastError: Error | null = null;
 
        for (let attempt = 0; attempt <= this.retryConfig.maxRetries; attempt++) {
            try {
                const response = await this.executeWithTimeout(url, options);
                
                // Success - reset failure count
                this.onSuccess();
                return response as T;
                
            } catch (error) {
                lastError = error as Error;
                this.onFailure();
 
                // Don't retry if circuit opened
                if (this.isCircuitOpen()) {
                    throw new Error('Circuit breaker opened during retries');
                }
 
                // Don't retry on client errors (4xx) - these won't resolve
                if (error instanceof HttpError && error.status >= 400 && error.status < 500) {
                    throw error;
                }
 
                // Wait before retry with exponential backoff + jitter
                if (attempt < this.retryConfig.maxRetries) {
                    const delay = this.calculateBackoff(attempt);
                    await this.sleep(delay);
                }
            }
        }
 
        throw lastError || new Error('Request failed after retries');
    }
 
    private async executeWithTimeout(url: string, options: RequestInit): Promise<unknown> {
        const controller = new AbortController();
        const timeoutId = setTimeout(() => controller.abort(), this.timeoutMs);
 
        try {
            const response = await fetch(url, { ...options, signal: controller.signal });
            
            if (!response.ok) {
                throw new HttpError(response.status, await response.text());
            }
            
            return await response.json();
        } finally {
            clearTimeout(timeoutId);
        }
    }
 
    private calculateBackoff(attempt: number): number {
        // Exponential backoff: baseDelay * 2^attempt
        const exponentialDelay = this.retryConfig.baseDelayMs * Math.pow(2, attempt);
        
        // Cap at maximum delay
        const cappedDelay = Math.min(exponentialDelay, this.retryConfig.maxDelayMs);
        
        // Add jitter (±25%) to prevent thundering herd
        const jitter = cappedDelay * 0.25 * (Math.random() * 2 - 1);
        
        return Math.floor(cappedDelay + jitter);
    }
 
    private isCircuitOpen(): boolean {
        if (!this.circuitOpen) return false;
 
        // Check if recovery period has passed
        const timeSinceFailure = Date.now() - this.lastFailureTime;
        if (timeSinceFailure >= this.circuitConfig.recoveryTimeMs) {
            // Half-open state: allow one request to test recovery
            return false;
        }
 
        return true;
    }
 
    private onSuccess(): void {
        this.failureCount = 0;
        this.circuitOpen = false;
    }
 
    private onFailure(): void {
        this.failureCount++;
        this.lastFailureTime = Date.now();
 
        if (this.failureCount >= this.circuitConfig.failureThreshold) {
            this.circuitOpen = true;
            console.warn(`Circuit breaker opened after ${this.failureCount} failures`);
        }
    }
 
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}
 
class HttpError extends Error {
    constructor(public status: number, message: string) {
        super(`HTTP ${status}: ${message}`);
        this.name = 'HttpError';
    }
}

Defense in Depth

These patterns work best in combination. Timeouts prevent indefinite waits, retries handle transient failures, idempotency makes retries safe, circuit breakers prevent cascade failures, and fallbacks maintain user experience. Missing any layer leaves gaps in your resilience.

Idempotency: The Key to Safe Retries

When a request times out, you face a dilemma: Do you retry (risking duplicate execution) or give up (risking a failed operation the user believes succeeded)? Idempotency resolves this dilemma by making duplicate execution safe.

An operation is idempotent if executing it multiple times produces the same result as executing it once.

Naturally idempotent operations:

Set user email to 'alice@example.com' (can repeat safely)
Delete record with ID 42 (already deleted = still deleted)
GET /api/users/123 (pure read, no state change)

Non-idempotent operations that must be made idempotent:

Increment counter by 1 (each call adds 1)
Charge customer $100 (each call charges again)
Send notification (each call sends another email)

Patterns for Implementing Idempotency

•Idempotency Keys — Client generates a unique key (UUID) for each logical operation. Server tracks which keys have been processed and returns cached results for duplicates. Stripe, PayPal, and most payment APIs use this pattern.
•Conditional Updates — Use ETags or version numbers. 'Update only if version is X' becomes idempotent—once X is no longer current, retries have no effect.
•Natural Keys — Design operations around inherent unique identifiers. 'Create order with ID ord_12345' is idempotent; 'Create order' is not.
•At-Most-Once Delivery — For messaging, track message IDs and reject duplicates. Combined with retries, this gives exactly-once semantics despite unreliable delivery.

idempotency-implementation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/**
 * Server-side idempotency implementation
 * Tracks idempotency keys to prevent duplicate processing
 */
 
interface IdempotencyRecord {
    key: string;
    response: unknown;
    createdAt: Date;
    expiresAt: Date;
}
 
class IdempotencyService {
    // In production: use Redis or database with TTL
    private records = new Map<string, IdempotencyRecord>();
    private readonly TTL_MS = 24 * 60 * 60 * 1000; // 24 hours
 
    /**
     * Execute an operation idempotently
     * Returns cached result for duplicate keys
     */
    async executeOnce<T>(
        idempotencyKey: string,
        operation: () => Promise<T>
    ): Promise<T> {
        // Check for existing record
        const existing = this.records.get(idempotencyKey);
        
        if (existing) {
            if (existing.expiresAt > new Date()) {
                console.log(`Returning cached result for key: ${idempotencyKey}`);
                return existing.response as T;
            } else {
                // Expired - remove and allow re-execution
                // Be careful: in production, consider if re-execution is safe
                this.records.delete(idempotencyKey);
            }
        }
 
        // Execute the operation
        // NOTE: In production, use distributed locking to prevent
        // concurrent executions with the same idempotency key
        const result = await operation();
 
        // Cache the result
        const now = new Date();
        this.records.set(idempotencyKey, {
            key: idempotencyKey,
            response: result,
            createdAt: now,
            expiresAt: new Date(now.getTime() + this.TTL_MS),
        });
 
        return result;
    }
}
 
// Example API endpoint using idempotency
class PaymentController {
    constructor(private idempotencyService: IdempotencyService) {}
    
    async chargeCustomer(request: {
        idempotencyKey: string;
        customerId: string;
        amount: number;
    }): Promise<PaymentResult> {
        // Same idempotency key = same result, even if called multiple times
        return this.idempotencyService.executeOnce(
            request.idempotencyKey,
            async () => {
                // Actual payment processing happens here
                // This will only execute once per unique idempotency key
                const charge = await this.paymentGateway.charge(
                    request.customerId,
                    request.amount
                );
                
                return {
                    success: true,
                    chargeId: charge.id,
                    amount: charge.amount,
                };
            }
        );
    }
    
    private paymentGateway = {
        charge: async (customerId: string, amount: number) => ({
            id: `ch_${Date.now()}`,
            customerId,
            amount,
        })
    };
}
 
interface PaymentResult {
    success: boolean;
    chargeId: string;
    amount: number;
}

Idempotency Has Subtleties

Idempotency key storage must be durable and consistent. If your idempotency keys are stored in-memory or on a separate database that can diverge from your main data, you can still get duplicates. In distributed systems, idempotency storage becomes a consensus problem itself.

Testing for Network Failures

You cannot build resilient systems without testing failure scenarios. Unfortunately, network failures are inherently non-deterministic and hard to reproduce. Sophisticated teams use chaos engineering and fault injection to systematically test their assumptions.

Approaches to testing network unreliability:

Network Failure Testing Techniques
Technique	Description	Tools	When to Use
Fault Injection	Programmatically inject failures (delays, errors, drops)	Toxiproxy, Envoy fault injection	Development, integration tests
Chaos Engineering	Randomly inject failures in production	Chaos Monkey, Gremlin, Litmus	Production (with monitoring)
Network Simulation	Simulate specific network conditions	tc (Linux), Clumsy (Windows)	Development, CI/CD
Partition Testing	Physically or logically isolate network segments	iptables, network namespaces	Pre-production validation
Dependency Mocking	Mock external services with controllable failures	WireMock, LocalStack	Unit and integration tests

Failure Scenarios to Test

•Complete failure — Dependency returns no response (simulate with iptables DROP)
•Slow responses — Responses arrive but after unusual delays (simulate with Toxiproxy)
•Partial failure — Some requests succeed, others fail randomly
•Byzantine failure — Dependency returns corrupted or incorrect data
•Cascade failure — One failure triggers failures in other services
•Recovery behavior — System returns to normal when failure resolves
•Split-brain scenarios — Different nodes see different views of the network

Netflix's Simian Army

Netflix pioneered chaos engineering with their 'Simian Army'—a collection of tools that continuously attack their own production systems. Chaos Monkey kills random instances. Chaos Kong simulates entire AWS region failures. Latency Monkey adds delays. By constantly testing failure modes, Netflix builds systems that handle real outages gracefully.

Architectural Implications

Accepting network unreliability as a fundamental constraint has profound implications for how we design systems. These aren't just code patterns—they're architectural philosophies.

Anti-Patterns to Avoid

•Synchronous chains of service calls
•Tight coupling between services
•Single points of failure
•Operations without timeouts
•Unbounded retries
•Missing circuit breakers
•In-memory session state

Resilience Patterns

•Async message-driven communication
•Loose coupling via events
•Redundancy at every layer
•Timeouts on all I/O
•Bounded retries with backoff
•Circuit breakers for dependencies
•Externalized session management

Design for failure, not for success:

The mindset shift required is profound. Instead of asking, 'What happens when this succeeds?' defensive architects ask, 'What happens when this fails?' and 'How do we know it failed?' and 'What do we do about partial failure?'

This leads to patterns like:

Bulkheads — Isolate failures so they don't cascade. If the recommendation service fails, search should still work.
Timeouts and deadlines — Every operation has a budget. If we can't complete in time, fail gracefully.
Health checks and observability — You can't fix what you can't see. Comprehensive monitoring is essential.
Graceful degradation — When dependencies fail, provide reduced functionality rather than no functionality.
Self-healing — Automate recovery wherever possible. Restart crashed processes, route around failures, rebalance load.

Embrace the Chaos

The best distributed systems engineers don't try to prevent all failures—they accept failures as inevitable and design systems that handle them gracefully. A system that works perfectly 99% of the time and fails catastrophically 1% of the time is worse than one that works well 100% of the time including during failures.

Summary: The Network Is Never Reliable

We've explored the first and most fundamental fallacy of distributed computing: the assumption that the network is reliable. Let's consolidate the key insights:

Key Takeaways

•Networks fail in myriad ways — Physical failures, configuration errors, software bugs, and congestion all cause network unreliability at every scale from local to global.
•The unknown state is the central challenge — When a request times out, you cannot know if it succeeded, failed, or is still in progress. This uncertainty drives much of distributed systems complexity.
•Every network call needs protection — Timeouts, retries with backoff, circuit breakers, and fallbacks are not optional—they're mandatory for any production-grade distributed system.
•Idempotency enables safe retries — Design operations to be repeatable without side effects, allowing clients to retry safely when they can't determine if the original request succeeded.
•Test failure scenarios explicitly — Chaos engineering and fault injection are essential practices for building confidence that your system handles failures gracefully.
•Design for failure, not success — Assume everything will fail and build systems that maintain acceptable behavior during failures.

What's next:

The first fallacy taught us that networks fail. But even when networks work, they don't work instantly. The next fallacy—Latency Is Zero—explores why ignoring the time it takes for data to travel leads to equally devastating system failures.

Page Complete

You now understand why assuming network reliability is dangerous and how to architect systems that embrace unreliability. The patterns you've learned—timeouts, retries, idempotency, circuit breakers, and chaos engineering—are the foundation of resilient distributed systems.

1 / 5

Loading learning content...

System Design (HLD)Fallacies of Distributed Computing

Fallacies of Distributed Computing

LevelIntermediate

Duration90 mins

TopicFallacies of Distributed Computing

1 / 5

The Network Is Reliable (It's Not)

The Most Expensive Assumption in Computing

The first and most treacherous fallacy is this: The network is reliable.

The Cost of This Fallacy

Why Networks Are Inherently Unreliable

The physical reality:

Network Interface Cards (NICs) can malfunction or become overloaded
Cables can be cut, corroded, or disconnected (intentionally or accidentally)
Switches and routers can crash, reboot, or become congested
DNS servers can return stale data or become unreachable
Load balancers can misconfigure or fail over incorrectly
Firewalls can block traffic based on misconfigured rules
ISPs and cloud providers can experience outages affecting regions or the entire internet

Common Network Failure Modes
Failure Type	Description	Typical Duration	Frequency
Packet Loss	Packets dropped by congested routers or faulty hardware	Milliseconds (per packet)	Continuous (0.1-1% typical)
Network Partition	Two groups of nodes cannot communicate with each other	Seconds to hours	Rare but catastrophic
Timeout	Request sent but no response received within expected time	Seconds	Regular occurrence
DNS Failure	Domain name resolution fails or returns incorrect IP	Minutes to hours	Occasional
BGP Misconfiguration	Internet routing tables corrupted, traffic misrouted	Minutes to hours	Rare but widespread
Cable Cut	Physical infrastructure damage (construction, sharks, etc.)	Hours to days	Uncommon but severe
Hardware Failure	NIC, switch, or router fails completely	Minutes to hours	Regular in large deployments
Congestion Collapse	Network saturated, most packets dropped	Minutes	Under peak load

The Fallacy of the Local Network

The Anatomy of Network Failure

Detectable failures include:

Connection refused (server not listening)
Connection reset (server crashed mid-request)
DNS resolution failure
TLS handshake failure

These are actually the good failures—your code can catch them immediately and respond appropriately.

Undetectable failures are far more dangerous:

Request sent but lost in transit
Request received and processed, but response lost
Request delivered but stuck in a queue
Response delayed beyond any reasonable timeout

The Three States of a Remote Request

•Success — Request was received, processed, and response delivered. You can proceed with confidence.
•Failure — Request definitively failed at some point. You know to retry or handle the error.
•Unknown — The dangerous state. Request may have succeeded, failed, or be in progress. No timeout gives you certainty; it only tells you that you waited too long.

The unknown state is the central challenge of distributed systems.

Consider this scenario: You send a request to a payment service to charge a customer $100. Your request times out after 30 seconds. What happened?

The request was never received (you should retry)
The request was received but processing failed (you should retry)
The request was received, the customer was charged, but the response was lost (retrying will double-charge)
The request is still processing and will eventually succeed (retrying will double-charge)

The Two Generals Problem

Real-World Network Failure Case Studies

Abstract concepts become concrete through real-world examples. These case studies illustrate how network unreliability has caused major outages at companies that should—and did—know better.

Case Study: The GitHub Outage (2012)

•What happened: A network partition separated GitHub's Redis masters from their slaves for about 10 seconds.
•The assumption: The Redis failover system assumed network partitions would be short, and slaves would receive all writes before potential promotion.
•The reality: During the partition, the master continued accepting writes. When network healed, data from both former-master and promoted-slave had to be reconciled—but couldn't be.
•The impact: Significant data loss and extended downtime as engineers manually reconciled database state.
•The lesson: Network partitions don't just prevent communication; they create parallel universes that diverge and must somehow be merged.

Case Study: AWS US-East-1 (2017)

•What happened: A routine debugging operation required removing a small number of S3 storage servers from service.
•The assumption: The removal tool's command syntax would be correctly specified, and the system would handle partial outages gracefully.
•The reality: A typo caused far more servers than intended to be removed, effectively creating a massive network partition for S3 subsystems.
•The impact: Major portions of the internet went offline for hours. Services from Quora to Slack to parts of Docker Hub became unavailable.
•The lesson: Cascading failures from network-related issues don't stop at your service boundary. Your dependencies have dependencies.

Case Study: The Split-Brain Problem

•What happens: A database cluster consists of multiple nodes. A network partition divides them into two groups, each believing the other has failed.
•The assumption: Each partition detects the other as failed and promotes itself to primary to maintain availability.
•The reality: Both partitions accept writes. When network heals, you have two conflicting views of reality.
•The impact: This exact scenario has caused data loss at MongoDB deployments, MySQL clusters, and essentially every distributed database at some point.
•The lesson: This is why distributed databases require quorum-based systems and why the CAP theorem forces a choice between consistency and availability during partitions.

Study Post-Mortems

Patterns for Handling Network Unreliability

Since we cannot make networks reliable, we must architect systems that function correctly despite unreliable networks. This requires specific design patterns applied at multiple layers.

Essential Resilience Patterns

•Timeouts — Every network call must have a timeout. An operation without a timeout can hang forever, consuming resources and propagating failures. Choose timeout values based on observed latency distributions, not guesses.
•Retries with Exponential Backoff — Transient failures often resolve quickly. Retry with increasing delays (1s, 2s, 4s, 8s...) to avoid overwhelming recovering services. Add jitter to prevent thundering herds.
•Idempotency — Design operations to produce the same result whether executed once or multiple times. This allows safe retries when you're in the 'unknown' state.
•Circuit Breakers — Track failure rates for dependencies. When failures exceed a threshold, 'open' the circuit and fail fast instead of waiting for timeouts. Periodically test if the dependency has recovered.
•Fallbacks and Degradation — When a dependency fails, provide degraded functionality instead of complete failure. Show cached data. Disable non-essential features. Always prefer partial functionality to full outages.

resilient-http-client.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
/**
 * Example: Resilient HTTP client with timeout, retry, and circuit breaker
 * This illustrates the patterns needed for network unreliability
 */
 
interface RetryConfig {
    maxRetries: number;
    baseDelayMs: number;
    maxDelayMs: number;
}
 
interface CircuitBreakerConfig {
    failureThreshold: number;    // Opens after this many failures
    recoveryTimeMs: number;      // Try again after this interval
}
 
class ResilientHttpClient {
    private failureCount = 0;
    private lastFailureTime = 0;
    private circuitOpen = false;
 
    constructor(
        private retryConfig: RetryConfig = { 
            maxRetries: 3, 
            baseDelayMs: 100, 
            maxDelayMs: 10000 
        },
        private circuitConfig: CircuitBreakerConfig = { 
            failureThreshold: 5, 
            recoveryTimeMs: 30000 
        },
        private timeoutMs: number = 5000
    ) {}
 
    async request<T>(url: string, options: RequestInit = {}): Promise<T> {
        // Check circuit breaker state
        if (this.isCircuitOpen()) {
            throw new Error('Circuit breaker open - failing fast');
        }
 
        let lastError: Error | null = null;
 
        for (let attempt = 0; attempt <= this.retryConfig.maxRetries; attempt++) {
            try {
                const response = await this.executeWithTimeout(url, options);
                
                // Success - reset failure count
                this.onSuccess();
                return response as T;
                
            } catch (error) {
                lastError = error as Error;
                this.onFailure();
 
                // Don't retry if circuit opened
                if (this.isCircuitOpen()) {
                    throw new Error('Circuit breaker opened during retries');
                }
 
                // Don't retry on client errors (4xx) - these won't resolve
                if (error instanceof HttpError && error.status >= 400 && error.status < 500) {
                    throw error;
                }
 
                // Wait before retry with exponential backoff + jitter
                if (attempt < this.retryConfig.maxRetries) {
                    const delay = this.calculateBackoff(attempt);
                    await this.sleep(delay);
                }
            }
        }
 
        throw lastError || new Error('Request failed after retries');
    }
 
    private async executeWithTimeout(url: string, options: RequestInit): Promise<unknown> {
        const controller = new AbortController();
        const timeoutId = setTimeout(() => controller.abort(), this.timeoutMs);
 
        try {
            const response = await fetch(url, { ...options, signal: controller.signal });
            
            if (!response.ok) {
                throw new HttpError(response.status, await response.text());
            }
            
            return await response.json();
        } finally {
            clearTimeout(timeoutId);
        }
    }
 
    private calculateBackoff(attempt: number): number {
        // Exponential backoff: baseDelay * 2^attempt
        const exponentialDelay = this.retryConfig.baseDelayMs * Math.pow(2, attempt);
        
        // Cap at maximum delay
        const cappedDelay = Math.min(exponentialDelay, this.retryConfig.maxDelayMs);
        
        // Add jitter (±25%) to prevent thundering herd
        const jitter = cappedDelay * 0.25 * (Math.random() * 2 - 1);
        
        return Math.floor(cappedDelay + jitter);
    }
 
    private isCircuitOpen(): boolean {
        if (!this.circuitOpen) return false;
 
        // Check if recovery period has passed
        const timeSinceFailure = Date.now() - this.lastFailureTime;
        if (timeSinceFailure >= this.circuitConfig.recoveryTimeMs) {
            // Half-open state: allow one request to test recovery
            return false;
        }
 
        return true;
    }
 
    private onSuccess(): void {
        this.failureCount = 0;
        this.circuitOpen = false;
    }
 
    private onFailure(): void {
        this.failureCount++;
        this.lastFailureTime = Date.now();
 
        if (this.failureCount >= this.circuitConfig.failureThreshold) {
            this.circuitOpen = true;
            console.warn(`Circuit breaker opened after ${this.failureCount} failures`);
        }
    }
 
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}
 
class HttpError extends Error {
    constructor(public status: number, message: string) {
        super(`HTTP ${status}: ${message}`);
        this.name = 'HttpError';
    }
}

Defense in Depth

Idempotency: The Key to Safe Retries

An operation is idempotent if executing it multiple times produces the same result as executing it once.

Naturally idempotent operations:

Set user email to 'alice@example.com' (can repeat safely)
Delete record with ID 42 (already deleted = still deleted)
GET /api/users/123 (pure read, no state change)

Non-idempotent operations that must be made idempotent:

Increment counter by 1 (each call adds 1)
Charge customer $100 (each call charges again)
Send notification (each call sends another email)

Patterns for Implementing Idempotency

•Idempotency Keys — Client generates a unique key (UUID) for each logical operation. Server tracks which keys have been processed and returns cached results for duplicates. Stripe, PayPal, and most payment APIs use this pattern.
•Conditional Updates — Use ETags or version numbers. 'Update only if version is X' becomes idempotent—once X is no longer current, retries have no effect.
•Natural Keys — Design operations around inherent unique identifiers. 'Create order with ID ord_12345' is idempotent; 'Create order' is not.
•At-Most-Once Delivery — For messaging, track message IDs and reject duplicates. Combined with retries, this gives exactly-once semantics despite unreliable delivery.

idempotency-implementation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/**
 * Server-side idempotency implementation
 * Tracks idempotency keys to prevent duplicate processing
 */
 
interface IdempotencyRecord {
    key: string;
    response: unknown;
    createdAt: Date;
    expiresAt: Date;
}
 
class IdempotencyService {
    // In production: use Redis or database with TTL
    private records = new Map<string, IdempotencyRecord>();
    private readonly TTL_MS = 24 * 60 * 60 * 1000; // 24 hours
 
    /**
     * Execute an operation idempotently
     * Returns cached result for duplicate keys
     */
    async executeOnce<T>(
        idempotencyKey: string,
        operation: () => Promise<T>
    ): Promise<T> {
        // Check for existing record
        const existing = this.records.get(idempotencyKey);
        
        if (existing) {
            if (existing.expiresAt > new Date()) {
                console.log(`Returning cached result for key: ${idempotencyKey}`);
                return existing.response as T;
            } else {
                // Expired - remove and allow re-execution
                // Be careful: in production, consider if re-execution is safe
                this.records.delete(idempotencyKey);
            }
        }
 
        // Execute the operation
        // NOTE: In production, use distributed locking to prevent
        // concurrent executions with the same idempotency key
        const result = await operation();
 
        // Cache the result
        const now = new Date();
        this.records.set(idempotencyKey, {
            key: idempotencyKey,
            response: result,
            createdAt: now,
            expiresAt: new Date(now.getTime() + this.TTL_MS),
        });
 
        return result;
    }
}
 
// Example API endpoint using idempotency
class PaymentController {
    constructor(private idempotencyService: IdempotencyService) {}
    
    async chargeCustomer(request: {
        idempotencyKey: string;
        customerId: string;
        amount: number;
    }): Promise<PaymentResult> {
        // Same idempotency key = same result, even if called multiple times
        return this.idempotencyService.executeOnce(
            request.idempotencyKey,
            async () => {
                // Actual payment processing happens here
                // This will only execute once per unique idempotency key
                const charge = await this.paymentGateway.charge(
                    request.customerId,
                    request.amount
                );
                
                return {
                    success: true,
                    chargeId: charge.id,
                    amount: charge.amount,
                };
            }
        );
    }
    
    private paymentGateway = {
        charge: async (customerId: string, amount: number) => ({
            id: `ch_${Date.now()}`,
            customerId,
            amount,
        })
    };
}
 
interface PaymentResult {
    success: boolean;
    chargeId: string;
    amount: number;
}

Idempotency Has Subtleties

Testing for Network Failures

Approaches to testing network unreliability:

Network Failure Testing Techniques
Technique	Description	Tools	When to Use
Fault Injection	Programmatically inject failures (delays, errors, drops)	Toxiproxy, Envoy fault injection	Development, integration tests
Chaos Engineering	Randomly inject failures in production	Chaos Monkey, Gremlin, Litmus	Production (with monitoring)
Network Simulation	Simulate specific network conditions	tc (Linux), Clumsy (Windows)	Development, CI/CD
Partition Testing	Physically or logically isolate network segments	iptables, network namespaces	Pre-production validation
Dependency Mocking	Mock external services with controllable failures	WireMock, LocalStack	Unit and integration tests

Failure Scenarios to Test

•Complete failure — Dependency returns no response (simulate with iptables DROP)
•Slow responses — Responses arrive but after unusual delays (simulate with Toxiproxy)
•Partial failure — Some requests succeed, others fail randomly
•Byzantine failure — Dependency returns corrupted or incorrect data
•Cascade failure — One failure triggers failures in other services
•Recovery behavior — System returns to normal when failure resolves
•Split-brain scenarios — Different nodes see different views of the network

Netflix's Simian Army

Architectural Implications

Accepting network unreliability as a fundamental constraint has profound implications for how we design systems. These aren't just code patterns—they're architectural philosophies.

Anti-Patterns to Avoid

•Synchronous chains of service calls
•Tight coupling between services
•Single points of failure
•Operations without timeouts
•Unbounded retries
•Missing circuit breakers
•In-memory session state

Resilience Patterns

•Async message-driven communication
•Loose coupling via events
•Redundancy at every layer
•Timeouts on all I/O
•Bounded retries with backoff
•Circuit breakers for dependencies
•Externalized session management

Design for failure, not for success:

This leads to patterns like:

Bulkheads — Isolate failures so they don't cascade. If the recommendation service fails, search should still work.
Timeouts and deadlines — Every operation has a budget. If we can't complete in time, fail gracefully.
Health checks and observability — You can't fix what you can't see. Comprehensive monitoring is essential.
Graceful degradation — When dependencies fail, provide reduced functionality rather than no functionality.
Self-healing — Automate recovery wherever possible. Restart crashed processes, route around failures, rebalance load.

Embrace the Chaos

Summary: The Network Is Never Reliable

We've explored the first and most fundamental fallacy of distributed computing: the assumption that the network is reliable. Let's consolidate the key insights:

Key Takeaways

•Networks fail in myriad ways — Physical failures, configuration errors, software bugs, and congestion all cause network unreliability at every scale from local to global.
•The unknown state is the central challenge — When a request times out, you cannot know if it succeeded, failed, or is still in progress. This uncertainty drives much of distributed systems complexity.
•Every network call needs protection — Timeouts, retries with backoff, circuit breakers, and fallbacks are not optional—they're mandatory for any production-grade distributed system.
•Idempotency enables safe retries — Design operations to be repeatable without side effects, allowing clients to retry safely when they can't determine if the original request succeeded.
•Test failure scenarios explicitly — Chaos engineering and fault injection are essential practices for building confidence that your system handles failures gracefully.
•Design for failure, not success — Assume everything will fail and build systems that maintain acceptable behavior during failures.

What's next:

Page Complete

1 / 5