Loading learning content...
In 1994, Peter Deutsch at Sun Microsystems identified a set of dangerous assumptions that programmers new to distributed computing invariably make. These became known as the Fallacies of Distributed Computing—eight beliefs that seem reasonable but lead to software that fails spectacularly in production.
The first and most treacherous fallacy is this: The network is reliable.
It's a seductive assumption. When you send a request from your laptop to a server sitting next to you, it works. Every time. The request goes out, the response comes back. Networking feels invisible, magical, perfectly dependable. And so developers write code that assumes this reliability extends to production environments with hundreds of services communicating across data centers spanning continents.
This single assumption has caused more distributed system outages than perhaps any other design flaw. Companies have lost billions in revenue, engineers have spent countless sleepless nights debugging phantom failures, and users have experienced data loss—all because someone assumed the network would always be there when needed.
In this page, we will dissect exactly why networks are unreliable, understand the myriad ways they fail, and develop architectural patterns that embrace—rather than ignore—this fundamental reality of distributed computing.
To understand network unreliability, we must first understand what a network actually is: a complex web of physical hardware, software protocols, and human-managed configurations spanning your application's process to another process that might be across the room or across the planet.
The physical reality:
Every network packet you send travels through a remarkable journey. It begins as electrical signals or photons, traverses cables, routers, switches, and sometimes satellites. Each component in this chain can fail:
| Failure Type | Description | Typical Duration | Frequency |
|---|---|---|---|
| Packet Loss | Packets dropped by congested routers or faulty hardware | Milliseconds (per packet) | Continuous (0.1-1% typical) |
| Network Partition | Two groups of nodes cannot communicate with each other | Seconds to hours | Rare but catastrophic |
| Timeout | Request sent but no response received within expected time | Seconds | Regular occurrence |
| DNS Failure | Domain name resolution fails or returns incorrect IP | Minutes to hours | Occasional |
| BGP Misconfiguration | Internet routing tables corrupted, traffic misrouted | Minutes to hours | Rare but widespread |
| Cable Cut | Physical infrastructure damage (construction, sharks, etc.) | Hours to days | Uncommon but severe |
| Hardware Failure | NIC, switch, or router fails completely | Minutes to hours | Regular in large deployments |
| Congestion Collapse | Network saturated, most packets dropped | Minutes | Under peak load |
Even networks that seem 'local'—machines in the same rack, same data center, same availability zone—can and do fail. A misconfigured switch can partition a rack. A firmware bug can cause a network card to silently drop specific packet types. Google's extensive research shows that even within a single data center, network failures happen more often than most engineers expect.
Network failures come in many forms, and understanding their characteristics is essential for building resilient systems. The most important distinction is between detectable and undetectable failures.
Detectable failures include:
These are actually the good failures—your code can catch them immediately and respond appropriately.
Undetectable failures are far more dangerous:
The unknown state is the central challenge of distributed systems.
Consider this scenario: You send a request to a payment service to charge a customer $100. Your request times out after 30 seconds. What happened?
Without additional mechanisms like idempotency keys or exactly-once semantics, you literally cannot know which scenario occurred. This fundamental uncertainty drives much of the complexity in distributed systems design.
This uncertainty isn't a solvable engineering problem—it's a fundamental property of asynchronous communication. The 'Two Generals Problem' proves that no protocol can guarantee consensus between two parties communicating over an unreliable channel. We can only mitigate, never eliminate, this uncertainty.
Abstract concepts become concrete through real-world examples. These case studies illustrate how network unreliability has caused major outages at companies that should—and did—know better.
Major tech companies publish detailed post-mortems of their outages. Reading these is one of the most effective ways to internalize the reality of network failures. Every post-mortem teaches lessons that no textbook can match.
Since we cannot make networks reliable, we must architect systems that function correctly despite unreliable networks. This requires specific design patterns applied at multiple layers.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
/** * Example: Resilient HTTP client with timeout, retry, and circuit breaker * This illustrates the patterns needed for network unreliability */ interface RetryConfig { maxRetries: number; baseDelayMs: number; maxDelayMs: number;} interface CircuitBreakerConfig { failureThreshold: number; // Opens after this many failures recoveryTimeMs: number; // Try again after this interval} class ResilientHttpClient { private failureCount = 0; private lastFailureTime = 0; private circuitOpen = false; constructor( private retryConfig: RetryConfig = { maxRetries: 3, baseDelayMs: 100, maxDelayMs: 10000 }, private circuitConfig: CircuitBreakerConfig = { failureThreshold: 5, recoveryTimeMs: 30000 }, private timeoutMs: number = 5000 ) {} async request<T>(url: string, options: RequestInit = {}): Promise<T> { // Check circuit breaker state if (this.isCircuitOpen()) { throw new Error('Circuit breaker open - failing fast'); } let lastError: Error | null = null; for (let attempt = 0; attempt <= this.retryConfig.maxRetries; attempt++) { try { const response = await this.executeWithTimeout(url, options); // Success - reset failure count this.onSuccess(); return response as T; } catch (error) { lastError = error as Error; this.onFailure(); // Don't retry if circuit opened if (this.isCircuitOpen()) { throw new Error('Circuit breaker opened during retries'); } // Don't retry on client errors (4xx) - these won't resolve if (error instanceof HttpError && error.status >= 400 && error.status < 500) { throw error; } // Wait before retry with exponential backoff + jitter if (attempt < this.retryConfig.maxRetries) { const delay = this.calculateBackoff(attempt); await this.sleep(delay); } } } throw lastError || new Error('Request failed after retries'); } private async executeWithTimeout(url: string, options: RequestInit): Promise<unknown> { const controller = new AbortController(); const timeoutId = setTimeout(() => controller.abort(), this.timeoutMs); try { const response = await fetch(url, { ...options, signal: controller.signal }); if (!response.ok) { throw new HttpError(response.status, await response.text()); } return await response.json(); } finally { clearTimeout(timeoutId); } } private calculateBackoff(attempt: number): number { // Exponential backoff: baseDelay * 2^attempt const exponentialDelay = this.retryConfig.baseDelayMs * Math.pow(2, attempt); // Cap at maximum delay const cappedDelay = Math.min(exponentialDelay, this.retryConfig.maxDelayMs); // Add jitter (±25%) to prevent thundering herd const jitter = cappedDelay * 0.25 * (Math.random() * 2 - 1); return Math.floor(cappedDelay + jitter); } private isCircuitOpen(): boolean { if (!this.circuitOpen) return false; // Check if recovery period has passed const timeSinceFailure = Date.now() - this.lastFailureTime; if (timeSinceFailure >= this.circuitConfig.recoveryTimeMs) { // Half-open state: allow one request to test recovery return false; } return true; } private onSuccess(): void { this.failureCount = 0; this.circuitOpen = false; } private onFailure(): void { this.failureCount++; this.lastFailureTime = Date.now(); if (this.failureCount >= this.circuitConfig.failureThreshold) { this.circuitOpen = true; console.warn(`Circuit breaker opened after ${this.failureCount} failures`); } } private sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms)); }} class HttpError extends Error { constructor(public status: number, message: string) { super(`HTTP ${status}: ${message}`); this.name = 'HttpError'; }}These patterns work best in combination. Timeouts prevent indefinite waits, retries handle transient failures, idempotency makes retries safe, circuit breakers prevent cascade failures, and fallbacks maintain user experience. Missing any layer leaves gaps in your resilience.
When a request times out, you face a dilemma: Do you retry (risking duplicate execution) or give up (risking a failed operation the user believes succeeded)? Idempotency resolves this dilemma by making duplicate execution safe.
An operation is idempotent if executing it multiple times produces the same result as executing it once.
Naturally idempotent operations:
Non-idempotent operations that must be made idempotent:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
/** * Server-side idempotency implementation * Tracks idempotency keys to prevent duplicate processing */ interface IdempotencyRecord { key: string; response: unknown; createdAt: Date; expiresAt: Date;} class IdempotencyService { // In production: use Redis or database with TTL private records = new Map<string, IdempotencyRecord>(); private readonly TTL_MS = 24 * 60 * 60 * 1000; // 24 hours /** * Execute an operation idempotently * Returns cached result for duplicate keys */ async executeOnce<T>( idempotencyKey: string, operation: () => Promise<T> ): Promise<T> { // Check for existing record const existing = this.records.get(idempotencyKey); if (existing) { if (existing.expiresAt > new Date()) { console.log(`Returning cached result for key: ${idempotencyKey}`); return existing.response as T; } else { // Expired - remove and allow re-execution // Be careful: in production, consider if re-execution is safe this.records.delete(idempotencyKey); } } // Execute the operation // NOTE: In production, use distributed locking to prevent // concurrent executions with the same idempotency key const result = await operation(); // Cache the result const now = new Date(); this.records.set(idempotencyKey, { key: idempotencyKey, response: result, createdAt: now, expiresAt: new Date(now.getTime() + this.TTL_MS), }); return result; }} // Example API endpoint using idempotencyclass PaymentController { constructor(private idempotencyService: IdempotencyService) {} async chargeCustomer(request: { idempotencyKey: string; customerId: string; amount: number; }): Promise<PaymentResult> { // Same idempotency key = same result, even if called multiple times return this.idempotencyService.executeOnce( request.idempotencyKey, async () => { // Actual payment processing happens here // This will only execute once per unique idempotency key const charge = await this.paymentGateway.charge( request.customerId, request.amount ); return { success: true, chargeId: charge.id, amount: charge.amount, }; } ); } private paymentGateway = { charge: async (customerId: string, amount: number) => ({ id: `ch_${Date.now()}`, customerId, amount, }) };} interface PaymentResult { success: boolean; chargeId: string; amount: number;}Idempotency key storage must be durable and consistent. If your idempotency keys are stored in-memory or on a separate database that can diverge from your main data, you can still get duplicates. In distributed systems, idempotency storage becomes a consensus problem itself.
You cannot build resilient systems without testing failure scenarios. Unfortunately, network failures are inherently non-deterministic and hard to reproduce. Sophisticated teams use chaos engineering and fault injection to systematically test their assumptions.
Approaches to testing network unreliability:
| Technique | Description | Tools | When to Use |
|---|---|---|---|
| Fault Injection | Programmatically inject failures (delays, errors, drops) | Toxiproxy, Envoy fault injection | Development, integration tests |
| Chaos Engineering | Randomly inject failures in production | Chaos Monkey, Gremlin, Litmus | Production (with monitoring) |
| Network Simulation | Simulate specific network conditions | tc (Linux), Clumsy (Windows) | Development, CI/CD |
| Partition Testing | Physically or logically isolate network segments | iptables, network namespaces | Pre-production validation |
| Dependency Mocking | Mock external services with controllable failures | WireMock, LocalStack | Unit and integration tests |
Netflix pioneered chaos engineering with their 'Simian Army'—a collection of tools that continuously attack their own production systems. Chaos Monkey kills random instances. Chaos Kong simulates entire AWS region failures. Latency Monkey adds delays. By constantly testing failure modes, Netflix builds systems that handle real outages gracefully.
Accepting network unreliability as a fundamental constraint has profound implications for how we design systems. These aren't just code patterns—they're architectural philosophies.
Design for failure, not for success:
The mindset shift required is profound. Instead of asking, 'What happens when this succeeds?' defensive architects ask, 'What happens when this fails?' and 'How do we know it failed?' and 'What do we do about partial failure?'
This leads to patterns like:
The best distributed systems engineers don't try to prevent all failures—they accept failures as inevitable and design systems that handle them gracefully. A system that works perfectly 99% of the time and fails catastrophically 1% of the time is worse than one that works well 100% of the time including during failures.
We've explored the first and most fundamental fallacy of distributed computing: the assumption that the network is reliable. Let's consolidate the key insights:
What's next:
The first fallacy taught us that networks fail. But even when networks work, they don't work instantly. The next fallacy—Latency Is Zero—explores why ignoring the time it takes for data to travel leads to equally devastating system failures.
You now understand why assuming network reliability is dangerous and how to architect systems that embrace unreliability. The patterns you've learned—timeouts, retries, idempotency, circuit breakers, and chaos engineering—are the foundation of resilient distributed systems.