Fallacies of Distributed Computing - Learning Module

Loading content...

0/273

Latency Is Zero (It's Not)

The Speed of Light Is Your Bottleneck

The previous page taught us that networks are unreliable—they fail in countless ways. But even when networks work perfectly, there's another insidious assumption that corrupts distributed system designs: Latency is zero.

When you call a function in your code, it executes in nanoseconds. When you read from memory, the data arrives in under 100 nanoseconds. These operations feel instantaneous because, at human time scales, they are.

But network communication operates on entirely different time scales. A packet traveling from New York to London—even at the speed of light—takes at least 28 milliseconds. Add routing, processing, and protocol overhead, and real-world latencies regularly exceed 100ms. That's 100,000 times slower than a memory access.

The Physics You Can't Escape

The speed of light in fiber optic cable is approximately 200,000 km/s (about 2/3 the speed of light in vacuum). This means it takes roughly 5 microseconds for a signal to travel 1 kilometer. For a round-trip across the Atlantic (~11,500 km), you're looking at a minimum of 57ms just for the physical signal propagation—and that's before any switches, routers, or processing.

Understanding latency isn't just academic—it fundamentally shapes how distributed systems must be designed. Every remote call has a time cost, and those costs compound in ways that can render systems unusable.

Understanding Latency Components

Network latency isn't a single number—it's composed of multiple additive delays at every layer of the networking stack. Understanding these components is essential for diagnosing and optimizing system performance.

Components of Network Latency
Component	Description	Typical Range	Controllable?
Propagation Delay	Time for signal to physically travel through medium	~5μs per km	No (physics)
Transmission Delay	Time to push all packet bits onto the wire	Depends on bandwidth	Upgrade bandwidth
Processing Delay	Time for routers/switches to process packet headers	1-100μs per hop	Limited
Queueing Delay	Time waiting in router/switch buffers	0ms to seconds	Reduce congestion
Serialization Delay	Time to convert data to/from wire format	10-100μs	Use efficient formats
Protocol Overhead	TCP handshakes, TLS negotiation, etc.	1-3 RTTs	Connection pooling
Application Delay	Time for application to process request	Varies widely	Optimize code

The multiplicative effect of protocol overhead:

Establishing a new HTTPS connection involves:

TCP three-way handshake: 1 round-trip (RTT)
TLS handshake (TLS 1.2): 2 additional RTTs
First HTTP request/response: 1 RTT

That's 4 RTTs before you receive the first byte of actual data. With a 100ms RTT to a distant server, you've already spent 400ms just on handshakes. This is why connection reuse and HTTP/2 multiplexing are so critical for performance.

Latency at Different Scales

Same-rack latency: ~0.5ms. Same-datacenter latency: ~1-2ms. Cross-region (US East to West): ~60-80ms. Cross-continent (US to Europe): ~100-150ms. These numbers should be memorized—they inform every architectural decision.

The Latency Numbers Every Engineer Must Know

Jeff Dean's famous "Latency Numbers Every Programmer Should Know" provides essential mental models for understanding system performance. While specific numbers evolve with hardware, the relative orders of magnitude remain stable.

Latency Numbers (Approximate, 2024)
Operation	Time	Relative Scale
L1 cache reference	1 ns	Baseline
L2 cache reference	4 ns	4x L1
Main memory reference	100 ns	100x L1
SSD random read	16 μs	16,000x L1
Read 1 MB sequentially from SSD	50 μs	50,000x L1
Read 1 MB sequentially from disk	2 ms	2,000,000x L1
Same-datacenter round-trip	500 μs	500,000x L1
Cross-region network (same continent)	50 ms	50,000,000x L1
Cross-continent network	150 ms	150,000,000x L1

Interpreting these numbers:

The key insight is the eight orders of magnitude between a CPU cache reference (1 nanosecond) and a cross-continent network call (150 milliseconds). This massive gap means:

A single remote call can be as expensive as millions of local operations
Algorithms that are efficient locally become catastrophic when they involve network calls
Batching, caching, and prefetching become essential at any significant scale

Example calculation:

Suppose your code has a loop that, for each of 1,000 items, makes a remote call. With 100ms latency per call:

Sequential execution: 1,000 × 100ms = 100 seconds
Batch all items in one call: 100ms + processing time

The same logical operation takes 1,000x longer without batching. This is why the assumption that "latency is zero" leads to such dramatic performance failures.

Latency Hides in Synchronous Chains

If Service A calls Service B, which calls Service C, which calls Service D, the total latency is A + B + C + D. A request that seems simple on a diagram might traverse five services, each adding 50ms, resulting in a 250ms baseline—before any actual processing happens.

The Tail Latency Problem

Average latency is a misleading metric. What users actually experience—and what breaks SLAs—is tail latency: the worst-case response times that affect a small but significant percentage of requests.

Why tail latency matters more than average:

Consider a service with 10ms average latency but 500ms 99th percentile (p99) latency. For a single request, 99% of users see great performance. But what happens when a single user action triggers 100 parallel backend requests?

tail-latency-math.txt

Calculation

The Tail at Scale Problem
========================
 
Given:
- Single request p99 latency: 500ms (1% of requests)
- User action fans out to: 100 parallel backend requests
 
Probability that ALL 100 requests complete within p99:
  = (0.99)^100 = 36.6%
 
Probability that at least ONE request exceeds p99:
  = 1 - (0.99)^100 = 63.4%
 
Result: 
For 63% of user requests, at least one backend call will be slow,
making the ENTIRE user-visible request slow.
 
With 1000 parallel requests:
  Probability of at least one slow request = 1 - (0.99)^1000 = 99.996%
 
This is why at Google/Facebook/Amazon, p99 latency optimization
is often more important than average latency optimization.

Common Causes of Tail Latency

•Garbage Collection Pauses — JVM/Go/C# GC can pause applications for 10ms-1s, affecting in-flight requests
•Resource Contention — CPU scheduling, lock contention, thread pool exhaustion cause variable delays
•Network Congestion — Queueing at switches and routers adds variable latency under load
•Disk I/O Variance — SSDs have relatively consistent latency, but rotational disks have high variance
•Cold Starts — Serverless functions, JIT compilation, cache misses all cause first-request slowness
•Background Tasks — Log rotation, compaction, checkpointing can temporarily increase latency
•Power Management — CPU frequency scaling and sleep states add wake-up latency

Hedged Requests

Google's 'hedged requests' pattern: if a request hasn't returned within the p95 time, send a duplicate request to another server. When either responds, cancel the other. This dramatically reduces tail latency at the cost of slightly increased overall load.

Latency Anti-Patterns in Distributed Systems

Developers who assume zero latency naturally produce code that works locally but fails spectacularly in distributed environments. These anti-patterns are distressingly common.

Common Latency Anti-Patterns

•N+1 Query Problem — Fetching a list of items, then making a separate network call for each item's details. With 100 items and 50ms per call, that's 5 seconds of latency that could be 50ms with batching.
•Synchronous Service Chains — Request flows through A → B → C → D, each adding latency. The total response time is the sum of all latencies. Even 10ms × 5 services = 50ms minimum baseline.
•Blocking I/O in Request Path — CPU-bound processing typically completes in microseconds; waiting for external calls takes milliseconds. Mixing them without async patterns wastes resources.
•Chatty Protocols — Multiple round-trips to establish context before doing work. Example: authenticate, fetch user, check permissions, load preferences, then finally process the request.
•Ignoring Connection Overhead — Creating a new TCP+TLS connection per request instead of using connection pools. Each connection adds 2-4 RTTs of setup latency.
•Unbatched Writes — Writing every event immediately to a remote datastore instead of buffering and batch-flushing. 1000 individual writes at 10ms each = 10 seconds; one batched write = 10ms.

n-plus-one-bad.ts
Anti-Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// ❌ N+1 Query Anti-Pattern
// Each iteration makes a remote call
async function getOrdersWithCustomers(
    orderIds: string[]
): Promise<OrderWithCustomer[]> {
    const result: OrderWithCustomer[] = [];
    
    for (const orderId of orderIds) {
        // Remote call 1: fetch order
        const order = await orderService.get(orderId);
        
        // Remote call 2: fetch customer
        const customer = await customerService
            .get(order.customerId);
        
        result.push({ order, customer });
    }
    
    return result;
}
 
// With 100 orders and 50ms per call:
// Time = 100 × (50ms + 50ms) = 10 seconds!

batched-good.ts
Correct Pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// ✅ Batched Pattern
// All remote calls happen in parallel
async function getOrdersWithCustomers(
    orderIds: string[]
): Promise<OrderWithCustomer[]> {
    // Batch fetch all orders (1 call)
    const orders = await orderService
        .getMany(orderIds);
    
    // Extract unique customer IDs
    const customerIds = [...new Set(
        orders.map(o => o.customerId)
    )];
    
    // Batch fetch all customers (1 call)
    const customers = await customerService
        .getMany(customerIds);
    
    const customerMap = new Map(
        customers.map(c => [c.id, c])
    );
    
    return orders.map(order => ({
        order,
        customer: customerMap.get(order.customerId)!
    }));
}
 
// With 100 orders and 50ms per call:
// Time = 50ms + 50ms = 100ms total!

100x Improvement

The batched version is 100x faster than the N+1 version for 100 items.This isn't a micro-optimization—it's the difference between usable and unusable software. Learn to recognize and eliminate N+1 patterns instinctively.

Designing for Latency

Once you accept that latency is non-zero and significant, you can apply specific design principles to minimize its impact on user experience and system throughput.

Latency Reduction Strategies

•Caching — Store results of expensive operations closer to the user. CDN caching reduces cross-continent latency to edge-server latency. Local caching eliminates network calls entirely.
•Connection Pooling — Maintain persistent connections to frequently-accessed services. Eliminate TCP/TLS handshake latency for each request.
•Request Batching — Combine multiple logical operations into single network requests. GraphQL and DataLoader are designed specifically for this.
•Parallel Execution — When multiple independent remote calls are needed, execute them concurrently rather than sequentially.
•Prefetching — Predict what data will be needed and fetch it before it's requested. Pre-warm caches during low-traffic periods.
•Compression — Reduce bytes transferred to reduce transmission time. Trade CPU cycles for network time.
•Geographic Distribution — Deploy services closer to users. If users are global, your services should be too.
•Async Processing — Move non-critical work out of the request path. Acknowledge the request, then process in the background.

parallel-vs-sequential.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/**
 * Parallel vs Sequential Remote Calls
 * Demonstrates the dramatic latency difference
 */
 
interface DashboardData {
    user: User;
    notifications: Notification[];
    recentOrders: Order[];
    recommendations: Product[];
}
 
// ❌ Sequential: Total latency = sum of all call latencies
async function loadDashboardSequential(): Promise<DashboardData> {
    const user = await userService.getCurrentUser();           // 50ms
    const notifications = await notificationService.getAll();  // 50ms
    const recentOrders = await orderService.getRecent();       // 50ms
    const recommendations = await productService.getRecommended(); // 50ms
    
    return { user, notifications, recentOrders, recommendations };
    // Total: ~200ms
}
 
// ✅ Parallel: Total latency = max of all call latencies
async function loadDashboardParallel(): Promise<DashboardData> {
    const [user, notifications, recentOrders, recommendations] = 
        await Promise.all([
            userService.getCurrentUser(),           // 50ms ─┐
            notificationService.getAll(),           // 50ms ─┼─ Run simultaneously
            orderService.getRecent(),               // 50ms ─┤
            productService.getRecommended(),        // 50ms ─┘
        ]);
    
    return { user, notifications, recentOrders, recommendations };
    // Total: ~50ms (4x faster!)
}
 
// ✅✅ Parallel with Timeouts and Fallbacks: Production-grade
async function loadDashboardResilient(): Promise<DashboardData> {
    // Critical data - must succeed
    const userPromise = userService.getCurrentUser();
    
    // Non-critical data - use fallbacks on timeout
    const [user, notifications, recentOrders, recommendations] = 
        await Promise.all([
            userPromise,
            withTimeoutFallback(notificationService.getAll(), 100, []),
            withTimeoutFallback(orderService.getRecent(), 100, []),
            withTimeoutFallback(productService.getRecommended(), 100, []),
        ]);
    
    return { user, notifications, recentOrders, recommendations };
}
 
async function withTimeoutFallback<T>(
    promise: Promise<T>,
    timeoutMs: number,
    fallback: T
): Promise<T> {
    try {
        return await Promise.race([
            promise,
            new Promise<T>((_, reject) => 
                setTimeout(() => reject(new Error('Timeout')), timeoutMs)
            ),
        ]);
    } catch {
        return fallback;
    }
}

Latency Budgets

Set explicit latency budgets for user-facing operations. If your target is 200ms end-to-end, allocate portions to each component: 50ms for network, 50ms for database, 50ms for processing, 50ms buffer. When any component exceeds its budget, you know exactly where to focus optimization.

Latency in Global Systems

Geographic distribution introduces latency challenges that cannot be optimized away with clever code—the speed of light is a hard limit. Global systems must be architected with physics in mind.

Minimum Round-Trip Latencies (Speed of Light Limits)
Route	Distance	Theoretical Min RTT	Real-World RTT
NYC → London	5,570 km	37 ms	70-90 ms
NYC → Tokyo	10,850 km	72 ms	180-220 ms
NYC → Sydney	16,000 km	107 ms	250-300 ms
London → Singapore	10,880 km	73 ms	160-200 ms
San Francisco → Singapore	13,600 km	91 ms	170-210 ms

Architectural implications:

When your users span the globe but your servers are in one region, some users will always experience high latency. Consider a user in Tokyo accessing a server in Virginia:

Minimum network latency: ~180ms RTT
If one page load requires 5 sequential server round-trips: 900ms
If sync writes require cross-region consistency: additional 180ms+ per write

This is why global applications require global architectures:

Edge computing: Run logic at CDN edge nodes close to users
Regional deployments: Full application stacks in multiple regions
Eventually consistent data: Accept stale reads to avoid cross-region consensus
Read replicas: Place read-only copies near users while writes go to primary
Conflict resolution: Allow writes in multiple regions, merge asynchronously

The Speed of Light Can't Be Patched

No amount of optimization can make a Sydney-to-Virginia round-trip faster than physics allows. If your system requires synchronous cross-region communication, you're building in latency that no code change can remove. This must be acknowledged in requirements, not discovered in production.

Measuring and Monitoring Latency

You can't improve what you don't measure, and latency is notoriously tricky to measure correctly. Common mistakes lead to metrics that look good on dashboards while users experience poor performance.

Latency Measurement Best Practices

•Measure from the client, not the server — Server-side metrics miss network latency and client processing time. A request that takes 10ms to process on the server might take 500ms from the user's perspective.
•Use percentiles, not averages — Report p50, p95, p99, and p99.9. Averages hide outliers that affect user experience. A service with 10ms average and 10-second p99 has serious problems hidden by a good average.
•Histogram over time, not single values — Latency varies throughout the day, week, and month. Track distributions over windows, not snapshots.
•Sample carefully at high volume — At millions of requests per second, you can't record every latency. Use reservoir sampling or streaming percentile algorithms (t-digest, HDR histogram).
•Correlate with load — Latency often increases with traffic. Measure and alert based on latency at your actual traffic levels, not just during off-peak testing.
•Include queuing time — A request waiting in a thread pool queue is latency the user experiences but the processing logic never sees.

latency-metrics.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/**
 * Example: Comprehensive latency tracking
 * Captures the full request lifecycle
 */
 
interface RequestTiming {
    requestId: string;
    
    // Client-side timing (if available)
    clientStart?: number;
    
    // Server-side timing
    receivedAt: number;
    queuedDuration: number;     // Time waiting for processing
    processingStart: number;
    
    // Downstream calls
    downstreamCalls: {
        service: string;
        startedAt: number;
        completedAt: number;
        success: boolean;
    }[];
    
    // Response
    processingEnd: number;
    sentAt: number;
}
 
class LatencyTracker {
    private histograms: Map<string, number[]> = new Map();
    
    recordTiming(timing: RequestTiming): void {
        // Calculate component latencies
        const queueLatency = timing.processingStart - timing.receivedAt;
        const processingLatency = timing.processingEnd - timing.processingStart;
        const totalServerLatency = timing.sentAt - timing.receivedAt;
        
        // Downstream breakdown
        for (const call of timing.downstreamCalls) {
            const duration = call.completedAt - call.startedAt;
            this.addToHistogram(`downstream.${call.service}`, duration);
        }
        
        // Record all components
        this.addToHistogram('queue', queueLatency);
        this.addToHistogram('processing', processingLatency);
        this.addToHistogram('total', totalServerLatency);
    }
    
    getPercentiles(metric: string): { p50: number; p95: number; p99: number } {
        const values = this.histograms.get(metric) || [];
        if (values.length === 0) return { p50: 0, p95: 0, p99: 0 };
        
        values.sort((a, b) => a - b);
        
        return {
            p50: values[Math.floor(values.length * 0.50)],
            p95: values[Math.floor(values.length * 0.95)],
            p99: values[Math.floor(values.length * 0.99)],
        };
    }
    
    private addToHistogram(name: string, value: number): void {
        if (!this.histograms.has(name)) {
            this.histograms.set(name, []);
        }
        this.histograms.get(name)!.push(value);
        
        // Keep bounded size (sample oldest)
        const arr = this.histograms.get(name)!;
        if (arr.length > 100000) {
            arr.splice(0, arr.length - 100000);
        }
    }
}

Distributed Tracing

Tools like Jaeger, Zipkin, and AWS X-Ray provide distributed tracing—tracking a single request as it flows through multiple services. This is invaluable for identifying which service or network hop is contributing most to latency.

Summary: Latency Is Never Zero

We've explored the second fallacy of distributed computing: the assumption that latency is zero. Let's consolidate the key insights:

Key Takeaways

•Latency has multiple components — Propagation, transmission, processing, queueing, protocol overhead. Each adds delay that compounds in distributed systems.
•Tail latency matters more than average — At scale, with many parallel requests, the slowest request determines user experience. Optimize for p99, not mean.
•N+1 patterns are catastrophic — Sequential remote calls multiply latency linearly. Batching and parallelization transform O(n) latency into O(1).
•Synchronous chains add latency — Each service in a call chain adds its latency. Deep call chains multiply response times.
•Physics constrains global systems — The speed of light limits cross-region latency. Global architectures must account for this reality.
•Measure from the user's perspective — Server-side metrics lie. True latency is what the user experiences, not what the server records.

What's next:

We've established that networks fail (Fallacy 1) and that even when they work, data takes time to travel (Fallacy 2). The next fallacy—Bandwidth Is Infinite—explores what happens when we ignore the limits on how much data networks can carry.

Page Complete

You now understand why assuming zero latency leads to systems that work in development but fail in production. The patterns you've learned—batching, parallelization, caching, connection pooling, and latency budgeting—are essential for building performant distributed systems.