Loading content...
The previous page taught us that networks are unreliable—they fail in countless ways. But even when networks work perfectly, there's another insidious assumption that corrupts distributed system designs: Latency is zero.
When you call a function in your code, it executes in nanoseconds. When you read from memory, the data arrives in under 100 nanoseconds. These operations feel instantaneous because, at human time scales, they are.
But network communication operates on entirely different time scales. A packet traveling from New York to London—even at the speed of light—takes at least 28 milliseconds. Add routing, processing, and protocol overhead, and real-world latencies regularly exceed 100ms. That's 100,000 times slower than a memory access.
The speed of light in fiber optic cable is approximately 200,000 km/s (about 2/3 the speed of light in vacuum). This means it takes roughly 5 microseconds for a signal to travel 1 kilometer. For a round-trip across the Atlantic (~11,500 km), you're looking at a minimum of 57ms just for the physical signal propagation—and that's before any switches, routers, or processing.
Understanding latency isn't just academic—it fundamentally shapes how distributed systems must be designed. Every remote call has a time cost, and those costs compound in ways that can render systems unusable.
Network latency isn't a single number—it's composed of multiple additive delays at every layer of the networking stack. Understanding these components is essential for diagnosing and optimizing system performance.
| Component | Description | Typical Range | Controllable? |
|---|---|---|---|
| Propagation Delay | Time for signal to physically travel through medium | ~5μs per km | No (physics) |
| Transmission Delay | Time to push all packet bits onto the wire | Depends on bandwidth | Upgrade bandwidth |
| Processing Delay | Time for routers/switches to process packet headers | 1-100μs per hop | Limited |
| Queueing Delay | Time waiting in router/switch buffers | 0ms to seconds | Reduce congestion |
| Serialization Delay | Time to convert data to/from wire format | 10-100μs | Use efficient formats |
| Protocol Overhead | TCP handshakes, TLS negotiation, etc. | 1-3 RTTs | Connection pooling |
| Application Delay | Time for application to process request | Varies widely | Optimize code |
The multiplicative effect of protocol overhead:
Establishing a new HTTPS connection involves:
That's 4 RTTs before you receive the first byte of actual data. With a 100ms RTT to a distant server, you've already spent 400ms just on handshakes. This is why connection reuse and HTTP/2 multiplexing are so critical for performance.
Same-rack latency: ~0.5ms. Same-datacenter latency: ~1-2ms. Cross-region (US East to West): ~60-80ms. Cross-continent (US to Europe): ~100-150ms. These numbers should be memorized—they inform every architectural decision.
Jeff Dean's famous "Latency Numbers Every Programmer Should Know" provides essential mental models for understanding system performance. While specific numbers evolve with hardware, the relative orders of magnitude remain stable.
| Operation | Time | Relative Scale |
|---|---|---|
| L1 cache reference | 1 ns | Baseline |
| L2 cache reference | 4 ns | 4x L1 |
| Main memory reference | 100 ns | 100x L1 |
| SSD random read | 16 μs | 16,000x L1 |
| Read 1 MB sequentially from SSD | 50 μs | 50,000x L1 |
| Read 1 MB sequentially from disk | 2 ms | 2,000,000x L1 |
| Same-datacenter round-trip | 500 μs | 500,000x L1 |
| Cross-region network (same continent) | 50 ms | 50,000,000x L1 |
| Cross-continent network | 150 ms | 150,000,000x L1 |
Interpreting these numbers:
The key insight is the eight orders of magnitude between a CPU cache reference (1 nanosecond) and a cross-continent network call (150 milliseconds). This massive gap means:
Example calculation:
Suppose your code has a loop that, for each of 1,000 items, makes a remote call. With 100ms latency per call:
The same logical operation takes 1,000x longer without batching. This is why the assumption that "latency is zero" leads to such dramatic performance failures.
If Service A calls Service B, which calls Service C, which calls Service D, the total latency is A + B + C + D. A request that seems simple on a diagram might traverse five services, each adding 50ms, resulting in a 250ms baseline—before any actual processing happens.
Average latency is a misleading metric. What users actually experience—and what breaks SLAs—is tail latency: the worst-case response times that affect a small but significant percentage of requests.
Why tail latency matters more than average:
Consider a service with 10ms average latency but 500ms 99th percentile (p99) latency. For a single request, 99% of users see great performance. But what happens when a single user action triggers 100 parallel backend requests?
12345678910111213141516171819202122
The Tail at Scale Problem======================== Given:- Single request p99 latency: 500ms (1% of requests)- User action fans out to: 100 parallel backend requests Probability that ALL 100 requests complete within p99: = (0.99)^100 = 36.6% Probability that at least ONE request exceeds p99: = 1 - (0.99)^100 = 63.4% Result: For 63% of user requests, at least one backend call will be slow,making the ENTIRE user-visible request slow. With 1000 parallel requests: Probability of at least one slow request = 1 - (0.99)^1000 = 99.996% This is why at Google/Facebook/Amazon, p99 latency optimizationis often more important than average latency optimization.Google's 'hedged requests' pattern: if a request hasn't returned within the p95 time, send a duplicate request to another server. When either responds, cancel the other. This dramatically reduces tail latency at the cost of slightly increased overall load.
Developers who assume zero latency naturally produce code that works locally but fails spectacularly in distributed environments. These anti-patterns are distressingly common.
1234567891011121314151617181920212223
// ❌ N+1 Query Anti-Pattern// Each iteration makes a remote callasync function getOrdersWithCustomers( orderIds: string[]): Promise<OrderWithCustomer[]> { const result: OrderWithCustomer[] = []; for (const orderId of orderIds) { // Remote call 1: fetch order const order = await orderService.get(orderId); // Remote call 2: fetch customer const customer = await customerService .get(order.customerId); result.push({ order, customer }); } return result;} // With 100 orders and 50ms per call:// Time = 100 × (50ms + 50ms) = 10 seconds!123456789101112131415161718192021222324252627282930
// ✅ Batched Pattern// All remote calls happen in parallelasync function getOrdersWithCustomers( orderIds: string[]): Promise<OrderWithCustomer[]> { // Batch fetch all orders (1 call) const orders = await orderService .getMany(orderIds); // Extract unique customer IDs const customerIds = [...new Set( orders.map(o => o.customerId) )]; // Batch fetch all customers (1 call) const customers = await customerService .getMany(customerIds); const customerMap = new Map( customers.map(c => [c.id, c]) ); return orders.map(order => ({ order, customer: customerMap.get(order.customerId)! }));} // With 100 orders and 50ms per call:// Time = 50ms + 50ms = 100ms total!The batched version is 100x faster than the N+1 version for 100 items.This isn't a micro-optimization—it's the difference between usable and unusable software. Learn to recognize and eliminate N+1 patterns instinctively.
Once you accept that latency is non-zero and significant, you can apply specific design principles to minimize its impact on user experience and system throughput.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
/** * Parallel vs Sequential Remote Calls * Demonstrates the dramatic latency difference */ interface DashboardData { user: User; notifications: Notification[]; recentOrders: Order[]; recommendations: Product[];} // ❌ Sequential: Total latency = sum of all call latenciesasync function loadDashboardSequential(): Promise<DashboardData> { const user = await userService.getCurrentUser(); // 50ms const notifications = await notificationService.getAll(); // 50ms const recentOrders = await orderService.getRecent(); // 50ms const recommendations = await productService.getRecommended(); // 50ms return { user, notifications, recentOrders, recommendations }; // Total: ~200ms} // ✅ Parallel: Total latency = max of all call latenciesasync function loadDashboardParallel(): Promise<DashboardData> { const [user, notifications, recentOrders, recommendations] = await Promise.all([ userService.getCurrentUser(), // 50ms ─┐ notificationService.getAll(), // 50ms ─┼─ Run simultaneously orderService.getRecent(), // 50ms ─┤ productService.getRecommended(), // 50ms ─┘ ]); return { user, notifications, recentOrders, recommendations }; // Total: ~50ms (4x faster!)} // ✅✅ Parallel with Timeouts and Fallbacks: Production-gradeasync function loadDashboardResilient(): Promise<DashboardData> { // Critical data - must succeed const userPromise = userService.getCurrentUser(); // Non-critical data - use fallbacks on timeout const [user, notifications, recentOrders, recommendations] = await Promise.all([ userPromise, withTimeoutFallback(notificationService.getAll(), 100, []), withTimeoutFallback(orderService.getRecent(), 100, []), withTimeoutFallback(productService.getRecommended(), 100, []), ]); return { user, notifications, recentOrders, recommendations };} async function withTimeoutFallback<T>( promise: Promise<T>, timeoutMs: number, fallback: T): Promise<T> { try { return await Promise.race([ promise, new Promise<T>((_, reject) => setTimeout(() => reject(new Error('Timeout')), timeoutMs) ), ]); } catch { return fallback; }}Set explicit latency budgets for user-facing operations. If your target is 200ms end-to-end, allocate portions to each component: 50ms for network, 50ms for database, 50ms for processing, 50ms buffer. When any component exceeds its budget, you know exactly where to focus optimization.
Geographic distribution introduces latency challenges that cannot be optimized away with clever code—the speed of light is a hard limit. Global systems must be architected with physics in mind.
| Route | Distance | Theoretical Min RTT | Real-World RTT |
|---|---|---|---|
| NYC → London | 5,570 km | 37 ms | 70-90 ms |
| NYC → Tokyo | 10,850 km | 72 ms | 180-220 ms |
| NYC → Sydney | 16,000 km | 107 ms | 250-300 ms |
| London → Singapore | 10,880 km | 73 ms | 160-200 ms |
| San Francisco → Singapore | 13,600 km | 91 ms | 170-210 ms |
Architectural implications:
When your users span the globe but your servers are in one region, some users will always experience high latency. Consider a user in Tokyo accessing a server in Virginia:
This is why global applications require global architectures:
No amount of optimization can make a Sydney-to-Virginia round-trip faster than physics allows. If your system requires synchronous cross-region communication, you're building in latency that no code change can remove. This must be acknowledged in requirements, not discovered in production.
You can't improve what you don't measure, and latency is notoriously tricky to measure correctly. Common mistakes lead to metrics that look good on dashboards while users experience poor performance.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
/** * Example: Comprehensive latency tracking * Captures the full request lifecycle */ interface RequestTiming { requestId: string; // Client-side timing (if available) clientStart?: number; // Server-side timing receivedAt: number; queuedDuration: number; // Time waiting for processing processingStart: number; // Downstream calls downstreamCalls: { service: string; startedAt: number; completedAt: number; success: boolean; }[]; // Response processingEnd: number; sentAt: number;} class LatencyTracker { private histograms: Map<string, number[]> = new Map(); recordTiming(timing: RequestTiming): void { // Calculate component latencies const queueLatency = timing.processingStart - timing.receivedAt; const processingLatency = timing.processingEnd - timing.processingStart; const totalServerLatency = timing.sentAt - timing.receivedAt; // Downstream breakdown for (const call of timing.downstreamCalls) { const duration = call.completedAt - call.startedAt; this.addToHistogram(`downstream.${call.service}`, duration); } // Record all components this.addToHistogram('queue', queueLatency); this.addToHistogram('processing', processingLatency); this.addToHistogram('total', totalServerLatency); } getPercentiles(metric: string): { p50: number; p95: number; p99: number } { const values = this.histograms.get(metric) || []; if (values.length === 0) return { p50: 0, p95: 0, p99: 0 }; values.sort((a, b) => a - b); return { p50: values[Math.floor(values.length * 0.50)], p95: values[Math.floor(values.length * 0.95)], p99: values[Math.floor(values.length * 0.99)], }; } private addToHistogram(name: string, value: number): void { if (!this.histograms.has(name)) { this.histograms.set(name, []); } this.histograms.get(name)!.push(value); // Keep bounded size (sample oldest) const arr = this.histograms.get(name)!; if (arr.length > 100000) { arr.splice(0, arr.length - 100000); } }}Tools like Jaeger, Zipkin, and AWS X-Ray provide distributed tracing—tracking a single request as it flows through multiple services. This is invaluable for identifying which service or network hop is contributing most to latency.
We've explored the second fallacy of distributed computing: the assumption that latency is zero. Let's consolidate the key insights:
What's next:
We've established that networks fail (Fallacy 1) and that even when they work, data takes time to travel (Fallacy 2). The next fallacy—Bandwidth Is Infinite—explores what happens when we ignore the limits on how much data networks can carry.
You now understand why assuming zero latency leads to systems that work in development but fail in production. The patterns you've learned—batching, parallelization, caching, connection pooling, and latency budgeting—are essential for building performant distributed systems.