Loading content...
In the world of distributed systems, latency is physics. No amount of engineering brilliance can make data travel faster than light, and in fiber optic cables, light travels at roughly 200,000 kilometers per second (about two-thirds of its vacuum speed due to the refractive index of glass). This fundamental constraint shapes every decision we make about cloud geography.
Consider: A roundtrip from New York to London traverses approximately 11,000 kilometers of fiber—that's a minimum of 55 milliseconds just for light to make the journey, with no processing at either end. Add network equipment, TCP handshakes, TLS negotiation, and application processing, and you're easily looking at 100-150ms for a single API call.
For real-time applications—voice calls, video games, financial trading—these milliseconds determine success or failure. For web applications, they determine whether users perceive your service as "snappy" or "sluggish." Understanding latency isn't optional for engineers building global systems; it's fundamental.
By the end of this page, you will understand the physics and engineering of network latency, how to measure and decompose latency in distributed systems, strategies for reducing latency through architecture and positioning, and how to reason about latency trade-offs when designing global systems.
Latency is not a single metric but a composition of multiple delays. To optimize latency, you must understand where time is spent.
The Anatomy of a Network Request:
When a client makes a request to a server, time is consumed at multiple stages:
┌──────────────────────────────────────────────────────────────────┐
│ Total Request Latency │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ DNS │→ │ TCP │→ │ TLS │→ │ Request │→ │Response │ │
│ │ Lookup │ │Handshake│ │Handshake│ │ Transit │ │ Transit │ │
│ │ │ │ │ │ │ │+ Process│ │ │ │
│ │ 0-100ms │ │ 1 RTT │ │ 1-2 RTT │ │ 1+ RTT │ │ 1+ RTT │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└──────────────────────────────────────────────────────────────────┘
| Component | Description | Typical Impact | Optimization |
|---|---|---|---|
| Propagation Delay | Speed of light through fiber | ~5ms per 1,000 km | Deploy closer to users |
| Transmission Delay | Time to push bits onto wire | <1ms typically | Increase bandwidth |
| Queuing Delay | Time waiting in router/switch buffers | Variable (0-100s ms) | Reduce congestion, QoS |
| Processing Delay | Time for routers to process packets | <1ms per hop | Fewer hops, better hardware |
| DNS Lookup | Translating domain to IP | 0-100ms | DNS caching, low TTL |
| TCP Handshake | Establishing connection (3-way) | 1 RTT | Connection pooling, keep-alive |
| TLS Handshake | Establishing secure connection | 1-2 RTT | TLS 1.3, session resumption |
| Server Processing | Application logic execution | Variable (1-1000s ms) | Optimize code, caching |
Round Trip Time (RTT):
RTT is the time for a packet to travel from client to server and back. It's the fundamental unit of latency measurement because most protocols require acknowledgments:
First Request vs. Subsequent Requests:
| Request Type | Components | RTTs | Notes |
|---|---|---|---|
| First request (HTTP/1.1) | DNS + TCP + TLS + HTTP | 5+ RTT | Full connection setup |
| First request (HTTP/3) | DNS + QUIC + HTTP | 2-3 RTT | 0-RTT possible with cache |
| Subsequent (keep-alive) | HTTP only | 1 RTT | Connection already open |
| Subsequent (HTTP/2 or 3) | Multiplexed HTTP | <1 RTT | No head-of-line blocking |
For distant users, these RTTs accumulate dramatically. A user 150ms away (RTT) making a first HTTP/1.1 request could wait 750ms before any application logic even runs.
A rough rule of thumb: latency increases by approximately 1ms for every 100km of distance (accounting for non-straight paths, network equipment, and practical overhead). A user 3,000km from your server starts with a ~30ms baseline before any processing begins.
You cannot optimize what you don't measure. Effective latency measurement requires understanding what to measure, how to measure it, and how to interpret the results.
What to Measure:
1. End-to-End Latency (Most Important)
The time from when a user initiates an action to when they see the result. This is what users actually experience.
User clicks button
│
▼
┌──── Total End-to-End Latency ────┐
│ │
│ Client → Network → Server → │
│ Processing → Network → Client │
│ │
└──────────────────────────────────┘
│
▼
User sees result
2. Server-Side Latency
Time from request received to response sent. What your observability stack typically measures.
3. Network Latency
Time spent in transit, excluding processing. End-to-end minus server-side.
4. Per-Component Latency
Database queries, cache lookups, external API calls—each component's contribution.
Measurement Techniques:
Real User Monitoring (RUM):
// Browser-based timing (simplified)
const timing = performance.timing;
const metrics = {
dns: timing.domainLookupEnd - timing.domainLookupStart,
tcp: timing.connectEnd - timing.connectStart,
ttfb: timing.responseStart - timing.requestStart, // Time to First Byte
download: timing.responseEnd - timing.responseStart,
domReady: timing.domContentLoadedEventEnd - timing.navigationStart,
load: timing.loadEventEnd - timing.navigationStart
};
// Send metrics to analytics backend
RUM captures real user experience from actual browsers/devices worldwide.
Synthetic Monitoring:
Distributed Tracing:
Trace ID: abc123
│
├── Frontend (50ms)
│ └── API Gateway (5ms)
│ └── Auth Service (15ms)
│ └── User Service (120ms)
│ └── Database (80ms)
│ └── Cache Miss (35ms)
│ └── Response serialization (5ms)
Tracing shows exactly where latency accumulates, essential for debugging slow requests.
Statistical Analysis of Latency:
Latency is not normally distributed—it has a long tail. Mean and median can be misleading.
Key Percentiles:
| Percentile | Meaning | Importance |
|---|---|---|
| P50 (Median) | 50% of requests faster | Typical user experience |
| P90 | 90% of requests faster | Starting to see slow users |
| P95 | 95% of requests faster | Many users; often used for SLOs |
| P99 | 99% of requests faster | Worst-case typical operations |
| P99.9 | 99.9% of requests faster | Edge cases, debugging |
Why P99 Matters:
If P99 is 500ms and median is 50ms, that 1 in 100 slow request affects real users. In a session with 100 page views, nearly every user experiences one slow page.
Example Latency Distribution:
Requests
│
│ ████████████
│ ██████████████████
│ ████████████████████████
│██████████████████████████████████ ▪▪▪ tail
└────────────────────────────────────────────────────────────────
10ms 50ms 100ms 200ms 500ms 1s 5s
↑ ↑ ↑
P50 P95 P99.9
The tail contains your worst user experiences and often reveals architectural problems.
Service Level Objectives based on averages hide poor tail latency. Set SLOs on percentiles: 'P95 latency < 200ms' ensures 95% of users have that experience. Use P99 for critical user flows where even rare slow requests impact business outcomes.
To reason about latency in global systems, you need to understand the immutable physical constraints. No amount of optimization can violate physics.
Speed of Light in Fiber:
Global Distance Reference:
| Route | Distance (km) | Theoretical Min RTT | Typical Real RTT |
|---|---|---|---|
| NYC → London | 5,570 | ~56 ms | 70-90 ms |
| NYC → San Francisco | 4,130 | ~41 ms | 60-80 ms |
| London → Frankfurt | 650 | ~7 ms | 10-15 ms |
| London → Singapore | 10,870 | ~109 ms | 160-200 ms |
| Tokyo → Sydney | 7,820 | ~78 ms | 100-130 ms |
| NYC → Sydney | 15,990 | ~160 ms | 200-250 ms |
| London → São Paulo | 9,470 | ~95 ms | 180-220 ms |
Why Real RTT Exceeds Theoretical:
Non-straight paths: Fiber cables follow coastlines, avoid mountains, pass through cable landing stations. The actual path length is 1.5-2× straight-line distance.
Network equipment: Every router, switch, and amplifier adds processing delay (microseconds to milliseconds each).
Routing inefficiency: Traffic may route through intermediate cities, adding distance.
Congestion: Queuing at busy network nodes adds variable delay.
Protocol overhead: TCP acknowledgments, retransmissions, and flow control add rounds.
Submarine Cable Reality:
Transcontinental and transoceanic latency is constrained by submarine cables:
North Atlantic Cables
New York ←→ London: Multiple cable systems
Capacity: Petabits per second
Latency: 35-40ms one-way typical
Trans-Pacific Cables
Los Angeles ←→ Tokyo: Multiple systems
Latency: 55-60ms one-way typical
Cable routes follow ocean floor geography,
may detour significantly from great circle path
The submarine cable map defines the real topology of the internet and constrains achievable latencies.
Latency Implications for Architecture:
These physical constraints have direct architectural implications:
1. Single-Region for Global Users Is Insufficient
If your servers are in Virginia (US-East-1), users in:
For interactive applications, Sydney users feel the service is sluggish.
2. Synchronous Cross-Region Calls Are Expensive
A write that requires synchronous confirmation from a DR region adds a full RTT:
Write in US-East without cross-region sync: 5ms
Write in US-East with sync to EU-West: 5ms + 75ms = 80ms
This is why cross-region database replication is usually asynchronous.
3. Microservices Chains Multiply Latency
If each service call adds 10ms, a chain of 10 services adds 100ms—and that's before accounting for any cross-AZ or cross-region hops.
Client → API → Auth → User → Permissions → Product → Inventory
→ Pricing → Cart → Checkout → Response
If each hop is 10ms locally, chain is 100ms
If some hops are cross-region (50ms each), chain could be 300ms+
No caching, optimization, or CDN can reduce latency below the speed of light constraint. If your user is 10,000 km from your server, you're starting with ~100ms RTT floor. The only solutions are: move compute closer (edge/multi-region) or accept the latency.
With an understanding of latency components, let's explore systematic strategies for reducing user-perceived latency.
Strategy 1: Reduce Distance (Deploy Closer)
The most effective latency reduction is deploying compute and data closer to users.
Options by Distance:
| Deployment | Distance to User | Typical Latency | What Can Run There |
|---|---|---|---|
| User's device | 0 | 0 | Client-side logic |
| Edge (CDN PoP) | 10-100 km | 1-10 ms | Static content, edge functions |
| Local Region | 100-3000 km | 10-50 ms | Full application, regional DB |
| Central Region | 3000-15000 km | 50-200 ms | Global services, primary DB |
Strategy 2: CDN for Static Content
Without CDN:
User (Sydney) → Origin (Virginia)
RTT: 250ms × multiple requests for images, CSS, JS
With CDN:
User (Sydney) → CDN Edge (Sydney) → Origin (Virginia)
Static content: 10ms (cached at edge)
Dynamic content: Still 250ms, but fewer requests
CDNs cache static assets at hundreds of edge locations worldwide, dramatically reducing latency for content that doesn't change frequently.
Strategy 3: Edge Computing for Dynamic Content
For personalized or dynamic content, edge computing runs application logic at edge locations:
┌─────────────────────────────────────────────────────────────┐
│ Edge Computing Architecture │
│ │
│ User → Edge Location → Edge Function → (if needed) Origin │
│ (Sydney) (runs logic) (Virginia) │
│ │
│ Common edge use cases: │
│ • A/B testing logic (no origin needed) │
│ • Authentication/authorization (verify JWT at edge) │
│ • Personalization (user segment → cached variant) │
│ • API response assembly (aggregate cached fragments) │
│ • Geo-based routing decisions │
└─────────────────────────────────────────────────────────────┘
Platforms:
Strategy 4: Connection Optimization
HTTP/2 and HTTP/3:
| Feature | HTTP/1.1 | HTTP/2 | HTTP/3 (QUIC) |
|---|---|---|---|
| Connections | Multiple (6-8) | Single multiplexed | Single multiplexed |
| Head-of-line blocking | Yes (per connection) | Yes (TCP-level) | No (per-stream) |
| Handshake RTTs | 1 TCP + 2 TLS = 3 | 1 TCP + 1 TLS = 2 | 1 QUIC = 1 |
| 0-RTT resumption | No | No | Yes |
HTTP/3 with QUIC is particularly valuable for mobile users on lossy connections.
Connection Keep-Alive:
Maintain persistent connections to avoid repeated TCP/TLS handshakes:
First request: DNS + TCP + TLS + HTTP = 500ms total
Subsequent requests (keep-alive): HTTP only = 100ms
Strategy 5: Reduce Payload Size
Smaller payloads = less transmission time
Compression:
- gzip: 70-90% reduction for text
- Brotli: 15-25% better than gzip
Efficient formats:
- JSON → Protocol Buffers (50-80% smaller)
- Images: WebP, AVIF over JPEG/PNG
Minimize responses:
- GraphQL: Request only needed fields
- Pagination: Don't return 10,000 items
- Omit nulls, defaults
Strategy 6: Caching at Every Layer
┌──────────────────────────────────────────────────────────┐
│ Caching Layers │
│ │
│ Browser Cache (milliseconds) │
│ ↓ miss │
│ CDN Edge Cache (1-10ms) │
│ ↓ miss │
│ API Gateway Cache (5-20ms) │
│ ↓ miss │
│ Application Cache - Redis (1-5ms) │
│ ↓ miss │
│ Database (10-100ms) │
└──────────────────────────────────────────────────────────┘
Each cache hit avoids the latency of all subsequent layers.
Strategy 7: Asynchronous and Background Processing
Synchronous (slow perceived latency):
User submits → Process → Save → Email → Notify → Respond
Total: 500ms
Asynchronous (fast perceived latency):
User submits → Save → Respond → (async) Process, Email, Notify
Perceived: 50ms
Move non-essential processing out of the critical path.
Define a latency budget for each user interaction (e.g., 'Page load < 2 seconds'). Allocate that budget across components (Network: 500ms, Server: 300ms, Rendering: 1200ms). When a component exceeds its budget, you know where to focus optimization efforts.
Databases often dominate application latency. Understanding database latency—especially in distributed scenarios—is essential for system design.
Local Database Latency:
Same-AZ database (typical):
Network: 0.5ms
Query execution: 1-100ms (depends on query)
Total: 2-100ms
Cross-AZ database:
Network: 1-2ms
Query execution: 1-100ms
Total: 3-102ms
Cross-Region Database Latency:
Read from local replica (eventual consistency):
Network: 1ms (local)
Query: 5ms
Total: 6ms
Read from remote primary (strong consistency):
Network: 100ms (cross-region)
Query: 5ms
Total: 105ms
Write to remote primary:
Network: 100ms × 2 (request + response)
Write operation: 10ms
Total: 210ms
Synchronous cross-region write:
Network to primary: 100ms
Write: 10ms
Sync to replica: 100ms
Response: 100ms
Total: 310ms
| Pattern | Read Latency | Write Latency | Consistency |
|---|---|---|---|
| Single-region primary | Low (local) | Low (local) | Strong |
| Multi-AZ (sync standby) | Low (local) | Low + 1-2ms | Strong |
| Cross-region read replica | Low (local) | High (remote primary) | Read: Eventual, Write: Strong |
| Cross-region active-active | Low (local) | Low (local) | Eventual (conflicts possible) |
| Global DB (sync replication) | Low (local) | High (wait for quorum) | Strong |
Database Latency Optimization Patterns:
1. Read-Local, Write-Remote
For read-heavy workloads, deploy read replicas in each region:
Writes Reads
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Primary │────→│ Local │
│ (US-East) │async│ Replica │
│ │ │ (EU-West) │
└─────────────┘ └─────────────┘
↑
Writes route
to primary
(100ms+ RTT)
2. Follower Reads
Some databases support reading from followers with bounded staleness:
-- CockroachDB example
SET statement_timeout = '100ms';
SELECT * FROM users
WHERE id = 123
WITH FOLLOWER_READ;
Reads from nearest replica, accepting data may be up to X seconds old.
3. Geo-Partitioned Data
Partition data by geography so each region is authoritative for its data:
┌─────────────────────────────────────────────────────────┐
│ Global Table │
│ │
│ EU Users (partition) US Users (partition) │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Stored in │ │ Stored in │ │
│ │ EU-West │ │ US-East │ │
│ │ Full R/W │ │ Full R/W │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ EU users' data never leaves EU → GDPR compliant │
│ US users' data never crosses ocean → low latency │
└─────────────────────────────────────────────────────────┘
Most applications are read-heavy (90%+ reads). Optimizing read latency with local replicas while accepting higher write latency is often the right trade-off. Know your read/write ratio before designing your database topology.
Certain architectural patterns are specifically designed to minimize or manage latency in distributed systems.
Pattern 1: Client-Side Caching and Offline-First
┌─────────────────────────────────────────────────────────┐
│ Offline-First Architecture │
│ │
│ User interacts with local cache/database │
│ (0ms latency - instant feedback) │
│ │ │
│ ▼ │
│ Sync engine reconciles with server │
│ (happens in background, async) │
│ │ │
│ ▼ │
│ Conflicts resolved (merge, last-write-wins, etc.) │
└─────────────────────────────────────────────────────────┘
Pattern 2: Optimistic Updates
1. User clicks "Add to Cart"
2. UI immediately shows item in cart (optimistic)
3. Background: API call to server
4. If success: No change needed, already showing correct state
5. If failure: Roll back UI, show error
User-perceived latency: ~0ms (instant feedback)
Actual operation: 100-500ms (happening in background)
Pattern 3: Prefetching and Preloading
Predictive Prefetching:
User on product listing page
↓
System predicts user might click on first few products
↓
Prefetch product detail pages in background
↓
When user clicks, data already in cache → instant display
Pattern 4: Speculative Execution
┌─────────────────────────────────────────────────────────┐
│ Speculative Execution Example │
│ │
│ User types in search box: │
│ │
│ After 'iph' typed: │
│ → Speculatively execute search for 'iphone' │
│ → Prepare suggestions, don't display yet │
│ │
│ User types 'o' (now 'ipho'): │
│ → 'iphone' speculation likely correct │
│ → Display pre-fetched results instantly │
│ │
│ User types 't' (now 'iphot'): │
│ → 'iphone' speculation wrong │
│ → Discard result, start new search │
└─────────────────────────────────────────────────────────┘
Pattern 5: Edge State Machine
┌─────────────────────────────────────────────────────────┐
│ Edge State Machine │
│ │
│ Edge Location Origin │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ State Machine │←── sync ───│ Source of │ │
│ │ (User prefs, │ │ Truth │ │
│ │ feature flags,│ │ │ │
│ │ session) │ │ │ │
│ └─────────────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ Request processed entirely at edge with current state │
│ No round-trip to origin for common operations │
└─────────────────────────────────────────────────────────┘
Users don't experience latency in milliseconds—they experience it as responsiveness. A 300ms operation with immediate visual feedback (spinner, optimistic update) feels faster than a 100ms operation with no feedback. Design for perceived latency, not just measured latency.
Latency optimization without targets is aimless. Latency budgets and SLOs provide the structure needed for systematic improvement.
Defining Latency SLOs:
A latency SLO specifies the maximum acceptable latency for a given percentile:
Example SLOs:
• Homepage load: P50 < 500ms, P95 < 2s, P99 < 5s
• API response: P50 < 50ms, P95 < 200ms, P99 < 500ms
• Search results: P50 < 100ms, P95 < 300ms, P99 < 1s
• Checkout completion: P50 < 2s, P95 < 5s, P99 < 10s
Setting Appropriate Targets:
| Application Type | P50 Target | P95 Target | P99 Target |
|---|---|---|---|
| Real-time gaming | < 10ms | < 30ms | < 50ms |
| Voice/video call | < 50ms | < 100ms | < 150ms |
| Interactive web app | < 100ms | < 300ms | < 500ms |
| Standard web page | < 200ms | < 500ms | < 1s |
| Complex dashboard | < 500ms | < 2s | < 5s |
| Batch/async operation | N/A | N/A | Minutes OK |
Latency Budget Allocation:
A latency budget divides the total allowed latency across system components:
Example: Homepage Load Budget (2 second total)
┌────────────────────────────────────────────────────────┐
│ Component │ Budget │ Actual │ Status │
├─────────────────────────┼───────────┼──────────┼────────┤
│ DNS Resolution │ 50ms │ 30ms │ ✓ │
│ Connection Setup │ 100ms │ 95ms │ ✓ │
│ Server Processing │ 300ms │ 250ms │ ✓ │
│ Data Transfer │ 500ms │ 400ms │ ✓ │
│ DOM Parsing │ 200ms │ 180ms │ ✓ │
│ JavaScript Execution │ 400ms │ 600ms │ ✗ │
│ Rendering │ 450ms │ 300ms │ ✓ │
├─────────────────────────┼───────────┼──────────┼────────┤
│ Total │ 2000ms │ 1855ms │ ✓ │
│ (but JS over budget) │ │ │ │
└────────────────────────────────────────────────────────┘
Even though total is under budget, JS exceeding its allocation signals a problem to address.
Monitoring Latency SLOs:
┌─────────────────────────────────────────────────────────┐
│ Latency SLO Dashboard │
│ │
│ API Endpoint: /api/users │
│ │
│ SLO: P95 < 200ms │
│ │
│ Current (24h): P95 = 185ms ✓ │
│ │
│ Error Budget: 5% of requests can exceed 200ms │
│ Consumed: 3.2% (68% of budget remaining) │
│ │
│ Trend: ↑ 5ms from last week (monitoring) │
└─────────────────────────────────────────────────────────┘
Error budget tracks how much of your SLO headroom has been consumed.
Latency Regression Detection:
Catch latency regressions before they affect users:
Example Alert Rules:
alerts:
- name: API Latency SLO Breach
condition: p95_latency > 200ms for 5 minutes
severity: high
- name: API Latency Degradation
condition: p95_latency > 1.5 * baseline(7d) for 15 minutes
severity: medium
- name: API Latency Trending Up
condition: linear_trend(p95_latency, 7d) > 5ms/day
severity: low
Latency SLOs should inform architectural decisions. If your SLO requires P99 < 100ms globally, you know you need multi-region deployment—physics prevents meeting that SLO from a single region for distant users. Let SLOs guide investment in infrastructure.
Latency is a fundamental constraint in distributed systems, governed by immutable physics. Understanding its components, measurement, and optimization is essential for building systems that feel responsive to users worldwide.
Module Complete:
You have now completed the Regions and Availability Zones module. You understand how to select cloud regions strategically, design for availability zone fault isolation, deploy applications across multiple AZs, extend resilience to cross-region architectures, and reason about the latency implications of geographic distribution.
These concepts form the foundation of cloud-native infrastructure design. Every system you build in the cloud will benefit from thoughtful application of these principles—choosing the right regions for your users, designing for AZ-level fault tolerance, and understanding the latency constraints that shape user experience.
You've mastered the concepts of cloud geography—regions, availability zones, multi-AZ deployments, cross-region architectures, and latency considerations. You can now make informed decisions about where and how to deploy infrastructure for availability, performance, and compliance. This knowledge is foundational for the remaining cloud architecture topics.