Regions Availability Zones - Learning Module

Loading content...

0/273

Latency Considerations

The Speed of Light Problem

In the world of distributed systems, latency is physics. No amount of engineering brilliance can make data travel faster than light, and in fiber optic cables, light travels at roughly 200,000 kilometers per second (about two-thirds of its vacuum speed due to the refractive index of glass). This fundamental constraint shapes every decision we make about cloud geography.

Consider: A roundtrip from New York to London traverses approximately 11,000 kilometers of fiber—that's a minimum of 55 milliseconds just for light to make the journey, with no processing at either end. Add network equipment, TCP handshakes, TLS negotiation, and application processing, and you're easily looking at 100-150ms for a single API call.

For real-time applications—voice calls, video games, financial trading—these milliseconds determine success or failure. For web applications, they determine whether users perceive your service as "snappy" or "sluggish." Understanding latency isn't optional for engineers building global systems; it's fundamental.

What You Will Learn

By the end of this page, you will understand the physics and engineering of network latency, how to measure and decompose latency in distributed systems, strategies for reducing latency through architecture and positioning, and how to reason about latency trade-offs when designing global systems.

Understanding Latency Components

Latency is not a single metric but a composition of multiple delays. To optimize latency, you must understand where time is spent.

The Anatomy of a Network Request:

When a client makes a request to a server, time is consumed at multiple stages:

┌──────────────────────────────────────────────────────────────────┐
│                    Total Request Latency                          │
│                                                                   │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐ │
│  │  DNS    │→ │  TCP    │→ │   TLS   │→ │ Request │→ │Response │ │
│  │ Lookup  │  │Handshake│  │Handshake│  │ Transit │  │ Transit │ │
│  │         │  │         │  │         │  │+ Process│  │         │ │
│  │ 0-100ms │  │  1 RTT  │  │ 1-2 RTT │  │ 1+ RTT  │  │ 1+ RTT  │ │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘  └─────────┘ │
└──────────────────────────────────────────────────────────────────┘

Latency Components Breakdown
Component	Description	Typical Impact	Optimization
Propagation Delay	Speed of light through fiber	~5ms per 1,000 km	Deploy closer to users
Transmission Delay	Time to push bits onto wire	<1ms typically	Increase bandwidth
Queuing Delay	Time waiting in router/switch buffers	Variable (0-100s ms)	Reduce congestion, QoS
Processing Delay	Time for routers to process packets	<1ms per hop	Fewer hops, better hardware
DNS Lookup	Translating domain to IP	0-100ms	DNS caching, low TTL
TCP Handshake	Establishing connection (3-way)	1 RTT	Connection pooling, keep-alive
TLS Handshake	Establishing secure connection	1-2 RTT	TLS 1.3, session resumption
Server Processing	Application logic execution	Variable (1-1000s ms)	Optimize code, caching

Round Trip Time (RTT):

RTT is the time for a packet to travel from client to server and back. It's the fundamental unit of latency measurement because most protocols require acknowledgments:

TCP connection: 1 RTT (SYN → SYN-ACK → ACK)
TLS 1.2: 2 RTT (additional handshake exchanges)
TLS 1.3: 1 RTT (optimized handshake)
HTTP request/response: 1+ RTT (depends on request/response sizes)

First Request vs. Subsequent Requests:

Request Type	Components	RTTs	Notes
First request (HTTP/1.1)	DNS + TCP + TLS + HTTP	5+ RTT	Full connection setup
First request (HTTP/3)	DNS + QUIC + HTTP	2-3 RTT	0-RTT possible with cache
Subsequent (keep-alive)	HTTP only	1 RTT	Connection already open
Subsequent (HTTP/2 or 3)	Multiplexed HTTP	<1 RTT	No head-of-line blocking

For distant users, these RTTs accumulate dramatically. A user 150ms away (RTT) making a first HTTP/1.1 request could wait 750ms before any application logic even runs.

The Latency Tax on Distance

A rough rule of thumb: latency increases by approximately 1ms for every 100km of distance (accounting for non-straight paths, network equipment, and practical overhead). A user 3,000km from your server starts with a ~30ms baseline before any processing begins.

Measuring and Observing Latency

You cannot optimize what you don't measure. Effective latency measurement requires understanding what to measure, how to measure it, and how to interpret the results.

What to Measure:

1. End-to-End Latency (Most Important)

The time from when a user initiates an action to when they see the result. This is what users actually experience.

            User clicks button
                   │
                   ▼
    ┌──── Total End-to-End Latency ────┐
    │                                  │
    │  Client → Network → Server →     │
    │  Processing → Network → Client   │
    │                                  │
    └──────────────────────────────────┘
                   │
                   ▼
           User sees result

2. Server-Side Latency

Time from request received to response sent. What your observability stack typically measures.

3. Network Latency

Time spent in transit, excluding processing. End-to-end minus server-side.

4. Per-Component Latency

Database queries, cache lookups, external API calls—each component's contribution.

Measurement Techniques:

Real User Monitoring (RUM):

// Browser-based timing (simplified)
const timing = performance.timing;
const metrics = {
    dns: timing.domainLookupEnd - timing.domainLookupStart,
    tcp: timing.connectEnd - timing.connectStart,
    ttfb: timing.responseStart - timing.requestStart,  // Time to First Byte
    download: timing.responseEnd - timing.responseStart,
    domReady: timing.domContentLoadedEventEnd - timing.navigationStart,
    load: timing.loadEventEnd - timing.navigationStart
};
// Send metrics to analytics backend

RUM captures real user experience from actual browsers/devices worldwide.

Synthetic Monitoring:

Automated tests from known locations at regular intervals
Consistent baseline for comparison
Detects regressions before users report them
Tools: Pingdom, Datadog Synthetic, AWS CloudWatch Synthetics

Distributed Tracing:

Trace ID: abc123
│
├── Frontend (50ms)
│   └── API Gateway (5ms)
│       └── Auth Service (15ms)
│       └── User Service (120ms)
│           └── Database (80ms)
│           └── Cache Miss (35ms)
│       └── Response serialization (5ms)

Tracing shows exactly where latency accumulates, essential for debugging slow requests.

Statistical Analysis of Latency:

Latency is not normally distributed—it has a long tail. Mean and median can be misleading.

Key Percentiles:

Percentile	Meaning	Importance
P50 (Median)	50% of requests faster	Typical user experience
P90	90% of requests faster	Starting to see slow users
P95	95% of requests faster	Many users; often used for SLOs
P99	99% of requests faster	Worst-case typical operations
P99.9	99.9% of requests faster	Edge cases, debugging

Why P99 Matters:

If P99 is 500ms and median is 50ms, that 1 in 100 slow request affects real users. In a session with 100 page views, nearly every user experiences one slow page.

Example Latency Distribution:

Requests
   │
   │   ████████████
   │  ██████████████████
   │ ████████████████████████
   │██████████████████████████████████                     ▪▪▪ tail
   └────────────────────────────────────────────────────────────────
       10ms  50ms    100ms   200ms    500ms     1s        5s
              ↑                 ↑                ↑
            P50               P95              P99.9

The tail contains your worst user experiences and often reveals architectural problems.

Set SLOs on P95 or P99

Service Level Objectives based on averages hide poor tail latency. Set SLOs on percentiles: 'P95 latency < 200ms' ensures 95% of users have that experience. Use P99 for critical user flows where even rare slow requests impact business outcomes.

The Physics of Global Latency

To reason about latency in global systems, you need to understand the immutable physical constraints. No amount of optimization can violate physics.

Speed of Light in Fiber:

Light speed in vacuum: 299,792 km/s
Light speed in fiber (refractive index ~1.5): ~200,000 km/s
Practical latency: ~5 microseconds per kilometer (one way)
Or equivalently: ~1 millisecond per 200 km

Global Distance Reference:

Theoretical Minimum RTT Between Major Cities
Route	Distance (km)	Theoretical Min RTT	Typical Real RTT
NYC → London	5,570	~56 ms	70-90 ms
NYC → San Francisco	4,130	~41 ms	60-80 ms
London → Frankfurt	650	~7 ms	10-15 ms
London → Singapore	10,870	~109 ms	160-200 ms
Tokyo → Sydney	7,820	~78 ms	100-130 ms
NYC → Sydney	15,990	~160 ms	200-250 ms
London → São Paulo	9,470	~95 ms	180-220 ms

Why Real RTT Exceeds Theoretical:

Non-straight paths: Fiber cables follow coastlines, avoid mountains, pass through cable landing stations. The actual path length is 1.5-2× straight-line distance.
Network equipment: Every router, switch, and amplifier adds processing delay (microseconds to milliseconds each).
Routing inefficiency: Traffic may route through intermediate cities, adding distance.
Congestion: Queuing at busy network nodes adds variable delay.
Protocol overhead: TCP acknowledgments, retransmissions, and flow control add rounds.

Submarine Cable Reality:

Transcontinental and transoceanic latency is constrained by submarine cables:

                    North Atlantic Cables
         New York ←→ London: Multiple cable systems
         Capacity: Petabits per second
         Latency: 35-40ms one-way typical
         
                    Trans-Pacific Cables
         Los Angeles ←→ Tokyo: Multiple systems
         Latency: 55-60ms one-way typical
         
         Cable routes follow ocean floor geography,
         may detour significantly from great circle path

The submarine cable map defines the real topology of the internet and constrains achievable latencies.

Latency Implications for Architecture:

These physical constraints have direct architectural implications:

1. Single-Region for Global Users Is Insufficient

If your servers are in Virginia (US-East-1), users in:

California see 60-80ms RTT (acceptable for most apps)
London see 70-90ms RTT (acceptable for most apps)
Sydney see 200-250ms RTT (noticeable delay)
Singapore see 220-280ms RTT (noticeable delay)

For interactive applications, Sydney users feel the service is sluggish.

2. Synchronous Cross-Region Calls Are Expensive

A write that requires synchronous confirmation from a DR region adds a full RTT:

Write in US-East without cross-region sync: 5ms
Write in US-East with sync to EU-West: 5ms + 75ms = 80ms

This is why cross-region database replication is usually asynchronous.

3. Microservices Chains Multiply Latency

If each service call adds 10ms, a chain of 10 services adds 100ms—and that's before accounting for any cross-AZ or cross-region hops.

Client → API → Auth → User → Permissions → Product → Inventory
                → Pricing → Cart → Checkout → Response

If each hop is 10ms locally, chain is 100ms
If some hops are cross-region (50ms each), chain could be 300ms+

You Cannot Outsmart Physics

No caching, optimization, or CDN can reduce latency below the speed of light constraint. If your user is 10,000 km from your server, you're starting with ~100ms RTT floor. The only solutions are: move compute closer (edge/multi-region) or accept the latency.

Latency Optimization Strategies

With an understanding of latency components, let's explore systematic strategies for reducing user-perceived latency.

Strategy 1: Reduce Distance (Deploy Closer)

The most effective latency reduction is deploying compute and data closer to users.

Options by Distance:

Deployment	Distance to User	Typical Latency	What Can Run There
User's device	0	0	Client-side logic
Edge (CDN PoP)	10-100 km	1-10 ms	Static content, edge functions
Local Region	100-3000 km	10-50 ms	Full application, regional DB
Central Region	3000-15000 km	50-200 ms	Global services, primary DB

Strategy 2: CDN for Static Content

Without CDN:
    User (Sydney) → Origin (Virginia)
    RTT: 250ms × multiple requests for images, CSS, JS
    
With CDN:
    User (Sydney) → CDN Edge (Sydney) → Origin (Virginia)
    Static content: 10ms (cached at edge)
    Dynamic content: Still 250ms, but fewer requests

CDNs cache static assets at hundreds of edge locations worldwide, dramatically reducing latency for content that doesn't change frequently.

Strategy 3: Edge Computing for Dynamic Content

For personalized or dynamic content, edge computing runs application logic at edge locations:

┌─────────────────────────────────────────────────────────────┐
│  Edge Computing Architecture                                 │
│                                                              │
│  User → Edge Location → Edge Function → (if needed) Origin  │
│         (Sydney)        (runs logic)    (Virginia)          │
│                                                              │
│  Common edge use cases:                                      │
│  • A/B testing logic (no origin needed)                     │
│  • Authentication/authorization (verify JWT at edge)        │
│  • Personalization (user segment → cached variant)          │
│  • API response assembly (aggregate cached fragments)        │
│  • Geo-based routing decisions                               │
└─────────────────────────────────────────────────────────────┘

Platforms:

Cloudflare Workers
AWS Lambda@Edge / CloudFront Functions
Fastly Compute@Edge
Vercel Edge Functions

Strategy 4: Connection Optimization

HTTP/2 and HTTP/3:

Feature	HTTP/1.1	HTTP/2	HTTP/3 (QUIC)
Connections	Multiple (6-8)	Single multiplexed	Single multiplexed
Head-of-line blocking	Yes (per connection)	Yes (TCP-level)	No (per-stream)
Handshake RTTs	1 TCP + 2 TLS = 3	1 TCP + 1 TLS = 2	1 QUIC = 1
0-RTT resumption	No	No	Yes

HTTP/3 with QUIC is particularly valuable for mobile users on lossy connections.

Connection Keep-Alive:

Maintain persistent connections to avoid repeated TCP/TLS handshakes:

First request: DNS + TCP + TLS + HTTP = 500ms total
Subsequent requests (keep-alive): HTTP only = 100ms

Strategy 5: Reduce Payload Size

Smaller payloads = less transmission time

Compression:
  - gzip: 70-90% reduction for text
  - Brotli: 15-25% better than gzip
  
Efficient formats:
  - JSON → Protocol Buffers (50-80% smaller)
  - Images: WebP, AVIF over JPEG/PNG
  
Minimize responses:
  - GraphQL: Request only needed fields
  - Pagination: Don't return 10,000 items
  - Omit nulls, defaults

Strategy 6: Caching at Every Layer

┌──────────────────────────────────────────────────────────┐
│  Caching Layers                                           │
│                                                          │
│  Browser Cache (milliseconds)                            │
│      ↓ miss                                              │
│  CDN Edge Cache (1-10ms)                                 │
│      ↓ miss                                              │
│  API Gateway Cache (5-20ms)                              │
│      ↓ miss                                              │
│  Application Cache - Redis (1-5ms)                       │
│      ↓ miss                                              │
│  Database (10-100ms)                                     │
└──────────────────────────────────────────────────────────┘

Each cache hit avoids the latency of all subsequent layers.

Strategy 7: Asynchronous and Background Processing

Synchronous (slow perceived latency):
  User submits → Process → Save → Email → Notify → Respond
  Total: 500ms
  
Asynchronous (fast perceived latency):
  User submits → Save → Respond → (async) Process, Email, Notify
  Perceived: 50ms

Move non-essential processing out of the critical path.

Latency Budget

Define a latency budget for each user interaction (e.g., 'Page load < 2 seconds'). Allocate that budget across components (Network: 500ms, Server: 300ms, Rendering: 1200ms). When a component exceeds its budget, you know where to focus optimization efforts.

Latency in Distributed Databases

Databases often dominate application latency. Understanding database latency—especially in distributed scenarios—is essential for system design.

Local Database Latency:

Same-AZ database (typical):
  Network: 0.5ms
  Query execution: 1-100ms (depends on query)
  Total: 2-100ms
  
Cross-AZ database:
  Network: 1-2ms
  Query execution: 1-100ms
  Total: 3-102ms

Cross-Region Database Latency:

Read from local replica (eventual consistency):
  Network: 1ms (local)
  Query: 5ms
  Total: 6ms
  
Read from remote primary (strong consistency):
  Network: 100ms (cross-region)
  Query: 5ms
  Total: 105ms
  
Write to remote primary:
  Network: 100ms × 2 (request + response)
  Write operation: 10ms
  Total: 210ms
  
Synchronous cross-region write:
  Network to primary: 100ms
  Write: 10ms
  Sync to replica: 100ms
  Response: 100ms
  Total: 310ms

Database Latency Patterns in Multi-Region
Pattern	Read Latency	Write Latency	Consistency
Single-region primary	Low (local)	Low (local)	Strong
Multi-AZ (sync standby)	Low (local)	Low + 1-2ms	Strong
Cross-region read replica	Low (local)	High (remote primary)	Read: Eventual, Write: Strong
Cross-region active-active	Low (local)	Low (local)	Eventual (conflicts possible)
Global DB (sync replication)	Low (local)	High (wait for quorum)	Strong

Database Latency Optimization Patterns:

1. Read-Local, Write-Remote

For read-heavy workloads, deploy read replicas in each region:

              Writes              Reads
               │                   │
               ▼                   ▼
        ┌─────────────┐     ┌─────────────┐
        │  Primary    │────→│   Local     │
        │  (US-East)  │async│   Replica   │
        │             │     │ (EU-West)   │
        └─────────────┘     └─────────────┘
              ↑
         Writes route
         to primary
         (100ms+ RTT)

Reads: 5ms (local replica)
Writes: 150ms (round trip to primary)
Trade-off: Read-your-writes may require routing to primary

2. Follower Reads

Some databases support reading from followers with bounded staleness:

-- CockroachDB example
SET statement_timeout = '100ms';
SELECT * FROM users 
WHERE id = 123 
WITH FOLLOWER_READ;

Reads from nearest replica, accepting data may be up to X seconds old.

3. Geo-Partitioned Data

Partition data by geography so each region is authoritative for its data:

┌─────────────────────────────────────────────────────────┐
│                   Global Table                           │
│                                                         │
│   EU Users (partition)     US Users (partition)         │
│   ┌─────────────────┐     ┌─────────────────┐           │
│   │  Stored in      │     │  Stored in      │           │
│   │  EU-West        │     │  US-East        │           │
│   │  Full R/W       │     │  Full R/W       │           │
│   └─────────────────┘     └─────────────────┘           │
│                                                         │
│   EU users' data never leaves EU → GDPR compliant       │
│   US users' data never crosses ocean → low latency      │
└─────────────────────────────────────────────────────────┘

The Read/Write Ratio Matters

Most applications are read-heavy (90%+ reads). Optimizing read latency with local replicas while accepting higher write latency is often the right trade-off. Know your read/write ratio before designing your database topology.

Latency-Driven Architecture Patterns

Certain architectural patterns are specifically designed to minimize or manage latency in distributed systems.

Pattern 1: Client-Side Caching and Offline-First

┌─────────────────────────────────────────────────────────┐
│  Offline-First Architecture                              │
│                                                         │
│  User interacts with local cache/database               │
│      (0ms latency - instant feedback)                   │
│             │                                           │
│             ▼                                           │
│  Sync engine reconciles with server                     │
│      (happens in background, async)                     │
│             │                                           │
│             ▼                                           │
│  Conflicts resolved (merge, last-write-wins, etc.)      │
└─────────────────────────────────────────────────────────┘

Used by: Google Docs, Figma, Notion, CRDTs
User sees instant response; sync happens in background
Requires conflict resolution strategy

Pattern 2: Optimistic Updates

1. User clicks "Add to Cart"
2. UI immediately shows item in cart (optimistic)
3. Background: API call to server
4. If success: No change needed, already showing correct state
5. If failure: Roll back UI, show error

User-perceived latency: ~0ms (instant feedback)
Actual operation: 100-500ms (happening in background)

User intent is assumed to succeed
Rollback on failure is rare (most operations succeed)
Dramatically improves perceived performance

Pattern 3: Prefetching and Preloading

Predictive Prefetching:

User on product listing page
    ↓
System predicts user might click on first few products
    ↓
Prefetch product detail pages in background
    ↓
When user clicks, data already in cache → instant display

Navigation prefetching (next likely page)
Data prefetching (API responses for predicted actions)
Resource prefetching (images, scripts user will need)

Pattern 4: Speculative Execution

┌─────────────────────────────────────────────────────────┐
│  Speculative Execution Example                           │
│                                                         │
│  User types in search box:                              │
│                                                         │
│  After 'iph' typed:                                     │
│    → Speculatively execute search for 'iphone'          │
│    → Prepare suggestions, don't display yet             │
│                                                         │
│  User types 'o' (now 'ipho'):                           │
│    → 'iphone' speculation likely correct                │
│    → Display pre-fetched results instantly              │
│                                                         │
│  User types 't' (now 'iphot'):                          │
│    → 'iphone' speculation wrong                          │
│    → Discard result, start new search                   │
└─────────────────────────────────────────────────────────┘

Pattern 5: Edge State Machine

┌─────────────────────────────────────────────────────────┐
│  Edge State Machine                                      │
│                                                         │
│  Edge Location                    Origin                 │
│  ┌─────────────────┐            ┌─────────────────┐     │
│  │  State Machine  │←── sync ───│  Source of      │     │
│  │  (User prefs,   │            │  Truth          │     │
│  │   feature flags,│            │                 │     │
│  │   session)      │            │                 │     │
│  └─────────────────┘            └─────────────────┘     │
│          │                                              │
│          ▼                                              │
│  Request processed entirely at edge with current state  │
│  No round-trip to origin for common operations          │
└─────────────────────────────────────────────────────────┘

Maintain authoritative state subset at edge
Process requests that only need that state locally
Sync state changes asynchronously

Perceived vs. Actual Latency

Users don't experience latency in milliseconds—they experience it as responsiveness. A 300ms operation with immediate visual feedback (spinner, optimistic update) feels faster than a 100ms operation with no feedback. Design for perceived latency, not just measured latency.

Latency Budgets and Service Level Objectives

Latency optimization without targets is aimless. Latency budgets and SLOs provide the structure needed for systematic improvement.

Defining Latency SLOs:

A latency SLO specifies the maximum acceptable latency for a given percentile:

Example SLOs:
• Homepage load: P50 < 500ms, P95 < 2s, P99 < 5s
• API response: P50 < 50ms, P95 < 200ms, P99 < 500ms  
• Search results: P50 < 100ms, P95 < 300ms, P99 < 1s
• Checkout completion: P50 < 2s, P95 < 5s, P99 < 10s

Setting Appropriate Targets:

Application Type Latency Expectations
Application Type	P50 Target	P95 Target	P99 Target
Real-time gaming	< 10ms	< 30ms	< 50ms
Voice/video call	< 50ms	< 100ms	< 150ms
Interactive web app	< 100ms	< 300ms	< 500ms
Standard web page	< 200ms	< 500ms	< 1s
Complex dashboard	< 500ms	< 2s	< 5s
Batch/async operation	N/A	N/A	Minutes OK

Latency Budget Allocation:

A latency budget divides the total allowed latency across system components:

Example: Homepage Load Budget (2 second total)
┌────────────────────────────────────────────────────────┐
│  Component              │  Budget   │  Actual  │ Status │
├─────────────────────────┼───────────┼──────────┼────────┤
│  DNS Resolution         │   50ms    │   30ms   │   ✓    │
│  Connection Setup       │   100ms   │   95ms   │   ✓    │
│  Server Processing      │   300ms   │   250ms  │   ✓    │
│  Data Transfer          │   500ms   │   400ms  │   ✓    │
│  DOM Parsing            │   200ms   │   180ms  │   ✓    │
│  JavaScript Execution   │   400ms   │   600ms  │   ✗    │
│  Rendering              │   450ms   │   300ms  │   ✓    │
├─────────────────────────┼───────────┼──────────┼────────┤
│  Total                  │  2000ms   │  1855ms  │   ✓    │
│  (but JS over budget)   │           │          │        │
└────────────────────────────────────────────────────────┘

Even though total is under budget, JS exceeding its allocation signals a problem to address.

Monitoring Latency SLOs:

┌─────────────────────────────────────────────────────────┐
│  Latency SLO Dashboard                                   │
│                                                         │
│  API Endpoint: /api/users                               │
│                                                         │
│  SLO: P95 < 200ms                                       │
│                                                         │
│  Current (24h): P95 = 185ms ✓                          │
│                                                         │
│  Error Budget: 5% of requests can exceed 200ms          │
│  Consumed: 3.2% (68% of budget remaining)               │
│                                                         │
│  Trend: ↑ 5ms from last week (monitoring)               │
└─────────────────────────────────────────────────────────┘

Error budget tracks how much of your SLO headroom has been consumed.

Latency Regression Detection:

Catch latency regressions before they affect users:

Baseline establishment: Measure current latency percentiles
Threshold alerts: Alert when P95 exceeds target
Trend alerts: Alert when latency increases significantly week-over-week
Anomaly detection: ML-based detection of unusual patterns
Deployment correlation: Correlate latency changes with code deployments

Example Alert Rules:

alerts:
  - name: API Latency SLO Breach
    condition: p95_latency > 200ms for 5 minutes
    severity: high
    
  - name: API Latency Degradation
    condition: p95_latency > 1.5 * baseline(7d) for 15 minutes
    severity: medium
    
  - name: API Latency Trending Up
    condition: linear_trend(p95_latency, 7d) > 5ms/day
    severity: low

SLOs Drive Architecture

Latency SLOs should inform architectural decisions. If your SLO requires P99 < 100ms globally, you know you need multi-region deployment—physics prevents meeting that SLO from a single region for distant users. Let SLOs guide investment in infrastructure.

Summary: Latency Considerations Mastery

Latency is a fundamental constraint in distributed systems, governed by immutable physics. Understanding its components, measurement, and optimization is essential for building systems that feel responsive to users worldwide.

Key Takeaways

•Latency is physics — Speed of light constraints are immutable. Distance between user and server sets a floor that no optimization can breach.
•Understand latency components — DNS, TCP, TLS, processing, and transmission all contribute. Know where time is spent to know where to optimize.
•Measure percentiles, not averages — P95 and P99 reveal real user experience. Tail latency matters because 1 in 100 users experience it.
•Deploy closer to users — CDNs, edge computing, and multi-region are the only ways to reduce distance-related latency.
•Optimize connections — HTTP/2, HTTP/3, keep-alive, and TLS session resumption reduce handshake overhead.
•Database latency dominates — Cross-region database calls are expensive. Use local reads where possible, accept async replication trade-offs.
•Perceived latency ≠ actual latency — Optimistic updates, prefetching, and instant feedback make applications feel faster than raw numbers suggest.
•Set SLOs and budgets — Define latency targets by user-facing operation. Allocate budget across components and monitor for regressions.

Module Complete:

You have now completed the Regions and Availability Zones module. You understand how to select cloud regions strategically, design for availability zone fault isolation, deploy applications across multiple AZs, extend resilience to cross-region architectures, and reason about the latency implications of geographic distribution.

These concepts form the foundation of cloud-native infrastructure design. Every system you build in the cloud will benefit from thoughtful application of these principles—choosing the right regions for your users, designing for AZ-level fault tolerance, and understanding the latency constraints that shape user experience.

Module Complete

You've mastered the concepts of cloud geography—regions, availability zones, multi-AZ deployments, cross-region architectures, and latency considerations. You can now make informed decisions about where and how to deploy infrastructure for availability, performance, and compliance. This knowledge is foundational for the remaining cloud architecture topics.