System Design (HLD)Geo-Distributed Architecture

Geo-Distributed Architecture

LevelAdvanced

Duration90 mins

TopicGeo-Distributed Architecture

5 / 5

Latency Optimization

Every Millisecond Counts

We've established that geo-distribution fundamentally addresses latency—but deploying to multiple regions is just the foundation. Extracting maximum benefit requires systematic optimization across every layer of the stack, from network protocols to application architecture.

Latency optimization is not a single technique but a discipline: understanding where time is spent, identifying bottlenecks, and applying appropriate optimizations. Some techniques save milliseconds; others save hundreds of milliseconds. Knowing which optimizations matter for your specific workload separates efficient engineering from wasted effort.

In this page, we'll build a comprehensive toolkit for minimizing user-perceived latency in geo-distributed systems.

What You Will Learn

By the end of this page, you'll understand how to decompose latency into actionable components, edge caching strategies and CDN optimization, connection and protocol optimizations, geographic traffic routing approaches, and application-level techniques for latency reduction.

Understanding Latency Components

Before optimizing latency, we must understand where time is actually spent. User-perceived latency is the sum of many components, each requiring different optimization approaches.

The Request Lifecycle

A typical HTTP request from browser to server and back involves:

DNS Resolution (0-100ms)
- Browser queries DNS for domain IP
- May be cached (0ms) or require recursive lookup (50-150ms)
TCP Connection Establishment (1 RTT)
- Three-way handshake: SYN → SYN-ACK → ACK
- Cross-continental: 100-200ms
TLS Handshake (1-2 RTTs)
- TLS 1.2: 2 RTTs for key exchange
- TLS 1.3: 1 RTT (0-RTT possible for resumption)
- Add 100-400ms for cross-continental TLS 1.2
Request Transmission (varies by payload)
- Time to upload request body
- Dominated by bandwidth for large payloads
Server Processing (application-dependent)
- Time from request receipt to response ready
- This is typically what we optimize first
Response Transmission (varies by payload)
- Time to download response body
- First byte to last byte
Client Processing (browser/app)
- Parsing, rendering, JavaScript execution

Latency Budget Example: US-East Server, Sydney User
Component	Time (Typical)	Optimization Approach
DNS Resolution	0-50ms (cached)	DNS prefetching, low TTL for failover
TCP Handshake	150ms (1 RTT)	Connection reuse, QUIC
TLS Handshake	300ms (2 RTT, TLS 1.2)	TLS 1.3, session resumption, 0-RTT
Request Transmission	10ms	Compression, payload minimization
Server Processing	50ms	Code optimization, caching, async
Response Transmission	100ms	Compression, chunked encoding, streaming
Client Processing	100ms	Smaller JS bundles, lazy loading
Total	760ms	Edge deployment achieves ~100ms

The Optimization Hierarchy

Optimizations should be prioritized by impact:

Tier 1: Geographic Proximity (100s of milliseconds)

Moving compute closer to users
Edge deployment, regional infrastructure
Single most impactful optimization

Tier 2: Connection Optimization (10s to 100s of milliseconds)

Protocol upgrades (HTTP/2, HTTP/3, TLS 1.3)
Connection reuse and pooling
Reducing round trips

Tier 3: Caching (10s to 100s of milliseconds)

CDN caching for static assets
API response caching
Database query caching

Tier 4: Payload Optimization (10s of milliseconds)

Compression (gzip, brotli)
Efficient serialization (Protocol Buffers, MessagePack)
Selective field returns

Tier 5: Application Optimization (milliseconds to 10s of milliseconds)

Query optimization
Async processing
Code path efficiency

Engineers often start with Tier 5 (it's comfortable) when Tier 1 would provide 10x the benefit. Work from the top of the hierarchy down.

Measure Before Optimizing

Use Real User Monitoring (RUM), synthetic monitoring, and distributed tracing to measure actual latency from users' perspectives. Tools like Lighthouse, WebPageTest, or custom tracing reveal where time is actually spent—which often differs from intuition.

Edge Caching and CDN Optimization

Content Delivery Networks (CDNs) place content on edge servers close to users, eliminating the need to traverse the global internet for each request. Properly configured, CDNs provide the single largest latency reduction for many applications.

CDN Architecture

Edge Locations (PoPs - Points of Presence):

Distributed globally (major CDNs: 100-300+ locations)
Each location caches content and terminates connections
User requests route to nearest healthy PoP

Origin:

Your actual server infrastructure
CDN fetches content from origin when cache misses
Should be optimized, but latency to origin is less critical

Cache Hierarchy:

L1 Edge Cache: Closest to user, limited capacity
L2 Regional Cache: Larger capacity, serves multiple edges
Origin Shield: Single point of contact to origin, maximizes cache efficiency

What to Cache

CDN Caching Strategies by Content Type
Content Type	Cache Strategy	Typical TTL	Invalidation Approach
Static Assets (JS, CSS, images)	Aggressive caching with versioned URLs	1 year	New version = new URL
Media (video, audio)	Aggressive caching, range request support	1 year	New upload = new URL
HTML (static pages)	Short cache with stale-while-revalidate	5-60 seconds	TTL expiry or purge
API Responses (public)	Vary by relevant headers, short TTL	1-60 seconds	TTL expiry or event-driven
API Responses (personalized)	Generally not cacheable at CDN	N/A	N/A
Real-time data	Not cacheable	N/A	N/A

Cache Key Design

The cache key determines whether requests share cached responses:

Too Narrow:

Every request is unique (query params, headers included in key)
Cache miss rate is high, little benefit

Too Broad:

Incorrect responses served (user A sees user B's data)
Security and correctness issues

Best Practices:

Include only what truly differentiates response
Use Vary header correctly (but sparingly)
Normalize URLs (trailing slashes, query param ordering)
Avoid including session identifiers unless necessary

Cache Invalidation

Time-Based (TTL):

Simple and predictable
Stale content served until TTL expires
Works for content that can be slightly stale

Purge on Event:

Active invalidation when content changes
Immediate freshness but requires integration
May be slow to propagate across all PoPs

Stale-While-Revalidate:

Serve stale content while fetching fresh in background
Best user experience (always fast)
Requires HTTP cache headers support

CDN Selection Criteria

When choosing a CDN, evaluate:

PoP Coverage: Where are your users? Does CDN have presence there?
Origin Shield: Does it support origin shield to reduce origin load?
Edge Compute: Can you run code at the edge (Workers, Lambda@Edge)?
Cache Control: Granularity of cache rules and invalidation
Analytics: Visibility into cache hit rates, latency, errors
Price: Cost per GB, request pricing, minimum commitments

Major Providers:

Cloudflare: Strong edge compute, competitive pricing
AWS CloudFront: Deep AWS integration, Lambda@Edge
Fastly: Real-time purge, advanced VCL configuration
Akamai: Largest network, enterprise features

Cache Hit Rate Is the Key Metric

Monitor your CDN cache hit ratio continuously. Target 90%+ for static assets, and as high as possible for cacheable dynamic content. A 10% improvement in hit rate can translate to significant latency reduction and origin load reduction. Investigate cache misses—they're often due to misconfigured headers or cache key fragmentation.

Connection and Protocol Optimization

Connection establishment and protocol overhead add significant latency, especially over high-latency paths. Modern protocols like HTTP/2, HTTP/3, and TLS 1.3 address many historical inefficiencies.

HTTP/2 Improvements

HTTP/2 addresses HTTP/1.1 limitations:

Multiplexing:

Multiple requests over single connection
No head-of-line blocking at HTTP layer
Eliminates need for connection pooling hacks

Header Compression (HPACK):

Headers compressed with indexing
Dramatically reduces header overhead on repeated requests

Server Push:

Server can push resources before client requests
Useful for critical assets (CSS, JS)
Must be used carefully (can waste bandwidth)

Stream Prioritization:

Client can indicate priority of resources
Browser can prioritize visible content

HTTP/3 and QUIC

HTTP/3 uses QUIC (UDP-based transport) instead of TCP:

0-RTT Connection Establishment:

For repeat connections, no handshake latency
First request can be sent immediately
Huge benefit for high-latency paths

No Head-of-Line Blocking:

Packet loss doesn't block other streams
Significant improvement on lossy connections (mobile)

Connection Migration:

Connection survives IP address changes
Important for mobile users changing networks

Built-in Encryption:

Always encrypted (no unencrypted HTTP/3)
Faster handshake than TCP + TLS

Latency Comparison (cross-continental):

HTTP/1.1 + TLS 1.2: 600-800ms first request
HTTP/2 + TLS 1.3: 400-500ms first request, reuse benefits
HTTP/3 + QUIC: 200-300ms first request, 0ms for resumption

Protocol Comparison for Geo-Distributed Systems
Protocol	Connection Overhead	Multiplexing	Lossy Network Behavior	Adoption Considerations
HTTP/1.1 + TLS 1.2	High (3-4 RTTs)	No	Poor (HOL blocking)	Legacy, full support
HTTP/2 + TLS 1.3	Medium (2 RTTs)	Yes	TCP HOL still exists	Widely supported now
HTTP/3 + QUIC	Low (0-1 RTT)	Yes	Excellent	Growing support, some middlebox issues

TLS Optimization

TLS 1.3:

1-RTT handshake (vs 2-RTT for TLS 1.2)
0-RTT resumption for repeat connections
Stronger security with simpler configuration
No reason not to use in new deployments

Session Resumption:

Server stores session state, client presents ticket
Skips expensive key exchange on repeat connections
Essential for high-latency paths

OCSP Stapling:

Server includes certificate revocation status
Client doesn't need separate OCSP lookup
Saves 50-200ms on certificate validation

Connection Reuse

HTTP Keep-Alive:

Reuse TCP connections across requests
Default in HTTP/1.1+
Configure appropriate timeouts server-side

Connection Pooling:

Maintain pool of warm connections
New requests use existing connections
Critical for inter-service communication

Persistent Connections:

For real-time: WebSockets, Server-Sent Events
Eliminate per-request connection overhead
Require careful connection management

DNS Optimization

Low TTL for Failover:

During normal operation, DNS TTL can be longer (5-15 minutes)
When preparing for failover, reduce to 30-60 seconds
Balance cacheability vs failover speed

DNS Pre-resolution:

<link rel="dns-prefetch" href="//api.example.com">
Browser resolves DNS in background before requests
Eliminates DNS latency from critical path

GeoDNS:

Return different IPs based on resolver location
Route users to nearest region/PoP
Foundation of geographic traffic routing

Protocol Upgrades Are High-ROI

Upgrading from HTTP/1.1 to HTTP/2, from TLS 1.2 to 1.3, and eventually to HTTP/3 provides significant latency improvements, especially on high-latency paths. These are typically infrastructure changes that benefit all traffic without application changes.

Geographic Traffic Routing

Getting users to the right region is a solved problem in principle but nuanced in practice. Multiple approaches exist, each with different characteristics.

DNS-Based Routing (GeoDNS)

Mechanism:

DNS server returns different IP addresses based on query source
User's resolver location determines which server IP is returned
Most common approach for initial geo-routing

Advantages:

Simple to implement
Works with any protocol/application
Well-understood operational model

Limitations:

Resolver location ≠ user location (corporate DNS, mobile networks)
TTL caching delays failover
Limited granularity (can't route by request attributes)

Anycast Routing

Mechanism:

Same IP address announced from multiple locations via BGP
Network routing naturally selects closest location
Used by major CDNs and global infrastructure

Advantages:

Instant failover (BGP reconverges)
No DNS TTL issues
Network-level routing is efficient

Limitations:

Requires BGP expertise and AS number
IP address changes can be disruptive (mid-connection)
Less control over specific routing decisions

Load Balancer Routing

Mechanism:

Global load balancer (GLB) receives all traffic
GLB routes to backend based on client IP or other attributes
AWS Global Accelerator, Google Cloud Global LB, Azure Front Door

Advantages:

Fine-grained control over routing decisions
Can route by request attributes (headers, path)
Integrated health checking and failover

Traffic Routing Approaches Comparison
Approach	Failover Time	Granularity	Complexity	Cost
GeoDNS	TTL-dependent (seconds-minutes)	Per domain/subdomain	Low	Low
Anycast	Seconds (BGP reconvergence)	Per IP address	High (BGP)	Medium
Global Load Balancer	Seconds (health-based)	Per request	Medium	Medium-High
CDN Routing	Seconds	Per request	Low	Varies
Client-side logic	Immediate	Per request	Medium	Low

Global Load Balancers in Practice

AWS Global Accelerator:

Anycast IPs route to nearest AWS edge
Traffic then routed to configured endpoints
Integrates with NLB, ALB, EC2, EIP

Google Cloud Global Load Balancing:

Single anycast IP for global service
Supports HTTP(S), TCP, UDP
Deep integration with GCP services

Azure Front Door:

Global HTTP load balancing
Path-based routing, URL rewrites
Integrated CDN capabilities

Cloudflare:

Anycast across 300+ PoPs
Load balancing to origins with health checking
Geo-steering with customizable rules

Routing Decisions

Latency-Based:

Route to region with lowest measured latency
Accounts for actual network conditions
May shift traffic dynamically

Geographic:

Route based on user's geographic location
Simpler, more predictable
May not optimize for latency anomalies

Geofenced:

Force traffic to specific regions for compliance
EU users → EU region (data residency)
Overrides latency optimization

Weighted:

Distribute traffic by percentage across regions
Useful for canary deployments, A/B testing
Combine with other routing strategies

Handling Edge Cases

VPNs and Proxies:

User's apparent location differs from actual
May be routed to wrong region
Consider user preference settings for override

Mobile Network NAT:

Large mobile networks share IP addresses
Geo-IP location may be incorrect
Fallback to client-side detection

Traveling Users:

User's home region differs from current location
Session state may be in home region
Balance latency vs data locality

Test From Actual User Locations

Routing decisions are only as good as geo-IP databases and network routing. Use synthetic monitoring from actual user locations to verify routing is working as expected. Tools like Catchpoint, Pingdom, or RUM data reveal when routing goes wrong.

Application-Level Optimization

While infrastructure optimizations provide the largest latency wins, application-level techniques further reduce latency and improve user experience.

Async and Parallel Processing

Parallelization:

Identify independent operations and execute concurrently
Aggregate results only after all complete
Latency = max(individual latencies) instead of sum

Example: Page Load

Sequential: User fetch (50ms) → Posts fetch (100ms) → Ads fetch (50ms) = 200ms
Parallel: User + Posts + Ads all start → Wait for all = 100ms

Async Processing:

Don't block on operations that can be deferred
Return response immediately, process in background
User perceives lower latency

Example: Write Operations

Sync: Accept order → Charge card → Send email → Update analytics → Response
Async: Accept order → Charge card → Response → (background: email, analytics)

Speculative Execution

Prefetching:

Predict what user will need next
Fetch in background before user requests
Data ready instantly when needed

Examples:

Pagination: Fetch next page while user reads current
Search: Fetch top result details during typing
Navigation: Preload likely next pages

Hedged Requests:

Send duplicate requests to multiple backends
Return first response, cancel others
Reduces tail latency (p99) at cost of more requests

Example:

Single request: p50=50ms, p99=500ms
Hedged (2 backends): p50=50ms, p99≈100ms (wait only for faster)

Application-Level Latency Techniques

•Connection Pooling: Maintain warm connections to databases, caches, external services. Eliminates connection setup latency per request.
•Query Optimization: N+1 queries are a classic latency killer. Use joins, batching, dataloader patterns.
•Caching Layers: In-memory caches (Redis, Memcached) return in <1ms. Database queries may take 10-100ms.
•Lazy Loading: Load only what's immediately needed. Additional data loaded on demand.
•Compression: Gzip or Brotli for text responses. Smaller payloads transmit faster.
•Efficient Serialization: Protocol Buffers, MessagePack faster than JSON for large payloads.
•Response Streaming: Begin sending response before complete. User sees progress sooner.
•Graceful Degradation: If optional component is slow, continue without it rather than blocking.

Client-Side Optimization

Optimistic Updates:

Update UI immediately on user action
Send request to server in background
Roll back only if request fails
User perceives zero latency for most operations

Local-First Architecture:

Operate on local data primarily
Sync with server asynchronously
Works offline, always responsive
Complex: requires conflict resolution

Skeleton Screens:

Show content structure immediately
Fill in actual content as it loads
Perceived performance improves significantly

Progressive Loading:

Load critical content first
Non-critical content loads progressively
Largest Contentful Paint (LCP) optimized

Edge Computing

Move Logic to the Edge:

CDN edge locations can run code (Workers, Lambda@Edge)
Dramatically reduces latency for suitable workloads
Authentication, A/B testing, personalization, API aggregation

Constraints:

Limited compute resources per request
Limited access to origin (latency returns if edge must fetch)
State management is challenging
Cold start latency (varies by platform)

Best Suited For:

Request routing and manipulation
Lightweight personalization
Authentication and authorization checks
APO response generation
A/B test assignment

Measure the Right Things

Server-side latency measurements miss client-perceived latency. Monitor Time to First Byte (TTFB), Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS). These Core Web Vitals capture what users actually experience.

Latency Monitoring and Debugging

Effective latency optimization requires comprehensive measurement and the ability to debug latency issues when they arise.

Real User Monitoring (RUM)

What RUM Captures:

Actual user latency from browser/app
Distribution across geography, devices, networks
Correlation with user behavior

Key Metrics:

Page Load Time (PLT)
Time to First Byte (TTFB)
Largest Contentful Paint (LCP)
First Input Delay (FID)
Time to Interactive (TTI)

Segmentation:

By geography (country, region, city)
By device type (mobile, desktop, tablet)
By network type (WiFi, 4G, 3G)
By browser/OS
By page/feature

Synthetic Monitoring

What Synthetic Provides:

Consistent baseline measurements
Monitoring from locations where you don't have users yet
Detection of issues before users report
Controlled conditions for comparisons

Best Practices:

Monitor from all regions you serve
Check critical user journeys, not just homepage
Configure realistic scenarios (mobile, slow 3G)
Alert on percentile changes, not just averages

Monitoring Approaches for Geo-Distributed Systems
Approach	Strengths	Limitations	Example Tools
RUM	Real user data, actual experience	Requires traffic, privacy considerations	Datadog RUM, New Relic Browser, SpeedCurve
Synthetic	Consistent, proactive, controllable	Not real users, may miss edge cases	Catchpoint, Pingdom, WebPageTest
APM/Tracing	Server-side detail, root cause	Misses client-side	Datadog APM, Jaeger, Zipkin
Log Analysis	Deep detail, custom metrics	Requires aggregation, post-hoc	Elasticsearch, Splunk, CloudWatch

Distributed Tracing

In geo-distributed systems, requests may span multiple regions. Distributed tracing tracks requests across the entire lifecycle:

Components:

Trace ID: Unique identifier for entire request
Span ID: Identifier for each operation within trace
Parent Span: Links spans into a tree
Timing: Start time and duration of each span

What to Trace:

All HTTP requests (in and out)
Database queries
Cache operations
External service calls
Significant internal operations

Cross-Region Tracing:

Ensure trace context propagates across regions
Correlate traces between regions
Identify where cross-region calls add latency

Debugging Latency Issues

Systematic Approach:

Quantify the problem
- Which percentile is affected? (p50 vs p99)
- Which users/regions? (global vs specific)
- When did it start? (sudden vs gradual)
Isolate the component
- Client-side or server-side?
- Which service in the chain?
- Which region?
Examine the evidence
- Traces for slow requests
- Logs with timing information
- Metrics for affected components
Form and test hypotheses
- Network issue? Check latency between components
- Resource exhaustion? Check CPU, memory, connections
- Code issue? Check recent deployments
Apply fix and verify
- Implement fix
- Verify latency improves
- Monitor for regression

Common Root Causes:

Database query without index (exponential growth with data)
Connection pool exhaustion (sudden latency spike)
Garbage collection (periodic spikes)
N+1 queries (grows with data size)
External dependency slowdown (check third-party status)

Focus on Percentiles, Not Averages

Average latency hides problems. A p50 of 50ms and p99 of 5000ms (average: ~100ms) means 1% of users wait 5 seconds—a terrible experience. Monitor p50, p95, and p99. Optimize for the percentile that matches your user experience goals.

Summary: The Latency Optimization Toolkit

We've comprehensively explored latency optimization for geo-distributed systems. Let's consolidate the key insights:

Key Takeaways

•Prioritize by impact: Geographic proximity > protocol optimization > caching > payload optimization > application tuning. Work from the top down.
•CDN is fundamental: Edge caching provides massive latency reduction for cacheable content. Optimize cache hit rates before anything else.
•Protocols matter: HTTP/2, TLS 1.3, and eventually HTTP/3 provide significant latency improvements, especially on high-latency paths.
•Route intelligently: GeoDNS, anycast, and global load balancers each have trade-offs. Choose based on failover requirements and routing granularity needs.
•Optimize application patterns: Parallelization, async processing, caching, and efficient data access amplify infrastructure optimizations.
•Measure comprehensively: RUM, synthetic monitoring, and distributed tracing together provide complete latency visibility. Focus on percentiles, not averages.

Completing the Module:

We've now covered the full scope of geo-distributed architecture:

Why Geo-Distribution Matters: The physics of latency, business impact, compliance
Single vs Multi-Region: When to use each, decision framework
Active-Passive vs Active-Active: Pattern trade-offs and selection
Data Replication: Sync/async, consistency models, failure handling
Latency Optimization: Edge caching, protocols, routing, application techniques

You now have a comprehensive foundation for designing and operating geo-distributed systems at scale.

Module Complete

Congratulations! You've completed the Geo-Distributed Architecture module. You now understand the full spectrum of considerations for building systems that serve users globally with low latency and high availability. Apply these principles to design systems that perform excellently regardless of where users are located.

5 / 5

Loading learning content...

System Design (HLD)Geo-Distributed Architecture

Geo-Distributed Architecture

LevelAdvanced

Duration90 mins

TopicGeo-Distributed Architecture

5 / 5

Latency Optimization

Every Millisecond Counts

In this page, we'll build a comprehensive toolkit for minimizing user-perceived latency in geo-distributed systems.

What You Will Learn

Understanding Latency Components

Before optimizing latency, we must understand where time is actually spent. User-perceived latency is the sum of many components, each requiring different optimization approaches.

The Request Lifecycle

A typical HTTP request from browser to server and back involves:

DNS Resolution (0-100ms)
- Browser queries DNS for domain IP
- May be cached (0ms) or require recursive lookup (50-150ms)
TCP Connection Establishment (1 RTT)
- Three-way handshake: SYN → SYN-ACK → ACK
- Cross-continental: 100-200ms
TLS Handshake (1-2 RTTs)
- TLS 1.2: 2 RTTs for key exchange
- TLS 1.3: 1 RTT (0-RTT possible for resumption)
- Add 100-400ms for cross-continental TLS 1.2
Request Transmission (varies by payload)
- Time to upload request body
- Dominated by bandwidth for large payloads
Server Processing (application-dependent)
- Time from request receipt to response ready
- This is typically what we optimize first
Response Transmission (varies by payload)
- Time to download response body
- First byte to last byte
Client Processing (browser/app)
- Parsing, rendering, JavaScript execution

Latency Budget Example: US-East Server, Sydney User
Component	Time (Typical)	Optimization Approach
DNS Resolution	0-50ms (cached)	DNS prefetching, low TTL for failover
TCP Handshake	150ms (1 RTT)	Connection reuse, QUIC
TLS Handshake	300ms (2 RTT, TLS 1.2)	TLS 1.3, session resumption, 0-RTT
Request Transmission	10ms	Compression, payload minimization
Server Processing	50ms	Code optimization, caching, async
Response Transmission	100ms	Compression, chunked encoding, streaming
Client Processing	100ms	Smaller JS bundles, lazy loading
Total	760ms	Edge deployment achieves ~100ms

The Optimization Hierarchy

Optimizations should be prioritized by impact:

Tier 1: Geographic Proximity (100s of milliseconds)

Moving compute closer to users
Edge deployment, regional infrastructure
Single most impactful optimization

Tier 2: Connection Optimization (10s to 100s of milliseconds)

Protocol upgrades (HTTP/2, HTTP/3, TLS 1.3)
Connection reuse and pooling
Reducing round trips

Tier 3: Caching (10s to 100s of milliseconds)

CDN caching for static assets
API response caching
Database query caching

Tier 4: Payload Optimization (10s of milliseconds)

Compression (gzip, brotli)
Efficient serialization (Protocol Buffers, MessagePack)
Selective field returns

Tier 5: Application Optimization (milliseconds to 10s of milliseconds)

Query optimization
Async processing
Code path efficiency

Engineers often start with Tier 5 (it's comfortable) when Tier 1 would provide 10x the benefit. Work from the top of the hierarchy down.

Measure Before Optimizing

Edge Caching and CDN Optimization

CDN Architecture

Edge Locations (PoPs - Points of Presence):

Distributed globally (major CDNs: 100-300+ locations)
Each location caches content and terminates connections
User requests route to nearest healthy PoP

Origin:

Your actual server infrastructure
CDN fetches content from origin when cache misses
Should be optimized, but latency to origin is less critical

Cache Hierarchy:

L1 Edge Cache: Closest to user, limited capacity
L2 Regional Cache: Larger capacity, serves multiple edges
Origin Shield: Single point of contact to origin, maximizes cache efficiency

What to Cache

CDN Caching Strategies by Content Type
Content Type	Cache Strategy	Typical TTL	Invalidation Approach
Static Assets (JS, CSS, images)	Aggressive caching with versioned URLs	1 year	New version = new URL
Media (video, audio)	Aggressive caching, range request support	1 year	New upload = new URL
HTML (static pages)	Short cache with stale-while-revalidate	5-60 seconds	TTL expiry or purge
API Responses (public)	Vary by relevant headers, short TTL	1-60 seconds	TTL expiry or event-driven
API Responses (personalized)	Generally not cacheable at CDN	N/A	N/A
Real-time data	Not cacheable	N/A	N/A

Cache Key Design

The cache key determines whether requests share cached responses:

Too Narrow:

Every request is unique (query params, headers included in key)
Cache miss rate is high, little benefit

Too Broad:

Incorrect responses served (user A sees user B's data)
Security and correctness issues

Best Practices:

Include only what truly differentiates response
Use Vary header correctly (but sparingly)
Normalize URLs (trailing slashes, query param ordering)
Avoid including session identifiers unless necessary

Cache Invalidation

Time-Based (TTL):

Simple and predictable
Stale content served until TTL expires
Works for content that can be slightly stale

Purge on Event:

Active invalidation when content changes
Immediate freshness but requires integration
May be slow to propagate across all PoPs

Stale-While-Revalidate:

Serve stale content while fetching fresh in background
Best user experience (always fast)
Requires HTTP cache headers support

CDN Selection Criteria

When choosing a CDN, evaluate:

PoP Coverage: Where are your users? Does CDN have presence there?
Origin Shield: Does it support origin shield to reduce origin load?
Edge Compute: Can you run code at the edge (Workers, Lambda@Edge)?
Cache Control: Granularity of cache rules and invalidation
Analytics: Visibility into cache hit rates, latency, errors
Price: Cost per GB, request pricing, minimum commitments

Major Providers:

Cloudflare: Strong edge compute, competitive pricing
AWS CloudFront: Deep AWS integration, Lambda@Edge
Fastly: Real-time purge, advanced VCL configuration
Akamai: Largest network, enterprise features

Cache Hit Rate Is the Key Metric

Connection and Protocol Optimization

Connection establishment and protocol overhead add significant latency, especially over high-latency paths. Modern protocols like HTTP/2, HTTP/3, and TLS 1.3 address many historical inefficiencies.

HTTP/2 Improvements

HTTP/2 addresses HTTP/1.1 limitations:

Multiplexing:

Multiple requests over single connection
No head-of-line blocking at HTTP layer
Eliminates need for connection pooling hacks

Header Compression (HPACK):

Headers compressed with indexing
Dramatically reduces header overhead on repeated requests

Server Push:

Server can push resources before client requests
Useful for critical assets (CSS, JS)
Must be used carefully (can waste bandwidth)

Stream Prioritization:

Client can indicate priority of resources
Browser can prioritize visible content

HTTP/3 and QUIC

HTTP/3 uses QUIC (UDP-based transport) instead of TCP:

0-RTT Connection Establishment:

For repeat connections, no handshake latency
First request can be sent immediately
Huge benefit for high-latency paths

No Head-of-Line Blocking:

Packet loss doesn't block other streams
Significant improvement on lossy connections (mobile)

Connection Migration:

Connection survives IP address changes
Important for mobile users changing networks

Built-in Encryption:

Always encrypted (no unencrypted HTTP/3)
Faster handshake than TCP + TLS

Latency Comparison (cross-continental):

HTTP/1.1 + TLS 1.2: 600-800ms first request
HTTP/2 + TLS 1.3: 400-500ms first request, reuse benefits
HTTP/3 + QUIC: 200-300ms first request, 0ms for resumption

Protocol Comparison for Geo-Distributed Systems
Protocol	Connection Overhead	Multiplexing	Lossy Network Behavior	Adoption Considerations
HTTP/1.1 + TLS 1.2	High (3-4 RTTs)	No	Poor (HOL blocking)	Legacy, full support
HTTP/2 + TLS 1.3	Medium (2 RTTs)	Yes	TCP HOL still exists	Widely supported now
HTTP/3 + QUIC	Low (0-1 RTT)	Yes	Excellent	Growing support, some middlebox issues

TLS Optimization

TLS 1.3:

1-RTT handshake (vs 2-RTT for TLS 1.2)
0-RTT resumption for repeat connections
Stronger security with simpler configuration
No reason not to use in new deployments

Session Resumption:

Server stores session state, client presents ticket
Skips expensive key exchange on repeat connections
Essential for high-latency paths

OCSP Stapling:

Server includes certificate revocation status
Client doesn't need separate OCSP lookup
Saves 50-200ms on certificate validation

Connection Reuse

HTTP Keep-Alive:

Reuse TCP connections across requests
Default in HTTP/1.1+
Configure appropriate timeouts server-side

Connection Pooling:

Maintain pool of warm connections
New requests use existing connections
Critical for inter-service communication

Persistent Connections:

For real-time: WebSockets, Server-Sent Events
Eliminate per-request connection overhead
Require careful connection management

DNS Optimization

Low TTL for Failover:

During normal operation, DNS TTL can be longer (5-15 minutes)
When preparing for failover, reduce to 30-60 seconds
Balance cacheability vs failover speed

DNS Pre-resolution:

<link rel="dns-prefetch" href="//api.example.com">
Browser resolves DNS in background before requests
Eliminates DNS latency from critical path

GeoDNS:

Return different IPs based on resolver location
Route users to nearest region/PoP
Foundation of geographic traffic routing

Protocol Upgrades Are High-ROI

Geographic Traffic Routing

Getting users to the right region is a solved problem in principle but nuanced in practice. Multiple approaches exist, each with different characteristics.

DNS-Based Routing (GeoDNS)

Mechanism:

DNS server returns different IP addresses based on query source
User's resolver location determines which server IP is returned
Most common approach for initial geo-routing

Advantages:

Simple to implement
Works with any protocol/application
Well-understood operational model

Limitations:

Resolver location ≠ user location (corporate DNS, mobile networks)
TTL caching delays failover
Limited granularity (can't route by request attributes)

Anycast Routing

Mechanism:

Same IP address announced from multiple locations via BGP
Network routing naturally selects closest location
Used by major CDNs and global infrastructure

Advantages:

Instant failover (BGP reconverges)
No DNS TTL issues
Network-level routing is efficient

Limitations:

Requires BGP expertise and AS number
IP address changes can be disruptive (mid-connection)
Less control over specific routing decisions

Load Balancer Routing

Mechanism:

Global load balancer (GLB) receives all traffic
GLB routes to backend based on client IP or other attributes
AWS Global Accelerator, Google Cloud Global LB, Azure Front Door

Advantages:

Fine-grained control over routing decisions
Can route by request attributes (headers, path)
Integrated health checking and failover

Traffic Routing Approaches Comparison
Approach	Failover Time	Granularity	Complexity	Cost
GeoDNS	TTL-dependent (seconds-minutes)	Per domain/subdomain	Low	Low
Anycast	Seconds (BGP reconvergence)	Per IP address	High (BGP)	Medium
Global Load Balancer	Seconds (health-based)	Per request	Medium	Medium-High
CDN Routing	Seconds	Per request	Low	Varies
Client-side logic	Immediate	Per request	Medium	Low

Global Load Balancers in Practice

AWS Global Accelerator:

Anycast IPs route to nearest AWS edge
Traffic then routed to configured endpoints
Integrates with NLB, ALB, EC2, EIP

Google Cloud Global Load Balancing:

Single anycast IP for global service
Supports HTTP(S), TCP, UDP
Deep integration with GCP services

Azure Front Door:

Global HTTP load balancing
Path-based routing, URL rewrites
Integrated CDN capabilities

Cloudflare:

Anycast across 300+ PoPs
Load balancing to origins with health checking
Geo-steering with customizable rules

Routing Decisions

Latency-Based:

Route to region with lowest measured latency
Accounts for actual network conditions
May shift traffic dynamically

Geographic:

Route based on user's geographic location
Simpler, more predictable
May not optimize for latency anomalies

Geofenced:

Force traffic to specific regions for compliance
EU users → EU region (data residency)
Overrides latency optimization

Weighted:

Distribute traffic by percentage across regions
Useful for canary deployments, A/B testing
Combine with other routing strategies

Handling Edge Cases

VPNs and Proxies:

User's apparent location differs from actual
May be routed to wrong region
Consider user preference settings for override

Mobile Network NAT:

Large mobile networks share IP addresses
Geo-IP location may be incorrect
Fallback to client-side detection

Traveling Users:

User's home region differs from current location
Session state may be in home region
Balance latency vs data locality

Test From Actual User Locations

Application-Level Optimization

While infrastructure optimizations provide the largest latency wins, application-level techniques further reduce latency and improve user experience.

Async and Parallel Processing

Parallelization:

Identify independent operations and execute concurrently
Aggregate results only after all complete
Latency = max(individual latencies) instead of sum

Example: Page Load

Sequential: User fetch (50ms) → Posts fetch (100ms) → Ads fetch (50ms) = 200ms
Parallel: User + Posts + Ads all start → Wait for all = 100ms

Async Processing:

Don't block on operations that can be deferred
Return response immediately, process in background
User perceives lower latency

Example: Write Operations

Sync: Accept order → Charge card → Send email → Update analytics → Response
Async: Accept order → Charge card → Response → (background: email, analytics)

Speculative Execution

Prefetching:

Predict what user will need next
Fetch in background before user requests
Data ready instantly when needed

Examples:

Pagination: Fetch next page while user reads current
Search: Fetch top result details during typing
Navigation: Preload likely next pages

Hedged Requests:

Send duplicate requests to multiple backends
Return first response, cancel others
Reduces tail latency (p99) at cost of more requests

Example:

Single request: p50=50ms, p99=500ms
Hedged (2 backends): p50=50ms, p99≈100ms (wait only for faster)

Application-Level Latency Techniques

•Connection Pooling: Maintain warm connections to databases, caches, external services. Eliminates connection setup latency per request.
•Query Optimization: N+1 queries are a classic latency killer. Use joins, batching, dataloader patterns.
•Caching Layers: In-memory caches (Redis, Memcached) return in <1ms. Database queries may take 10-100ms.
•Lazy Loading: Load only what's immediately needed. Additional data loaded on demand.
•Compression: Gzip or Brotli for text responses. Smaller payloads transmit faster.
•Efficient Serialization: Protocol Buffers, MessagePack faster than JSON for large payloads.
•Response Streaming: Begin sending response before complete. User sees progress sooner.
•Graceful Degradation: If optional component is slow, continue without it rather than blocking.

Client-Side Optimization

Optimistic Updates:

Update UI immediately on user action
Send request to server in background
Roll back only if request fails
User perceives zero latency for most operations

Local-First Architecture:

Operate on local data primarily
Sync with server asynchronously
Works offline, always responsive
Complex: requires conflict resolution

Skeleton Screens:

Show content structure immediately
Fill in actual content as it loads
Perceived performance improves significantly

Progressive Loading:

Load critical content first
Non-critical content loads progressively
Largest Contentful Paint (LCP) optimized

Edge Computing

Move Logic to the Edge:

CDN edge locations can run code (Workers, Lambda@Edge)
Dramatically reduces latency for suitable workloads
Authentication, A/B testing, personalization, API aggregation

Constraints:

Limited compute resources per request
Limited access to origin (latency returns if edge must fetch)
State management is challenging
Cold start latency (varies by platform)

Best Suited For:

Request routing and manipulation
Lightweight personalization
Authentication and authorization checks
APO response generation
A/B test assignment

Measure the Right Things

Latency Monitoring and Debugging

Effective latency optimization requires comprehensive measurement and the ability to debug latency issues when they arise.

Real User Monitoring (RUM)

What RUM Captures:

Actual user latency from browser/app
Distribution across geography, devices, networks
Correlation with user behavior

Key Metrics:

Page Load Time (PLT)
Time to First Byte (TTFB)
Largest Contentful Paint (LCP)
First Input Delay (FID)
Time to Interactive (TTI)

Segmentation:

By geography (country, region, city)
By device type (mobile, desktop, tablet)
By network type (WiFi, 4G, 3G)
By browser/OS
By page/feature

Synthetic Monitoring

What Synthetic Provides:

Consistent baseline measurements
Monitoring from locations where you don't have users yet
Detection of issues before users report
Controlled conditions for comparisons

Best Practices:

Monitor from all regions you serve
Check critical user journeys, not just homepage
Configure realistic scenarios (mobile, slow 3G)
Alert on percentile changes, not just averages

Monitoring Approaches for Geo-Distributed Systems
Approach	Strengths	Limitations	Example Tools
RUM	Real user data, actual experience	Requires traffic, privacy considerations	Datadog RUM, New Relic Browser, SpeedCurve
Synthetic	Consistent, proactive, controllable	Not real users, may miss edge cases	Catchpoint, Pingdom, WebPageTest
APM/Tracing	Server-side detail, root cause	Misses client-side	Datadog APM, Jaeger, Zipkin
Log Analysis	Deep detail, custom metrics	Requires aggregation, post-hoc	Elasticsearch, Splunk, CloudWatch

Distributed Tracing

In geo-distributed systems, requests may span multiple regions. Distributed tracing tracks requests across the entire lifecycle:

Components:

Trace ID: Unique identifier for entire request
Span ID: Identifier for each operation within trace
Parent Span: Links spans into a tree
Timing: Start time and duration of each span

What to Trace:

All HTTP requests (in and out)
Database queries
Cache operations
External service calls
Significant internal operations

Cross-Region Tracing:

Ensure trace context propagates across regions
Correlate traces between regions
Identify where cross-region calls add latency

Debugging Latency Issues

Systematic Approach:

Quantify the problem
- Which percentile is affected? (p50 vs p99)
- Which users/regions? (global vs specific)
- When did it start? (sudden vs gradual)
Isolate the component
- Client-side or server-side?
- Which service in the chain?
- Which region?
Examine the evidence
- Traces for slow requests
- Logs with timing information
- Metrics for affected components
Form and test hypotheses
- Network issue? Check latency between components
- Resource exhaustion? Check CPU, memory, connections
- Code issue? Check recent deployments
Apply fix and verify
- Implement fix
- Verify latency improves
- Monitor for regression

Common Root Causes:

Database query without index (exponential growth with data)
Connection pool exhaustion (sudden latency spike)
Garbage collection (periodic spikes)
N+1 queries (grows with data size)
External dependency slowdown (check third-party status)

Focus on Percentiles, Not Averages

Summary: The Latency Optimization Toolkit

We've comprehensively explored latency optimization for geo-distributed systems. Let's consolidate the key insights:

Key Takeaways

•Prioritize by impact: Geographic proximity > protocol optimization > caching > payload optimization > application tuning. Work from the top down.
•CDN is fundamental: Edge caching provides massive latency reduction for cacheable content. Optimize cache hit rates before anything else.
•Protocols matter: HTTP/2, TLS 1.3, and eventually HTTP/3 provide significant latency improvements, especially on high-latency paths.
•Route intelligently: GeoDNS, anycast, and global load balancers each have trade-offs. Choose based on failover requirements and routing granularity needs.
•Optimize application patterns: Parallelization, async processing, caching, and efficient data access amplify infrastructure optimizations.
•Measure comprehensively: RUM, synthetic monitoring, and distributed tracing together provide complete latency visibility. Focus on percentiles, not averages.

Completing the Module:

We've now covered the full scope of geo-distributed architecture:

Why Geo-Distribution Matters: The physics of latency, business impact, compliance
Single vs Multi-Region: When to use each, decision framework
Active-Passive vs Active-Active: Pattern trade-offs and selection
Data Replication: Sync/async, consistency models, failure handling
Latency Optimization: Edge caching, protocols, routing, application techniques

You now have a comprehensive foundation for designing and operating geo-distributed systems at scale.

Module Complete

5 / 5