Non-Functional Requirements - Learning Module

Loading content...

0/273

Latency Requirements

The Invisible Tax on Every User Action

Amazon famously discovered that every 100ms of latency cost them 1% in sales. Google found that an extra 0.5 seconds in search results generation caused a 20% drop in traffic. Walmart determined that for every 1 second of improvement in page load time, conversions increased by 2%.

These aren't edge cases—they're universal patterns. Latency is the invisible tax on every user interaction. Users don't consciously calculate response times, but their brains register delays and translate them into frustration, impatience, and ultimately, abandonment.

Latency requirements specify the performance boundaries that distinguish a responsive, delightful experience from one that tests user patience. They determine whether your system feels fast, acceptable, or frustratingly slow. In a world where competitors are one click away, latency requirements are competitive differentiators.

What You Will Learn

By the end of this page, you will master the science of latency requirements: understanding latency anatomy, specifying percentile targets, building latency budgets, accounting for tail latencies, and translating user experience needs into precise technical specifications.

Understanding Latency: Anatomy of a Request

Before specifying latency requirements, you must understand what latency actually measures and where delays accumulate.

Latency Definition:

Latency is the time elapsed from when a request is initiated to when the response is received. This seemingly simple definition hides complexity because 'initiated' and 'received' can be measured at different points.

The Request Journey:

User clicks button
    │
    ▼ Client processing time (JS render, validation)
    │
    ▼ Network: Client → Edge (DNS, TCP handshake, TLS)
    │
    ▼ Edge processing (CDN, WAF, load balancer)
    │
    ▼ Network: Edge → Application server
    │
    ▼ Application processing (business logic)
    │
    ▼ Network: App → Database
    │
    ▼ Database processing (query execution)
    │
    ▼ Network: Database → App
    │
    ▼ Application processing (response assembly)
    │
    ▼ Network: App → Edge
    │
    ▼ Edge processing (compression, caching)
    │
    ▼ Network: Edge → Client
    │
    ▼ Client processing (render, hydration)
    │
User sees result

Each step contributes to total latency. Your requirements must specify which segments you're measuring and targeting.

Latency Measurement Points
Measurement	Start Point	End Point	What It Captures
User-perceived latency	User action	Visual feedback	Complete experience, includes client rendering
Page load time (PLT)	Navigation start	Page interactive	Full page including assets
Time to First Byte (TTFB)	Request sent	First byte received	Network + server processing
Server-side latency	Request received at server	Response sent	Pure server processing
Database latency	Query issued	Result returned	Database layer only
API latency	API call initiated	API response complete	Single API endpoint

Specify What You're Measuring

Latency requirements become meaningless if measurement points aren't explicit. 'API latency under 200ms' measured at the server differs dramatically from 'API latency under 200ms' measured at the client (which includes network round-trip). Always specify: 'Server-side API latency, measured from request arrival at application server to response completion, excluding network transit.'

Latency Percentiles: Beyond Average

Average latency is arguably the most misleading metric in system design. It masks the experience of the users who matter most—those encountering the worst performance.

Why Average Fails:

Consider a system with the following latency distribution:

99 requests complete in 50ms each
1 request completes in 5,000ms (5 seconds)

Average latency: (99 × 50 + 1 × 5000) ÷ 100 = 99.5ms

The average looks great—under 100ms! But 1 in 100 users waits 5 seconds. If your service handles 1 million requests per day, that's 10,000 users experiencing unacceptable latency daily.

Percentile Metrics:

Percentiles reveal the distribution of latency, not just the center:

Understanding Latency Percentiles
Percentile	Meaning	Users Affected by Worse Performance
P50 (median)	Half of requests are faster	50%
P90	90% of requests are faster	10%
P95	95% of requests are faster	5%
P99	99% of requests are faster	1%
P99.9	99.9% of requests are faster	0.1%
P99.99	99.99% of requests are faster	0.01%

Which Percentile to Specify:

Percentile	When to Use	Rationale
P50	Internal dashboards, general trends	Shows typical experience
P90	Standard SLOs for most services	Good balance of representativeness
P95	User-facing services, e-commerce	Captures meaningful tail
P99	Critical paths, payment flows	Ensures rare bad experiences are bounded
P99.9	Financial systems, trading platforms	Extreme predictability required

Specifying Multiple Percentiles:

Robust latency requirements specify multiple percentiles:

API Latency Requirements:
- P50 (median): ≤ 50ms
- P90: ≤ 100ms  
- P99: ≤ 500ms
- P99.9: ≤ 2,000ms

This specification says: "Most users experience <50ms, the vast majority <100ms, almost everyone <500ms, and essentially nobody waits more than 2 seconds."

P99 is Typically 5-10x P50

A common mistake is specifying unrealistic percentile relationships. If your P50 is 50ms, expecting P99 of 60ms is often impossible without fundamental architectural changes. Realistic ratios: P90 ≈ 2x P50, P99 ≈ 5-10x P50, P99.9 ≈ 10-50x P50. If your measured ratios exceed these, investigate tail latency causes.

User Perception Thresholds: The Psychology of Speed

Latency requirements should be grounded in human perception, not arbitrary technical targets. Research on user perception provides concrete guidance:

The 100ms/1s/10s Rule:

Jakob Nielsen's research established thresholds that remain valid today:

0-100ms: Feels instant. User perceives immediate cause-and-effect.
100-300ms: Noticeable delay but feels responsive. User maintains mental flow.
300-1000ms: User notices waiting but maintains context. Acceptable for most actions.
1-3 seconds: Significant delay. Users start task-switching mentally.
3-10 seconds: User attention wanders. Progress indicators essential.
10+ seconds: User abandons or loses context entirely.

Mapping Latency to User Actions:

Latency Targets by Interaction Type
Interaction Type	User Expectation	Target Latency (P95)	Rationale
Keystroke response	Instant	<50ms	Typing must feel immediate
Button click feedback	Instant	<100ms	Visual acknowledgment needed
Autosuggest/autocomplete	Fast	<200ms	Must beat typing speed
Page navigation	Quick	<500ms	Maintain browsing flow
Search results	Fast	<1000ms	User expects instant answers
Form submission	Quick	<2000ms	User committed to waiting
Report generation	Tolerant	<5000ms	User knows this needs processing
Bulk operations	Patient	<30000ms	Progress indicator essential

The First Meaningful Paint (FMP) Principle:

For page loads, users don't need everything—they need progress. Modern performance metrics reflect this:

Metric	Definition	Target
First Contentful Paint (FCP)	First content renders	<1.8s
Largest Contentful Paint (LCP)	Main content visible	<2.5s
First Input Delay (FID)	Time to first interactivity	<100ms
Cumulative Layout Shift (CLS)	Visual stability	<0.1
Time to Interactive (TTI)	Page fully interactive	<3.8s

Incorporating User Context:

Latency tolerance varies by user state:

First visit: Users are more tolerant, exploring
Habitual use: Expectations are higher, trained by prior experiences
Critical task: Users demand speed when urgency is high (checkout)
Background task: More tolerant when multitasking

Your requirements should acknowledge this: "Authentication latency: <500ms P95 (users are focused and impatient). Dashboard load: <2s P95 (users expect first paint quickly but tolerate complete load)."

Perceived Performance Often Matters More Than Actual

A 3-second operation with a progress bar feels faster than a 2-second operation with a blank screen. Consider specifying both absolute latency targets AND perceived performance requirements: 'If operation exceeds 500ms, display progress indicator within 100ms of action.'

Latency Budgets: Allocating Time Across Components

Complex systems involve multiple components. A latency budget explicitly allocates the total latency target across each component, ensuring the sum of parts doesn't exceed the whole.

Building a Latency Budget:

Consider an e-commerce product page with a 500ms P95 target:

Total Budget: 500ms P95

├── Network (client → edge): 50ms
├── Edge processing (CDN, WAF): 10ms  
├── Network (edge → app): 5ms
├── Load balancer: 5ms
├── Application server: 200ms
│   ├── Auth validation: 20ms
│   ├── Session lookup: 10ms
│   ├── Product fetch: 50ms
│   ├── Pricing calculation: 30ms
│   ├── Inventory check: 40ms
│   ├── Recommendations: 30ms (async, non-blocking)
│   └── Response assembly: 20ms
├── Aggregated DB calls: 150ms
│   ├── Product data: 50ms
│   ├── Pricing data: 30ms
│   └── Inventory data: 70ms
├── Network (app → edge): 5ms
├── Edge processing (compression): 5ms
└── Network (edge → client): 70ms

Total: 500ms ✓

Latency Budget Requirements Specification:

Document latency budgets in your requirements:

Latency Budget Requirements (Product Detail Page):

Overall Target: 500ms P95 (server-side), 800ms P95 (user-perceived)

Component Budgets:

1. API Gateway Layer
   - Allocation: 15ms P95
   - Includes: TLS termination, rate limiting, routing
   - Owner: Platform Team
   
2. Application Layer
   - Allocation: 200ms P95
   - Includes: Business logic, response assembly
   - Owner: Product Team
   
3. Database Layer 
   - Allocation: 150ms P95 (aggregate of all queries)
   - Individual query limit: 50ms P95
   - Owner: Data Team
   
4. External Service Calls
   - Allocation: 100ms P95 (aggregate)
   - Fallback: Return cached/default data if exceeded
   - Owner: Integration Team
   
5. Serialization/Transport
   - Allocation: 35ms P95
   - Owner: Platform Team

Buffer: 0ms (fully allocated)
Escalation: If any component exceeds budget by 20%, trigger architectural review.

Latency Budget Strategies
Strategy	Description	When to Use
Even distribution	Divide equally among components	Components are similar in complexity
Proportional allocation	Allocate based on component complexity	Clear understanding of relative costs
Slack buffer	Reserve 10-20% unallocated	Uncertain or volatile systems
Critical path priority	Allocate most to slowest component	Known bottleneck architecture
Parallel processing	Budget total, not sum of parallel calls	Independent concurrent operations

Budgets Compound Differently Than You Expect

If three sequential components each have 100ms P99 targets, the combined P99 is NOT 300ms—it's typically much higher because you need all three to simultaneously hit their P99. Use Monte Carlo simulation or conservative multipliers (1.5-2x sum of P99s) for realistic composite latency targets.

Tail Latency: The Long Tail Problem

Tail latency—the latency experienced by the slowest requests (P99, P99.9, P99.99)—is disproportionately important in modern distributed systems. Understanding and specifying tail latency requirements is essential.

Why Tail Latency Matters:

Fan-Out Amplification:

When a single user request fans out to multiple backend services (common in microservices), tail latency compounds:

If you query 100 backend services and each has P99 = 100ms (1% chance of slow response)
Probability that at least one is slow: 1 - (0.99)¹⁰⁰ = 63%
Your user-facing P50 latency is dominated by backend P99!

This is the tail-at-scale problem: systems that look fine in isolation create poor user experience at scale.

Fan-Out and Tail Latency Probability
Backend Services	Backend P99 = 1% slow	User Request Probability of Experiencing Slow Backend
1	1%	1%
10	1%	9.6%
50	1%	39.5%
100	1%	63.4%
500	1%	99.3%
1000	1%	99.99%

Sources of Tail Latency:

Source	Impact	Mitigation
Garbage collection	Pause times in P99+	Tune GC, use low-latency collectors
Background tasks	Compaction, indexing	Schedule during low traffic
Resource contention	Lock waiting, thread pool exhaustion	Separate pools, reduce sharing
Network jitter	Variable transmission times	Hedged requests, timeouts
Cold cache	Cache misses cause slow path	Pre-warming, consistent hashing
Query variability	Some queries touch more data	Query limits, pagination
Noisy neighbors	Shared infrastructure contention	Dedicated resources, isolation

Tail Latency Mitigation Requirements

•Hedged requests: 'For calls exceeding P50 latency, a duplicate request shall be issued to an alternate server after 50ms. The first response is used.'
•Timeouts: 'All RPC calls shall have a 500ms hard timeout. Requests exceeding timeout are cancelled and retried once.'
•Request prioritization: 'User-facing requests shall receive priority over background tasks within shared resource pools.'
•Tail tolerance: 'Service mesh shall track P99.9 latency per endpoint and alert if P99.9/P50 ratio exceeds 20x.'
•Load shedding: 'Under load, system shall reject lowest-priority requests to maintain latency targets for critical paths.'

The Hedged Request Pattern

Google's 'tail tolerance' research showed that issuing a duplicate request after the P50 latency (hedging) dramatically improves tail latency with only ~2-5% increase in total load. Specify: 'For operations with >100 backend calls, implement hedging where duplicate requests are issued after median latency threshold.'

Latency Under Load: Performance Degradation Curves

Systems don't maintain constant latency as load increases. Understanding and specifying latency behavior under varying load is critical.

The Latency-Throughput Relationship:

As load approaches capacity, latency typically follows a hockey-stick curve:

Latency
│
│                                    ╱
│                                   ╱
│                                  ╱  ← Capacity exceeded
│                                 ╱
│                   ─────────────╱    ← Knee of curve
│    ───────────────              
│                    ↑ Optimal operating range (50-70% capacity)
└─────────────────────────────────────── Load
        0%         50%        80%   100%

Latency Behavior Specification:

Your requirements should specify latency at different load levels:

Latency Under Load Requirements:

1. Baseline (0-50% capacity):
   - P50: 30ms
   - P95: 80ms
   - P99: 200ms

2. Normal (50-70% capacity):
   - P50: 40ms (1.3x baseline)
   - P95: 120ms (1.5x baseline)
   - P99: 350ms (1.75x baseline)

3. High (70-90% capacity):
   - P50: 60ms (2x baseline)
   - P95: 200ms (2.5x baseline)
   - P99: 700ms (3.5x baseline)

4. Critical (90-100% capacity):
   - P50: 100ms (3.3x baseline)
   - P95: 500ms (6.25x baseline)
   - P99: 2000ms (10x baseline)
   - Action: Auto-scaling triggered

5. Overload (>100% capacity):
   - Load shedding activates
   - Accepted requests maintain High-load latency
   - Rejected requests receive 503 within 10ms

Latency Degradation Limits
Load Level	Acceptable Degradation	Action If Exceeded
0-50%	Baseline (1.0x)	None
50-70%	Up to 1.5x baseline	Monitor
70-85%	Up to 2.5x baseline	Prepare scale-up
85-95%	Up to 5x baseline	Scale-up initiated
95-100%	Up to 10x baseline	Load shedding enabled
100%	Shed load, maintain limits	Alert on-call

Test to Find Your Degradation Curve

Your system's latency-load curve is unique. Load testing should map this curve explicitly: 'Load testing shall characterize P50/P95/P99 latency at 25%, 50%, 75%, 85%, 95%, and 100% of target capacity. Results shall inform scaling trigger thresholds.'

Geographic Latency: The Physics Constraint

Light travels at approximately 200,000 km/s through fiber optics. This creates a hard physical limit on latency between geographically distant points.

Minimum Network Latency by Distance:

Speed-of-Light Minimum Latency
Route	Distance (km)	Theoretical Min RTT	Practical RTT
Same datacenter	~0	<1ms	1-2ms
Same region (multi-AZ)	~50	0.5ms	1-5ms
Coast to coast (US)	~4,000	40ms	60-80ms
US to Europe	~6,000	60ms	80-120ms
US to Asia	~12,000	120ms	150-200ms
Global round trip	~40,000	400ms	300-500ms

Implications for Latency Requirements:

Geographic latency is a floor—you cannot serve users faster than the speed of light allows.

Multi-Region Requirements:

For global services, specify latency requirements per region:

Geographic Latency Requirements:

1. Primary Region (US-East):
   - User-perceived latency: <100ms P95
   - Server-side latency: <50ms P95

2. Secondary Regions:
   - US-West: User <150ms P95 (includes 40ms RTT)
   - Europe: User <200ms P95 (includes 80ms RTT)
   - Asia-Pacific: User <250ms P95 (includes 150ms RTT)

3. Global Fallback:
   - Users hitting non-local region: <500ms P95
   - Acceptable only during regional failure

4. CDN Edge:
   - Static assets served from edge: <50ms P95 globally
   - Dynamic edge compute: <100ms P95 globally

Design Implications:

If P95 target is 100ms and US-Asia RTT is 150ms, you cannot serve Asian users from US infrastructure alone
Requirements drive multi-region architecture decisions
Edge computing can reduce RTT for some operations

Physics Doesn't Negotiate

If you're seeing 75ms latency to users in Europe from US servers, you cannot optimize your way to 50ms—the speed of light won't allow it. Your options are: multi-region deployment, relaxed latency requirements for distant users, or edge computing. Acknowledge physics in your requirements.

Summary: Latency Requirements Mastery

We have covered the complete framework for latency requirements. Let's consolidate the essential takeaways:

Key Takeaways

•Specify measurement points — Define exactly where latency is measured (server, edge, client).
•Use percentiles, not averages — P50, P95, P99 reveal the distribution; averages hide problems.
•Ground requirements in user perception — The 100ms/1s/10s thresholds align with human cognition.
•Build latency budgets — Allocate total latency across components to ensure parts don't exceed whole.
•Account for tail latency — Fan-out amplifies tail latency; hedging and timeouts mitigate.
•Specify behavior under load — Document how latency degrades as capacity is approached.
•Respect physics — Geographic distance creates latency floors that no optimization can breach.

What's Next:

With latency requirements mastered, we turn to Consistency Requirements. While latency determines how quickly users receive responses, consistency determines how accurate those responses are—whether users see stale data, conflicting information, or perfectly synchronized state. In distributed systems, consistency and latency often trade off directly, making this topic essential for complete non-functional requirement specification.

Page Complete

You now have a comprehensive framework for defining latency requirements. These specifications drive every decision from infrastructure placement to caching strategy to service architecture. In the next page, we'll explore consistency requirements—the counterpart to latency in distributed system trade-offs.