System Design (HLD)What Is Real-Time?

Understanding Real-Time Systems

LevelIntermediate

Duration60 mins

TopicWhat Is Real-Time?

2 / 4

Latency Expectations

The Currency of Real-Time Systems

In real-time systems, latency is the currency by which all architectural decisions are evaluated. While conventional systems might accept latency variability in exchange for throughput or cost efficiency, real-time systems treat latency as a hard constraint that shapes every design choice.

But what does "low latency" actually mean? The answer depends entirely on context:

A 1-second response is unacceptably slow for a video game, but impressively fast for a batch data pipeline
100 milliseconds feels instant for a web page load, but introduces noticeable lag in a phone conversation
10 milliseconds is imperceptible for human interaction, but far too slow for a motor control loop
1 microsecond is achievable in specialized hardware, but impossible for general-purpose software

This page establishes a rigorous framework for understanding latency expectations: how they vary across domains, how they're measured, and how system architects budget latency across components to meet end-to-end requirements.

What You Will Learn

By the end of this page, you will understand how latency expectations vary across application domains by orders of magnitude, master the vocabulary and mathematics of latency characterization (percentiles, distributions, jitter), and learn practical techniques for creating and managing latency budgets in complex systems.

Orders of Magnitude: The Latency Spectrum

Different application domains operate at vastly different latency scales. Understanding where your system falls on this spectrum is the first step in setting appropriate expectations.

The latency hierarchy:

Latency Requirements Across Application Domains
Latency Range	Domain Examples	Typical Constraint Source
< 1 microsecond (< 1μs)	High-frequency trading, hardware interrupts, FPGA logic	Speed of light, electronic switching times
1-100 microseconds	Kernel operations, device drivers, network packet processing	CPU cycles, memory access, bus speeds
100μs - 1 millisecond	Database queries, in-memory caching, audio processing	I/O latency, algorithm complexity
1-10 milliseconds	Interactive UI, gaming physics, industrial control systems	Human perception thresholds, control stability
10-100 milliseconds	Web API responses, mobile apps, collaborative editing	Human attention, conversational flow
100ms - 1 second	Page loads, search results, email delivery	User patience, perception of 'instant'
1-10 seconds	File uploads, complex queries, report generation	Task completion expectations
10 seconds	Batch processing, background sync, analytics	Background operation tolerance

Physical constraints at the extremes:

At the lowest latencies, physics becomes the limiting factor:

Speed of light: Light travels approximately 30cm per nanosecond. A round-trip across a 10-meter data center network cable takes at least 67 nanoseconds—just for photons to travel, ignoring all processing.
Memory access hierarchy: L1 cache hit: ~1ns. L2 cache: ~4ns. L3 cache: ~15ns. Main memory: ~100ns. SSD: ~100μs. HDD: ~10ms. Each level represents roughly an order of magnitude increase.
CPU clock cycles: At 4GHz, each cycle is 0.25 nanoseconds. Even a "simple" operation requiring 100 cycles takes 25 nanoseconds.

These physical realities create hard floors below which no software optimization can push latency. High-frequency trading firms spend billions locating their servers meters closer to exchanges because at their scale, nanoseconds matter.

Know Your Domain's Scale

Before optimizing, identify which latency order of magnitude your domain requires. Optimizing a batch system to sub-millisecond response is wasted effort; failing to achieve sub-100ms for interactive UI destroys user experience. Match your investment to your requirements.

Human Perception Thresholds

For systems that interact with humans, human perception sets the latency targets. Decades of human-computer interaction research have established clear thresholds that should guide your requirements.

The foundational research:

Jakob Nielsen's "Response Time Limits" (derived from Miller's and Card's earlier research) established three fundamental thresholds that remain relevant today:

Nielsen's Response Time Thresholds

•100 milliseconds: The limit of "instantaneous" feeling. Below this threshold, users perceive the system as responding immediately. There's no sense of delay or waiting. Ideal target for UI interactions.
•1 second: The limit of uninterrupted thought flow. Users notice the delay but maintain their train of thought. The system should indicate it's working, but no progress indicator is necessary.
•10 seconds: The limit of user attention. Beyond this, users will context-switch to other tasks. Progress indicators become essential, and users may abandon the operation.

Domain-specific perception thresholds:

Different human senses have different latency sensitivity:

Human Sensory Latency Thresholds
Modality	Perceptible Delay	Disruptive Delay	Application Impact
Visual (motion)	< 16ms (60fps)	33ms (30fps)	Choppy video, laggy animations
Visual (interaction)	< 50ms	100ms	Perceived lag in UI, gaming
Audio	< 10-20ms	30ms	Audible echo, desynchronization
Audio-visual sync	< 45ms audio leading	125ms lag	Lip sync perception
Haptic (touch)	< 5-10ms	25ms	Tactile feedback feels disconnected
Keyboard input	< 50ms	100ms	Typing feels sluggish

The conversation threshold:

For real-time communication systems (video calls, VoIP, gaming voice chat), the ITU-T G.114 standard establishes critical thresholds:

< 150ms one-way delay: Excellent quality, natural conversation flow
150-300ms: Acceptable for most users, slight awkwardness in turn-taking
300-450ms: Poor quality, noticeable delay, talkers interrupt each other
> 450ms: Unacceptable for interactive conversation

These thresholds explain why satellite phone calls feel unnatural (600ms+ round trip due to geostationary orbit distance) despite high audio quality.

Perception Is Not Linear

The difference between 50ms and 100ms latency is far more perceptible than the difference between 500ms and 550ms. Human perception follows a logarithmic sensitivity curve—users notice delays relative to their expectations, not in absolute terms. Optimize aggressively at the low end of your latency range.

Measuring Latency Correctly

Accurate latency measurement is surprisingly difficult. Many teams measure latency incorrectly, leading to false confidence in their systems' performance characteristics.

What to measure:

End-to-end latency is the gold standard—the time from when a user initiates an action to when they perceive the result. This includes:

Client-side processing time
Network transmission (both directions)
Server queuing and processing
Response generation and serialization
Client-side rendering

Measuring only server-side processing time can dramatically underestimate actual user-perceived latency.

Common Measurement Mistakes

•Measuring only server processing time
•Excluding queue wait time
•Ignoring serialization/deserialization
•Testing only under light load
•Using averages instead of percentiles
•Measuring from inside the system boundary
•Excluding DNS, TLS handshake time
•Testing only the happy path

Correct Measurement Practices

•Measure from client's perspective
•Include all queue and wait times
•Capture full request/response cycle
•Test under realistic load conditions
•Report p50, p90, p99, p99.9 percentiles
•Instrument at actual system boundaries
•Include connection establishment
•Test error paths and edge cases

Coordinated Omission:

One of the most insidious measurement errors is "coordinated omission," identified by Gil Tene. This occurs when a load generator waits for a response before sending the next request. If responses are delayed, the load generator sends fewer requests, hiding the true queuing delays.

Example:

System normally responds in 1ms
Under load, responses take 100ms
With coordinated omission, you measure ~100ms latency
Actual user experience: 100ms + queue wait (potentially seconds)

Tools that "fire and forget" requests on a fixed schedule (regardless of responses) expose the true latency including queue wait times.

Instrumentation Overhead

Measurement itself adds latency. High-resolution timestamp calls, logging, and metrics collection consume CPU cycles and memory bandwidth. For microsecond-scale systems, measurement overhead can be significant. Consider sampling and ensure your production measurements don't materially affect the thing you're measuring.

Understanding Latency Distributions

Latency is not a single number—it's a distribution. Understanding this distribution is essential for real-time system design because the tail of the distribution often determines whether you meet your requirements.

Why averages lie:

Consider two systems:

System A: Every request takes exactly 50ms. Average: 50ms.
System B: 99% of requests take 10ms, 1% take 4010ms. Average: 50ms.

Both have the same average, but System B delivers unusable experience for 1% of users. At 1 million requests per day, that's 10,000 terrible experiences daily.

Percentile thinking:

Real-time requirements should be specified using percentiles:

Percentile Terminology and Interpretation
Percentile	Also Called	Interpretation	Common Usage
p50	Median	50% of requests faster than this	Baseline/typical experience
p90	90th percentile	90% of requests faster; 10% slower	Good user experience threshold
p99	99th percentile	1 in 100 requests slower	Critical for consistent UX
p99.9	Three nines	1 in 1,000 requests slower	High-value transactions
p99.99	Four nines	1 in 10,000 requests slower	Ultra-low-latency systems
Max	Maximum	Single worst observation	Debugging, not SLOs

The amplification problem:

In distributed systems, latency compounds across service calls. If your request touches 10 backend services in parallel, the end-to-end latency is the maximum of the 10 individual latencies.

Mathematical example:

Each service has p99 latency of 10ms
Single call: 99% of requests complete in ≤10ms
With 10 parallel calls: 0.99^10 = 0.904
Only 90.4% of requests will have all 10 complete in ≤10ms
Your end-to-end p99 requires looking at p99.9 or higher of individual services

This is why systems with many dependencies struggle with tail latency—the "long tail" of each dependency contributes to a very long tail at the system level.

Tail Latency Mitigation Strategies

•Hedged requests — Send duplicate requests to multiple replicas; use first response. Effective but doubles load.
•Tied requests — Send to one replica, but allow migration to idle replica if first is slow.
•Timeouts and fallbacks — Set aggressive timeouts and return cached/default responses when exceeded.
•Canary requests — Send a preliminary request to warm caches before the real request.
•Request criticality — Prioritize latency-sensitive requests over background work.
•Reduce fan-out — Fewer dependencies = less tail latency multiplication.

Jeff Dean's Tail Latency Rule

Google's Jeff Dean popularized the rule that if your service calls N other services, your p99 latency target requires individual services to hit roughly p(100 - 1/N). For 100 dependencies, each needs p99.99 to achieve system-level p99. Design for fewer dependencies or accept tail latency.

Jitter and Latency Variance

For many real-time applications, consistency matters as much as speed. A system with steady 50ms latency often provides better user experience than one oscillating between 10ms and 100ms, even though the latter has better average latency.

Defining jitter:

Jitter is the variation in latency over time. Formally, it's often measured as:

Peak-to-peak jitter: Maximum latency minus minimum latency
Standard deviation: Statistical measure of variation around the mean
Inter-packet jitter (for streaming): Variation in delay between consecutive packets

Why jitter matters:

Jitter Sensitivity by Application Type
Application	Jitter Impact	Jitter Tolerance
Audio streaming	Audible clicks, pops, gaps	< 10-30ms typically buffered
Video playback	Frame drops, stutter	< 30ms for smooth playback
Real-time communication	Echo, conversation overlap	< 30ms for natural speech
Gaming	Rubber-banding, teleporting	< 20ms for competitive play
Control systems	Oscillation, instability	Application-specific, often < 1ms
Financial trading	Unfair execution order	Microseconds matter
VR/AR	Motion sickness, disorientation	< 20ms end-to-end including rendering

Jitter buffering:

The classic solution to jitter is buffering—accumulate some data before processing/playback to smooth out variations. But buffering adds latency:

Effective latency = Transmission latency + Buffer size

There's a fundamental tradeoff:

Large buffers: Smooth playback but high latency (problematic for interactive applications)
Small buffers: Low latency but vulnerable to jitter (gaps, stuttering)

Adaptive jitter buffers dynamically adjust based on observed jitter, but they can't eliminate the underlying constraint.

Sources of jitter:

Common Jitter Sources

•Network congestion — Queue depth variation at routers and switches
•Route changes — Dynamic routing can suddenly add/remove latency
•Server load variation — GC pauses, background tasks, load spikes
•Scheduling delays — OS thread scheduling adds variability
•Batching — Data accumulated before transmission creates bursts
•Retransmissions — TCP retransmits add delay to affected packets
•Wireless links — Radio interference, handoffs cause variable latency
•Shared infrastructure — Multi-tenant environments share resources

Designing for Jitter

For jitter-sensitive applications, focus on eliminating variance sources rather than just reducing average latency. Dedicated resources, priority scheduling, traffic shaping, and careful selection of network paths can reduce jitter even if average latency increases slightly.

Engineering Latency Budgets

A latency budget divides an end-to-end latency requirement into allocations for each component of the system. This is the foundational artifact of real-time system design.

The budgeting process:

Steps to Create a Latency Budget

•Define the end-to-end target — e.g., "p99 end-to-end latency < 100ms from user click to visible response"
•Map the request path — Identify every component the request touches: client, CDN, load balancer, API gateway, services, databases, caches, etc.
•Measure current latency — Profile each component under realistic load to understand baseline.
•Allocate budget — Assign each component a portion of the total budget, with headroom for safety.
•Identify risks — Which components are most variable? Where might budget be exceeded?
•Monitor and enforce — Alert when components approach budget; redesign if budget is chronically exceeded.

Example: E-commerce search latency budget

Target: p99 search results < 200ms from keypress to results displayed

Component	Budget	Rationale
Client processing (keystroke to request)	10ms	JavaScript debouncing, serialization
Network to CDN/Edge	15ms	Geographic edge presence
Edge processing (routing, caching)	10ms	Cache lookup, request forwarding
Network to origin	30ms	Inter-datacenter if cache miss
API Gateway	5ms	Routing, auth validation
Search service	60ms	Query parsing, index lookup, ranking
Result aggregation	15ms	Combining from multiple shards
Response serialization	5ms	JSON formatting
Network to client	30ms	Return path
Client rendering	20ms	DOM update, display
Total budget	200ms
Reserved headroom	20ms	For spikes, variation
Working budget	180ms	Actual allocation

Budget Creep Is Real

Latency budgets erode over time as features are added and systems evolve. A 5ms component grows to 15ms over two years as edge cases are handled. Establish ongoing monitoring and treat budget violations as production incidents requiring immediate attention, not tech debt to address later.

SLOs and SLIs for Latency

Latency requirements are formalized through Service Level Indicators (SLIs) and Service Level Objectives (SLOs), which provide measurable, enforceable targets.

Terminology:

SLI (Service Level Indicator): A quantitative measure of some aspect of service quality. For latency: "The proportion of requests that complete within X milliseconds."
SLO (Service Level Objective): A target value or range for an SLI. For latency: "99% of requests will complete within 100ms."
SLA (Service Level Agreement): A contract specifying what happens when SLOs are not met (usually external, with financial consequences).

Defining latency SLOs:

Example Latency SLO Specifications
Service	SLI Definition	SLO Target
API Gateway	% of requests with latency < 10ms	p99 < 10ms for 99.5% of 5-min windows
Search API	% of searches with latency < 200ms	p95 < 200ms measured hourly
Payment Processing	% of transactions completing < 500ms	p99.9 < 500ms; p99 < 300ms
Real-time Messaging	End-to-end message delivery latency	p99 < 100ms; p50 < 30ms
CDN Edge	Time to first byte	p90 < 50ms for cache hits

Best practices for latency SLOs:

Latency SLO Best Practices

•Specify the percentile explicitly — "Fast" is meaningless; "p99 < 100ms" is actionable.
•Define the measurement point — Server-side? Client-side? Specify exactly where timing starts and ends.
•Include the measurement window — "99th percentile over what period?" Rolling 5 minutes differs from daily.
•Account for load conditions — SLOs should hold under normal load; document what happens under overload.
•Set multiple thresholds — p50 for typical experience; p99 for tail; p99.9 for outlier detection.
•Measure from user perspective — End-to-end latency matters more than component latency.
•Leave error budget — If SLO is 99.9%, you have 0.1% budget for violations before escalation.

Multiple SLOs for Different Users

Consider having different SLOs for different user segments. Premium customers might have a p99 < 50ms SLO while free tier has p99 < 200ms. Traffic prioritization and resource allocation can then be tuned to each tier's contracted expectations.

Summary: Latency Expectations

We've established a comprehensive framework for understanding, measuring, and managing latency in real-time systems. Let's consolidate the key concepts:

Key Takeaways

•Context determines expectations — Latency requirements span microseconds to seconds depending on domain. Know your domain's scale.
•Human perception sets targets — For user-facing systems, cognitive science research provides clear thresholds to target.
•Measure correctly — End-to-end, from user perspective, using proper tools that avoid coordinated omission.
•Think in distributions — Averages lie. Percentiles (p50, p99, p99.9) reveal true behavior.
•Tail latency compounds — In distributed systems, individual p99 cascades into worse system-level percentiles.
•Jitter matters — For streaming and interactive applications, consistent latency beats variable latency.
•Budget rigorously — Allocate latency across components with headroom, then monitor and enforce.
•Formalize with SLOs — Specific, measurable latency objectives drive accountability and prioritization.

What's next:

With latency expectations established, the next page explores the critical distinction between soft real-time and hard real-time systems. This classification determines the appropriate architecture, implementation technologies, and failure handling strategies for your real-time application.

Page Complete

You now understand how latency expectations vary across domains, how to measure latency correctly, why distributions and percentiles matter, and how to create actionable latency budgets and SLOs. This knowledge is essential for making informed architectural decisions in real-time system design.

2 / 4

Loading learning content...

System Design (HLD)What Is Real-Time?

Understanding Real-Time Systems

LevelIntermediate

Duration60 mins

TopicWhat Is Real-Time?

2 / 4

Latency Expectations

The Currency of Real-Time Systems

But what does "low latency" actually mean? The answer depends entirely on context:

A 1-second response is unacceptably slow for a video game, but impressively fast for a batch data pipeline
100 milliseconds feels instant for a web page load, but introduces noticeable lag in a phone conversation
10 milliseconds is imperceptible for human interaction, but far too slow for a motor control loop
1 microsecond is achievable in specialized hardware, but impossible for general-purpose software

What You Will Learn

Orders of Magnitude: The Latency Spectrum

Different application domains operate at vastly different latency scales. Understanding where your system falls on this spectrum is the first step in setting appropriate expectations.

The latency hierarchy:

Latency Requirements Across Application Domains
Latency Range	Domain Examples	Typical Constraint Source
< 1 microsecond (< 1μs)	High-frequency trading, hardware interrupts, FPGA logic	Speed of light, electronic switching times
1-100 microseconds	Kernel operations, device drivers, network packet processing	CPU cycles, memory access, bus speeds
100μs - 1 millisecond	Database queries, in-memory caching, audio processing	I/O latency, algorithm complexity
1-10 milliseconds	Interactive UI, gaming physics, industrial control systems	Human perception thresholds, control stability
10-100 milliseconds	Web API responses, mobile apps, collaborative editing	Human attention, conversational flow
100ms - 1 second	Page loads, search results, email delivery	User patience, perception of 'instant'
1-10 seconds	File uploads, complex queries, report generation	Task completion expectations
10 seconds	Batch processing, background sync, analytics	Background operation tolerance

Physical constraints at the extremes:

At the lowest latencies, physics becomes the limiting factor:

Speed of light: Light travels approximately 30cm per nanosecond. A round-trip across a 10-meter data center network cable takes at least 67 nanoseconds—just for photons to travel, ignoring all processing.
Memory access hierarchy: L1 cache hit: ~1ns. L2 cache: ~4ns. L3 cache: ~15ns. Main memory: ~100ns. SSD: ~100μs. HDD: ~10ms. Each level represents roughly an order of magnitude increase.
CPU clock cycles: At 4GHz, each cycle is 0.25 nanoseconds. Even a "simple" operation requiring 100 cycles takes 25 nanoseconds.

Know Your Domain's Scale

Human Perception Thresholds

The foundational research:

Jakob Nielsen's "Response Time Limits" (derived from Miller's and Card's earlier research) established three fundamental thresholds that remain relevant today:

Nielsen's Response Time Thresholds

•100 milliseconds: The limit of "instantaneous" feeling. Below this threshold, users perceive the system as responding immediately. There's no sense of delay or waiting. Ideal target for UI interactions.
•1 second: The limit of uninterrupted thought flow. Users notice the delay but maintain their train of thought. The system should indicate it's working, but no progress indicator is necessary.
•10 seconds: The limit of user attention. Beyond this, users will context-switch to other tasks. Progress indicators become essential, and users may abandon the operation.

Domain-specific perception thresholds:

Different human senses have different latency sensitivity:

Human Sensory Latency Thresholds
Modality	Perceptible Delay	Disruptive Delay	Application Impact
Visual (motion)	< 16ms (60fps)	33ms (30fps)	Choppy video, laggy animations
Visual (interaction)	< 50ms	100ms	Perceived lag in UI, gaming
Audio	< 10-20ms	30ms	Audible echo, desynchronization
Audio-visual sync	< 45ms audio leading	125ms lag	Lip sync perception
Haptic (touch)	< 5-10ms	25ms	Tactile feedback feels disconnected
Keyboard input	< 50ms	100ms	Typing feels sluggish

The conversation threshold:

For real-time communication systems (video calls, VoIP, gaming voice chat), the ITU-T G.114 standard establishes critical thresholds:

< 150ms one-way delay: Excellent quality, natural conversation flow
150-300ms: Acceptable for most users, slight awkwardness in turn-taking
300-450ms: Poor quality, noticeable delay, talkers interrupt each other
> 450ms: Unacceptable for interactive conversation

These thresholds explain why satellite phone calls feel unnatural (600ms+ round trip due to geostationary orbit distance) despite high audio quality.

Perception Is Not Linear

Measuring Latency Correctly

Accurate latency measurement is surprisingly difficult. Many teams measure latency incorrectly, leading to false confidence in their systems' performance characteristics.

What to measure:

End-to-end latency is the gold standard—the time from when a user initiates an action to when they perceive the result. This includes:

Client-side processing time
Network transmission (both directions)
Server queuing and processing
Response generation and serialization
Client-side rendering

Measuring only server-side processing time can dramatically underestimate actual user-perceived latency.

Common Measurement Mistakes

•Measuring only server processing time
•Excluding queue wait time
•Ignoring serialization/deserialization
•Testing only under light load
•Using averages instead of percentiles
•Measuring from inside the system boundary
•Excluding DNS, TLS handshake time
•Testing only the happy path

Correct Measurement Practices

•Measure from client's perspective
•Include all queue and wait times
•Capture full request/response cycle
•Test under realistic load conditions
•Report p50, p90, p99, p99.9 percentiles
•Instrument at actual system boundaries
•Include connection establishment
•Test error paths and edge cases

Coordinated Omission:

Example:

System normally responds in 1ms
Under load, responses take 100ms
With coordinated omission, you measure ~100ms latency
Actual user experience: 100ms + queue wait (potentially seconds)

Tools that "fire and forget" requests on a fixed schedule (regardless of responses) expose the true latency including queue wait times.

Instrumentation Overhead

Understanding Latency Distributions

Why averages lie:

Consider two systems:

System A: Every request takes exactly 50ms. Average: 50ms.
System B: 99% of requests take 10ms, 1% take 4010ms. Average: 50ms.

Both have the same average, but System B delivers unusable experience for 1% of users. At 1 million requests per day, that's 10,000 terrible experiences daily.

Percentile thinking:

Real-time requirements should be specified using percentiles:

Percentile Terminology and Interpretation
Percentile	Also Called	Interpretation	Common Usage
p50	Median	50% of requests faster than this	Baseline/typical experience
p90	90th percentile	90% of requests faster; 10% slower	Good user experience threshold
p99	99th percentile	1 in 100 requests slower	Critical for consistent UX
p99.9	Three nines	1 in 1,000 requests slower	High-value transactions
p99.99	Four nines	1 in 10,000 requests slower	Ultra-low-latency systems
Max	Maximum	Single worst observation	Debugging, not SLOs

The amplification problem:

In distributed systems, latency compounds across service calls. If your request touches 10 backend services in parallel, the end-to-end latency is the maximum of the 10 individual latencies.

Mathematical example:

Each service has p99 latency of 10ms
Single call: 99% of requests complete in ≤10ms
With 10 parallel calls: 0.99^10 = 0.904
Only 90.4% of requests will have all 10 complete in ≤10ms
Your end-to-end p99 requires looking at p99.9 or higher of individual services

This is why systems with many dependencies struggle with tail latency—the "long tail" of each dependency contributes to a very long tail at the system level.

Tail Latency Mitigation Strategies

•Hedged requests — Send duplicate requests to multiple replicas; use first response. Effective but doubles load.
•Tied requests — Send to one replica, but allow migration to idle replica if first is slow.
•Timeouts and fallbacks — Set aggressive timeouts and return cached/default responses when exceeded.
•Canary requests — Send a preliminary request to warm caches before the real request.
•Request criticality — Prioritize latency-sensitive requests over background work.
•Reduce fan-out — Fewer dependencies = less tail latency multiplication.

Jeff Dean's Tail Latency Rule

Jitter and Latency Variance

Defining jitter:

Jitter is the variation in latency over time. Formally, it's often measured as:

Peak-to-peak jitter: Maximum latency minus minimum latency
Standard deviation: Statistical measure of variation around the mean
Inter-packet jitter (for streaming): Variation in delay between consecutive packets

Why jitter matters:

Jitter Sensitivity by Application Type
Application	Jitter Impact	Jitter Tolerance
Audio streaming	Audible clicks, pops, gaps	< 10-30ms typically buffered
Video playback	Frame drops, stutter	< 30ms for smooth playback
Real-time communication	Echo, conversation overlap	< 30ms for natural speech
Gaming	Rubber-banding, teleporting	< 20ms for competitive play
Control systems	Oscillation, instability	Application-specific, often < 1ms
Financial trading	Unfair execution order	Microseconds matter
VR/AR	Motion sickness, disorientation	< 20ms end-to-end including rendering

Jitter buffering:

The classic solution to jitter is buffering—accumulate some data before processing/playback to smooth out variations. But buffering adds latency:

Effective latency = Transmission latency + Buffer size

There's a fundamental tradeoff:

Large buffers: Smooth playback but high latency (problematic for interactive applications)
Small buffers: Low latency but vulnerable to jitter (gaps, stuttering)

Adaptive jitter buffers dynamically adjust based on observed jitter, but they can't eliminate the underlying constraint.

Sources of jitter:

Common Jitter Sources

•Network congestion — Queue depth variation at routers and switches
•Route changes — Dynamic routing can suddenly add/remove latency
•Server load variation — GC pauses, background tasks, load spikes
•Scheduling delays — OS thread scheduling adds variability
•Batching — Data accumulated before transmission creates bursts
•Retransmissions — TCP retransmits add delay to affected packets
•Wireless links — Radio interference, handoffs cause variable latency
•Shared infrastructure — Multi-tenant environments share resources

Designing for Jitter

Engineering Latency Budgets

A latency budget divides an end-to-end latency requirement into allocations for each component of the system. This is the foundational artifact of real-time system design.

The budgeting process:

Steps to Create a Latency Budget

•Define the end-to-end target — e.g., "p99 end-to-end latency < 100ms from user click to visible response"
•Map the request path — Identify every component the request touches: client, CDN, load balancer, API gateway, services, databases, caches, etc.
•Measure current latency — Profile each component under realistic load to understand baseline.
•Allocate budget — Assign each component a portion of the total budget, with headroom for safety.
•Identify risks — Which components are most variable? Where might budget be exceeded?
•Monitor and enforce — Alert when components approach budget; redesign if budget is chronically exceeded.

Example: E-commerce search latency budget

Target: p99 search results < 200ms from keypress to results displayed

Component	Budget	Rationale
Client processing (keystroke to request)	10ms	JavaScript debouncing, serialization
Network to CDN/Edge	15ms	Geographic edge presence
Edge processing (routing, caching)	10ms	Cache lookup, request forwarding
Network to origin	30ms	Inter-datacenter if cache miss
API Gateway	5ms	Routing, auth validation
Search service	60ms	Query parsing, index lookup, ranking
Result aggregation	15ms	Combining from multiple shards
Response serialization	5ms	JSON formatting
Network to client	30ms	Return path
Client rendering	20ms	DOM update, display
Total budget	200ms
Reserved headroom	20ms	For spikes, variation
Working budget	180ms	Actual allocation

Budget Creep Is Real

SLOs and SLIs for Latency

Latency requirements are formalized through Service Level Indicators (SLIs) and Service Level Objectives (SLOs), which provide measurable, enforceable targets.

Terminology:

SLI (Service Level Indicator): A quantitative measure of some aspect of service quality. For latency: "The proportion of requests that complete within X milliseconds."
SLO (Service Level Objective): A target value or range for an SLI. For latency: "99% of requests will complete within 100ms."
SLA (Service Level Agreement): A contract specifying what happens when SLOs are not met (usually external, with financial consequences).

Defining latency SLOs:

Example Latency SLO Specifications
Service	SLI Definition	SLO Target
API Gateway	% of requests with latency < 10ms	p99 < 10ms for 99.5% of 5-min windows
Search API	% of searches with latency < 200ms	p95 < 200ms measured hourly
Payment Processing	% of transactions completing < 500ms	p99.9 < 500ms; p99 < 300ms
Real-time Messaging	End-to-end message delivery latency	p99 < 100ms; p50 < 30ms
CDN Edge	Time to first byte	p90 < 50ms for cache hits

Best practices for latency SLOs:

Latency SLO Best Practices

•Specify the percentile explicitly — "Fast" is meaningless; "p99 < 100ms" is actionable.
•Define the measurement point — Server-side? Client-side? Specify exactly where timing starts and ends.
•Include the measurement window — "99th percentile over what period?" Rolling 5 minutes differs from daily.
•Account for load conditions — SLOs should hold under normal load; document what happens under overload.
•Set multiple thresholds — p50 for typical experience; p99 for tail; p99.9 for outlier detection.
•Measure from user perspective — End-to-end latency matters more than component latency.
•Leave error budget — If SLO is 99.9%, you have 0.1% budget for violations before escalation.

Multiple SLOs for Different Users

Summary: Latency Expectations

We've established a comprehensive framework for understanding, measuring, and managing latency in real-time systems. Let's consolidate the key concepts:

Key Takeaways

•Context determines expectations — Latency requirements span microseconds to seconds depending on domain. Know your domain's scale.
•Human perception sets targets — For user-facing systems, cognitive science research provides clear thresholds to target.
•Measure correctly — End-to-end, from user perspective, using proper tools that avoid coordinated omission.
•Think in distributions — Averages lie. Percentiles (p50, p99, p99.9) reveal true behavior.
•Tail latency compounds — In distributed systems, individual p99 cascades into worse system-level percentiles.
•Jitter matters — For streaming and interactive applications, consistent latency beats variable latency.
•Budget rigorously — Allocate latency across components with headroom, then monitor and enforce.
•Formalize with SLOs — Specific, measurable latency objectives drive accountability and prioritization.

What's next:

Page Complete

2 / 4