Loading learning content...
In real-time systems, latency is the currency by which all architectural decisions are evaluated. While conventional systems might accept latency variability in exchange for throughput or cost efficiency, real-time systems treat latency as a hard constraint that shapes every design choice.
But what does "low latency" actually mean? The answer depends entirely on context:
This page establishes a rigorous framework for understanding latency expectations: how they vary across domains, how they're measured, and how system architects budget latency across components to meet end-to-end requirements.
By the end of this page, you will understand how latency expectations vary across application domains by orders of magnitude, master the vocabulary and mathematics of latency characterization (percentiles, distributions, jitter), and learn practical techniques for creating and managing latency budgets in complex systems.
Different application domains operate at vastly different latency scales. Understanding where your system falls on this spectrum is the first step in setting appropriate expectations.
The latency hierarchy:
| Latency Range | Domain Examples | Typical Constraint Source |
|---|---|---|
| < 1 microsecond (< 1μs) | High-frequency trading, hardware interrupts, FPGA logic | Speed of light, electronic switching times |
| 1-100 microseconds | Kernel operations, device drivers, network packet processing | CPU cycles, memory access, bus speeds |
| 100μs - 1 millisecond | Database queries, in-memory caching, audio processing | I/O latency, algorithm complexity |
| 1-10 milliseconds | Interactive UI, gaming physics, industrial control systems | Human perception thresholds, control stability |
| 10-100 milliseconds | Web API responses, mobile apps, collaborative editing | Human attention, conversational flow |
| 100ms - 1 second | Page loads, search results, email delivery | User patience, perception of 'instant' |
| 1-10 seconds | File uploads, complex queries, report generation | Task completion expectations |
10 seconds | Batch processing, background sync, analytics | Background operation tolerance |
Physical constraints at the extremes:
At the lowest latencies, physics becomes the limiting factor:
Speed of light: Light travels approximately 30cm per nanosecond. A round-trip across a 10-meter data center network cable takes at least 67 nanoseconds—just for photons to travel, ignoring all processing.
Memory access hierarchy: L1 cache hit: ~1ns. L2 cache: ~4ns. L3 cache: ~15ns. Main memory: ~100ns. SSD: ~100μs. HDD: ~10ms. Each level represents roughly an order of magnitude increase.
CPU clock cycles: At 4GHz, each cycle is 0.25 nanoseconds. Even a "simple" operation requiring 100 cycles takes 25 nanoseconds.
These physical realities create hard floors below which no software optimization can push latency. High-frequency trading firms spend billions locating their servers meters closer to exchanges because at their scale, nanoseconds matter.
Before optimizing, identify which latency order of magnitude your domain requires. Optimizing a batch system to sub-millisecond response is wasted effort; failing to achieve sub-100ms for interactive UI destroys user experience. Match your investment to your requirements.
For systems that interact with humans, human perception sets the latency targets. Decades of human-computer interaction research have established clear thresholds that should guide your requirements.
The foundational research:
Jakob Nielsen's "Response Time Limits" (derived from Miller's and Card's earlier research) established three fundamental thresholds that remain relevant today:
Domain-specific perception thresholds:
Different human senses have different latency sensitivity:
| Modality | Perceptible Delay | Disruptive Delay | Application Impact |
|---|---|---|---|
| Visual (motion) | < 16ms (60fps) | 33ms (30fps) | Choppy video, laggy animations |
| Visual (interaction) | < 50ms | 100ms | Perceived lag in UI, gaming |
| Audio | < 10-20ms | 30ms | Audible echo, desynchronization |
| Audio-visual sync | < 45ms audio leading | 125ms lag | Lip sync perception |
| Haptic (touch) | < 5-10ms | 25ms | Tactile feedback feels disconnected |
| Keyboard input | < 50ms | 100ms | Typing feels sluggish |
The conversation threshold:
For real-time communication systems (video calls, VoIP, gaming voice chat), the ITU-T G.114 standard establishes critical thresholds:
These thresholds explain why satellite phone calls feel unnatural (600ms+ round trip due to geostationary orbit distance) despite high audio quality.
The difference between 50ms and 100ms latency is far more perceptible than the difference between 500ms and 550ms. Human perception follows a logarithmic sensitivity curve—users notice delays relative to their expectations, not in absolute terms. Optimize aggressively at the low end of your latency range.
Accurate latency measurement is surprisingly difficult. Many teams measure latency incorrectly, leading to false confidence in their systems' performance characteristics.
What to measure:
End-to-end latency is the gold standard—the time from when a user initiates an action to when they perceive the result. This includes:
Measuring only server-side processing time can dramatically underestimate actual user-perceived latency.
Coordinated Omission:
One of the most insidious measurement errors is "coordinated omission," identified by Gil Tene. This occurs when a load generator waits for a response before sending the next request. If responses are delayed, the load generator sends fewer requests, hiding the true queuing delays.
Example:
Tools that "fire and forget" requests on a fixed schedule (regardless of responses) expose the true latency including queue wait times.
Measurement itself adds latency. High-resolution timestamp calls, logging, and metrics collection consume CPU cycles and memory bandwidth. For microsecond-scale systems, measurement overhead can be significant. Consider sampling and ensure your production measurements don't materially affect the thing you're measuring.
Latency is not a single number—it's a distribution. Understanding this distribution is essential for real-time system design because the tail of the distribution often determines whether you meet your requirements.
Why averages lie:
Consider two systems:
Both have the same average, but System B delivers unusable experience for 1% of users. At 1 million requests per day, that's 10,000 terrible experiences daily.
Percentile thinking:
Real-time requirements should be specified using percentiles:
| Percentile | Also Called | Interpretation | Common Usage |
|---|---|---|---|
| p50 | Median | 50% of requests faster than this | Baseline/typical experience |
| p90 | 90th percentile | 90% of requests faster; 10% slower | Good user experience threshold |
| p99 | 99th percentile | 1 in 100 requests slower | Critical for consistent UX |
| p99.9 | Three nines | 1 in 1,000 requests slower | High-value transactions |
| p99.99 | Four nines | 1 in 10,000 requests slower | Ultra-low-latency systems |
| Max | Maximum | Single worst observation | Debugging, not SLOs |
The amplification problem:
In distributed systems, latency compounds across service calls. If your request touches 10 backend services in parallel, the end-to-end latency is the maximum of the 10 individual latencies.
Mathematical example:
This is why systems with many dependencies struggle with tail latency—the "long tail" of each dependency contributes to a very long tail at the system level.
Google's Jeff Dean popularized the rule that if your service calls N other services, your p99 latency target requires individual services to hit roughly p(100 - 1/N). For 100 dependencies, each needs p99.99 to achieve system-level p99. Design for fewer dependencies or accept tail latency.
For many real-time applications, consistency matters as much as speed. A system with steady 50ms latency often provides better user experience than one oscillating between 10ms and 100ms, even though the latter has better average latency.
Defining jitter:
Jitter is the variation in latency over time. Formally, it's often measured as:
Why jitter matters:
| Application | Jitter Impact | Jitter Tolerance |
|---|---|---|
| Audio streaming | Audible clicks, pops, gaps | < 10-30ms typically buffered |
| Video playback | Frame drops, stutter | < 30ms for smooth playback |
| Real-time communication | Echo, conversation overlap | < 30ms for natural speech |
| Gaming | Rubber-banding, teleporting | < 20ms for competitive play |
| Control systems | Oscillation, instability | Application-specific, often < 1ms |
| Financial trading | Unfair execution order | Microseconds matter |
| VR/AR | Motion sickness, disorientation | < 20ms end-to-end including rendering |
Jitter buffering:
The classic solution to jitter is buffering—accumulate some data before processing/playback to smooth out variations. But buffering adds latency:
Effective latency = Transmission latency + Buffer size
There's a fundamental tradeoff:
Adaptive jitter buffers dynamically adjust based on observed jitter, but they can't eliminate the underlying constraint.
Sources of jitter:
For jitter-sensitive applications, focus on eliminating variance sources rather than just reducing average latency. Dedicated resources, priority scheduling, traffic shaping, and careful selection of network paths can reduce jitter even if average latency increases slightly.
A latency budget divides an end-to-end latency requirement into allocations for each component of the system. This is the foundational artifact of real-time system design.
The budgeting process:
Example: E-commerce search latency budget
Target: p99 search results < 200ms from keypress to results displayed
| Component | Budget | Rationale |
|---|---|---|
| Client processing (keystroke to request) | 10ms | JavaScript debouncing, serialization |
| Network to CDN/Edge | 15ms | Geographic edge presence |
| Edge processing (routing, caching) | 10ms | Cache lookup, request forwarding |
| Network to origin | 30ms | Inter-datacenter if cache miss |
| API Gateway | 5ms | Routing, auth validation |
| Search service | 60ms | Query parsing, index lookup, ranking |
| Result aggregation | 15ms | Combining from multiple shards |
| Response serialization | 5ms | JSON formatting |
| Network to client | 30ms | Return path |
| Client rendering | 20ms | DOM update, display |
| Total budget | 200ms | |
| Reserved headroom | 20ms | For spikes, variation |
| Working budget | 180ms | Actual allocation |
Latency budgets erode over time as features are added and systems evolve. A 5ms component grows to 15ms over two years as edge cases are handled. Establish ongoing monitoring and treat budget violations as production incidents requiring immediate attention, not tech debt to address later.
Latency requirements are formalized through Service Level Indicators (SLIs) and Service Level Objectives (SLOs), which provide measurable, enforceable targets.
Terminology:
SLI (Service Level Indicator): A quantitative measure of some aspect of service quality. For latency: "The proportion of requests that complete within X milliseconds."
SLO (Service Level Objective): A target value or range for an SLI. For latency: "99% of requests will complete within 100ms."
SLA (Service Level Agreement): A contract specifying what happens when SLOs are not met (usually external, with financial consequences).
Defining latency SLOs:
| Service | SLI Definition | SLO Target |
|---|---|---|
| API Gateway | % of requests with latency < 10ms | p99 < 10ms for 99.5% of 5-min windows |
| Search API | % of searches with latency < 200ms | p95 < 200ms measured hourly |
| Payment Processing | % of transactions completing < 500ms | p99.9 < 500ms; p99 < 300ms |
| Real-time Messaging | End-to-end message delivery latency | p99 < 100ms; p50 < 30ms |
| CDN Edge | Time to first byte | p90 < 50ms for cache hits |
Best practices for latency SLOs:
Consider having different SLOs for different user segments. Premium customers might have a p99 < 50ms SLO while free tier has p99 < 200ms. Traffic prioritization and resource allocation can then be tuned to each tier's contracted expectations.
We've established a comprehensive framework for understanding, measuring, and managing latency in real-time systems. Let's consolidate the key concepts:
What's next:
With latency expectations established, the next page explores the critical distinction between soft real-time and hard real-time systems. This classification determines the appropriate architecture, implementation technologies, and failure handling strategies for your real-time application.
You now understand how latency expectations vary across domains, how to measure latency correctly, why distributions and percentiles matter, and how to create actionable latency budgets and SLOs. This knowledge is essential for making informed architectural decisions in real-time system design.