Loading content...
When engineers talk about 'performance,' they often conflate two fundamentally different concepts: latency and throughput. These are not the same thing. Optimizing for one often comes at the expense of the other. Understanding this trade-off is essential for designing systems that actually meet your performance requirements.
Consider an airport security checkpoint. You can optimize for:
These goals often conflict. Batch processing (grouping passengers, consolidated scanning) can increase throughput but adds latency for individual passengers who must wait for their batch. Dedicated express lanes reduce latency for priority passengers but may reduce overall throughput.
This same tension exists in every layer of software systems—from CPU pipelines to database queries to API designs. Mastering it is essential for building systems that perform well under the specific conditions that matter to your users.
By the end of this page, you will deeply understand the relationship between latency and throughput, why they often trade off against each other, and how to make informed decisions about which to optimize in different contexts. You'll learn measurement techniques, optimization patterns, and how to communicate these trade-offs to stakeholders.
Before we can optimize, we need precise definitions. Vague language leads to miscommunication and misaligned optimization efforts.
Latency:
Latency is the time between initiating a request and receiving the complete response. It measures how long an individual operation takes.
More precisely:
Total latency is the sum of all components: Total = Queue + Network + Processing + Network
Average latency is almost never the right metric. Use percentiles: p50 (median), p95, p99, p99.9. A system with 10ms average latency but 5-second p99 is very different from one with 50ms average and 60ms p99. Users experience the tail latencies, especially for multi-service requests where the slowest call dominates.
Throughput:
Throughput is the number of operations completed per unit time. It measures how much work the system can handle.
Common throughput metrics:
Throughput can be measured at different points: what the system is capable of (capacity), what it's currently doing (utilization), or what it achieved (historical).
| Dimension | Latency | Throughput |
|---|---|---|
| Unit of Measure | Time (ms, seconds) | Operations per time (RPS, QPS) |
| Perspective | Individual request experience | System aggregate capacity |
| User Impact | Perceived responsiveness | Concurrent user capacity |
| Optimization Goal | Minimize | Maximize |
| Key Constraints | Speed of light, processing time | Resources (CPU, memory, I/O) |
Bandwidth vs. Throughput:
Bandwidth and throughput are related but distinct:
A network with 1 Gbps bandwidth might achieve only 100 Mbps throughput due to protocol overhead, latency effects, or congestion. Similarly, a database capable of 10,000 QPS might handle only 5,000 QPS due to locking, query complexity, or resource contention.
The latency-throughput trade-off emerges from fundamental properties of computing systems. Understanding why it exists helps you navigate it intelligently.
Reason 1: Batching Increases Throughput, Adds Latency
Processing items in batches is almost always more efficient than processing them individually:
But batching requires waiting to collect a batch, adding latency to individual items:
Reason 2: Parallelism Increases Throughput, May Increase Latency
Parallel processing increases throughput by handling multiple requests simultaneously:
But parallelism can increase individual request latency:
Reason 3: Queuing Effects
As throughput approaches capacity, latency increases dramatically. This is described by queuing theory:
Little's Law: L = λW
As arrival rate (λ) approaches service rate, queue length (L) grows, and wait time (W) increases non-linearly. Systems pushed to high utilization have dramatically higher latencies.
Queuing theory shows that latencies hockey-stick around 70-80% capacity. A system running at 50% utilization might have 10ms latency. At 80%, it might be 50ms. At 95%, it could be 500ms. Never run production systems near capacity if latency matters.
Reason 4: Resource Allocation Trade-offs
You can often trade resources between latency and throughput:
Every resource allocated to reducing latency is a resource not available for increasing throughput capacity, and vice versa.
Every system has a characteristic latency-throughput curve that describes how latency changes as throughput increases. Understanding this curve is essential for capacity planning and performance optimization.
The Typical Shape:
Most systems follow a predictable pattern:
Low Load Region: Latency is constant and low. The system processes requests immediately with no queuing.
Linear Region: Latency increases slightly as load increases. Some queuing begins, but the system handles it gracefully.
Knee Region: Latency begins increasing faster than linearly. Queues are building. This is the 'knee' of the curve—the point of diminishing returns.
Saturation Region: Latency increases dramatically with small throughput increases. The system is at or near capacity.
Degradation Region: Throughput may actually decrease as the system becomes overloaded. Requests fail, retries add load, and the system spirals.
| Region | Utilization | Latency Behavior | Operational State |
|---|---|---|---|
| Low Load | 0-30% | Constant, minimal | Idle capacity (potentially inefficient) |
| Linear | 30-60% | Slight linear increase | Healthy operating range |
| Knee | 60-80% | Non-linear increase begins | Approaching capacity limits |
| Saturation | 80-95% | Dramatic exponential increase | Over-utilized, add capacity |
| Degradation | 95% | Latency → ∞, throughput drops | System failure, immediate action needed |
Operating Point Selection:
Where on the curve should your system operate? The answer depends on your priorities:
Latency-Optimized Systems (30-50% utilization):
Balanced Systems (50-70% utilization):
Throughput-Optimized Systems (70-85% utilization):
Load test your system to characterize its latency-throughput curve before production. Plot latency percentiles (p50, p95, p99) against throughput. Identify the knee point. Set capacity alerts below the knee. This is one of the most valuable performance engineering exercises you can do.
When low latency is the priority—interactive applications, real-time systems, user-facing APIs—specific optimization strategies apply.
Strategy 1: Reduce Network Round Trips
Network latency is often the dominant factor:
Example: Latency Optimization for a User Profile API
Original design:
Optimized design:
Every latency optimization has costs: caching requires memory and cache invalidation logic. Precomputation requires storage and consistency management. Async work requires eventual consistency reasoning. Always weigh the complexity cost against the latency benefit.
When throughput is the priority—data pipelines, batch processing, analytics workloads—different optimization strategies apply.
Strategy 1: Batching
Batching is the most powerful throughput optimization:
Batch size selection is critical:
Example: Throughput Optimization for a Data Import Pipeline
Original design:
Optimized design:
Sometimes throughput optimizations also improve latency. Batching reduces per-item overhead. Caching serves requests faster and reduces backend load. But when they conflict, be clear about your priority and accept the trade-off.
In many scenarios, optimizing for latency directly conflicts with optimizing for throughput. Here's how to make the choice.
Framework for Choosing:
| Factor | Prioritize Latency | Prioritize Throughput |
|---|---|---|
| User Interaction | Synchronous, interactive | Asynchronous, background |
| Request Pattern | Real-time, streaming | Batch, periodic jobs |
| SLA Type | 'Respond within X ms' | 'Process N records per hour' |
| User Perception | Waiting for response | Not directly waiting |
| Business Model | User-facing, engagement | Backend, data processing |
| Cost Structure | Can afford over-provisioning | Resource-constrained |
Real-World Conflict Examples:
Example 1: API Request Handling
Example 2: Database Query Optimization
Example 3: Network Communication
You don't have to choose one optimization for your entire system. Separate request types and optimize each path appropriately—latency-optimized for user-facing requests, throughput-optimized for background jobs, using different queues, workers, and tuning.
You can't optimize what you can't measure. Proper instrumentation of latency and throughput is essential for data-driven optimization.
Latency Measurement Best Practices:
Key Dashboard Views:
Latency Over Time: Line chart of p50, p95, p99 with time on X-axis. Immediately shows regressions and spikes.
Latency vs Throughput Scatter: Plot requests with throughput on X, latency on Y. Reveals system's characteristic curve.
Utilization Gauges: Current CPU, memory, disk, network as percentages. Quick health check.
Queue Depth Trends: Time series of queue sizes. Leading indicator of throughput problems.
Error Rate Correlation: Error rate plotted against latency and throughput. Often reveals when pushback begins.
Instrumentation itself has costs. High-cardinality metrics consume memory. Per-request logging impacts latency. Profiling adds overhead. Measure the metrics that matter and sample high-volume data. Over-instrumentation can cause the performance problems you're trying to detect.
Let's examine how major systems navigate the latency-throughput trade-off.
Case Study 1: High-Frequency Trading (Latency at All Costs)
HFT firms optimize for minimum latency, often at extreme cost:
Throughput is secondary—they process one order at a time, extremely fast. This trade-off makes economic sense when a 1μs advantage means millions in profit.
Case Study 2: Data Analytics Pipelines (Throughput at All Costs)
Analytics systems optimize for maximum throughput:
Latency is secondary—no one cares if yesterday's analytics report takes 30 minutes vs. 35 minutes to generate. What matters is processing petabytes per day.
Case Study 3: Web Application APIs (Balanced Approach)
Typical web APIs balance both concerns:
The balance point depends on specific requirements. E-commerce checkout might prioritize latency (users convert better). Internal APIs might accept higher latency for better efficiency.
HFT firms spend millions on microseconds because that's where their profit comes from. Analytics platforms tolerate hours of latency because batch processing is far cheaper. Your optimization strategy should follow business value—don't over-engineer latency for batch jobs or under-serve interactive users.
Understanding common mistakes helps you avoid them in your own systems.
Microservices architectures often multiply latency problems. A single user request fans out to 10 services, each with network overhead. Tail latencies compound—if each service is p99 = 50ms, the request p99 approaches 500ms. Design for parallel calls, set aggressive timeouts, and measure end-to-end.
We've deeply explored the latency-throughput trade-off that shapes all performance engineering. Let's consolidate the key insights:
What's Next:
We've covered two major trade-off pairs: consistency vs. availability, and latency vs. throughput. Next, we'll examine Cost vs. Performance—the trade-off that grounds all engineering decisions in economic reality. Every optimization has a price, and every saving has a cost. Understanding this trade-off is essential for building systems that are both effective and sustainable.
You now understand the latency-throughput trade-off deeply—why it exists, how to measure it, and how to optimize for each extreme. You can characterize systems by their latency-throughput curves and make informed decisions about operating points. Next, we'll add the economic dimension to our trade-off analysis.