Loading learning content...
If latency tells us how fast a single request completes, throughput tells us how much work the system handles simultaneously. It's the difference between asking "How quickly can you run to the store?" versus "How many trips to the store can you make in an hour?"
Throughput—measured in requests per second (RPS), queries per second (QPS), or transactions per second (TPS)—defines your system's capacity. It determines how many users you can serve, how much data you can process, and ultimately, how big your business can grow before infrastructure becomes the bottleneck.
Understanding throughput is essential because it governs capacity planning, cost optimization, and scalability decisions. A system with beautiful 10ms latency but maxing out at 100 RPS won't survive a product launch. Conversely, a system capable of 1 million RPS but with 5-second latency delivers an unusable experience.
This page will teach you how to reason about throughput: what it measures, how to calculate theoretical and practical limits, how throughput interacts with latency, and how to optimize for higher throughput while maintaining acceptable response times. You'll develop the mental models for capacity planning that distinguish experienced system architects.
Throughput measures the rate of completed work over time. While the concept seems simple, precision matters significantly in system design conversations.
Common Throughput Metrics
| Metric | Full Name | Unit | Typical Context |
|---|---|---|---|
| RPS | Requests Per Second | req/s | Web servers, API gateways, microservices |
| QPS | Queries Per Second | queries/s | Databases, search engines, cache systems |
| TPS | Transactions Per Second | txn/s | Payment systems, databases, trading platforms |
| MPS | Messages Per Second | msg/s | Message queues, event streaming, pub/sub |
| IOPS | I/O Operations Per Second | ops/s | Storage systems, disk arrays, databases |
| BPS | Bytes Per Second | bytes/s | Network links, data pipelines, streaming |
What Counts as a "Request"?
Throughput measurements require clarity on what constitutes a unit of work:
Different operations have vastly different costs. A read-only cache hit costs microseconds; a complex database transaction costs hundreds of milliseconds. 10,000 cache RPS represents less load than 1,000 complex-write RPS.
The Solution: Weighted Throughput
Sophisticated capacity planning weights operations by their cost:
Effective Load = Σ (operation_count × operation_cost)
Rather than reporting "1000 RPS," you might report "1000 RPS (70% reads, 25% simple writes, 5% complex transactions)" to distinguish from "1000 RPS (100% complex transactions)."
Throughput and bandwidth are related but distinct. Bandwidth is the theoretical maximum data transfer rate of a channel (e.g., a 1 Gbps network link). Throughput is the actual rate achieved, which is always less than bandwidth due to overhead, latency, and protocol inefficiencies. A 1 Gbps link might achieve 800 Mbps throughput under ideal conditions and 200 Mbps under high packet loss.
Every system has theoretical maximum throughput determined by its bottleneck resource. Understanding these limits helps you predict capacity and identify optimization targets.
Little's Law: The Fundamental Relationship
Little's Law is one of the most important equations in system performance:
L = λ × W
Where:
Rearranged for throughput:
Throughput = Concurrency / Latency
Or: λ = L / W
This fundamental law has profound implications:
For a server with N worker threads and average latency L, maximum throughput ≈ N / L. If you have 50 worker threads and 25ms average latency: max throughput ≈ 50 / 0.025 = 2000 RPS. This simple calculation helps validate capacity claims and plan scaling.
CPU-Bound Throughput Limits
For compute-intensive workloads, CPU becomes the bottleneck:
Max Throughput = (CPU_cores × CPU_utilization_target) / CPU_time_per_request
Example:
Max Throughput = (8 × 0.75) / 0.01 = 600 RPS
I/O-Bound Throughput Limits
For I/O-intensive workloads, storage or network becomes the bottleneck:
Max Throughput = Available_IOPs / IOPs_per_request
Example:
Max Throughput = 50,000 / 5 = 10,000 RPS
The actual limit is the minimum of CPU limit and I/O limit.
Latency and throughput are deeply interconnected, but their relationship is often misunderstood. They are not simply inversely proportional—the relationship is more nuanced and has critical implications for system design.
The Queueing Theory Perspective
As throughput approaches system capacity, latency doesn't just increase—it explodes. This is the queueing theory effect:
The mathematical reason: at high utilization, requests spend most of their time waiting in queues rather than being processed.
Never plan for more than 70-80% sustained utilization. Beyond this threshold, latency becomes unpredictable and the system becomes fragile. Traffic spikes that push you to 95% utilization will cause severe latency degradation even if the system technically survives.
Understanding the Curve
The latency-throughput relationship follows characteristic patterns:
| Utilization Zone | Latency Behavior | System State | Engineering Response |
|---|---|---|---|
| 0-50% | Stable, near-optimal latency | Healthy, headroom available | Cost optimization opportunity |
| 50-70% | Slight increase, still predictable | Normal production operation | Target zone for most systems |
| 70-85% | Noticeable increase, some spikes | Caution zone | Monitor closely, plan scaling |
| 85-95% | Rapid increase, high variability | Danger zone | Immediate action needed |
| 95-100% | Exponential increase, unpredictable | Critical failure imminent | Shed load, emergency response |
The Throughput-Latency Trade-off
You often cannot maximize both throughput and minimize latency simultaneously:
Optimizing for throughput: Batch operations, full utilization, queue depth management. Latency suffers because requests wait in batches.
Optimizing for latency: Dedicated resources, immediate processing, over-provisioning. Throughput suffers because resources sit idle.
The right balance depends on your use case:
Accurate throughput measurement requires attention to methodology. Common mistakes lead to overly optimistic or misleading results.
Measurement Intervals Matter
Throughput varies over time. The interval you choose for measurement affects the story:
Reporting "1000 RPS" could mean:
Benchmark throughput often exceeds production throughput by 2-5x. Benchmarks use ideal conditions: uniform requests, warm caches, no network variance, no garbage collection pressure. Production has all these factors plus real-world complexity. Never promise benchmark numbers in SLAs.
Successful vs. Total Throughput
A critical distinction that catches teams:
A system handling 10,000 RPS with 30% error rate has only 7,000 successful RPS. If 20% exceed latency SLOs, goodput is only 5,600 RPS.
Always measure goodput—it's what actually matters for users.
Load Testing for Throughput
To find maximum throughput, use proper load testing methodology:
| Approach | Best For | Pitfall To Avoid |
|---|---|---|
| Open-loop | Finding breaking point | Overwhelming the system before observing degradation |
| Closed-loop | Simulating dependent clients | Hiding true capacity limits due to coordinated omission |
| Stepped load | Finding optimal operating point | Steps too large missing the critical threshold |
| Soak testing | Long-term stability validation | Too short duration missing slow resource leaks |
Every system has a bottleneck—the component limiting overall throughput. Following Amdahl's Law, improving non-bottleneck components provides diminishing returns. You must identify and address the actual bottleneck.
Common Throughput Bottlenecks
The Universal Scalability Law (USL) models throughput as: X(N) = N / (1 + α(N-1) + β·N·(N-1)), where α represents contention and β represents coherence overhead. This explains why adding more capacity sometimes decreases throughput—coordination overhead eventually dominates.
When current throughput is insufficient, you have two fundamental scaling options: vertical (bigger machines) and horizontal (more machines). Each has distinct throughput implications.
Throughput Scaling Techniques
A system's throughput under normal conditions is interesting; its throughput under failure conditions is what matters for reliability. How does your system behave when things go wrong?
Graceful Degradation of Throughput
Well-designed systems degrade gracefully rather than collapsing:
This graceful degradation shouldn't happen by accident—it requires explicit design.
When a system exceeds throughput capacity, failures cascade: Server A overloaded → Server A starts timing out → Clients retry → Load increases further → More servers overload → System-wide failure. One bottleneck can take down the entire system if not properly managed.
Load Shedding for Throughput Protection
When incoming throughput exceeds capacity, you must shed load deliberately:
Backpressure Propagation
Rather than silently dropping requests, signal overload upstream:
Backpressure allows the entire system to slow down gracefully rather than suffering uncontrolled failure.
Capacity planning translates business growth projections into infrastructure requirements. Throughput metrics are the bridge between business metrics and engineering decisions.
From Business Metrics to Throughput
| Business Metric | Conversion Factor | Throughput Implication |
|---|---|---|
| Monthly Active Users | ×10 (avg page views) | ×30 RPS per 1M MAU (rough) |
| Orders per Hour | ×5 (API calls per order) | 1 order/s = 5 RPS |
| Messages per Day | ÷86,400 (seconds/day) | 1M msgs/day ≈ 12 MPS average |
| Video Hours Watched | ×3600 (seconds/hour) | 1M hours = 3.6B seconds of streaming |
| Peak-to-Average Ratio | ×3-10 typical | Design for peak, not average |
The Capacity Planning Formula
Required Capacity = (Peak Throughput × Safety Margin) / (1 - Failure Headroom)
Example:
Growth Projections
Forward-looking capacity planning accounts for growth:
Rule of thumb: Plan for 12-18 months of projected growth plus a 2x buffer for unknowns.
Track cost-per-request (total infrastructure cost / throughput). As you scale, this should remain stable or decrease. If cost-per-request increases with scale, you have scaling inefficiencies—coordination overhead, underutilized resources, or architectural limitations.
Throughput defines your system's capacity—how much work it can handle over time. Let's consolidate our understanding:
What's Next:
We've covered latency (the time dimension) and throughput (the volume dimension) as individual metrics. But single-number summaries hide important details. The next page explores percentiles—how to understand and communicate the full distribution of performance, not just averages.
You now understand throughput as a fundamental performance metric. You can calculate theoretical limits, identify bottlenecks, plan capacity, and design for graceful degradation under failure. Next, we'll explore how percentiles reveal the full picture of system performance that averages hide.