System Design (HLD)Performance Optimization

Performance Metrics

LevelIntermediate

Duration90 mins

TopicPerformance Optimization

2 / 5

Throughput: Requests per Second

The Volume Dimension of Performance

If latency tells us how fast a single request completes, throughput tells us how much work the system handles simultaneously. It's the difference between asking "How quickly can you run to the store?" versus "How many trips to the store can you make in an hour?"

Throughput—measured in requests per second (RPS), queries per second (QPS), or transactions per second (TPS)—defines your system's capacity. It determines how many users you can serve, how much data you can process, and ultimately, how big your business can grow before infrastructure becomes the bottleneck.

Understanding throughput is essential because it governs capacity planning, cost optimization, and scalability decisions. A system with beautiful 10ms latency but maxing out at 100 RPS won't survive a product launch. Conversely, a system capable of 1 million RPS but with 5-second latency delivers an unusable experience.

What You Will Learn

This page will teach you how to reason about throughput: what it measures, how to calculate theoretical and practical limits, how throughput interacts with latency, and how to optimize for higher throughput while maintaining acceptable response times. You'll develop the mental models for capacity planning that distinguish experienced system architects.

Defining Throughput Precisely

Throughput measures the rate of completed work over time. While the concept seems simple, precision matters significantly in system design conversations.

Common Throughput Metrics

Throughput Metrics and Their Contexts
Metric	Full Name	Unit	Typical Context
RPS	Requests Per Second	req/s	Web servers, API gateways, microservices
QPS	Queries Per Second	queries/s	Databases, search engines, cache systems
TPS	Transactions Per Second	txn/s	Payment systems, databases, trading platforms
MPS	Messages Per Second	msg/s	Message queues, event streaming, pub/sub
IOPS	I/O Operations Per Second	ops/s	Storage systems, disk arrays, databases
BPS	Bytes Per Second	bytes/s	Network links, data pipelines, streaming

What Counts as a "Request"?

Throughput measurements require clarity on what constitutes a unit of work:

A single HTTP request? But a page load involves 50+ requests.
A single API call? But one call might trigger 5 database queries.
A user action? But "login" is simpler than "checkout with payment."

Different operations have vastly different costs. A read-only cache hit costs microseconds; a complex database transaction costs hundreds of milliseconds. 10,000 cache RPS represents less load than 1,000 complex-write RPS.

The Solution: Weighted Throughput

Sophisticated capacity planning weights operations by their cost:

Effective Load = Σ (operation_count × operation_cost)

Rather than reporting "1000 RPS," you might report "1000 RPS (70% reads, 25% simple writes, 5% complex transactions)" to distinguish from "1000 RPS (100% complex transactions)."

Throughput vs. Bandwidth

Throughput and bandwidth are related but distinct. Bandwidth is the theoretical maximum data transfer rate of a channel (e.g., a 1 Gbps network link). Throughput is the actual rate achieved, which is always less than bandwidth due to overhead, latency, and protocol inefficiencies. A 1 Gbps link might achieve 800 Mbps throughput under ideal conditions and 200 Mbps under high packet loss.

Theoretical Throughput Limits

Every system has theoretical maximum throughput determined by its bottleneck resource. Understanding these limits helps you predict capacity and identify optimization targets.

Little's Law: The Fundamental Relationship

Little's Law is one of the most important equations in system performance:

L = λ × W

Where:

L = Average number of items in the system (concurrent requests)
λ (lambda) = Average arrival rate (throughput)
W = Average time spent in system (latency)

Rearranged for throughput:

Throughput = Concurrency / Latency

Or: λ = L / W

This fundamental law has profound implications:

Implications of Little's Law

•Fixed concurrency, lower latency → higher throughput. If you have 100 concurrent connections and reduce latency from 100ms to 50ms, throughput doubles from 1000 RPS to 2000 RPS.
•Fixed latency, more concurrency → higher throughput. More parallel workers handling requests increases capacity linearly.
•Higher throughput requirements → need lower latency or more concurrency. You can't increase throughput without addressing one of these.
•The relationship is linear and universal. It applies to any stable queuing system—networks, databases, web servers, everything.

Quick Throughput Estimation

For a server with N worker threads and average latency L, maximum throughput ≈ N / L. If you have 50 worker threads and 25ms average latency: max throughput ≈ 50 / 0.025 = 2000 RPS. This simple calculation helps validate capacity claims and plan scaling.

CPU-Bound Throughput Limits

For compute-intensive workloads, CPU becomes the bottleneck:

Max Throughput = (CPU_cores × CPU_utilization_target) / CPU_time_per_request

Example:

8 CPU cores
Target 75% utilization (leaving headroom)
10ms CPU time per request

Max Throughput = (8 × 0.75) / 0.01 = 600 RPS

I/O-Bound Throughput Limits

For I/O-intensive workloads, storage or network becomes the bottleneck:

Max Throughput = Available_IOPs / IOPs_per_request

Example:

SSD with 50,000 IOPS
Each request requires 5 disk reads

Max Throughput = 50,000 / 5 = 10,000 RPS

The actual limit is the minimum of CPU limit and I/O limit.

The Latency-Throughput Relationship

Latency and throughput are deeply interconnected, but their relationship is often misunderstood. They are not simply inversely proportional—the relationship is more nuanced and has critical implications for system design.

The Queueing Theory Perspective

As throughput approaches system capacity, latency doesn't just increase—it explodes. This is the queueing theory effect:

At 50% capacity: Latency is near baseline
At 70% capacity: Latency starts to rise noticeably
At 90% capacity: Latency may be 2-3x baseline
At 95% capacity: Latency may be 5-10x baseline
At 99% capacity: Latency approaches infinity

The mathematical reason: at high utilization, requests spend most of their time waiting in queues rather than being processed.

The 80% Utilization Rule

Never plan for more than 70-80% sustained utilization. Beyond this threshold, latency becomes unpredictable and the system becomes fragile. Traffic spikes that push you to 95% utilization will cause severe latency degradation even if the system technically survives.

Understanding the Curve

The latency-throughput relationship follows characteristic patterns:

Latency Behavior at Different Utilization Levels
Utilization Zone	Latency Behavior	System State	Engineering Response
0-50%	Stable, near-optimal latency	Healthy, headroom available	Cost optimization opportunity
50-70%	Slight increase, still predictable	Normal production operation	Target zone for most systems
70-85%	Noticeable increase, some spikes	Caution zone	Monitor closely, plan scaling
85-95%	Rapid increase, high variability	Danger zone	Immediate action needed
95-100%	Exponential increase, unpredictable	Critical failure imminent	Shed load, emergency response

The Throughput-Latency Trade-off

You often cannot maximize both throughput and minimize latency simultaneously:

Optimizing for throughput: Batch operations, full utilization, queue depth management. Latency suffers because requests wait in batches.
Optimizing for latency: Dedicated resources, immediate processing, over-provisioning. Throughput suffers because resources sit idle.

The right balance depends on your use case:

Trading systems: Minimize latency (cost is secondary)
Batch processing: Maximize throughput (latency is irrelevant)
Web applications: Balance both (SLO-driven trade-offs)

Measuring Throughput Correctly

Accurate throughput measurement requires attention to methodology. Common mistakes lead to overly optimistic or misleading results.

Measurement Intervals Matter

Throughput varies over time. The interval you choose for measurement affects the story:

1-second intervals: Capture bursts and spikes. Good for alerting.
1-minute averages: Smooth out micro-bursts. Good for dashboards.
1-hour averages: Show daily patterns. Good for capacity planning.
Daily/weekly averages: Show long-term trends. Good for growth planning.

Reporting "1000 RPS" could mean:

Sustained 1000 RPS for hours (impressive)
Peak 1000 RPS in a single second, averaging 100 RPS (less impressive)
1000 RPS in a benchmark, real traffic at 300 RPS (misleading)

Benchmark vs. Production Throughput

Benchmark throughput often exceeds production throughput by 2-5x. Benchmarks use ideal conditions: uniform requests, warm caches, no network variance, no garbage collection pressure. Production has all these factors plus real-world complexity. Never promise benchmark numbers in SLAs.

Successful vs. Total Throughput

A critical distinction that catches teams:

Total throughput: All requests received, including failures
Successful throughput: Only requests that completed successfully
Goodput: Successful requests that met latency SLOs

A system handling 10,000 RPS with 30% error rate has only 7,000 successful RPS. If 20% exceed latency SLOs, goodput is only 5,600 RPS.

Always measure goodput—it's what actually matters for users.

Load Testing for Throughput

To find maximum throughput, use proper load testing methodology:

Open-loop testing: Send requests at a fixed rate regardless of responses. Reveals how the system behaves under load.
Closed-loop testing: Wait for each response before sending the next request. Underestimates real capacity due to coordinated omission.
Stepped load: Gradually increase load while measuring latency. Find the point where latency degrades unacceptably.
Soak testing: Sustained load over hours/days to find capacity under real conditions (memory leaks, GC pressure, etc.).

Load Testing Approaches Comparison
Approach	Best For	Pitfall To Avoid
Open-loop	Finding breaking point	Overwhelming the system before observing degradation
Closed-loop	Simulating dependent clients	Hiding true capacity limits due to coordinated omission
Stepped load	Finding optimal operating point	Steps too large missing the critical threshold
Soak testing	Long-term stability validation	Too short duration missing slow resource leaks

Identifying and Eliminating Throughput Bottlenecks

Every system has a bottleneck—the component limiting overall throughput. Following Amdahl's Law, improving non-bottleneck components provides diminishing returns. You must identify and address the actual bottleneck.

Common Throughput Bottlenecks

Application Layer Bottlenecks

•Thread pool exhaustion — Fixed-size thread pools limit concurrency. All threads busy = requests queue. Add more threads (within memory limits) or switch to async I/O.
•Connection pool limits — Database/cache connection pools cap concurrent queries. Increase pool size or reduce query latency to free connections faster.
•Synchronization locks — Global locks serialize parallel work. Reduce critical section size, use lock-free data structures, or partition data to reduce contention.
•Memory pressure — Insufficient heap causes frequent GC, reducing throughput. Increase heap, reduce object allocation, or cache/pool objects.
•Blocking I/O — Threads waiting on I/O can't handle other requests. Use async I/O, increase thread count, or reduce I/O latency.

Infrastructure Bottlenecks

•Network bandwidth saturation — Network links have finite capacity. Large payloads exhaust bandwidth before RPS limits. Compress payloads, reduce response sizes, upgrade links.
•CPU saturation — 100% CPU utilization is the hard ceiling for compute-bound work. Optimize algorithms, add cores, or distribute load across machines.
•Disk I/O saturation — Storage devices have IOPS limits. Optimize queries, add caching, use faster storage (SSD→NVMe), or distribute across more disks.
•Single-leader bottleneck — Databases with single write leader concentrate all writes on one node. Shard writes, use multi-leader, or move to leaderless architecture.

The Universal Scalability Law

The Universal Scalability Law (USL) models throughput as: X(N) = N / (1 + α(N-1) + β·N·(N-1)), where α represents contention and β represents coherence overhead. This explains why adding more capacity sometimes decreases throughput—coordination overhead eventually dominates.

Scaling Strategies for Higher Throughput

When current throughput is insufficient, you have two fundamental scaling options: vertical (bigger machines) and horizontal (more machines). Each has distinct throughput implications.

Vertical Scaling

•Add more CPU cores → more parallel executions
•Add more RAM → larger caches, fewer disk hits
•Add faster storage → reduced I/O wait
•Upgrade network → higher bandwidth ceiling
•Limit: Maximum machine size, single point of failure
•Best for: Quick wins, simple architectures, low coordination overhead

Horizontal Scaling

•Add more servers → distribute load
•Partition data → parallelize storage access
•Add read replicas → scale read throughput
•Geo-distribute → reduce per-region load
•Limit: Coordination overhead, data consistency complexity
•Best for: Large scale, high availability, geographic distribution

Throughput Scaling Techniques

•Sharding — Partition data across machines. Each shard handles 1/N of the data, allowing N× throughput for queries that target specific shards. But cross-shard queries are expensive.
•Read Replicas — Replicate data to multiple read-only copies. Read throughput scales linearly with replica count. Write throughput remains limited by the leader.
•Write Partitioning — Split writes across partitions by a key (user ID, region, etc.). Each partition handles independent writes, scaling write throughput.
•Caching — Serve frequently-requested data from memory. Cache hits avoid database round-trips, dramatically increasing effective throughput.
•CDN Offload — Serve static content from edge locations. CDN handles 80%+ of requests, freeing origin servers for dynamic work.
•Queue-Based Decoupling — Accept work faster than you process it by buffering in queues. Smooths traffic bursts and decouples ingestion from processing throughput.

Maintaining Throughput Under Failure

A system's throughput under normal conditions is interesting; its throughput under failure conditions is what matters for reliability. How does your system behave when things go wrong?

Graceful Degradation of Throughput

Well-designed systems degrade gracefully rather than collapsing:

Healthy state: 10,000 RPS at full capacity
One server down: Load redistributes, 9,000 RPS sustainable
Database slow: Non-critical queries timeout, 7,000 RPS of core functionality
Cache failure: Database handles load, 3,000 RPS with degraded performance
Major outage: Minimal functionality, 500 RPS of critical paths only

This graceful degradation shouldn't happen by accident—it requires explicit design.

The Cascade Failure Pattern

When a system exceeds throughput capacity, failures cascade: Server A overloaded → Server A starts timing out → Clients retry → Load increases further → More servers overload → System-wide failure. One bottleneck can take down the entire system if not properly managed.

Load Shedding for Throughput Protection

When incoming throughput exceeds capacity, you must shed load deliberately:

Queue length limits — Reject requests when queue exceeds threshold. Fast rejection is better than slow failure.
Rate limiting — Cap requests per client/IP/API. Prevents one client from consuming all capacity.
Priority queues — Serve high-priority requests first. Let low-priority requests fail under overload.
Circuit breakers — Stop calling failing dependencies. Fast-fail rather than waiting for timeouts.

Backpressure Propagation

Rather than silently dropping requests, signal overload upstream:

Return 429 (Too Many Requests) with Retry-After headers
Return 503 (Service Unavailable) with expected recovery time
Use HTTP/2 flow control to slow client sending rate
Signal queue depth in responses so clients can adjust

Backpressure allows the entire system to slow down gracefully rather than suffering uncontrolled failure.

Capacity Planning with Throughput

Capacity planning translates business growth projections into infrastructure requirements. Throughput metrics are the bridge between business metrics and engineering decisions.

From Business Metrics to Throughput

Translating Business Metrics to Throughput Requirements
Business Metric	Conversion Factor	Throughput Implication
Monthly Active Users	×10 (avg page views)	×30 RPS per 1M MAU (rough)
Orders per Hour	×5 (API calls per order)	1 order/s = 5 RPS
Messages per Day	÷86,400 (seconds/day)	1M msgs/day ≈ 12 MPS average
Video Hours Watched	×3600 (seconds/hour)	1M hours = 3.6B seconds of streaming
Peak-to-Average Ratio	×3-10 typical	Design for peak, not average

The Capacity Planning Formula

Required Capacity = (Peak Throughput × Safety Margin) / (1 - Failure Headroom)

Example:

Average throughput: 1,000 RPS
Peak-to-average ratio: 3x → Peak: 3,000 RPS
Safety margin: 1.5x → Target: 4,500 RPS
Failure headroom: 20% (lose 1 of 5 servers) → Provisioned: 5,625 RPS

Growth Projections

Forward-looking capacity planning accounts for growth:

Historical growth rate: How fast has traffic grown?
Business projections: What growth does marketing expect?
Product launches: What events will drive traffic spikes?
Seasonal patterns: What are peak periods (holidays, weekends)?

Rule of thumb: Plan for 12-18 months of projected growth plus a 2x buffer for unknowns.

Cost Efficiency at Scale

Track cost-per-request (total infrastructure cost / throughput). As you scale, this should remain stable or decrease. If cost-per-request increases with scale, you have scaling inefficiencies—coordination overhead, underutilized resources, or architectural limitations.

Summary: Mastering Throughput

Throughput defines your system's capacity—how much work it can handle over time. Let's consolidate our understanding:

Key Takeaways

•Throughput measures work rate — RPS, QPS, TPS all measure completed operations per second. Different operations have different costs.
•Little's Law governs capacity — Throughput = Concurrency / Latency. This fundamental relationship informs all capacity planning.
•Latency explodes near capacity — Beyond 80% utilization, latency degrades rapidly. Never plan for sustained high utilization.
•Measure correctly — Use open-loop testing, distinguish successful vs. total throughput, and measure goodput (successful + meets SLOs).
•Identify the bottleneck — Only improving the bottleneck increases throughput. Non-bottleneck improvements waste effort.
•Design for failure — Plan throughput degradation paths. Use load shedding and backpressure to prevent cascade failures.
•Plan capacity forward — Translate business growth to throughput requirements with safety margins for peaks and failures.

What's Next:

We've covered latency (the time dimension) and throughput (the volume dimension) as individual metrics. But single-number summaries hide important details. The next page explores percentiles—how to understand and communicate the full distribution of performance, not just averages.

Page Complete

You now understand throughput as a fundamental performance metric. You can calculate theoretical limits, identify bottlenecks, plan capacity, and design for graceful degradation under failure. Next, we'll explore how percentiles reveal the full picture of system performance that averages hide.

2 / 5

Loading learning content...

System Design (HLD)Performance Optimization

Performance Metrics

LevelIntermediate

Duration90 mins

TopicPerformance Optimization

2 / 5

Throughput: Requests per Second

The Volume Dimension of Performance

What You Will Learn

Defining Throughput Precisely

Throughput measures the rate of completed work over time. While the concept seems simple, precision matters significantly in system design conversations.

Common Throughput Metrics

Throughput Metrics and Their Contexts
Metric	Full Name	Unit	Typical Context
RPS	Requests Per Second	req/s	Web servers, API gateways, microservices
QPS	Queries Per Second	queries/s	Databases, search engines, cache systems
TPS	Transactions Per Second	txn/s	Payment systems, databases, trading platforms
MPS	Messages Per Second	msg/s	Message queues, event streaming, pub/sub
IOPS	I/O Operations Per Second	ops/s	Storage systems, disk arrays, databases
BPS	Bytes Per Second	bytes/s	Network links, data pipelines, streaming

What Counts as a "Request"?

Throughput measurements require clarity on what constitutes a unit of work:

A single HTTP request? But a page load involves 50+ requests.
A single API call? But one call might trigger 5 database queries.
A user action? But "login" is simpler than "checkout with payment."

The Solution: Weighted Throughput

Sophisticated capacity planning weights operations by their cost:

Effective Load = Σ (operation_count × operation_cost)

Rather than reporting "1000 RPS," you might report "1000 RPS (70% reads, 25% simple writes, 5% complex transactions)" to distinguish from "1000 RPS (100% complex transactions)."

Throughput vs. Bandwidth

Theoretical Throughput Limits

Every system has theoretical maximum throughput determined by its bottleneck resource. Understanding these limits helps you predict capacity and identify optimization targets.

Little's Law: The Fundamental Relationship

Little's Law is one of the most important equations in system performance:

L = λ × W

Where:

L = Average number of items in the system (concurrent requests)
λ (lambda) = Average arrival rate (throughput)
W = Average time spent in system (latency)

Rearranged for throughput:

Throughput = Concurrency / Latency

Or: λ = L / W

This fundamental law has profound implications:

Implications of Little's Law

•Fixed concurrency, lower latency → higher throughput. If you have 100 concurrent connections and reduce latency from 100ms to 50ms, throughput doubles from 1000 RPS to 2000 RPS.
•Fixed latency, more concurrency → higher throughput. More parallel workers handling requests increases capacity linearly.
•Higher throughput requirements → need lower latency or more concurrency. You can't increase throughput without addressing one of these.
•The relationship is linear and universal. It applies to any stable queuing system—networks, databases, web servers, everything.

Quick Throughput Estimation

CPU-Bound Throughput Limits

For compute-intensive workloads, CPU becomes the bottleneck:

Max Throughput = (CPU_cores × CPU_utilization_target) / CPU_time_per_request

Example:

8 CPU cores
Target 75% utilization (leaving headroom)
10ms CPU time per request

Max Throughput = (8 × 0.75) / 0.01 = 600 RPS

I/O-Bound Throughput Limits

For I/O-intensive workloads, storage or network becomes the bottleneck:

Max Throughput = Available_IOPs / IOPs_per_request

Example:

SSD with 50,000 IOPS
Each request requires 5 disk reads

Max Throughput = 50,000 / 5 = 10,000 RPS

The actual limit is the minimum of CPU limit and I/O limit.

The Latency-Throughput Relationship

The Queueing Theory Perspective

As throughput approaches system capacity, latency doesn't just increase—it explodes. This is the queueing theory effect:

At 50% capacity: Latency is near baseline
At 70% capacity: Latency starts to rise noticeably
At 90% capacity: Latency may be 2-3x baseline
At 95% capacity: Latency may be 5-10x baseline
At 99% capacity: Latency approaches infinity

The mathematical reason: at high utilization, requests spend most of their time waiting in queues rather than being processed.

The 80% Utilization Rule

Understanding the Curve

The latency-throughput relationship follows characteristic patterns:

Latency Behavior at Different Utilization Levels
Utilization Zone	Latency Behavior	System State	Engineering Response
0-50%	Stable, near-optimal latency	Healthy, headroom available	Cost optimization opportunity
50-70%	Slight increase, still predictable	Normal production operation	Target zone for most systems
70-85%	Noticeable increase, some spikes	Caution zone	Monitor closely, plan scaling
85-95%	Rapid increase, high variability	Danger zone	Immediate action needed
95-100%	Exponential increase, unpredictable	Critical failure imminent	Shed load, emergency response

The Throughput-Latency Trade-off

You often cannot maximize both throughput and minimize latency simultaneously:

Optimizing for throughput: Batch operations, full utilization, queue depth management. Latency suffers because requests wait in batches.
Optimizing for latency: Dedicated resources, immediate processing, over-provisioning. Throughput suffers because resources sit idle.

The right balance depends on your use case:

Trading systems: Minimize latency (cost is secondary)
Batch processing: Maximize throughput (latency is irrelevant)
Web applications: Balance both (SLO-driven trade-offs)

Measuring Throughput Correctly

Accurate throughput measurement requires attention to methodology. Common mistakes lead to overly optimistic or misleading results.

Measurement Intervals Matter

Throughput varies over time. The interval you choose for measurement affects the story:

1-second intervals: Capture bursts and spikes. Good for alerting.
1-minute averages: Smooth out micro-bursts. Good for dashboards.
1-hour averages: Show daily patterns. Good for capacity planning.
Daily/weekly averages: Show long-term trends. Good for growth planning.

Reporting "1000 RPS" could mean:

Sustained 1000 RPS for hours (impressive)
Peak 1000 RPS in a single second, averaging 100 RPS (less impressive)
1000 RPS in a benchmark, real traffic at 300 RPS (misleading)

Benchmark vs. Production Throughput

Successful vs. Total Throughput

A critical distinction that catches teams:

Total throughput: All requests received, including failures
Successful throughput: Only requests that completed successfully
Goodput: Successful requests that met latency SLOs

A system handling 10,000 RPS with 30% error rate has only 7,000 successful RPS. If 20% exceed latency SLOs, goodput is only 5,600 RPS.

Always measure goodput—it's what actually matters for users.

Load Testing for Throughput

To find maximum throughput, use proper load testing methodology:

Open-loop testing: Send requests at a fixed rate regardless of responses. Reveals how the system behaves under load.
Closed-loop testing: Wait for each response before sending the next request. Underestimates real capacity due to coordinated omission.
Stepped load: Gradually increase load while measuring latency. Find the point where latency degrades unacceptably.
Soak testing: Sustained load over hours/days to find capacity under real conditions (memory leaks, GC pressure, etc.).

Load Testing Approaches Comparison
Approach	Best For	Pitfall To Avoid
Open-loop	Finding breaking point	Overwhelming the system before observing degradation
Closed-loop	Simulating dependent clients	Hiding true capacity limits due to coordinated omission
Stepped load	Finding optimal operating point	Steps too large missing the critical threshold
Soak testing	Long-term stability validation	Too short duration missing slow resource leaks

Identifying and Eliminating Throughput Bottlenecks

Common Throughput Bottlenecks

Application Layer Bottlenecks

•Thread pool exhaustion — Fixed-size thread pools limit concurrency. All threads busy = requests queue. Add more threads (within memory limits) or switch to async I/O.
•Connection pool limits — Database/cache connection pools cap concurrent queries. Increase pool size or reduce query latency to free connections faster.
•Synchronization locks — Global locks serialize parallel work. Reduce critical section size, use lock-free data structures, or partition data to reduce contention.
•Memory pressure — Insufficient heap causes frequent GC, reducing throughput. Increase heap, reduce object allocation, or cache/pool objects.
•Blocking I/O — Threads waiting on I/O can't handle other requests. Use async I/O, increase thread count, or reduce I/O latency.

Infrastructure Bottlenecks

•Network bandwidth saturation — Network links have finite capacity. Large payloads exhaust bandwidth before RPS limits. Compress payloads, reduce response sizes, upgrade links.
•CPU saturation — 100% CPU utilization is the hard ceiling for compute-bound work. Optimize algorithms, add cores, or distribute load across machines.
•Disk I/O saturation — Storage devices have IOPS limits. Optimize queries, add caching, use faster storage (SSD→NVMe), or distribute across more disks.
•Single-leader bottleneck — Databases with single write leader concentrate all writes on one node. Shard writes, use multi-leader, or move to leaderless architecture.

The Universal Scalability Law

Scaling Strategies for Higher Throughput

When current throughput is insufficient, you have two fundamental scaling options: vertical (bigger machines) and horizontal (more machines). Each has distinct throughput implications.

Vertical Scaling

•Add more CPU cores → more parallel executions
•Add more RAM → larger caches, fewer disk hits
•Add faster storage → reduced I/O wait
•Upgrade network → higher bandwidth ceiling
•Limit: Maximum machine size, single point of failure
•Best for: Quick wins, simple architectures, low coordination overhead

Horizontal Scaling

•Add more servers → distribute load
•Partition data → parallelize storage access
•Add read replicas → scale read throughput
•Geo-distribute → reduce per-region load
•Limit: Coordination overhead, data consistency complexity
•Best for: Large scale, high availability, geographic distribution

Throughput Scaling Techniques

•Sharding — Partition data across machines. Each shard handles 1/N of the data, allowing N× throughput for queries that target specific shards. But cross-shard queries are expensive.
•Read Replicas — Replicate data to multiple read-only copies. Read throughput scales linearly with replica count. Write throughput remains limited by the leader.
•Write Partitioning — Split writes across partitions by a key (user ID, region, etc.). Each partition handles independent writes, scaling write throughput.
•Caching — Serve frequently-requested data from memory. Cache hits avoid database round-trips, dramatically increasing effective throughput.
•CDN Offload — Serve static content from edge locations. CDN handles 80%+ of requests, freeing origin servers for dynamic work.
•Queue-Based Decoupling — Accept work faster than you process it by buffering in queues. Smooths traffic bursts and decouples ingestion from processing throughput.

Maintaining Throughput Under Failure

A system's throughput under normal conditions is interesting; its throughput under failure conditions is what matters for reliability. How does your system behave when things go wrong?

Graceful Degradation of Throughput

Well-designed systems degrade gracefully rather than collapsing:

Healthy state: 10,000 RPS at full capacity
One server down: Load redistributes, 9,000 RPS sustainable
Database slow: Non-critical queries timeout, 7,000 RPS of core functionality
Cache failure: Database handles load, 3,000 RPS with degraded performance
Major outage: Minimal functionality, 500 RPS of critical paths only

This graceful degradation shouldn't happen by accident—it requires explicit design.

The Cascade Failure Pattern

Load Shedding for Throughput Protection

When incoming throughput exceeds capacity, you must shed load deliberately:

Queue length limits — Reject requests when queue exceeds threshold. Fast rejection is better than slow failure.
Rate limiting — Cap requests per client/IP/API. Prevents one client from consuming all capacity.
Priority queues — Serve high-priority requests first. Let low-priority requests fail under overload.
Circuit breakers — Stop calling failing dependencies. Fast-fail rather than waiting for timeouts.

Backpressure Propagation

Rather than silently dropping requests, signal overload upstream:

Return 429 (Too Many Requests) with Retry-After headers
Return 503 (Service Unavailable) with expected recovery time
Use HTTP/2 flow control to slow client sending rate
Signal queue depth in responses so clients can adjust

Backpressure allows the entire system to slow down gracefully rather than suffering uncontrolled failure.

Capacity Planning with Throughput

Capacity planning translates business growth projections into infrastructure requirements. Throughput metrics are the bridge between business metrics and engineering decisions.

From Business Metrics to Throughput

Translating Business Metrics to Throughput Requirements
Business Metric	Conversion Factor	Throughput Implication
Monthly Active Users	×10 (avg page views)	×30 RPS per 1M MAU (rough)
Orders per Hour	×5 (API calls per order)	1 order/s = 5 RPS
Messages per Day	÷86,400 (seconds/day)	1M msgs/day ≈ 12 MPS average
Video Hours Watched	×3600 (seconds/hour)	1M hours = 3.6B seconds of streaming
Peak-to-Average Ratio	×3-10 typical	Design for peak, not average

The Capacity Planning Formula

Required Capacity = (Peak Throughput × Safety Margin) / (1 - Failure Headroom)

Example:

Average throughput: 1,000 RPS
Peak-to-average ratio: 3x → Peak: 3,000 RPS
Safety margin: 1.5x → Target: 4,500 RPS
Failure headroom: 20% (lose 1 of 5 servers) → Provisioned: 5,625 RPS

Growth Projections

Forward-looking capacity planning accounts for growth:

Historical growth rate: How fast has traffic grown?
Business projections: What growth does marketing expect?
Product launches: What events will drive traffic spikes?
Seasonal patterns: What are peak periods (holidays, weekends)?

Rule of thumb: Plan for 12-18 months of projected growth plus a 2x buffer for unknowns.

Cost Efficiency at Scale

Summary: Mastering Throughput

Throughput defines your system's capacity—how much work it can handle over time. Let's consolidate our understanding:

Key Takeaways

•Throughput measures work rate — RPS, QPS, TPS all measure completed operations per second. Different operations have different costs.
•Little's Law governs capacity — Throughput = Concurrency / Latency. This fundamental relationship informs all capacity planning.
•Latency explodes near capacity — Beyond 80% utilization, latency degrades rapidly. Never plan for sustained high utilization.
•Measure correctly — Use open-loop testing, distinguish successful vs. total throughput, and measure goodput (successful + meets SLOs).
•Identify the bottleneck — Only improving the bottleneck increases throughput. Non-bottleneck improvements waste effort.
•Design for failure — Plan throughput degradation paths. Use load shedding and backpressure to prevent cascade failures.
•Plan capacity forward — Translate business growth to throughput requirements with safety margins for peaks and failures.

What's Next:

Page Complete

2 / 5