Constraints And Bottlenecks - Learning Module

Loading content...

0/273

Common Bottlenecks: CPU, Memory, Network, Disk

The Four Horsemen of System Bottlenecks

Every computer system—from a developer's laptop to a hyperscale data center—is fundamentally bounded by four resources: CPU (compute), Memory (RAM), Network (bandwidth and latency), and Disk (storage I/O). These are the physical constraints from which there is no escape. No amount of clever architecture, elegant code, or sophisticated algorithms can violate the limits imposed by hardware.

Understanding these four bottleneck categories deeply is essential for any system designer. Each has distinct characteristics, different symptoms when saturated, different measurement approaches, and different mitigation strategies. A CPU bottleneck feels entirely different from a disk bottleneck, even though both might manifest as 'the system is slow.'

This page will give you the mental models to recognize each bottleneck type, the tools to measure them, and the architectural patterns to address them. By the end, you'll be able to diagnose a slow system like a seasoned infrastructure engineer and design systems that avoid common bottleneck traps.

What You Will Learn

You will understand the fundamental characteristics of CPU, memory, network, and disk bottlenecks; learn how to identify which resource is constraining your system; discover measurement tools and key metrics for each resource type; and master architectural patterns that mitigate each bottleneck category.

CPU Bottlenecks — When Compute Is the Limit

What Is a CPU Bottleneck?

A CPU bottleneck occurs when the central processing unit cannot execute instructions fast enough to meet demand. The processor is working at or near 100% utilization, and adding more work simply increases queue depth rather than throughput. This is the quintessential 'compute-bound' condition.

Modern CPU Architecture Context:

Modern CPUs are extraordinarily fast—billions of operations per second. A single core on a modern processor can execute several instructions per nanosecond under ideal conditions. Yet CPU bottlenecks remain common because:

Algorithms can be inefficient — An O(n²) algorithm at scale can exhaust any CPU
Cryptographic operations are CPU-intensive — TLS handshakes, encryption, hashing
Serialization/deserialization is expensive — JSON parsing, Protocol Buffers, compression
Complex business logic adds up — Hundreds of conditional checks per request
Interpreted languages have overhead — Python, Ruby, and older JavaScript are slower than compiled languages

Characteristics of CPU Bottlenecks:

When CPU is the bottleneck, you'll observe specific patterns:

High CPU utilization — Often 90-100% across cores (or high on specific cores for single-threaded workloads)
Low I/O wait — The system isn't waiting for disk or network; it's actively computing
Response time degrades linearly with load — Each additional request takes longer as the CPU queue grows
Adding CPU (vertical scaling or more cores) increases throughput — Unlike other bottlenecks, throwing compute at the problem actually helps

CPU Bottleneck Symptoms and Diagnostics
Symptom	Metric to Check	Healthy Range	Bottleneck Range
Slow response times	CPU utilization (%)	< 70%	90%
Request queuing	Load average	< number of cores	number of cores
Predictable degradation	CPU steal time (VMs)	< 2%	10%
High throughput despite latency	Instructions per cycle (IPC)	Workload-dependent	Low IPC indicates inefficiency

Measuring CPU Bottlenecks:

The primary tools for CPU bottleneck identification:

top / htop — Real-time CPU utilization per process and core
vmstat — System-wide CPU statistics including user, system, idle, and wait time
mpstat — Per-core utilization breakdown
perf — Deep profiling showing which functions consume CPU cycles
Flame Graphs — Visual representation of where CPU time is spent in the call stack

Key metrics to monitor:

%user — CPU time spent in user-space code (your application)
%system — CPU time spent in kernel operations (often I/O related)
%iowait — CPU time waiting for I/O (low during CPU bottlenecks, high during disk bottlenecks)
%idle — Unutilized CPU (should be near zero during true CPU bottleneck)
Load Average — Number of processes waiting for CPU (should be less than core count for healthy systems)

CPU Bottleneck Mitigation Strategies

•Algorithmic optimization — Reduce Big-O complexity; an O(n log n) algorithm replacing O(n²) can transform a bottleneck into headroom
•Caching computed results — Store the output of expensive computations to avoid recomputation
•Horizontal scaling — Add more instances behind a load balancer to distribute CPU-intensive work
•Vertical scaling — Use machines with more cores or faster processors
•Language/runtime optimization — Use JIT-compiled or ahead-of-time compiled languages for hot paths
•Asynchronous processing — Move CPU-heavy work to background jobs to avoid blocking request paths
•Specialized hardware — Offload specific computations to GPUs, TPUs, or custom ASICs (e.g., for ML inference, video encoding)

The Single-Threaded Trap

Many legacy applications (and some modern ones, like single-threaded Node.js event loops) can only use one CPU core effectively. Having 64 cores means nothing if your application pins at 100% on one core while 63 others sit idle. Always check per-core utilization, not just aggregate CPU usage.

Memory Bottlenecks — When RAM Is the Limit

What Is a Memory Bottleneck?

A memory bottleneck occurs when the system's RAM is insufficient for the current workload. Unlike CPU bottlenecks (where the resource is simply exhausted), memory bottlenecks have more nuanced failure modes:

Outright exhaustion — The system runs out of memory entirely, triggering OOM (Out of Memory) kills
Swapping/paging — The OS moves memory pages to disk, causing massive performance degradation
GC pressure — In garbage-collected languages, memory pressure causes frequent, long garbage collection pauses
Cache eviction — Application-level caches shrink, causing increased backend load

Modern Memory Architecture Context:

RAM is blazingly fast—nanosecond access times—but finite. A typical cloud VM might have 8-64 GB of RAM. This sounds like a lot until you're serving thousands of concurrent users, each with session state, request context, and cached data structures.

Memory bottlenecks are particularly dangerous because they often manifest suddenly. A system might run fine at 60% memory utilization, then fall off a cliff at 85% as garbage collection overhead explodes or swap begins.

Characteristics of Memory Bottlenecks:

High memory utilization — Often 80-90%+ of available RAM
Swap activity — Reading and writing to swap (swap in/out)
OOM kills — Processes being killed by the kernel
GC pauses — Long stop-the-world garbage collection events (Java, Go, .NET)
Page faults — High major page fault rate indicating swapping
Inconsistent latency — Bimodal response time distribution with most requests fast but some extremely slow

Memory Bottleneck Symptoms and Diagnostics
Symptom	Metric to Check	Healthy Range	Bottleneck Range
Sporadic extreme latency	Memory utilization (%)	< 75%	85%
Process crashes (OOM)	Swap usage	0	Any significant usage
GC-related pauses	Major page faults	Near zero	Sustained activity
Degraded cache hit rates	Application heap size	Within limits	Approaching or exceeding limits

Measuring Memory Bottlenecks:

Memory bottleneck tools and metrics:

free -h — System-wide memory utilization including buffers and cache
vmstat — Swap activity (si/so columns show swap in and swap out)
top / htop — Per-process memory usage (RES column for resident set size)
GC logs — For JVM-based applications, GC logs show pause times and heap utilization
/proc/meminfo — Detailed memory breakdown including cache, buffers, and swap

Understanding the Numbers:

MemTotal — Total physical RAM
MemFree — Completely unused memory (often low due to caching, which is normal)
MemAvailable — Memory available for applications (accounts for reclaimable cache)
SwapTotal / SwapUsed — Swap configuration and usage (any significant swap used = potential problem)
Buffers / Cached — Memory used by kernel for caching (this is good and can be reclaimed)

The free memory myth: A system showing low 'free' memory is not necessarily bottlenecked. Linux aggressively uses RAM for disk caching, which improves performance. Look at MemAvailable or the available column in free output—this is the true indicator of capacity.

Memory Bottleneck Mitigation Strategies

•Memory profiling — Identify leaks and unexpectedly large allocations using profilers (Valgrind, Java VisualVM, pprof)
•Reduce per-request memory footprint — Stream data instead of loading entirely into memory; use pagination
•Tune garbage collector — Adjust GC settings for your workload (e.g., G1GC tuning for Java, GOGC for Go)
•External caching — Move caches from in-process memory to distributed caches like Redis (trades memory for network)
•Vertical scaling — Use machines with more RAM
•Horizontal scaling — Distribute users across more instances to reduce per-instance memory pressure
•Object pooling — Reuse objects instead of frequent allocation/deallocation to reduce GC pressure
•Efficient data structures — Use memory-efficient representations (arrays vs linked lists, primitive arrays vs boxed objects)

The Swap Penalty

Swapping to disk is catastrophic for performance. Disk access is ~100,000x slower than RAM access. A memory access that takes 100 nanoseconds from RAM takes ~10 milliseconds from a spinning disk (or ~100 microseconds from SSD). If your system is swapping, your performance isn't just degraded—it's effectively broken. Configure systems with sufficient RAM to avoid any swap usage under normal load.

Network Bottlenecks — When Bandwidth or Latency Is the Limit

What Is a Network Bottleneck?

Network bottlenecks occur when communication between components is limited by either:

Bandwidth — The total data transfer capacity is exhausted
Latency — The round-trip time for messages is too long
Connection limits — The number of concurrent connections exceeds capacity
Packet loss — Dropped packets require retransmission, reducing effective throughput

Network bottlenecks are particularly insidious in distributed systems because almost every operation involves network communication. A microservices architecture with 20+ service-to-service calls per request is fundamentally network-constrained.

Modern Network Architecture Context:

Within a single data center, network bandwidth is typically abundant (10-100 Gbps between hosts). Latency is low (microseconds to low milliseconds). But:

Cross-datacenter latency is bounded by physics (speed of light): ~1ms per 200km of fiber
Internet-facing traffic is subject to highly variable latency and bandwidth constraints
In cloud environments, network I/O often becomes the first bottleneck at scale

Characteristics of Network Bottlenecks:

High network utilization — Approaching link capacity
Increased latency — Round-trip times grow as queues fill
Retransmissions — Packet loss triggering TCP retransmits
Connection exhaustion — Ephemeral ports or connection pool limits reached
Throughput plateau — Adding more application capacity doesn't increase throughput

Network Bottleneck Symptoms and Diagnostics
Symptom	Metric to Check	Healthy Range	Bottleneck Range
Slow request/response	Round-trip latency (RTT)	LAN: < 1ms, WAN: varies	Significantly elevated from baseline
Throughput ceiling	Bandwidth utilization (%)	< 70%	85%
Sporadic failures	TCP retransmissions	< 1%	5%
Connection errors	Ephemeral port exhaustion	Many available	Approaching 65535 limit
Inconsistent response times	Network jitter	Low variance	High variance

Measuring Network Bottlenecks:

iftop / nethogs — Real-time bandwidth usage per process/connection
netstat / ss — Connection states, ephemeral port usage, socket buffers
tcpdump / Wireshark — Deep packet inspection for retransmits, latency analysis
ping / traceroute — Latency and path analysis
iperf3 — Raw bandwidth testing between hosts
mtr — Combined traceroute and ping for path analysis

Understanding Network Metrics:

RX/TX bytes — Total data received/transmitted
TCP retransmits — Packets that had to be resent (indicates loss or congestion)
Connection states (TIME_WAIT, ESTABLISHED, etc.) — Socket lifecycle; too many TIME_WAIT suggests high churn
Socket buffer sizes — Determines how much data can be in-flight

The Bandwidth-Latency Product:

An important network concept: the amount of data 'in flight' on a connection is limited by bandwidth × latency. A 1 Gbps link with 50ms latency can have at most ~6.25 MB in-flight. Insufficient socket buffer sizes can limit throughput well below the link capacity.

Bandwidth Mitigation

•Compression — Reduce payload sizes (gzip, Brotli, zstd)
•Efficient serialization — Use binary protocols (Protocol Buffers, MessagePack) vs. JSON
•Pagination — Send less data per request
•CDN — Offload static content to edge locations
•Network upgrades — Higher capacity links where possible

Latency Mitigation

•Connection pooling — Reuse connections to avoid handshake overhead
•Geographic distribution — Locate services closer to users
•Request batching — Combine multiple operations into one round-trip
•Parallel requests — Fetch independent data concurrently
•Caching — Avoid network round-trips entirely for repeated data

The Microservices Latency Tax

Each service-to-service call adds latency. A request that traverses 10 services, each adding 2ms of network overhead, incurs 20ms of pure network tax before any business logic executes. Fine-grained microservices architectures can become latency-constrained even with fast networks. This is a key reason why service mesh, co-located services, and request batching are essential in microservices deployments.

Disk Bottlenecks — When Storage I/O Is the Limit

What Is a Disk Bottleneck?

A disk bottleneck occurs when the storage subsystem cannot read or write data fast enough to satisfy the workload. This can manifest as:

Throughput limitation — The total MB/s of reads or writes hits the disk's capacity
IOPS limitation — The number of I/O operations per second exceeds the disk's capability
Latency spikes — Individual I/O operations take too long due to queue depth

Modern Storage Architecture Context:

Storage technology has evolved dramatically:

HDD (spinning disk): ~100-200 IOPS, 100-200 MB/s throughput, 5-10ms latency
SSD (SATA): ~50,000-100,000 IOPS, 500-600 MB/s throughput, 50-100μs latency
NVMe SSD: ~500,000+ IOPS, 3,000-7,000 MB/s throughput, 10-20μs latency

The difference is enormous—NVMe is 1,000-5,000x faster than HDD for random I/O. Yet disk bottlenecks remain common because:

Many databases are I/O bound by nature (durability requires writes to disk)
Log-structured workloads (write-ahead logs, event stores) are write-heavy
Large datasets that don't fit in memory require disk reads
Cloud storage often throttles IOPS based on provisioning

Characteristics of Disk Bottlenecks:

High I/O wait — CPUs waiting for disk operations to complete
High I/O utilization — Disk busy time approaching 100%
High queue depth — Many I/O operations waiting
Elevated I/O latency — Time per operation increasing
Low CPU utilization with poor throughput — System isn't compute-bound, yet is slow

Disk Bottleneck Symptoms and Diagnostics
Symptom	Metric to Check	Healthy Range	Bottleneck Range
System feels slow despite low CPU	I/O wait (%)	< 10%	30%
Disk activity continuous	Disk utilization (%)	< 70%	90%
Operations queuing	Average queue depth	< 2 for SSD, < 8 for HDD	healthy range
Latency on disk operations	Average I/O latency	SSD: < 1ms, HDD: < 20ms	2-3x healthy range
Throughput plateau	IOPS or MB/s	Below disk limits	At or near disk limits

Measuring Disk Bottlenecks:

iostat — The primary tool for disk analysis; shows utilization, IOPS, throughput, and latency per device
iotop — Per-process I/O usage
vmstat — The wa column shows I/O wait percentage
dstat — Combined CPU, disk, network monitoring
fio — Flexible I/O tester for benchmarking storage performance

Key iostat Columns:

%util — Percentage of time the device was busy (saturation indicator)
r/s, w/s — Read and write IOPS
rkB/s, wkB/s — Read and write throughput in KB/s
await — Average time for I/O requests including queue time (latency)
avgqu-sz — Average queue depth (higher = more contention)

The Random vs. Sequential I/O Distinction:

This distinction is crucial:

Sequential I/O: Reading/writing contiguous blocks. HDDs handle this relatively well; SSDs excel.
Random I/O: Reading/writing scattered blocks. HDDs struggle massively (requires head seeks); SSDs excel.

A workload that's fine on SSD might completely break on HDD due to random access patterns. Databases, in particular, often generate significant random I/O.

Disk Bottleneck Mitigation Strategies

•Upgrade storage — Move from HDD to SSD, or from SATA SSD to NVMe; often the highest-ROI fix
•Increase RAM for caching — More data in memory means fewer disk reads; the OS page cache is your friend
•Optimize access patterns — Convert random I/O to sequential where possible (batching, sorting)
•Use write-ahead logging — Optimize write patterns to be sequential (append-only logs)
•Tune filesystem and I/O scheduler — deadline or noop scheduler for SSDs; appropriate filesystem (ext4, XFS, ZFS)
•Provision more IOPS — In cloud environments (EBS, Azure Disk), provisioned IOPS can be increased
•Data tiering — Keep hot data on fast storage, cold data on cheaper/slower storage
•Database-specific tuning — Tune buffer pools, checkpoint frequency, write batch sizes to optimize I/O patterns

The Power of Caching

The best disk I/O is the one that never happens. Maximizing RAM utilization for caching—both OS-level page cache and application-level caching—can reduce disk read load by 90%+. Before tuning disk performance, ask: Can I cache this data in memory instead?

Identifying Which Resource Is the Bottleneck

When a system is performing poorly, how do you determine which resource is constrained? The symptoms can overlap, and multiple bottlenecks can exist simultaneously. Here's a systematic approach:

The USE Method:

Developed by Brendan Gregg (author of Systems Performance), USE stands for:

Utilization: What percentage of the resource's capacity is being used?
Saturation: Is work queuing because the resource is fully utilized?
Errors: Are there any errors associated with the resource?

Apply USE to each resource type:

Resource	Utilization Metric	Saturation Metric	Error Metric
CPU	CPU utilization %	Load average, run queue	Machine check exceptions
Memory	Memory usage %	Swap usage, page faults	OOM kills
Network	Bandwidth usage	Socket buffers, connection queues	TCP retransmits
Disk	Disk utilization %	I/O queue depth	Device errors

Quick Diagnostic Checklist:

When a system is slow, run through this sequence:

Check CPU: Is %idle near zero? Is load average >> core count?
Check Memory: Is swap being used? Is MemAvailable very low?
Check Disk: Is %util high? Is await elevated? Is wa in vmstat high?
Check Network: Are there retransmits? Is bandwidth saturated? Are connection counts at limits?

Quick Diagnostic Commands

•top / htop — CPU and memory overview
•vmstat 1 — CPU, memory, swap, disk I/O
•iostat -xz 1 — Disk utilization and latency
•free -h — Memory and swap usage
•sar -n DEV 1 — Network throughput
•ss -s — Socket statistics summary

Common Bottleneck Patterns

•High CPU + low I/O wait = CPU-bound
•Low CPU + high I/O wait = Disk-bound
•High memory + swap activity = Memory-bound
•Low CPU, low disk, high latency = Network-bound
•High everything = Multiple bottlenecks (common at scale)

Bottlenecks Shift as You Scale

As you fix one bottleneck, another becomes binding. This is normal and expected. A system that was CPU-bound might become disk-bound once you optimize the algorithm. A system that was disk-bound might become network-bound once you add SSD storage. Performance optimization is an iterative process of identifying and addressing the current binding constraint.

Cross-Resource Interactions

Resources don't exist in isolation. They interact in complex ways, and understanding these interactions is key to advanced performance engineering.

Memory ↔ Disk:

The operating system uses available RAM as a disk cache. When RAM is plentiful, frequently accessed data stays in memory, and disk reads are minimized. When RAM is constrained:

The page cache shrinks → more disk reads required
Swap begins → disk becomes catastrophically slow for memory access
Double penalty: less cache AND swap overhead

CPU ↔ Memory:

CPU performance depends heavily on memory access patterns:

CPU caches (L1, L2, L3) are orders of magnitude faster than RAM
Cache-friendly algorithms can be 10-100x faster than cache-hostile ones
Memory bandwidth can become the limit for data-intensive workloads
NUMA (Non-Uniform Memory Access) effects: accessing 'distant' RAM is slower

Network ↔ CPU:

Network operations consume CPU:

Encryption/decryption (TLS) is CPU-intensive
Serialization/deserialization (JSON, Protocol Buffers) requires compute
High packet rates require interrupt handling and kernel processing
Kernel bypass technologies (DPDK, io_uring) trade complexity for reduced CPU overhead

Disk ↔ CPU:

Disk operations can create CPU load:

Compression/decompression for storage efficiency
Encryption at rest
Filesystem overhead (journaling, metadata)
Checksumming (ZFS, Btrfs)

Common Resource Interaction Patterns
Primary Bottleneck	Secondary Effect	Mitigation
Low memory	Increased disk I/O (cache eviction, swapping)	Add RAM, reduce memory footprint
High CPU	Network timeouts (can't process fast enough)	Scale horizontally, optimize code
Slow disk	CPU I/O wait (idle while waiting)	Use SSD, increase RAM cache
Network latency	Thread blocking (waiting for responses)	Async I/O, connection pooling
High network	CPU overhead (encryption, serialization)	Offload TLS, use efficient protocols

The Holistic View

Experienced engineers rarely look at resources in isolation. They mentally model the entire system and trace how a constraint in one area ripples through others. When you see high I/O wait, also check available memory. When you see high CPU, check if it's application code or network/serialization overhead.

Cloud-Specific Bottleneck Considerations

Cloud environments introduce additional complexity to bottleneck analysis. Resources are virtualized, shared, and often subject to hidden throttling.

CPU Throttling and Burst Credits:

Many cloud instance types (AWS t-series, Azure B-series) use burst credit models. You get a baseline CPU allocation with 'credits' that allow temporary bursting. When credits are exhausted, you're throttled to baseline—which can be 5-10% of peak capacity.

Monitor CPU credit balance, not just utilization
Sustained CPU-intensive workloads need dedicated/compute-optimized instances

IOPS Provisioning:

Cloud storage often has IOPS limits that are independent of the underlying hardware capability:

AWS EBS has IOPS limits based on volume type and provisioned IOPS
Azure managed disks have tier-based IOPS limits
You can hit these limits while the disk itself is capable of more

Monitor IOPS consumption relative to provisioned limits, not just raw disk utilization.

Network Bandwidth Limits:

Cloud instances have network bandwidth allocations that may not be obvious:

Bandwidth often scales with instance size but may not be documented clearly
Burst bandwidth is common; sustained transfers may be lower
EBS-backed instances share network bandwidth with storage traffic

The 'Noisy Neighbor' Problem:

In multi-tenant cloud environments, your 'neighbors' on the same physical hardware can affect your performance:

CPU steal time indicates your cycles are being consumed by others
Disk I/O can be impacted by other tenants on shared storage
Network bandwidth can vary based on co-tenant activity

Watch for unusual performance variance that doesn't correlate with your workload—it might be neighbors.

Cloud Bottleneck Monitoring Checklist

•CPU: Monitor burst credits (AWS: CPUCreditBalance; Azure: CPU Credits Remaining)
•Memory: Standard monitoring applies; cloud adds no special complexity
•Disk: Compare consumed IOPS vs. provisioned limits; monitor burst balance if applicable
•Network: Use cloud-native metrics for bandwidth consumption vs. instance limits
•Cross-checks: Compare in-instance metrics with cloud provider metrics to catch virtualization overhead

The Illusion of Unlimited Resources

Cloud marketing emphasizes 'unlimited scale,' but every resource has limits—they're just enforced differently. Hitting a cloud throttling limit feels like a sudden performance cliff, not a gradual degradation. Design your monitoring and alerting to detect these limits before users experience them.

Summary: Mastering Bottleneck Identification

The ability to quickly identify which resource is constraining your system is a superpower for any system designer. Let's consolidate the key insights:

Key Takeaways

•Four fundamental resources — CPU, Memory, Network, and Disk are the physical limits that bound all systems
•Each has distinct symptoms — High CPU shows as utilization; memory shows as swap/GC; disk shows as I/O wait; network shows as latency/retransmits
•USE method for diagnosis — Check Utilization, Saturation, and Errors for each resource type
•Resources interact — Memory pressure causes disk I/O; CPU limits affect network timeouts; understanding interactions is key
•Bottlenecks shift — Fixing one bottleneck reveals the next; performance tuning is iterative
•Cloud adds complexity — Burst credits, provisioned IOPS, and throttling limits require cloud-specific monitoring
•Measurement before optimization — Never guess at the bottleneck; always measure

What's Next:

With a solid understanding of the four fundamental bottleneck categories, we'll dive deep into one of the most common bottlenecks in distributed systems: the database. The next page explores database bottlenecks in detail—why databases so often become the constraint, and architectural patterns to address this challenge.

Page Complete

You now understand the four fundamental resource bottlenecks—CPU, Memory, Network, and Disk—their characteristics, measurement approaches, and mitigation strategies. This knowledge forms the foundation for diagnosing any system performance issue. Next, we'll examine the database as a bottleneck in depth.