Loading content...
We've explored bandwidth, throughput, latency, and jitter individually. But networks don't exist in isolation—these metrics interact in complex ways. A high-bandwidth link with terrible latency. Low latency with high jitter. Good throughput that collapses under load. Understanding network performance requires seeing the complete picture.
Performance metrics are the language of network engineering. They allow us to describe problems precisely, compare alternatives objectively, set meaningful SLAs, and make data-driven decisions. This page synthesizes everything into a comprehensive framework for performance analysis.
By the end of this page, you will understand how performance metrics relate to each other, apply structured measurement methodologies, interpret results correctly, design meaningful SLAs, and make informed capacity planning decisions.
Performance metrics are not independent—they form an interconnected system where changes to one affect others:
The Fundamental Relationships:
| When This Increases... | This Happens... | Because... |
|---|---|---|
| Bandwidth (upgrade) | Throughput can increase | Larger pipe, more capacity |
| Bandwidth (upgrade) | Latency unchanged | Propagation delay still same |
| Latency | TCP throughput decreases | BDP limits window efficiency |
| Utilization (load) | Latency increases | Queuing delay grows |
| Utilization (load) | Jitter increases | Variable queue depths |
| Utilization (near 100%) | Loss increases | Queue overflow |
| Loss | Throughput decreases | Retransmissions, congestion control |
| Jitter | Buffer requirements increase | Must absorb variation |
The Queuing Theory Connection:
Queuing theory provides mathematical models for understanding latency as a function of load:
M/M/1 Queue: Average delay grows hyperbolically as utilization approaches 100%
Avg Delay = Service Time / (1 - Utilization)
Example: At 50% utilization, delay is 2× service time. At 90%, it's 10×. At 99%, it's 100×.
This explains why networks become unusable at high utilization—the last 10% of capacity costs enormous latency.
There's no gradual degradation. At 80% utilization, latency is 5× baseline. At 95%, it's 20×. This is why capacity planning targets 60-70% peak utilization—headroom for bursts without catastrophic latency.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
"""Demonstrate relationships between network performance metrics.Shows how they interact under different conditions.""" import mathfrom typing import List, Dict def mm1_delay(service_time_ms: float, utilization: float) -> float: """ M/M/1 queuing model: average delay as function of utilization. As utilization → 1, delay → infinity. """ if utilization >= 1.0: return float('inf') return service_time_ms / (1 - utilization) def tcp_throughput_mathis(rtt_ms: float, loss_rate: float, mss: int = 1460) -> float: """ Mathis equation for TCP throughput. Throughput ≈ (MSS × C) / (RTT × √p) """ if loss_rate <= 0: return float('inf') c = 1.22 rtt_sec = rtt_ms / 1000 return (mss * 8 * c) / (rtt_sec * math.sqrt(loss_rate)) def calculate_effective_throughput( bandwidth_mbps: float, base_latency_ms: float, utilization: float, loss_rate: float) -> Dict: """ Calculate effective throughput considering all factors. """ # Queuing adds to latency service_time = (1500 * 8) / (bandwidth_mbps * 1e6) * 1000 # ms queue_delay = mm1_delay(service_time, min(utilization, 0.99)) - service_time total_rtt = 2 * (base_latency_ms + queue_delay) # TCP limited by RTT and loss tcp_max = tcp_throughput_mathis(total_rtt, max(loss_rate, 0.0001)) # Available bandwidth available = bandwidth_mbps * 1e6 * (1 - utilization) # Effective throughput is minimum of limits effective = min(tcp_max, available, bandwidth_mbps * 1e6) return { "bandwidth_mbps": bandwidth_mbps, "base_latency_ms": base_latency_ms, "utilization_pct": utilization * 100, "loss_rate_pct": loss_rate * 100, "queue_delay_ms": round(queue_delay, 2), "total_rtt_ms": round(total_rtt, 2), "tcp_limited_mbps": round(tcp_max / 1e6, 2) if tcp_max != float('inf') else "unlimited", "congestion_limited_mbps": round(available / 1e6, 2), "effective_mbps": round(effective / 1e6, 2), } # Show the effects of utilization on latencyprint("=== Effect of Utilization on Queuing Delay ===")print("(1 Gbps link, 1500-byte packets, 12μs service time)\n")print(f"{'Utilization':>12} {'Queue Delay':>12} {'Relative':>10}")print("-" * 36) service_time = 0.012 # ~12 microseconds for 1500 bytes on 1 Gbpsfor util in [0.1, 0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99]: delay = mm1_delay(service_time, util) relative = delay / service_time print(f"{util*100:>11.0f}% {delay:>11.3f}ms {relative:>9.1f}×") print("\n=== Combined Effects Example ===")print("(100 Mbps link, 20ms base latency)\n") scenarios = [ ("Light load, no loss", 0.3, 0.0001), ("Moderate load, low loss", 0.6, 0.001), ("Heavy load, moderate loss", 0.8, 0.01), ("Near-saturated, high loss", 0.95, 0.03),] for name, util, loss in scenarios: result = calculate_effective_throughput(100, 20, util, loss) print(f"{name}:") print(f" Queue delay: {result['queue_delay_ms']}ms, Total RTT: {result['total_rtt_ms']}ms") print(f" TCP limit: {result['tcp_limited_mbps']} Mbps, Congestion limit: {result['congestion_limited_mbps']} Mbps") print(f" → Effective throughput: {result['effective_mbps']} Mbps") print()Random spot-checks don't constitute meaningful measurement. Professional network performance analysis requires structured methodology:
The Measurement Framework:
Define Objectives — What questions are you answering? Capacity? Baseline? Troubleshooting? Comparison?
Select Metrics — Choose metrics relevant to objectives:
Choose Tools — Match tools to metrics and constraints:
Establish Baseline — Measure 'normal' under known conditions
Control Variables — Isolate what you're testing
Sufficient Samples — Statistical significance requires volume
Document Conditions — Time, load, configurations
Analyze and Report — Derive actionable conclusions
| Metric | Primary Tool | Secondary Tool | Passive Alternative |
|---|---|---|---|
| Bandwidth | iperf3 -u | nuttcp | Interface counters |
| Throughput (TCP) | iperf3 | curl -w | NetFlow |
| Latency (RTT) | ping, hping3 | mtr | TCP timestamps |
| Jitter | iperf3 -u | D-ITG | RTP stats |
| Packet Loss | iperf3 -u | mtr | Interface counters |
| Path Analysis | traceroute, mtr | Paris traceroute | NetFlow |
| Application | curl, wrk, k6 | Custom scripts | APM tools |
Active vs. Passive Measurement:
Measuring a network changes it. Active measurements consume bandwidth and add queuing. Heavy testing can make a network appear worse than it is. Always consider measurement load when interpreting results, especially on constrained links.
Raw numbers are meaningless without proper interpretation. Common pitfalls and how to avoid them:
Statistical Analysis:
Network measurements are noisy. Proper analysis requires:
Percentile-Based Reporting:
For latency and jitter, always report percentiles:
| Percentile | Meaning | When It Matters | Example Use |
|---|---|---|---|
| P50 (Median) | Half of values below/above | Typical experience | General performance |
| P90 | 9 in 10 values below | Most users' experience | User-facing services |
| P95 | 19 in 20 values below | Almost all users | SLA targets |
| P99 | 99 in 100 values below | Tail latency | High-volume services |
| P99.9 | 1 in 1000 exceeds | Extreme tail | Payment processing |
For services making multiple serial calls, tail latency compounds. If each of 10 calls has 1% chance of being slow, the request has ~10% chance of being slow. For 100 calls: ~63% chance. At scale, P99 latency becomes P50 experience. This is why high-fan-out architectures obsess over tail latency.
Common Interpretation Mistakes:
Averaging Latency: A bimodal distribution with 10ms and 100ms values averages to 55ms—but no one experiences 55ms. Report percentiles.
Single-Point Testing: One good/bad result doesn't indicate typical performance. Variance matters.
Ignoring Context: 50ms latency to London is excellent; to the DNS server in the next room, it's a problem.
Confusing Bandwidth and Throughput: 'We have 1 Gbps' ≠ 'We get 1 Gbps.'
Overlooking Asymmetry: Upload vs. download, outbound vs. inbound latency may differ.
Testing at Wrong Time: Off-peak testing misses congestion; test during representative load.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
"""Proper statistical analysis of network measurement data.Shows how to extract meaningful insights from raw measurements.""" import statisticsfrom typing import List, Dictimport math def comprehensive_analysis(measurements: List[float], label: str = "Value") -> Dict: """ Comprehensive statistical analysis of measurement data. Returns actionable metrics, not just raw statistics. """ if len(measurements) < 2: return {"error": "Insufficient data"} sorted_data = sorted(measurements) n = len(sorted_data) def percentile(p: float) -> float: idx = int(p / 100 * n) return sorted_data[min(idx, n-1)] mean = statistics.mean(measurements) median = statistics.median(measurements) stdev = statistics.stdev(measurements) # Detect distribution skewness # Positive skew: mean > median (common for latency) skewness = (mean - median) / stdev if stdev > 0 else 0 # Detect bimodality (simple heuristic) q1, q3 = percentile(25), percentile(75) iqr = q3 - q1 outlier_low = q1 - 1.5 * iqr outlier_high = q3 + 1.5 * iqr outliers = [v for v in measurements if v < outlier_low or v > outlier_high] return { "sample_count": n, # Central tendency "mean": round(mean, 3), "median": round(median, 3), "mode_estimate": round(3 * median - 2 * mean, 3), # Pearson's approximation # Spread "min": round(sorted_data[0], 3), "max": round(sorted_data[-1], 3), "stdev": round(stdev, 3), "cv_percent": round(stdev / mean * 100, 1) if mean > 0 else 0, # Percentiles "p50": round(percentile(50), 3), "p90": round(percentile(90), 3), "p95": round(percentile(95), 3), "p99": round(percentile(99), 3), # Shape analysis "skewness": round(skewness, 2), "is_right_skewed": skewness > 0.5, "outlier_count": len(outliers), "outlier_percent": round(len(outliers) / n * 100, 1), # Recommendations "use_median": abs(skewness) > 0.5, # If skewed, prefer median "high_variance": stdev / mean > 0.3 if mean > 0 else False, } def compare_measurements(baseline: List[float], comparison: List[float]) -> Dict: """ Compare two sets of measurements to detect significant changes. """ baseline_stats = comprehensive_analysis(baseline, "Baseline") comparison_stats = comprehensive_analysis(comparison, "Comparison") # Calculate relative changes def rel_change(b, c): return ((c - b) / b * 100) if b != 0 else float('inf') return { "baseline_p50": baseline_stats["p50"], "comparison_p50": comparison_stats["p50"], "p50_change_pct": round(rel_change(baseline_stats["p50"], comparison_stats["p50"]), 1), "baseline_p99": baseline_stats["p99"], "comparison_p99": comparison_stats["p99"], "p99_change_pct": round(rel_change(baseline_stats["p99"], comparison_stats["p99"]), 1), "variance_change": "increased" if comparison_stats["stdev"] > baseline_stats["stdev"] * 1.2 else "decreased" if comparison_stats["stdev"] < baseline_stats["stdev"] * 0.8 else "stable", "significant_degradation": comparison_stats["p95"] > baseline_stats["p95"] * 1.5, } # Example analysisimport randomrandom.seed(42) # Typical latency distribution (right-skewed)normal_latency = [20 + random.expovariate(0.1) for _ in range(1000)] print("=== Latency Measurement Analysis ===\n")stats = comprehensive_analysis(normal_latency, "Latency (ms)") print(f"Sample Size: {stats['sample_count']}")print(f"\nCentral Tendency:")print(f" Mean: {stats['mean']}ms")print(f" Median: {stats['median']}ms")print(f" Skewness: {stats['skewness']} ({'right-skewed' if stats['is_right_skewed'] else 'symmetric'})") print(f"\nPercentiles:")print(f" P50: {stats['p50']}ms")print(f" P90: {stats['p90']}ms") print(f" P95: {stats['p95']}ms")print(f" P99: {stats['p99']}ms") print(f"\nRecommendation: {'Use median (skewed data)' if stats['use_median'] else 'Mean is acceptable'}")print(f"Variance concern: {'Yes - investigate' if stats['high_variance'] else 'No'}")Benchmarking establishes baseline performance and enables meaningful comparisons. Proper benchmarking requires rigor:
Benchmarking Principles:
| Benchmark Type | What It Measures | Tool/Method | Key Metric |
|---|---|---|---|
| Maximum Throughput | Link capacity under ideal conditions | iperf3, multi-stream | Mbps/Gbps |
| Latency Baseline | Minimum achievable RTT | ping, low rate | ms (P50, P99) |
| Latency Under Load | RTT at various utilizations | iperf3 + ping | ms at 50%, 80% |
| Jitter Profile | Delay variation pattern | iperf3 -u, RTP | ms jitter, stability |
| Loss Threshold | Utilization where loss begins | iperf3 -u, increasing rate | % utilization |
| Concurrent Flows | Performance with multiple streams | iperf3 -P | Aggregate, per-flow |
| Long-duration | Stability over time | Multi-hour tests | Variance, degradation |
Creating a Benchmark Suite:
A comprehensive network benchmark suite should include:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
#!/bin/bash# Comprehensive network benchmark suite# Run against iperf3 server on remote host SERVER="benchmark-server.example.com"DURATION=30OUTPUT_DIR="./benchmark_results/$(date +%Y%m%d_%H%M%S)"mkdir -p "$OUTPUT_DIR" echo "=== Network Benchmark Suite ==="echo "Server: $SERVER"echo "Duration per test: ${DURATION}s"echo "Output: $OUTPUT_DIR"echo "" # === 1. Latency Baseline (no load) ===echo "1. Latency Baseline..."ping -c 100 -i 0.1 $SERVER > "$OUTPUT_DIR/latency_baseline.txt" # === 2. Maximum TCP Throughput (single stream) ===echo "2. Maximum Throughput (single stream)..."iperf3 -c $SERVER -t $DURATION -J > "$OUTPUT_DIR/throughput_single.json" # === 3. Maximum TCP Throughput (8 streams) ===echo "3. Maximum Throughput (8 streams)..."iperf3 -c $SERVER -t $DURATION -P 8 -J > "$OUTPUT_DIR/throughput_8stream.json" # === 4. Reverse Direction ===echo "4. Maximum Throughput (reverse)..."iperf3 -c $SERVER -t $DURATION -R -J > "$OUTPUT_DIR/throughput_reverse.json" # === 5. UDP Jitter/Loss Test ===echo "5. UDP Jitter Test (10 Mbps)..."iperf3 -c $SERVER -u -b 10M -t $DURATION -J > "$OUTPUT_DIR/udp_10mbps.json" echo "6. UDP Jitter Test (100 Mbps)..."iperf3 -c $SERVER -u -b 100M -t $DURATION -J > "$OUTPUT_DIR/udp_100mbps.json" # === 6. Latency Under Load ===echo "7. Latency Under Load..."# Start background loadiperf3 -c $SERVER -t 60 &IPERF_PID=$!sleep 5 # Let it ramp upping -c 50 -i 0.1 $SERVER > "$OUTPUT_DIR/latency_under_load.txt"wait $IPERF_PID # === 7. Bidirectional Test ===echo "8. Bidirectional Test..."iperf3 -c $SERVER -t $DURATION --bidir -J > "$OUTPUT_DIR/bidirectional.json" # === 8. Long Duration Stability ===echo "9. Long Duration (5 minutes)..."iperf3 -c $SERVER -t 300 -i 1 -J > "$OUTPUT_DIR/long_duration.json" echo ""echo "=== Benchmark Complete ==="echo "Results saved to: $OUTPUT_DIR" # Generate summaryecho ""echo "=== Quick Summary ==="echo "Latency (idle):"grep "rtt" "$OUTPUT_DIR/latency_baseline.txt" | tail -1 echo "Throughput (single):"jq '.end.sum_received.bits_per_second / 1000000 | floor' "$OUTPUT_DIR/throughput_single.json" 2>/dev/null || echo "Parse manually" echo "Throughput (8-stream):"jq '.end.sum_received.bits_per_second / 1000000 | floor' "$OUTPUT_DIR/throughput_8stream.json" 2>/dev/null || echo "Parse manually"Benchmark results are only valid for the specific configuration tested. Document everything: OS version, NIC driver, TCP stack settings, cable type, time of day, other traffic. Changes to any factor can invalidate comparisons.
Service Level Agreements translate performance requirements into contractual commitments. Well-designed SLAs are specific, measurable, achievable, and aligned with business needs.
SLA Components:
| Metric | Target | Measurement | Typical Penalty |
|---|---|---|---|
| Availability | 99.99% monthly | Uptime monitoring | Credit per hour down |
| Latency (RTT) | P95 < 50ms | Synthetic probes, 5min intervals | Credit if exceeded 5%+ of period |
| Jitter | P95 < 10ms | UDP stream analysis | Credit if exceeded |
| Packet Loss | < 0.1% monthly | Probe packets | Credit if exceeded |
| Throughput | 95% CIR | 5-minute samples | Credit proportional to shortfall |
SLA Anti-Patterns:
Tiered SLA Design:
Often, different traffic classes warrant different SLAs:
| Service Class | DSCP | Latency P95 | Loss | Use Cases |
|---|---|---|---|---|
| Real-time | EF (46) | < 20ms | < 0.01% | Voice, video |
| Business Critical | AF41 (34) | < 50ms | < 0.1% | ERP, trading |
| Standard | AF21 (18) | < 100ms | < 0.5% | Web, email |
| Bulk | BE (0) | Best effort | Best effort | Backup, updates |
When designing SLAs, start with achievable targets based on baseline data, then tighten as you improve. An SLA you can't meet damages credibility. An SLA you consistently beat builds trust and allows later tightening.
Capacity planning ensures networks can meet current and future demands. It's part science (traffic analysis), part art (predicting the future).
The Capacity Planning Process:
Measure Current Utilization
Forecast Demand
Determine Requirements
Plan Upgrades
| Threshold | Status | Action | Timeline |
|---|---|---|---|
| < 50% peak utilization | Comfortable | Monitor | Annual review |
| 50-70% peak | Healthy | Plan for growth | 6-12 month horizon |
| 70-80% peak | Warning | Budget for upgrade | 3-6 month horizon |
| 80-90% peak | Critical | Urgent upgrade needed | Immediate action |
90% peak | Emergency | Service impact likely | ASAP |
Capacity vs. Demand Modeling:
Simple model: Linear extrapolation of growth
Future Capacity Need = Current Usage × (1 + Growth Rate)^Years
Examples:
Planning for Bursts:
Average utilization doesn't capture peaks. Factor in:
Right-Sizing Decisions:
Why 80% utilization is the threshold: At 80%, a small burst can cause congestion. M/M/1 queuing shows delay is 5× baseline at 80% utilization. You need 20% headroom for bursts, measurement error, and growth between planning cycles.
When performance degrades, systematic troubleshooting isolates the cause efficiently:
The Troubleshooting Framework:
| Symptom | Likely Causes | Diagnostic Steps | Solutions |
|---|---|---|---|
| High latency (constant) | Distance, routing | Traceroute, check path | CDN, peering, closer server |
| High latency (variable) | Congestion, bufferbloat | Check utilization, queue depth | QoS, upgrade, AQM |
| Low throughput (single flow) | TCP limits, loss | Check window, RTT, loss; iperf | Tune TCP, fix loss |
| Low throughput (aggregate) | Bandwidth exhaustion | Interface utilization | Upgrade, traffic engineering |
| Packet loss (constant) | Link errors, congestion | Interface errors, queue drops | Replace cable, upgrade, QoS |
| Packet loss (periodic) | Maintenance, routing | Correlate timing, check logs | Improve redundancy |
| High jitter | Contention, interference | Check queuing, WiFi analysis | QoS, wired connection |
| Intermittent issues | Many possibilities | Continuous monitoring, correlation | Depends on cause |
The Divide-and-Conquer Approach:
Isolate the problem by testing segments:
Tools for Each Layer:
Just because two events coincide doesn't mean one caused the other. The network may appear slow because the server is slow. High utilization may be an effect, not a cause. Always verify causal relationships by testing hypotheses.
This page synthesized everything we've learned about network performance into a practical framework. Understanding metrics individually is necessary but not sufficient—seeing how they interact enables effective network engineering.
Module Complete:
You've completed the Network Performance module. You now understand the core metrics that define network quality—bandwidth, throughput, latency, jitter—and how to measure, analyze, and optimize them in practice.
This foundation prepares you for deeper study of how these metrics manifest in specific protocols (TCP congestion control, QoS mechanisms) and technologies (wireless, WAN optimization, cloud networking) throughout the curriculum.
Congratulations on completing the Network Performance module! You now have the conceptual and practical tools to analyze, measure, and improve network performance in real-world systems. Next, explore network hardware to understand the physical and logical devices that create the networks we've been measuring.