Network Performance - Learning Module

Loading content...

0/228

Performance Metrics

The Complete Picture: Measuring Network Performance

We've explored bandwidth, throughput, latency, and jitter individually. But networks don't exist in isolation—these metrics interact in complex ways. A high-bandwidth link with terrible latency. Low latency with high jitter. Good throughput that collapses under load. Understanding network performance requires seeing the complete picture.

Performance metrics are the language of network engineering. They allow us to describe problems precisely, compare alternatives objectively, set meaningful SLAs, and make data-driven decisions. This page synthesizes everything into a comprehensive framework for performance analysis.

What You Will Learn

By the end of this page, you will understand how performance metrics relate to each other, apply structured measurement methodologies, interpret results correctly, design meaningful SLAs, and make informed capacity planning decisions.

How Performance Metrics Relate

Performance metrics are not independent—they form an interconnected system where changes to one affect others:

The Fundamental Relationships:

Throughput ≤ Bandwidth — Always, by definition
Throughput = f(Bandwidth, Latency, Loss) — For TCP, explicitly modeled
Queuing Delay ↑ as Utilization ↑ — As throughput approaches bandwidth, latency increases
Jitter ↑ as Queuing ↑ — Variable queue depth = variable delay
Loss ↑ as Queue → Full — Congestion eventually causes drops

Performance Metric Interactions
When This Increases...	This Happens...	Because...
Bandwidth (upgrade)	Throughput can increase	Larger pipe, more capacity
Bandwidth (upgrade)	Latency unchanged	Propagation delay still same
Latency	TCP throughput decreases	BDP limits window efficiency
Utilization (load)	Latency increases	Queuing delay grows
Utilization (load)	Jitter increases	Variable queue depths
Utilization (near 100%)	Loss increases	Queue overflow
Loss	Throughput decreases	Retransmissions, congestion control
Jitter	Buffer requirements increase	Must absorb variation

The Queuing Theory Connection:

Queuing theory provides mathematical models for understanding latency as a function of load:

M/M/1 Queue: Average delay grows hyperbolically as utilization approaches 100%

Avg Delay = Service Time / (1 - Utilization)

Example: At 50% utilization, delay is 2× service time. At 90%, it's 10×. At 99%, it's 100×.

This explains why networks become unusable at high utilization—the last 10% of capacity costs enormous latency.

The Utilization Cliff

There's no gradual degradation. At 80% utilization, latency is 5× baseline. At 95%, it's 20×. This is why capacity planning targets 60-70% peak utilization—headroom for bursts without catastrophic latency.

metric_relationships.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
"""
Demonstrate relationships between network performance metrics.
Shows how they interact under different conditions.
"""
 
import math
from typing import List, Dict
 
def mm1_delay(service_time_ms: float, utilization: float) -> float:
    """
    M/M/1 queuing model: average delay as function of utilization.
    
    As utilization → 1, delay → infinity.
    """
    if utilization >= 1.0:
        return float('inf')
    return service_time_ms / (1 - utilization)
 
def tcp_throughput_mathis(rtt_ms: float, loss_rate: float, mss: int = 1460) -> float:
    """
    Mathis equation for TCP throughput.
    
    Throughput ≈ (MSS × C) / (RTT × √p)
    """
    if loss_rate <= 0:
        return float('inf')
    
    c = 1.22
    rtt_sec = rtt_ms / 1000
    return (mss * 8 * c) / (rtt_sec * math.sqrt(loss_rate))
 
def calculate_effective_throughput(
    bandwidth_mbps: float,
    base_latency_ms: float,
    utilization: float,
    loss_rate: float
) -> Dict:
    """
    Calculate effective throughput considering all factors.
    """
    # Queuing adds to latency
    service_time = (1500 * 8) / (bandwidth_mbps * 1e6) * 1000  # ms
    queue_delay = mm1_delay(service_time, min(utilization, 0.99)) - service_time
    total_rtt = 2 * (base_latency_ms + queue_delay)
    
    # TCP limited by RTT and loss
    tcp_max = tcp_throughput_mathis(total_rtt, max(loss_rate, 0.0001))
    
    # Available bandwidth
    available = bandwidth_mbps * 1e6 * (1 - utilization)
    
    # Effective throughput is minimum of limits
    effective = min(tcp_max, available, bandwidth_mbps * 1e6)
    
    return {
        "bandwidth_mbps": bandwidth_mbps,
        "base_latency_ms": base_latency_ms,
        "utilization_pct": utilization * 100,
        "loss_rate_pct": loss_rate * 100,
        "queue_delay_ms": round(queue_delay, 2),
        "total_rtt_ms": round(total_rtt, 2),
        "tcp_limited_mbps": round(tcp_max / 1e6, 2) if tcp_max != float('inf') else "unlimited",
        "congestion_limited_mbps": round(available / 1e6, 2),
        "effective_mbps": round(effective / 1e6, 2),
    }
 
# Show the effects of utilization on latency
print("=== Effect of Utilization on Queuing Delay ===")
print("(1 Gbps link, 1500-byte packets, 12μs service time)\n")
print(f"{'Utilization':>12} {'Queue Delay':>12} {'Relative':>10}")
print("-" * 36)
 
service_time = 0.012  # ~12 microseconds for 1500 bytes on 1 Gbps
for util in [0.1, 0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99]:
    delay = mm1_delay(service_time, util)
    relative = delay / service_time
    print(f"{util*100:>11.0f}% {delay:>11.3f}ms {relative:>9.1f}×")
 
print("\n=== Combined Effects Example ===")
print("(100 Mbps link, 20ms base latency)\n")
 
scenarios = [
    ("Light load, no loss", 0.3, 0.0001),
    ("Moderate load, low loss", 0.6, 0.001),
    ("Heavy load, moderate loss", 0.8, 0.01),
    ("Near-saturated, high loss", 0.95, 0.03),
]
 
for name, util, loss in scenarios:
    result = calculate_effective_throughput(100, 20, util, loss)
    print(f"{name}:")
    print(f"  Queue delay: {result['queue_delay_ms']}ms, Total RTT: {result['total_rtt_ms']}ms")
    print(f"  TCP limit: {result['tcp_limited_mbps']} Mbps, Congestion limit: {result['congestion_limited_mbps']} Mbps")
    print(f"  → Effective throughput: {result['effective_mbps']} Mbps")
    print()

Structured Measurement Methodology

Random spot-checks don't constitute meaningful measurement. Professional network performance analysis requires structured methodology:

The Measurement Framework:

Define Objectives — What questions are you answering? Capacity? Baseline? Troubleshooting? Comparison?
Select Metrics — Choose metrics relevant to objectives:
- Capacity → Bandwidth, maximum throughput
- User experience → Latency, jitter
- Reliability → Packet loss, availability
- Application → Response time, transaction rate
Choose Tools — Match tools to metrics and constraints:
- Active vs. passive
- Synthetic vs. real traffic
- Intrusive vs. non-intrusive
Establish Baseline — Measure 'normal' under known conditions
Control Variables — Isolate what you're testing
Sufficient Samples — Statistical significance requires volume
Document Conditions — Time, load, configurations
Analyze and Report — Derive actionable conclusions

Measurement Tools by Metric
Metric	Primary Tool	Secondary Tool	Passive Alternative
Bandwidth	iperf3 -u	nuttcp	Interface counters
Throughput (TCP)	iperf3	curl -w	NetFlow
Latency (RTT)	ping, hping3	mtr	TCP timestamps
Jitter	iperf3 -u	D-ITG	RTP stats
Packet Loss	iperf3 -u	mtr	Interface counters
Path Analysis	traceroute, mtr	Paris traceroute	NetFlow
Application	curl, wrk, k6	Custom scripts	APM tools

Active vs. Passive Measurement:

Active Measurement

•Injects test traffic (e.g., iperf, ping)
•Measures achievable performance
•Can stress-test to find limits
•Adds load to network
•May not reflect real traffic
•Best for: capacity, troubleshooting

Passive Measurement

•Observes existing traffic (SNMP, NetFlow)
•Measures actual utilization
•Non-intrusive, continuous
•Can't show unused capacity
•Reflects real workload
•Best for: monitoring, trending

The Measurement Paradox

Measuring a network changes it. Active measurements consume bandwidth and add queuing. Heavy testing can make a network appear worse than it is. Always consider measurement load when interpreting results, especially on constrained links.

Interpreting Measurement Results

Raw numbers are meaningless without proper interpretation. Common pitfalls and how to avoid them:

Statistical Analysis:

Network measurements are noisy. Proper analysis requires:

Sample Size: Minimum 20-30 for meaningful statistics, 100+ for percentiles
Distribution: Check if data is normal (rare) or skewed (common for latency)
Outliers: Don't automatically discard—they may be the problem
Confidence Intervals: Report ranges, not just point estimates

Percentile-Based Reporting:

For latency and jitter, always report percentiles:

Understanding Percentile Metrics
Percentile	Meaning	When It Matters	Example Use
P50 (Median)	Half of values below/above	Typical experience	General performance
P90	9 in 10 values below	Most users' experience	User-facing services
P95	19 in 20 values below	Almost all users	SLA targets
P99	99 in 100 values below	Tail latency	High-volume services
P99.9	1 in 1000 exceeds	Extreme tail	Payment processing

The Tail Amplification Effect

For services making multiple serial calls, tail latency compounds. If each of 10 calls has 1% chance of being slow, the request has ~10% chance of being slow. For 100 calls: ~63% chance. At scale, P99 latency becomes P50 experience. This is why high-fan-out architectures obsess over tail latency.

Common Interpretation Mistakes:

Averaging Latency: A bimodal distribution with 10ms and 100ms values averages to 55ms—but no one experiences 55ms. Report percentiles.
Single-Point Testing: One good/bad result doesn't indicate typical performance. Variance matters.
Ignoring Context: 50ms latency to London is excellent; to the DNS server in the next room, it's a problem.
Confusing Bandwidth and Throughput: 'We have 1 Gbps' ≠ 'We get 1 Gbps.'
Overlooking Asymmetry: Upload vs. download, outbound vs. inbound latency may differ.
Testing at Wrong Time: Off-peak testing misses congestion; test during representative load.

analyze_results.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
"""
Proper statistical analysis of network measurement data.
Shows how to extract meaningful insights from raw measurements.
"""
 
import statistics
from typing import List, Dict
import math
 
def comprehensive_analysis(measurements: List[float], label: str = "Value") -> Dict:
    """
    Comprehensive statistical analysis of measurement data.
    Returns actionable metrics, not just raw statistics.
    """
    if len(measurements) < 2:
        return {"error": "Insufficient data"}
    
    sorted_data = sorted(measurements)
    n = len(sorted_data)
    
    def percentile(p: float) -> float:
        idx = int(p / 100 * n)
        return sorted_data[min(idx, n-1)]
    
    mean = statistics.mean(measurements)
    median = statistics.median(measurements)
    stdev = statistics.stdev(measurements)
    
    # Detect distribution skewness
    # Positive skew: mean > median (common for latency)
    skewness = (mean - median) / stdev if stdev > 0 else 0
    
    # Detect bimodality (simple heuristic)
    q1, q3 = percentile(25), percentile(75)
    iqr = q3 - q1
    outlier_low = q1 - 1.5 * iqr
    outlier_high = q3 + 1.5 * iqr
    outliers = [v for v in measurements if v < outlier_low or v > outlier_high]
    
    return {
        "sample_count": n,
        
        # Central tendency
        "mean": round(mean, 3),
        "median": round(median, 3),
        "mode_estimate": round(3 * median - 2 * mean, 3),  # Pearson's approximation
        
        # Spread
        "min": round(sorted_data[0], 3),
        "max": round(sorted_data[-1], 3),
        "stdev": round(stdev, 3),
        "cv_percent": round(stdev / mean * 100, 1) if mean > 0 else 0,
        
        # Percentiles
        "p50": round(percentile(50), 3),
        "p90": round(percentile(90), 3),
        "p95": round(percentile(95), 3),
        "p99": round(percentile(99), 3),
        
        # Shape analysis
        "skewness": round(skewness, 2),
        "is_right_skewed": skewness > 0.5,
        "outlier_count": len(outliers),
        "outlier_percent": round(len(outliers) / n * 100, 1),
        
        # Recommendations
        "use_median": abs(skewness) > 0.5,  # If skewed, prefer median
        "high_variance": stdev / mean > 0.3 if mean > 0 else False,
    }
 
def compare_measurements(baseline: List[float], comparison: List[float]) -> Dict:
    """
    Compare two sets of measurements to detect significant changes.
    """
    baseline_stats = comprehensive_analysis(baseline, "Baseline")
    comparison_stats = comprehensive_analysis(comparison, "Comparison")
    
    # Calculate relative changes
    def rel_change(b, c):
        return ((c - b) / b * 100) if b != 0 else float('inf')
    
    return {
        "baseline_p50": baseline_stats["p50"],
        "comparison_p50": comparison_stats["p50"],
        "p50_change_pct": round(rel_change(baseline_stats["p50"], comparison_stats["p50"]), 1),
        
        "baseline_p99": baseline_stats["p99"],
        "comparison_p99": comparison_stats["p99"],
        "p99_change_pct": round(rel_change(baseline_stats["p99"], comparison_stats["p99"]), 1),
        
        "variance_change": "increased" if comparison_stats["stdev"] > baseline_stats["stdev"] * 1.2 else 
                          "decreased" if comparison_stats["stdev"] < baseline_stats["stdev"] * 0.8 else
                          "stable",
        
        "significant_degradation": comparison_stats["p95"] > baseline_stats["p95"] * 1.5,
    }
 
# Example analysis
import random
random.seed(42)
 
# Typical latency distribution (right-skewed)
normal_latency = [20 + random.expovariate(0.1) for _ in range(1000)]
 
print("=== Latency Measurement Analysis ===\n")
stats = comprehensive_analysis(normal_latency, "Latency (ms)")
 
print(f"Sample Size: {stats['sample_count']}")
print(f"\nCentral Tendency:")
print(f"  Mean: {stats['mean']}ms")
print(f"  Median: {stats['median']}ms")
print(f"  Skewness: {stats['skewness']} ({'right-skewed' if stats['is_right_skewed'] else 'symmetric'})")
 
print(f"\nPercentiles:")
print(f"  P50: {stats['p50']}ms")
print(f"  P90: {stats['p90']}ms") 
print(f"  P95: {stats['p95']}ms")
print(f"  P99: {stats['p99']}ms")
 
print(f"\nRecommendation: {'Use median (skewed data)' if stats['use_median'] else 'Mean is acceptable'}")
print(f"Variance concern: {'Yes - investigate' if stats['high_variance'] else 'No'}")

Benchmarking Network Performance

Benchmarking establishes baseline performance and enables meaningful comparisons. Proper benchmarking requires rigor:

Benchmarking Principles:

Isolation: Test one thing at a time. Change one variable, measure effect.
Repeatability: Same test, same conditions → same results (within variance)
Representativeness: Benchmark workload should resemble production
Documentation: Fully document configuration, ensuring reproducibility
Multiple Runs: Statistical validity requires multiple iterations

Standard Network Benchmarks
Benchmark Type	What It Measures	Tool/Method	Key Metric
Maximum Throughput	Link capacity under ideal conditions	iperf3, multi-stream	Mbps/Gbps
Latency Baseline	Minimum achievable RTT	ping, low rate	ms (P50, P99)
Latency Under Load	RTT at various utilizations	iperf3 + ping	ms at 50%, 80%
Jitter Profile	Delay variation pattern	iperf3 -u, RTP	ms jitter, stability
Loss Threshold	Utilization where loss begins	iperf3 -u, increasing rate	% utilization
Concurrent Flows	Performance with multiple streams	iperf3 -P	Aggregate, per-flow
Long-duration	Stability over time	Multi-hour tests	Variance, degradation

Creating a Benchmark Suite:

A comprehensive network benchmark suite should include:

benchmark_suite.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#!/bin/bash
# Comprehensive network benchmark suite
# Run against iperf3 server on remote host
 
SERVER="benchmark-server.example.com"
DURATION=30
OUTPUT_DIR="./benchmark_results/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$OUTPUT_DIR"
 
echo "=== Network Benchmark Suite ==="
echo "Server: $SERVER"
echo "Duration per test: ${DURATION}s"
echo "Output: $OUTPUT_DIR"
echo ""
 
# === 1. Latency Baseline (no load) ===
echo "1. Latency Baseline..."
ping -c 100 -i 0.1 $SERVER > "$OUTPUT_DIR/latency_baseline.txt"
 
# === 2. Maximum TCP Throughput (single stream) ===
echo "2. Maximum Throughput (single stream)..."
iperf3 -c $SERVER -t $DURATION -J > "$OUTPUT_DIR/throughput_single.json"
 
# === 3. Maximum TCP Throughput (8 streams) ===
echo "3. Maximum Throughput (8 streams)..."
iperf3 -c $SERVER -t $DURATION -P 8 -J > "$OUTPUT_DIR/throughput_8stream.json"
 
# === 4. Reverse Direction ===
echo "4. Maximum Throughput (reverse)..."
iperf3 -c $SERVER -t $DURATION -R -J > "$OUTPUT_DIR/throughput_reverse.json"
 
# === 5. UDP Jitter/Loss Test ===
echo "5. UDP Jitter Test (10 Mbps)..."
iperf3 -c $SERVER -u -b 10M -t $DURATION -J > "$OUTPUT_DIR/udp_10mbps.json"
 
echo "6. UDP Jitter Test (100 Mbps)..."
iperf3 -c $SERVER -u -b 100M -t $DURATION -J > "$OUTPUT_DIR/udp_100mbps.json"
 
# === 6. Latency Under Load ===
echo "7. Latency Under Load..."
# Start background load
iperf3 -c $SERVER -t 60 &
IPERF_PID=$!
sleep 5  # Let it ramp up
ping -c 50 -i 0.1 $SERVER > "$OUTPUT_DIR/latency_under_load.txt"
wait $IPERF_PID
 
# === 7. Bidirectional Test ===
echo "8. Bidirectional Test..."
iperf3 -c $SERVER -t $DURATION --bidir -J > "$OUTPUT_DIR/bidirectional.json"
 
# === 8. Long Duration Stability ===
echo "9. Long Duration (5 minutes)..."
iperf3 -c $SERVER -t 300 -i 1 -J > "$OUTPUT_DIR/long_duration.json"
 
echo ""
echo "=== Benchmark Complete ==="
echo "Results saved to: $OUTPUT_DIR"
 
# Generate summary
echo ""
echo "=== Quick Summary ==="
echo "Latency (idle):"
grep "rtt" "$OUTPUT_DIR/latency_baseline.txt" | tail -1
 
echo "Throughput (single):"
jq '.end.sum_received.bits_per_second / 1000000 | floor' "$OUTPUT_DIR/throughput_single.json" 2>/dev/null || echo "Parse manually"
 
echo "Throughput (8-stream):"
jq '.end.sum_received.bits_per_second / 1000000 | floor' "$OUTPUT_DIR/throughput_8stream.json" 2>/dev/null || echo "Parse manually"

Benchmark Environment Matters

Benchmark results are only valid for the specific configuration tested. Document everything: OS version, NIC driver, TCP stack settings, cable type, time of day, other traffic. Changes to any factor can invalidate comparisons.

Designing Network SLAs

Service Level Agreements translate performance requirements into contractual commitments. Well-designed SLAs are specific, measurable, achievable, and aligned with business needs.

SLA Components:

Metric Definition: What exactly is being measured?
Target Value: What level must be achieved?
Measurement Method: How is it measured?
Measurement Period: Over what time window?
Exclusions: What doesn't count? (maintenance, force majeure)
Remedies: What happens when SLA is missed?

Example Network SLA Targets
Metric	Target	Measurement	Typical Penalty
Availability	99.99% monthly	Uptime monitoring	Credit per hour down
Latency (RTT)	P95 < 50ms	Synthetic probes, 5min intervals	Credit if exceeded 5%+ of period
Jitter	P95 < 10ms	UDP stream analysis	Credit if exceeded
Packet Loss	< 0.1% monthly	Probe packets	Credit if exceeded
Throughput	95% CIR	5-minute samples	Credit proportional to shortfall

SLA Anti-Patterns:

Average-based SLAs: 'Average latency < 50ms' hides spikes. Use percentiles.
Unmeasurable SLAs: 'Best-effort performance'—what does that even mean?
Wrong granularity: Monthly averages hide daily problems; too-fine granularity creates noise.
Missing baselines: SLAs without baseline data are guesses.
Ignore-able exclusions: Too many exclusions make SLA meaningless.

Tiered SLA Design:

Often, different traffic classes warrant different SLAs:

Tiered SLA Example
Service Class	DSCP	Latency P95	Loss	Use Cases
Real-time	EF (46)	< 20ms	< 0.01%	Voice, video
Business Critical	AF41 (34)	< 50ms	< 0.1%	ERP, trading
Standard	AF21 (18)	< 100ms	< 0.5%	Web, email
Bulk	BE (0)	Best effort	Best effort	Backup, updates

Start Conservative, Tighten Later

When designing SLAs, start with achievable targets based on baseline data, then tighten as you improve. An SLA you can't meet damages credibility. An SLA you consistently beat builds trust and allows later tightening.

Network Capacity Planning

Capacity planning ensures networks can meet current and future demands. It's part science (traffic analysis), part art (predicting the future).

The Capacity Planning Process:

Measure Current Utilization
- Peak vs. average utilization
- Trend analysis (growth rate)
- Seasonal patterns
Forecast Demand
- Historical growth extrapolation
- Business input (new applications, users)
- Industry benchmarks
Determine Requirements
- Target utilization thresholds
- Latency/quality requirements
- Redundancy needs
Plan Upgrades
- Timeline for hitting thresholds
- Lead time for procurement
- Budget cycles

Capacity Planning Guidelines
Threshold	Status	Action	Timeline
< 50% peak utilization	Comfortable	Monitor	Annual review
50-70% peak	Healthy	Plan for growth	6-12 month horizon
70-80% peak	Warning	Budget for upgrade	3-6 month horizon
80-90% peak	Critical	Urgent upgrade needed	Immediate action
90% peak	Emergency	Service impact likely	ASAP

Capacity vs. Demand Modeling:

Simple model: Linear extrapolation of growth

Future Capacity Need = Current Usage × (1 + Growth Rate)^Years

Examples:

25% annual growth: doubles in ~3 years
50% annual growth: doubles in ~1.7 years
100% annual growth: doubles every year

Planning for Bursts:

Average utilization doesn't capture peaks. Factor in:

Peak-to-average ratio: Often 2-5× for bursty traffic
Concurrency factor: How many users/flows simultaneously?
Failure scenarios: Can remaining capacity handle load if link fails?

Right-Sizing Decisions:

Capacity Upgrade Considerations

•Step function costs: Network upgrades often come in steps (1G→10G→100G), so timing matters
•Lead times: Carrier circuits take weeks/months; plan ahead
•Budget cycles: Align requests with fiscal planning
•Technology transitions: 10G NICs are cheap now; 100G less so. Consider technology curves
•Overprovisioning value: The cost of outage often exceeds cost of extra capacity

The 80% Rule Revisited

Why 80% utilization is the threshold: At 80%, a small burst can cause congestion. M/M/1 queuing shows delay is 5× baseline at 80% utilization. You need 20% headroom for bursts, measurement error, and growth between planning cycles.

Troubleshooting Performance Problems

When performance degrades, systematic troubleshooting isolates the cause efficiently:

The Troubleshooting Framework:

Define the Problem: What exactly is slow? For which users? Since when?
Gather Data: Collect metrics from all relevant points
Establish Baseline: What was 'normal'? Compare to current.
Form Hypothesis: Based on data, what's the likely cause?
Test Hypothesis: Validate or eliminate systematically
Implement Fix: Apply solution
Verify Resolution: Confirm metrics return to baseline

Common Performance Problems and Diagnosis
Symptom	Likely Causes	Diagnostic Steps	Solutions
High latency (constant)	Distance, routing	Traceroute, check path	CDN, peering, closer server
High latency (variable)	Congestion, bufferbloat	Check utilization, queue depth	QoS, upgrade, AQM
Low throughput (single flow)	TCP limits, loss	Check window, RTT, loss; iperf	Tune TCP, fix loss
Low throughput (aggregate)	Bandwidth exhaustion	Interface utilization	Upgrade, traffic engineering
Packet loss (constant)	Link errors, congestion	Interface errors, queue drops	Replace cable, upgrade, QoS
Packet loss (periodic)	Maintenance, routing	Correlate timing, check logs	Improve redundancy
High jitter	Contention, interference	Check queuing, WiFi analysis	QoS, wired connection
Intermittent issues	Many possibilities	Continuous monitoring, correlation	Depends on cause

The Divide-and-Conquer Approach:

Isolate the problem by testing segments:

Test locally: Is the client/server healthy? (loopback, local tests)
Test LAN: Local network OK? (ping default gateway, switch stats)
Test WAN: First hop, intermediate hops, destination (mtr, traceroute)
Test end-to-end: Application layer (curl, app-specific tests)

Tools for Each Layer:

Layer-Specific Diagnostic Tools

•Physical/Link: ethtool (errors, speed), NIC stats, cable testers
•Network (IP): ping, traceroute, mtr, interface counters
•Transport (TCP/UDP): iperf3, tcpdump/Wireshark, ss/netstat
•Application: curl -w, app-specific tools, APM
•Aggregation: SNMP, NetFlow, sFlow

Correlation ≠ Causation

Just because two events coincide doesn't mean one caused the other. The network may appear slow because the server is slow. High utilization may be an effect, not a cause. Always verify causal relationships by testing hypotheses.

Summary: The Complete Performance Picture

This page synthesized everything we've learned about network performance into a practical framework. Understanding metrics individually is necessary but not sufficient—seeing how they interact enables effective network engineering.

Key Takeaways

•Metrics are interconnected — Bandwidth, throughput, latency, and jitter affect each other. Utilization increases latency nonlinearly. Loss destroys throughput. High jitter requires larger buffers.
•Measurement requires methodology — Active vs. passive, proper statistical analysis, sufficient samples, controlled conditions. Random spot-checks aren't meaningful.
•Interpret with context — Report percentiles not averages. Understand the specific scenario. Document conditions. Tail latency matters at scale.
•Benchmarks enable comparison — Establish baselines, use structured suites, document everything. Without baselines, you can't detect changes.
•SLAs make commitments concrete — Define precisely what's measured, how, and consequences of failure. Avoid anti-patterns like averaging.
•Capacity planning is continuous — Monitor trends, forecast growth, maintain headroom. The 80% rule exists for good reasons.
•Troubleshoot systematically — Define the problem, gather data, hypothesize, test, fix, verify. Divide and conquer isolates issues.

Module Complete:

You've completed the Network Performance module. You now understand the core metrics that define network quality—bandwidth, throughput, latency, jitter—and how to measure, analyze, and optimize them in practice.

This foundation prepares you for deeper study of how these metrics manifest in specific protocols (TCP congestion control, QoS mechanisms) and technologies (wireless, WAN optimization, cloud networking) throughout the curriculum.

Module Complete

Congratulations on completing the Network Performance module! You now have the conceptual and practical tools to analyze, measure, and improve network performance in real-world systems. Next, explore network hardware to understand the physical and logical devices that create the networks we've been measuring.