Profiling and Monitoring - Learning Module

Loading content...

0/273

Network Monitoring

The Network: The Invisible Infrastructure

In distributed systems, the network is the connective tissue that binds everything together. Every microservice call, every database query, every cache lookup, every message queue interaction—all traverse the network. Yet the network is often the least understood component of system performance.

Network monitoring is the discipline of observing, measuring, and analyzing network behavior to understand latency, detect anomalies, and diagnose connectivity issues. It transforms the network from an opaque "it just works" assumption into a well-understood, observable component of your system.

When an application is slow, the network is often blamed first but investigated last. Network monitoring provides the evidence to either confirm network issues or eliminate them from investigation.

What You Will Learn

By the end of this page, you will understand network fundamentals relevant to monitoring, key network metrics, latency analysis techniques, connection state tracking, and bandwidth monitoring. You'll learn to diagnose network issues systematically using the same approaches employed by principal engineers and network specialists.

Network Fundamentals for Monitoring

Effective network monitoring requires understanding what you're measuring. Let's establish the foundational concepts that inform monitoring strategies.

The OSI/TCP-IP Layers That Matter:

While the 7-layer OSI model is academically complete, practical network monitoring focuses on specific layers:

Network Layers and Monitoring Focus
Layer	Protocol Examples	What to Monitor	Common Issues
Layer 7 (Application)	HTTP, gRPC, DNS	Request latency, error rates, throughput	Slow responses, 5xx errors, timeouts
Layer 4 (Transport)	TCP, UDP	Connection states, retransmissions, RTT	Connection failures, high retransmit rate
Layer 3 (Network)	IP, ICMP	Packet loss, routing, latency	Unreachable hosts, asymmetric routing
Layer 2 (Data Link)	Ethernet, ARP	ARP cache, MAC issues	Usually managed by cloud/infra providers

TCP Connection Lifecycle:

Most distributed system communication uses TCP. Understanding the TCP lifecycle is essential for interpreting connection metrics:

Three-way Handshake (SYN → SYN-ACK → ACK): Establishes connection. Takes 1.5 RTT (Round Trip Time).
Data Transfer: Application data flows. Acknowledgments confirm receipt.
Connection Close (FIN/ACK or RST): Orderly shutdown or abrupt termination.

Each phase can introduce latency and each has distinct failure modes.

tcp_connection_states.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
TCP Connection State Diagram (Simplified)
==========================================
 
Client                          Server
  |                               |
  |------------ SYN ------------>|  Client: SYN_SENT
  |                               |  Server: SYN_RECEIVED
  |<--------- SYN-ACK -----------|
  |                               |
  |------------ ACK ------------>|  Both: ESTABLISHED
  |                               |
  |<========== DATA ============>|  Normal data transfer
  |                               |
  |------------ FIN ------------>|  Client: FIN_WAIT_1
  |                               |  Server: CLOSE_WAIT
  |<----------- ACK -------------|  Client: FIN_WAIT_2
  |                               |
  |<----------- FIN -------------|  Server: LAST_ACK
  |                               |
  |------------ ACK ------------>|  Client: TIME_WAIT (waits 2*MSL)
  |                               |  Server: CLOSED
  |                               |
  | (after timeout)               |
  |                               |  Client: CLOSED
 
 
Key States to Monitor:
----------------------
- ESTABLISHED: Active connections (expected state during use)
- TIME_WAIT: Recently closed connections, waiting for late packets
             High count = rapid connection cycling (connection pooling helps)
- CLOSE_WAIT: Server received FIN but hasn't closed (application hung?)
              High count = potential application bug or resource leak
- SYN_SENT/SYN_RECEIVED: Handshake in progress
              Accumulation = network issues or firewall problems
 
Monitoring Query (Linux):
$ ss -s  # Summary of socket states
$ netstat -ant | awk '{print $6}' | sort | uniq -c | sort -rn
# Example output:
#   1523 ESTABLISHED
#    892 TIME_WAIT
#     45 CLOSE_WAIT  <-- Investigate if this grows
#     12 SYN_SENT

TIME_WAIT Exhaustion

Connections in TIME_WAIT consume local ports for typically 60-120 seconds. High-throughput systems making many short-lived connections can exhaust available ports, causing connection failures. Solutions: connection pooling, SO_REUSEADDR, or reducing MSL (with caution).

Key Network Metrics

Effective network monitoring requires tracking the right metrics. The following represent the essential measurements for distributed system health:

Latency Metrics:

Latency Measurements

•Round-Trip Time (RTT) — Time for a packet to travel to destination and back. Fundamental measure of network distance and congestion.
•Connection Establishment Time — Duration of TCP handshake (1.5 × RTT baseline). Includes server processing time.
•Time to First Byte (TTFB) — From request sent to first response byte received. Combines network time and server processing.
•DNS Resolution Time — Time to resolve hostname to IP address. Can add 10-100ms per resolution without caching.
•TLS Handshake Time — Additional round trips for encryption setup (1-2 RTT for TLS 1.2, 0-1 RTT for TLS 1.3).

Reliability Metrics:

Reliability Measurements

•Packet Loss Rate — Percentage of packets that don't reach destination. Even 0.1% loss significantly impacts TCP performance.
•Retransmission Rate — Percentage of TCP segments that must be resent. High rates indicate loss or congestion.
•Connection Failure Rate — Percentage of connection attempts that fail. Includes timeouts, resets, and refused connections.
•Reset (RST) Rate — Abrupt connection terminations. May indicate firewall issues, application crashes, or resource exhaustion.
•Out-of-Order Delivery — Packets arriving in wrong sequence. Forces TCP reordering, impacting latency.

Throughput Metrics:

Throughput Measurements

•Bandwidth Utilization — Percentage of available bandwidth in use. High utilization can cause queuing delays.
•Bytes/Packets per Second — Raw throughput in each direction. Basis for capacity planning.
•Connection Rate — New connections established per second. High rates stress connection handling capacity.
•Concurrent Connections — Active connections at any moment. Affects memory and descriptor usage.
•Goodput — Application-level data throughput (excluding protocol overhead). True useful capacity.

Network Metrics Thresholds (Guidelines)
Metric	Healthy	Warning	Critical
Packet Loss	< 0.01%	0.01% - 0.1%	0.1%
Retransmission Rate	< 1%	1% - 3%	3%
RTT Variance	< 10ms jitter	10-50ms jitter	50ms jitter
Bandwidth Utilization	< 60%	60% - 80%	80%
Connection Failure Rate	< 0.1%	0.1% - 1%	1%

Latency × Bandwidth = Capacity

The Bandwidth-Delay Product (BDP) describes how much data can be 'in flight' on a network path. High-latency links (e.g., cross-continent) require larger TCP receive windows to saturate bandwidth. Monitor both latency and bandwidth together to understand true capacity.

Network Monitoring Tools

Effective network monitoring combines multiple tools, each providing different perspectives on network behavior.

System-Level Monitoring:

network_monitoring_commands.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#!/bin/bash
# Essential Network Monitoring Commands
 
# ============================================
# ss (Socket Statistics) - Modern netstat replacement
# ============================================
 
# Connection summary by state
ss -s
 
# TCP connections with timing details
ss -ti
 
# Connections to specific port
ss -tn state established '( dport = :443 )'
 
# Show process owning connection
ss -tnp | grep ESTABLISHED
 
# Output example:
# ESTAB    0      0           10.0.1.45:52341      10.0.2.100:5432  
#     cubic wscale:7,7 rto:204 rtt:3.5/1.3 ato:40 mss:1448 
#     pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:15234 
#     bytes_acked:15234 bytes_received:45678 segs_out:102 segs_in:98
#
# Key values:
# - rtt:3.5/1.3 = RTT and variance in ms
# - cwnd:10 = Congestion window (segments)
# - bytes_sent/received = Data transferred
 
# ============================================
# tcpdump - Packet capture
# ============================================
 
# Capture HTTP traffic on eth0
tcpdump -i eth0 -n port 80 or port 443 -c 100
 
# Capture with timing (for latency analysis)
tcpdump -i eth0 -n -tt host 10.0.2.100
 
# Capture for later analysis (save to file)
tcpdump -i eth0 -w capture.pcap -c 10000
 
# Capture only SYN packets (connection attempts)
tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0'
 
# ============================================
# ping / mtr - Basic latency measurement
# ============================================
 
# Basic ping with timing
ping -c 10 10.0.2.100
 
# MTR: Combined traceroute + ping (shows each hop)
mtr -rw 10.0.2.100
 
# Example mtr output:
# Host                  Loss%   Snt   Last   Avg  Best  Wrst StDev
# 1. gateway.local      0.0%    10    0.5   0.4   0.3   0.7   0.1
# 2. isp-router.net     0.0%    10    8.3   8.1   7.8   8.9   0.4
# 3. backbone.net       0.0%    10   15.2  15.0  14.8  15.5   0.2
# 4. target.server      0.0%    10   18.4  18.2  18.0  18.8   0.3
 
# ============================================
# nethogs / iftop - Bandwidth by process/connection
# ============================================
 
# Network usage per process
nethogs eth0
 
# Network usage per connection (live)
iftop -i eth0
 
# ============================================
# /proc/net/snmp - Kernel network statistics
# ============================================
 
# TCP statistics (retransmissions, etc)
cat /proc/net/snmp | grep Tcp:
 
# Extract retransmit ratio:
awk '/^Tcp:/ {
    if (NR==1) { for (i=1;i<=NF;i++) col[$i]=i; next }
    retrans = $col["RetransSegs"];
    outseg = $col["OutSegs"];
    if (outseg > 0) printf "Retransmit ratio: %.4f%%\n", (retrans/outseg)*100
}' /proc/net/snmp

Application-Level Network Monitoring:

System tools show raw network behavior. Application-level instrumentation shows how the network impacts user experience:

application_network_metrics.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# Python Example: Instrumenting HTTP Client for Network Metrics
 
import time
import requests
from prometheus_client import Histogram, Counter, Gauge
 
# Define metrics
HTTP_REQUEST_DURATION = Histogram(
    'http_client_request_duration_seconds',
    'HTTP request latency',
    ['method', 'host', 'status'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
 
HTTP_REQUEST_SIZE = Histogram(
    'http_client_request_size_bytes',
    'HTTP request body size',
    ['method', 'host'],
    buckets=[100, 1000, 10000, 100000, 1000000]
)
 
HTTP_CONNECTION_ERRORS = Counter(
    'http_client_connection_errors_total',
    'HTTP client connection failures',
    ['host', 'error_type']
)
 
DNS_RESOLUTION_TIME = Histogram(
    'dns_resolution_duration_seconds',
    'DNS resolution latency',
    ['host'],
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25]
)
 
 
class InstrumentedHTTPClient:
    """HTTP client with network-level metrics collection."""
    
    def __init__(self, base_url: str):
        self.base_url = base_url
        self.session = requests.Session()
        
    def request(self, method: str, path: str, **kwargs) -> requests.Response:
        url = f"{self.base_url}{path}"
        host = self.base_url.split('//')[1].split('/')[0]
        
        start_time = time.perf_counter()
        
        try:
            response = self.session.request(method, url, **kwargs)
            
            # Record successful request metrics
            duration = time.perf_counter() - start_time
            HTTP_REQUEST_DURATION.labels(
                method=method,
                host=host,
                status=str(response.status_code)
            ).observe(duration)
            
            if kwargs.get('data') or kwargs.get('json'):
                request_size = len(kwargs.get('data', '') or str(kwargs.get('json', '')))
                HTTP_REQUEST_SIZE.labels(method=method, host=host).observe(request_size)
            
            return response
            
        except requests.exceptions.ConnectionError as e:
            HTTP_CONNECTION_ERRORS.labels(
                host=host, 
                error_type='connection_error'
            ).inc()
            raise
            
        except requests.exceptions.Timeout as e:
            HTTP_CONNECTION_ERRORS.labels(
                host=host,
                error_type='timeout'
            ).inc()
            raise
            
        except requests.exceptions.DNSLookupError as e:
            HTTP_CONNECTION_ERRORS.labels(
                host=host,
                error_type='dns_error'
            ).inc()
            raise
 
 
# Usage:
# client = InstrumentedHTTPClient("https://api.example.com")
# response = client.request("GET", "/users")
#
# Resulting metrics exported to Prometheus:
# http_client_request_duration_seconds_bucket{method="GET",host="api.example.com",status="200",le="0.1"} 842
# http_client_connection_errors_total{host="api.example.com",error_type="timeout"} 3

Black-Box vs. White-Box Monitoring

System-level tools (tcpdump, ss) provide 'black-box' network view—what's happening on the wire. Application-level instrumentation provides 'white-box' view—how network affects application behavior. Both perspectives are necessary for complete understanding.

Latency Analysis: Where Time Disappears

Network latency is often the largest contributor to request latency in distributed systems. Understanding latency components enables targeted optimization.

Latency Components:

A simple HTTP request involves multiple latency sources:

latency_breakdown.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
HTTP Request Latency Breakdown
===============================
 
  Client                                           Server
    |                                                  |
    |-- DNS Resolution (10-100ms, cached: 0ms) ------->|
    |                                                  |
    |-- TCP Handshake (1.5 × RTT) -------------------->|
    |                                                  |
    |-- TLS Handshake (1-2 × RTT for TLS 1.2) -------->|
    |       (0-1 × RTT for TLS 1.3 or session resume)  |
    |                                                  |
    |-- Request Transmission (size ÷ bandwidth) ----->|
    |                                                  |
    |       [Server Processing Time]                   |
    |                                                  |
    |<----- Response First Byte (1 × RTT) -------------|
    |                                                  |
    |<----- Response Transmission (size ÷ bandwidth) --|
    |                                                  |
 
Example Breakdown (cross-region request to US-East from EU-West):
 
Component                  Time (ms)   Notes
-----------                ---------   -----
DNS Resolution                  0      Cached
TCP Handshake                 120      RTT = 80ms, so 1.5 × 80 = 120ms
TLS Handshake                  80      TLS 1.3 with session ticket
Request Transmission            2      1KB request ÷ ~500KB/s available
Server Processing              50      Application + database
Response First Byte            80      One RTT
Response Transmission          20      10KB response ÷ ~500KB/s
-----------                ---------
TOTAL                         352ms
 
Optimization Opportunities:
- Connection reuse eliminates TCP + TLS handshakes for subsequent requests
- Edge deployment reduces RTT (EU edge server: RTT = 10ms → huge savings)
- Response compression reduces transmission time
- Server-side optimization reduces processing time

Analyzing Latency Distribution:

Latency is rarely uniform. Analyzing distributions (not just averages) reveals problems:

P50 (median): The "typical" experience. Half of requests are faster.
P95: The experience of 1 in 20 users. Often significantly worse than P50.
P99: The experience of 1 in 100 users. Can be 10x+ worse than P50.
P99.9: Worst cases excluding true outliers. Critical for SLA compliance.

The Tail Latency Problem:

In systems making multiple parallel requests, tail latency dominates user experience. If a page requires 10 backend calls and each has 1% chance of 1-second latency, the page will be slow ~10% of the time.

latency_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# Latency Analysis with Histograms
 
import numpy as np
from scipy import stats
 
class LatencyAnalyzer:
    """Analyze latency samples to understand distribution patterns."""
    
    def __init__(self, samples: list[float]):
        self.samples = np.array(samples)
        
    def summary(self) -> dict:
        """Compute comprehensive latency statistics."""
        return {
            'count': len(self.samples),
            'min': np.min(self.samples),
            'max': np.max(self.samples),
            'mean': np.mean(self.samples),
            'median': np.median(self.samples),
            'std_dev': np.std(self.samples),
            'p50': np.percentile(self.samples, 50),
            'p75': np.percentile(self.samples, 75),
            'p90': np.percentile(self.samples, 90),
            'p95': np.percentile(self.samples, 95),
            'p99': np.percentile(self.samples, 99),
            'p999': np.percentile(self.samples, 99.9),
        }
    
    def detect_bimodal(self) -> bool:
        """
        Detect if latency distribution is bimodal (two distinct modes).
        This often indicates cache hit/miss or different code paths.
        """
        # Use kernel density estimation
        kde = stats.gaussian_kde(self.samples)
        x = np.linspace(np.min(self.samples), np.max(self.samples), 1000)
        density = kde(x)
        
        # Find local maxima (peaks)
        peaks = []
        for i in range(1, len(density) - 1):
            if density[i] > density[i-1] and density[i] > density[i+1]:
                peaks.append(x[i])
        
        return len(peaks) >= 2
    
    def tail_ratio(self) -> float:
        """
        Compute ratio of P99 to P50.
        High ratio indicates significant tail latency.
        """
        p50 = np.percentile(self.samples, 50)
        p99 = np.percentile(self.samples, 99)
        return p99 / p50 if p50 > 0 else float('inf')
 
 
# Example usage and interpretation:
# 
# samples = collect_latency_samples('api/users', duration='5m')
# analyzer = LatencyAnalyzer(samples)
# 
# stats = analyzer.summary()
# print(f"P50: {stats['p50']:.2f}ms")
# print(f"P99: {stats['p99']:.2f}ms")
# print(f"Tail ratio: {analyzer.tail_ratio():.1f}x")
# 
# Example output:
# P50: 45.23ms
# P99: 892.45ms
# Tail ratio: 19.7x
#
# INTERPRETATION: 
# - P99 is 20x P50 → significant tail latency problem
# - 1% of users experience ~1 second latency vs typical ~50ms
# - Investigate: cache misses, GC pauses, lock contention, 
#   or network variability affecting 1% of requests

Don't Average Latency

Average latency hides tail latency. A service with 90% of requests at 10ms and 10% at 1000ms has an 'average' of 109ms, which represents nobody's actual experience. Always report percentiles (P50, P95, P99) for meaningful latency metrics.

Connection Tracking and Connection Pooling

Connections are expensive. Each TCP connection requires a three-way handshake, consumes memory (kernel socket buffers), and may require TLS negotiation. Connection tracking monitors connection behavior to optimize resource usage.

Why Connection Pooling Matters:

Without pooling, each request incurs connection establishment overhead:

TCP handshake: 1.5 × RTT
TLS handshake: 1-2 × RTT
Total: 2.5-3.5 × RTT before any data transfers

With RTT of 80ms (cross-region), this adds 200-280ms to every request. Connection pooling amortizes this cost across many requests.

connection_pool_monitoring.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Connection Pool Monitoring
 
from prometheus_client import Gauge, Counter, Histogram
import threading
import time
 
class MonitoredConnectionPool:
    """
    Connection pool with comprehensive monitoring.
    Wraps actual pool implementation with metrics.
    """
    
    # Prometheus metrics
    POOL_SIZE = Gauge(
        'connection_pool_size',
        'Current connections in pool',
        ['pool_name', 'state']  # state: idle, active
    )
    
    POOL_WAIT_TIME = Histogram(
        'connection_pool_wait_seconds',
        'Time waiting to acquire connection',
        ['pool_name'],
        buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
    )
    
    POOL_EXHAUSTION = Counter(
        'connection_pool_exhausted_total',
        'Times pool was exhausted (no connections available)',
        ['pool_name']
    )
    
    CONNECTION_LIFETIME = Histogram(
        'connection_lifetime_seconds',
        'How long connections are used before return',
        ['pool_name'],
        buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 30.0, 60.0, 300.0]
    )
    
    def __init__(self, name: str, pool):
        self.name = name
        self.pool = pool
        self._lock = threading.Lock()
        self._active_connections = {}
        
    def acquire(self):
        """Acquire a connection from the pool, record metrics."""
        start = time.perf_counter()
        
        try:
            conn = self.pool.acquire(timeout=30)
            
            wait_time = time.perf_counter() - start
            self.POOL_WAIT_TIME.labels(pool_name=self.name).observe(wait_time)
            
            # Track when this connection was acquired
            conn_id = id(conn)
            self._active_connections[conn_id] = time.perf_counter()
            
            self._update_size_metrics()
            return conn
            
        except PoolExhaustedError:
            self.POOL_EXHAUSTION.labels(pool_name=self.name).inc()
            raise
    
    def release(self, conn):
        """Return a connection to the pool, record metrics."""
        conn_id = id(conn)
        
        if conn_id in self._active_connections:
            lifetime = time.perf_counter() - self._active_connections[conn_id]
            self.CONNECTION_LIFETIME.labels(pool_name=self.name).observe(lifetime)
            del self._active_connections[conn_id]
        
        self.pool.release(conn)
        self._update_size_metrics()
    
    def _update_size_metrics(self):
        """Update pool size gauges."""
        self.POOL_SIZE.labels(pool_name=self.name, state='active').set(
            len(self._active_connections)
        )
        self.POOL_SIZE.labels(pool_name=self.name, state='idle').set(
            self.pool.idle_count()
        )
 
 
# Alerting thresholds:
#
# 1. Pool Wait Time > 50ms (P95)
#    → Pool may be too small for load
#    
# 2. Pool Exhaustion > 0 in 5 minutes
#    → Pool size definitely too small
#    
# 3. Connection Lifetime > 5 seconds (P50)
#    → Connections held too long (missing releases? slow operations?)
#    
# 4. Active/Total > 90% sustained
#    → Pool near saturation, scale up pool size

Connection Pool Sizing:

Pool sizing balances efficiency against resource consumption:

Connection Pool Sizing Considerations
Factor	Implication
Pool too small	Requests wait for connections; latency spikes; pool exhaustion errors
Pool too large	Wasted memory; database connection limits; more TIME_WAIT states
Minimum connections	Keeps 'warm' connections ready; costs idle resources
Maximum connections	Caps resource usage; must handle exhaustion gracefully
Idle timeout	Closes unused connections; balances freshness vs reconnection cost
Max lifetime	Forces reconnection to rebalance load; prevents stale connections

Pool Size Formula

A reasonable starting point: Pool Size = (RPS × Average Connection Hold Time) + Headroom. For example, 100 requests/second with 50ms average hold time = 5 concurrent connections needed. Add 2-3x headroom for variance. Monitor and adjust based on actual wait times and utilization.

Bandwidth and Throughput Monitoring

While latency concerns dominate most distributed system discussions, bandwidth constraints become critical for data-intensive operations: bulk data transfers, database replications, backup operations, and media streaming.

Understanding Bandwidth vs. Throughput:

Bandwidth: The theoretical maximum data rate (e.g., 10 Gbps link)
Throughput: The actual data rate achieved (always less than bandwidth)
Goodput: Application-level throughput (excludes protocol overhead)

The gap between bandwidth and throughput reveals network efficiency or problems.

bandwidth_monitoring.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/bin/bash
# Bandwidth Monitoring Commands and Metrics
 
# ===========================================
# Real-time bandwidth monitoring
# ===========================================
 
# iftop: Live bandwidth by connection
iftop -i eth0 -nNP
 
# nload: Simple bandwidth graph
nload eth0
 
# bmon: Bandwidth monitor with graphs
bmon -p eth0
 
# ===========================================
# Interface statistics from kernel
# ===========================================
 
# Current interface statistics
cat /sys/class/net/eth0/statistics/rx_bytes
cat /sys/class/net/eth0/statistics/tx_bytes
 
# Calculate throughput over interval:
#!/bin/bash
INTERVAL=5
RX1=$(cat /sys/class/net/eth0/statistics/rx_bytes)
TX1=$(cat /sys/class/net/eth0/statistics/tx_bytes)
sleep $INTERVAL
RX2=$(cat /sys/class/net/eth0/statistics/rx_bytes)
TX2=$(cat /sys/class/net/eth0/statistics/tx_bytes)
 
RX_RATE=$(( (RX2 - RX1) / INTERVAL ))
TX_RATE=$(( (TX2 - TX1) / INTERVAL ))
 
echo "Receive: $(numfmt --to=iec-i --suffix=B/s $RX_RATE)"
echo "Transmit: $(numfmt --to=iec-i --suffix=B/s $TX_RATE)"
 
# ===========================================
# Detailed interface statistics
# ===========================================
 
# ethtool for driver-level statistics
ethtool -S eth0 | grep -E "(rx_|tx_|drop|error)"
 
# Example output (important metrics):
#   rx_packets: 234567890
#   tx_packets: 123456789
#   rx_bytes: 345678901234
#   tx_bytes: 234567890123
#   rx_dropped: 0          # <-- Should be zero
#   tx_dropped: 0          
#   rx_errors: 0           # <-- Should be zero
#   tx_errors: 0
#   rx_over_errors: 0      # <-- Ring buffer overflow
#   tx_carrier_errors: 0   # <-- Link problems
 
# ===========================================
# Prometheus node_exporter provides these automatically:
# ===========================================
# node_network_receive_bytes_total{device="eth0"}
# node_network_transmit_bytes_total{device="eth0"}
# node_network_receive_drop_total{device="eth0"}
# node_network_transmit_drop_total{device="eth0"}
# node_network_receive_errors_total{device="eth0"}

Bandwidth Saturation Symptoms:

Recognizing bandwidth saturation early prevents cascading problems:

Bandwidth Saturation Indicators

•Throughput plateau — Throughput reaches a ceiling despite increasing load. More requests don't mean more data transferred.
•Latency increase — Packets queue waiting for transmission. Latency grows proportionally with queue depth.
•Dropped packets — When buffers overflow, packets are dropped. Triggers retransmissions, making congestion worse.
•TCP window zero — Receiver advertises zero window, pausing sender. Indicates receiver can't process data fast enough.
•Asymmetric saturation — Upload and download may saturate independently. A 1Gbps/100Mbps asymmetric link saturates upload easily.

Cloud Network Bandwidth

Cloud instances have bandwidth limits tied to instance size. An AWS t3.micro has ~5 Gbps baseline with burst. An m5.24xlarge has 25 Gbps. These limits are often undocumented. Monitor instance-level network metrics and right-size for bandwidth needs, not just CPU/memory.

Building a Network Monitoring Dashboard

Effective network monitoring combines metrics into dashboards that reveal issues at a glance. A well-designed network dashboard answers:

Is the network healthy? — Overview panel with key health indicators
What's the current load? — Traffic volume and connection rates
Are there problems? — Errors, retransmissions, timeouts
Where are the problems? — Drill-down by host, service, or path

network_dashboard.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# Grafana Dashboard Configuration (Prometheus Data Source)
# Key panels for comprehensive network monitoring
 
# ====================================================
# Row 1: Network Health Overview
# ====================================================
 
# Panel: Connection Success Rate (Gauge)
- title: "Connection Success Rate"
  type: gauge
  query: |
    1 - (
      rate(tcp_connection_errors_total[5m])
      /
      rate(tcp_connection_attempts_total[5m])
    )
  thresholds:
    - value: 0.99
      color: green
    - value: 0.95
      color: yellow
    - value: 0
      color: red
 
# Panel: Current Active Connections (Stat)
- title: "Active Connections"
  type: stat
  query: sum(node_netstat_Tcp_CurrEstab)
 
# Panel: Retransmission Rate (Gauge)
- title: "TCP Retransmit Rate"
  type: gauge
  query: |
    rate(node_netstat_Tcp_RetransSegs[5m])
    /
    rate(node_netstat_Tcp_OutSegs[5m])
  thresholds:
    - value: 0.01  # < 1%
      color: green
    - value: 0.03  # 1-3%
      color: yellow
    - value: 0.03  # > 3%
      color: red
 
# ====================================================
# Row 2: Latency Analysis
# ====================================================
 
# Panel: Request Latency Heatmap
- title: "Request Latency Distribution"
  type: heatmap
  query: |
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
  yAxis:
    format: seconds
 
# Panel: Latency Percentiles
- title: "Latency Percentiles (P50, P95, P99)"
  type: timeseries
  queries:
    - query: histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
      legend: "P50"
    - query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
      legend: "P95"
    - query: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
      legend: "P99"
 
# ====================================================
# Row 3: Throughput and Bandwidth
# ====================================================
 
# Panel: Network Traffic (Bytes/sec)
- title: "Network Traffic"
  type: timeseries
  queries:
    - query: rate(node_network_receive_bytes_total{device="eth0"}[5m])
      legend: "Inbound"
    - query: rate(node_network_transmit_bytes_total{device="eth0"}[5m])
      legend: "Outbound"
  unit: bytes/sec
 
# Panel: Bandwidth Utilization %
- title: "Bandwidth Utilization"
  type: gauge
  query: |
    (rate(node_network_transmit_bytes_total{device="eth0"}[5m]) * 8)
    /
    (node_network_speed_bytes{device="eth0"} * 8)
  thresholds:
    - value: 0.60
      color: green
    - value: 0.80
      color: yellow
    - value: 0.80
      color: red
 
# ====================================================
# Row 4: Connection Pool Health
# ====================================================
 
# Panel: Pool Utilization by Service
- title: "Connection Pool Utilization"
  type: timeseries
  query: |
    connection_pool_size{state="active"}
    /
    (connection_pool_size{state="active"} + connection_pool_size{state="idle"})
  legend: "{{pool_name}}"
 
# Panel: Pool Wait Time P95
- title: "Pool Wait Time (P95)"
  type: timeseries
  query: |
    histogram_quantile(0.95, rate(connection_pool_wait_seconds_bucket[5m]))
  legend: "{{pool_name}}"
 
# ====================================================
# Row 5: Errors and Anomalies
# ====================================================
 
# Panel: Network Errors
- title: "Network Errors (Rate)"
  type: timeseries
  queries:
    - query: rate(node_network_receive_errors_total[5m])
      legend: "RX Errors"
    - query: rate(node_network_transmit_errors_total[5m])
      legend: "TX Errors"
    - query: rate(node_network_receive_drop_total[5m])
      legend: "RX Drops"
 
# Panel: Connection Failures by Type
- title: "Connection Failures"
  type: timeseries
  query: |
    rate(http_client_connection_errors_total[5m])
  legend: "{{host}} - {{error_type}}"

Correlation is Everything

The most valuable feature of a network dashboard is correlating metrics with events. Add deployment markers, incident annotations, and traffic overlays. When latency spikes, you should immediately see if it correlates with a deploy, traffic surge, or isolated to specific services.

Summary: Network Monitoring Mastery

We've explored comprehensive network monitoring—from TCP fundamentals to production dashboards used by principal engineers.

Key Takeaways

•Understand TCP states — Connection states tell stories. TIME_WAIT accumulation, CLOSE_WAIT backlogs, and SYN floods each indicate specific problems.
•Monitor latency percentiles — Averages lie. P99 latency often reveals problems affecting significant user populations that P50 hides.
•Track connection lifecycle — Connection establishment is expensive. Pool utilization, wait times, and exhaustion events reveal sizing problems.
•Measure reliability metrics — Packet loss, retransmission rate, and connection failures indicate network health beyond just latency.
•Watch bandwidth utilization — Approaching bandwidth limits causes latency spikes before complete saturation. Monitor utilization proactively.
•Instrument at multiple layers — Combine kernel-level metrics (ss, /proc) with application-level instrumentation for complete visibility.
•Correlate with events — Network metrics in isolation are less useful than correlated with deployments, traffic changes, and incidents.

What's Next:

We've covered application profiling, database analysis, and network monitoring. The next page explores continuous performance testing—ensuring performance remains excellent across every code change.

Page Complete

You now understand network monitoring at the depth required for production systems. The network is no longer an opaque 'it just works' assumption—it's an observable, measurable component of your distributed system.