Loading content...
In distributed systems, the network is the connective tissue that binds everything together. Every microservice call, every database query, every cache lookup, every message queue interaction—all traverse the network. Yet the network is often the least understood component of system performance.
Network monitoring is the discipline of observing, measuring, and analyzing network behavior to understand latency, detect anomalies, and diagnose connectivity issues. It transforms the network from an opaque "it just works" assumption into a well-understood, observable component of your system.
When an application is slow, the network is often blamed first but investigated last. Network monitoring provides the evidence to either confirm network issues or eliminate them from investigation.
By the end of this page, you will understand network fundamentals relevant to monitoring, key network metrics, latency analysis techniques, connection state tracking, and bandwidth monitoring. You'll learn to diagnose network issues systematically using the same approaches employed by principal engineers and network specialists.
Effective network monitoring requires understanding what you're measuring. Let's establish the foundational concepts that inform monitoring strategies.
The OSI/TCP-IP Layers That Matter:
While the 7-layer OSI model is academically complete, practical network monitoring focuses on specific layers:
| Layer | Protocol Examples | What to Monitor | Common Issues |
|---|---|---|---|
| Layer 7 (Application) | HTTP, gRPC, DNS | Request latency, error rates, throughput | Slow responses, 5xx errors, timeouts |
| Layer 4 (Transport) | TCP, UDP | Connection states, retransmissions, RTT | Connection failures, high retransmit rate |
| Layer 3 (Network) | IP, ICMP | Packet loss, routing, latency | Unreachable hosts, asymmetric routing |
| Layer 2 (Data Link) | Ethernet, ARP | ARP cache, MAC issues | Usually managed by cloud/infra providers |
TCP Connection Lifecycle:
Most distributed system communication uses TCP. Understanding the TCP lifecycle is essential for interpreting connection metrics:
Each phase can introduce latency and each has distinct failure modes.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
TCP Connection State Diagram (Simplified)========================================== Client Server | | |------------ SYN ------------>| Client: SYN_SENT | | Server: SYN_RECEIVED |<--------- SYN-ACK -----------| | | |------------ ACK ------------>| Both: ESTABLISHED | | |<========== DATA ============>| Normal data transfer | | |------------ FIN ------------>| Client: FIN_WAIT_1 | | Server: CLOSE_WAIT |<----------- ACK -------------| Client: FIN_WAIT_2 | | |<----------- FIN -------------| Server: LAST_ACK | | |------------ ACK ------------>| Client: TIME_WAIT (waits 2*MSL) | | Server: CLOSED | | | (after timeout) | | | Client: CLOSED Key States to Monitor:----------------------- ESTABLISHED: Active connections (expected state during use)- TIME_WAIT: Recently closed connections, waiting for late packets High count = rapid connection cycling (connection pooling helps)- CLOSE_WAIT: Server received FIN but hasn't closed (application hung?) High count = potential application bug or resource leak- SYN_SENT/SYN_RECEIVED: Handshake in progress Accumulation = network issues or firewall problems Monitoring Query (Linux):$ ss -s # Summary of socket states$ netstat -ant | awk '{print $6}' | sort | uniq -c | sort -rn# Example output:# 1523 ESTABLISHED# 892 TIME_WAIT# 45 CLOSE_WAIT <-- Investigate if this grows# 12 SYN_SENTConnections in TIME_WAIT consume local ports for typically 60-120 seconds. High-throughput systems making many short-lived connections can exhaust available ports, causing connection failures. Solutions: connection pooling, SO_REUSEADDR, or reducing MSL (with caution).
Effective network monitoring requires tracking the right metrics. The following represent the essential measurements for distributed system health:
Latency Metrics:
Reliability Metrics:
Throughput Metrics:
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Packet Loss | < 0.01% | 0.01% - 0.1% | 0.1% |
| Retransmission Rate | < 1% | 1% - 3% | 3% |
| RTT Variance | < 10ms jitter | 10-50ms jitter | 50ms jitter |
| Bandwidth Utilization | < 60% | 60% - 80% | 80% |
| Connection Failure Rate | < 0.1% | 0.1% - 1% | 1% |
The Bandwidth-Delay Product (BDP) describes how much data can be 'in flight' on a network path. High-latency links (e.g., cross-continent) require larger TCP receive windows to saturate bandwidth. Monitor both latency and bandwidth together to understand true capacity.
Effective network monitoring combines multiple tools, each providing different perspectives on network behavior.
System-Level Monitoring:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
#!/bin/bash# Essential Network Monitoring Commands # ============================================# ss (Socket Statistics) - Modern netstat replacement# ============================================ # Connection summary by statess -s # TCP connections with timing detailsss -ti # Connections to specific portss -tn state established '( dport = :443 )' # Show process owning connectionss -tnp | grep ESTABLISHED # Output example:# ESTAB 0 0 10.0.1.45:52341 10.0.2.100:5432 # cubic wscale:7,7 rto:204 rtt:3.5/1.3 ato:40 mss:1448 # pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:15234 # bytes_acked:15234 bytes_received:45678 segs_out:102 segs_in:98## Key values:# - rtt:3.5/1.3 = RTT and variance in ms# - cwnd:10 = Congestion window (segments)# - bytes_sent/received = Data transferred # ============================================# tcpdump - Packet capture# ============================================ # Capture HTTP traffic on eth0tcpdump -i eth0 -n port 80 or port 443 -c 100 # Capture with timing (for latency analysis)tcpdump -i eth0 -n -tt host 10.0.2.100 # Capture for later analysis (save to file)tcpdump -i eth0 -w capture.pcap -c 10000 # Capture only SYN packets (connection attempts)tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0' # ============================================# ping / mtr - Basic latency measurement# ============================================ # Basic ping with timingping -c 10 10.0.2.100 # MTR: Combined traceroute + ping (shows each hop)mtr -rw 10.0.2.100 # Example mtr output:# Host Loss% Snt Last Avg Best Wrst StDev# 1. gateway.local 0.0% 10 0.5 0.4 0.3 0.7 0.1# 2. isp-router.net 0.0% 10 8.3 8.1 7.8 8.9 0.4# 3. backbone.net 0.0% 10 15.2 15.0 14.8 15.5 0.2# 4. target.server 0.0% 10 18.4 18.2 18.0 18.8 0.3 # ============================================# nethogs / iftop - Bandwidth by process/connection# ============================================ # Network usage per processnethogs eth0 # Network usage per connection (live)iftop -i eth0 # ============================================# /proc/net/snmp - Kernel network statistics# ============================================ # TCP statistics (retransmissions, etc)cat /proc/net/snmp | grep Tcp: # Extract retransmit ratio:awk '/^Tcp:/ { if (NR==1) { for (i=1;i<=NF;i++) col[$i]=i; next } retrans = $col["RetransSegs"]; outseg = $col["OutSegs"]; if (outseg > 0) printf "Retransmit ratio: %.4f%%\n", (retrans/outseg)*100}' /proc/net/snmpApplication-Level Network Monitoring:
System tools show raw network behavior. Application-level instrumentation shows how the network impacts user experience:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
# Python Example: Instrumenting HTTP Client for Network Metrics import timeimport requestsfrom prometheus_client import Histogram, Counter, Gauge # Define metricsHTTP_REQUEST_DURATION = Histogram( 'http_client_request_duration_seconds', 'HTTP request latency', ['method', 'host', 'status'], buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]) HTTP_REQUEST_SIZE = Histogram( 'http_client_request_size_bytes', 'HTTP request body size', ['method', 'host'], buckets=[100, 1000, 10000, 100000, 1000000]) HTTP_CONNECTION_ERRORS = Counter( 'http_client_connection_errors_total', 'HTTP client connection failures', ['host', 'error_type']) DNS_RESOLUTION_TIME = Histogram( 'dns_resolution_duration_seconds', 'DNS resolution latency', ['host'], buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25]) class InstrumentedHTTPClient: """HTTP client with network-level metrics collection.""" def __init__(self, base_url: str): self.base_url = base_url self.session = requests.Session() def request(self, method: str, path: str, **kwargs) -> requests.Response: url = f"{self.base_url}{path}" host = self.base_url.split('//')[1].split('/')[0] start_time = time.perf_counter() try: response = self.session.request(method, url, **kwargs) # Record successful request metrics duration = time.perf_counter() - start_time HTTP_REQUEST_DURATION.labels( method=method, host=host, status=str(response.status_code) ).observe(duration) if kwargs.get('data') or kwargs.get('json'): request_size = len(kwargs.get('data', '') or str(kwargs.get('json', ''))) HTTP_REQUEST_SIZE.labels(method=method, host=host).observe(request_size) return response except requests.exceptions.ConnectionError as e: HTTP_CONNECTION_ERRORS.labels( host=host, error_type='connection_error' ).inc() raise except requests.exceptions.Timeout as e: HTTP_CONNECTION_ERRORS.labels( host=host, error_type='timeout' ).inc() raise except requests.exceptions.DNSLookupError as e: HTTP_CONNECTION_ERRORS.labels( host=host, error_type='dns_error' ).inc() raise # Usage:# client = InstrumentedHTTPClient("https://api.example.com")# response = client.request("GET", "/users")## Resulting metrics exported to Prometheus:# http_client_request_duration_seconds_bucket{method="GET",host="api.example.com",status="200",le="0.1"} 842# http_client_connection_errors_total{host="api.example.com",error_type="timeout"} 3System-level tools (tcpdump, ss) provide 'black-box' network view—what's happening on the wire. Application-level instrumentation provides 'white-box' view—how network affects application behavior. Both perspectives are necessary for complete understanding.
Network latency is often the largest contributor to request latency in distributed systems. Understanding latency components enables targeted optimization.
Latency Components:
A simple HTTP request involves multiple latency sources:
12345678910111213141516171819202122232425262728293031323334353637383940
HTTP Request Latency Breakdown=============================== Client Server | | |-- DNS Resolution (10-100ms, cached: 0ms) ------->| | | |-- TCP Handshake (1.5 × RTT) -------------------->| | | |-- TLS Handshake (1-2 × RTT for TLS 1.2) -------->| | (0-1 × RTT for TLS 1.3 or session resume) | | | |-- Request Transmission (size ÷ bandwidth) ----->| | | | [Server Processing Time] | | | |<----- Response First Byte (1 × RTT) -------------| | | |<----- Response Transmission (size ÷ bandwidth) --| | | Example Breakdown (cross-region request to US-East from EU-West): Component Time (ms) Notes----------- --------- -----DNS Resolution 0 CachedTCP Handshake 120 RTT = 80ms, so 1.5 × 80 = 120msTLS Handshake 80 TLS 1.3 with session ticketRequest Transmission 2 1KB request ÷ ~500KB/s availableServer Processing 50 Application + databaseResponse First Byte 80 One RTTResponse Transmission 20 10KB response ÷ ~500KB/s----------- ---------TOTAL 352ms Optimization Opportunities:- Connection reuse eliminates TCP + TLS handshakes for subsequent requests- Edge deployment reduces RTT (EU edge server: RTT = 10ms → huge savings)- Response compression reduces transmission time- Server-side optimization reduces processing timeAnalyzing Latency Distribution:
Latency is rarely uniform. Analyzing distributions (not just averages) reveals problems:
The Tail Latency Problem:
In systems making multiple parallel requests, tail latency dominates user experience. If a page requires 10 backend calls and each has 1% chance of 1-second latency, the page will be slow ~10% of the time.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
# Latency Analysis with Histograms import numpy as npfrom scipy import stats class LatencyAnalyzer: """Analyze latency samples to understand distribution patterns.""" def __init__(self, samples: list[float]): self.samples = np.array(samples) def summary(self) -> dict: """Compute comprehensive latency statistics.""" return { 'count': len(self.samples), 'min': np.min(self.samples), 'max': np.max(self.samples), 'mean': np.mean(self.samples), 'median': np.median(self.samples), 'std_dev': np.std(self.samples), 'p50': np.percentile(self.samples, 50), 'p75': np.percentile(self.samples, 75), 'p90': np.percentile(self.samples, 90), 'p95': np.percentile(self.samples, 95), 'p99': np.percentile(self.samples, 99), 'p999': np.percentile(self.samples, 99.9), } def detect_bimodal(self) -> bool: """ Detect if latency distribution is bimodal (two distinct modes). This often indicates cache hit/miss or different code paths. """ # Use kernel density estimation kde = stats.gaussian_kde(self.samples) x = np.linspace(np.min(self.samples), np.max(self.samples), 1000) density = kde(x) # Find local maxima (peaks) peaks = [] for i in range(1, len(density) - 1): if density[i] > density[i-1] and density[i] > density[i+1]: peaks.append(x[i]) return len(peaks) >= 2 def tail_ratio(self) -> float: """ Compute ratio of P99 to P50. High ratio indicates significant tail latency. """ p50 = np.percentile(self.samples, 50) p99 = np.percentile(self.samples, 99) return p99 / p50 if p50 > 0 else float('inf') # Example usage and interpretation:# # samples = collect_latency_samples('api/users', duration='5m')# analyzer = LatencyAnalyzer(samples)# # stats = analyzer.summary()# print(f"P50: {stats['p50']:.2f}ms")# print(f"P99: {stats['p99']:.2f}ms")# print(f"Tail ratio: {analyzer.tail_ratio():.1f}x")# # Example output:# P50: 45.23ms# P99: 892.45ms# Tail ratio: 19.7x## INTERPRETATION: # - P99 is 20x P50 → significant tail latency problem# - 1% of users experience ~1 second latency vs typical ~50ms# - Investigate: cache misses, GC pauses, lock contention, # or network variability affecting 1% of requestsAverage latency hides tail latency. A service with 90% of requests at 10ms and 10% at 1000ms has an 'average' of 109ms, which represents nobody's actual experience. Always report percentiles (P50, P95, P99) for meaningful latency metrics.
Connections are expensive. Each TCP connection requires a three-way handshake, consumes memory (kernel socket buffers), and may require TLS negotiation. Connection tracking monitors connection behavior to optimize resource usage.
Why Connection Pooling Matters:
Without pooling, each request incurs connection establishment overhead:
With RTT of 80ms (cross-region), this adds 200-280ms to every request. Connection pooling amortizes this cost across many requests.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
# Connection Pool Monitoring from prometheus_client import Gauge, Counter, Histogramimport threadingimport time class MonitoredConnectionPool: """ Connection pool with comprehensive monitoring. Wraps actual pool implementation with metrics. """ # Prometheus metrics POOL_SIZE = Gauge( 'connection_pool_size', 'Current connections in pool', ['pool_name', 'state'] # state: idle, active ) POOL_WAIT_TIME = Histogram( 'connection_pool_wait_seconds', 'Time waiting to acquire connection', ['pool_name'], buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0] ) POOL_EXHAUSTION = Counter( 'connection_pool_exhausted_total', 'Times pool was exhausted (no connections available)', ['pool_name'] ) CONNECTION_LIFETIME = Histogram( 'connection_lifetime_seconds', 'How long connections are used before return', ['pool_name'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 30.0, 60.0, 300.0] ) def __init__(self, name: str, pool): self.name = name self.pool = pool self._lock = threading.Lock() self._active_connections = {} def acquire(self): """Acquire a connection from the pool, record metrics.""" start = time.perf_counter() try: conn = self.pool.acquire(timeout=30) wait_time = time.perf_counter() - start self.POOL_WAIT_TIME.labels(pool_name=self.name).observe(wait_time) # Track when this connection was acquired conn_id = id(conn) self._active_connections[conn_id] = time.perf_counter() self._update_size_metrics() return conn except PoolExhaustedError: self.POOL_EXHAUSTION.labels(pool_name=self.name).inc() raise def release(self, conn): """Return a connection to the pool, record metrics.""" conn_id = id(conn) if conn_id in self._active_connections: lifetime = time.perf_counter() - self._active_connections[conn_id] self.CONNECTION_LIFETIME.labels(pool_name=self.name).observe(lifetime) del self._active_connections[conn_id] self.pool.release(conn) self._update_size_metrics() def _update_size_metrics(self): """Update pool size gauges.""" self.POOL_SIZE.labels(pool_name=self.name, state='active').set( len(self._active_connections) ) self.POOL_SIZE.labels(pool_name=self.name, state='idle').set( self.pool.idle_count() ) # Alerting thresholds:## 1. Pool Wait Time > 50ms (P95)# → Pool may be too small for load# # 2. Pool Exhaustion > 0 in 5 minutes# → Pool size definitely too small# # 3. Connection Lifetime > 5 seconds (P50)# → Connections held too long (missing releases? slow operations?)# # 4. Active/Total > 90% sustained# → Pool near saturation, scale up pool sizeConnection Pool Sizing:
Pool sizing balances efficiency against resource consumption:
| Factor | Implication |
|---|---|
| Pool too small | Requests wait for connections; latency spikes; pool exhaustion errors |
| Pool too large | Wasted memory; database connection limits; more TIME_WAIT states |
| Minimum connections | Keeps 'warm' connections ready; costs idle resources |
| Maximum connections | Caps resource usage; must handle exhaustion gracefully |
| Idle timeout | Closes unused connections; balances freshness vs reconnection cost |
| Max lifetime | Forces reconnection to rebalance load; prevents stale connections |
A reasonable starting point: Pool Size = (RPS × Average Connection Hold Time) + Headroom. For example, 100 requests/second with 50ms average hold time = 5 concurrent connections needed. Add 2-3x headroom for variance. Monitor and adjust based on actual wait times and utilization.
While latency concerns dominate most distributed system discussions, bandwidth constraints become critical for data-intensive operations: bulk data transfers, database replications, backup operations, and media streaming.
Understanding Bandwidth vs. Throughput:
The gap between bandwidth and throughput reveals network efficiency or problems.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
#!/bin/bash# Bandwidth Monitoring Commands and Metrics # ===========================================# Real-time bandwidth monitoring# =========================================== # iftop: Live bandwidth by connectioniftop -i eth0 -nNP # nload: Simple bandwidth graphnload eth0 # bmon: Bandwidth monitor with graphsbmon -p eth0 # ===========================================# Interface statistics from kernel# =========================================== # Current interface statisticscat /sys/class/net/eth0/statistics/rx_bytescat /sys/class/net/eth0/statistics/tx_bytes # Calculate throughput over interval:#!/bin/bashINTERVAL=5RX1=$(cat /sys/class/net/eth0/statistics/rx_bytes)TX1=$(cat /sys/class/net/eth0/statistics/tx_bytes)sleep $INTERVALRX2=$(cat /sys/class/net/eth0/statistics/rx_bytes)TX2=$(cat /sys/class/net/eth0/statistics/tx_bytes) RX_RATE=$(( (RX2 - RX1) / INTERVAL ))TX_RATE=$(( (TX2 - TX1) / INTERVAL )) echo "Receive: $(numfmt --to=iec-i --suffix=B/s $RX_RATE)"echo "Transmit: $(numfmt --to=iec-i --suffix=B/s $TX_RATE)" # ===========================================# Detailed interface statistics# =========================================== # ethtool for driver-level statisticsethtool -S eth0 | grep -E "(rx_|tx_|drop|error)" # Example output (important metrics):# rx_packets: 234567890# tx_packets: 123456789# rx_bytes: 345678901234# tx_bytes: 234567890123# rx_dropped: 0 # <-- Should be zero# tx_dropped: 0 # rx_errors: 0 # <-- Should be zero# tx_errors: 0# rx_over_errors: 0 # <-- Ring buffer overflow# tx_carrier_errors: 0 # <-- Link problems # ===========================================# Prometheus node_exporter provides these automatically:# ===========================================# node_network_receive_bytes_total{device="eth0"}# node_network_transmit_bytes_total{device="eth0"}# node_network_receive_drop_total{device="eth0"}# node_network_transmit_drop_total{device="eth0"}# node_network_receive_errors_total{device="eth0"}Bandwidth Saturation Symptoms:
Recognizing bandwidth saturation early prevents cascading problems:
Cloud instances have bandwidth limits tied to instance size. An AWS t3.micro has ~5 Gbps baseline with burst. An m5.24xlarge has 25 Gbps. These limits are often undocumented. Monitor instance-level network metrics and right-size for bandwidth needs, not just CPU/memory.
Effective network monitoring combines metrics into dashboards that reveal issues at a glance. A well-designed network dashboard answers:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
# Grafana Dashboard Configuration (Prometheus Data Source)# Key panels for comprehensive network monitoring # ====================================================# Row 1: Network Health Overview# ==================================================== # Panel: Connection Success Rate (Gauge)- title: "Connection Success Rate" type: gauge query: | 1 - ( rate(tcp_connection_errors_total[5m]) / rate(tcp_connection_attempts_total[5m]) ) thresholds: - value: 0.99 color: green - value: 0.95 color: yellow - value: 0 color: red # Panel: Current Active Connections (Stat)- title: "Active Connections" type: stat query: sum(node_netstat_Tcp_CurrEstab) # Panel: Retransmission Rate (Gauge)- title: "TCP Retransmit Rate" type: gauge query: | rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) thresholds: - value: 0.01 # < 1% color: green - value: 0.03 # 1-3% color: yellow - value: 0.03 # > 3% color: red # ====================================================# Row 2: Latency Analysis# ==================================================== # Panel: Request Latency Heatmap- title: "Request Latency Distribution" type: heatmap query: | sum(rate(http_request_duration_seconds_bucket[5m])) by (le) yAxis: format: seconds # Panel: Latency Percentiles- title: "Latency Percentiles (P50, P95, P99)" type: timeseries queries: - query: histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m])) legend: "P50" - query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) legend: "P95" - query: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) legend: "P99" # ====================================================# Row 3: Throughput and Bandwidth# ==================================================== # Panel: Network Traffic (Bytes/sec)- title: "Network Traffic" type: timeseries queries: - query: rate(node_network_receive_bytes_total{device="eth0"}[5m]) legend: "Inbound" - query: rate(node_network_transmit_bytes_total{device="eth0"}[5m]) legend: "Outbound" unit: bytes/sec # Panel: Bandwidth Utilization %- title: "Bandwidth Utilization" type: gauge query: | (rate(node_network_transmit_bytes_total{device="eth0"}[5m]) * 8) / (node_network_speed_bytes{device="eth0"} * 8) thresholds: - value: 0.60 color: green - value: 0.80 color: yellow - value: 0.80 color: red # ====================================================# Row 4: Connection Pool Health# ==================================================== # Panel: Pool Utilization by Service- title: "Connection Pool Utilization" type: timeseries query: | connection_pool_size{state="active"} / (connection_pool_size{state="active"} + connection_pool_size{state="idle"}) legend: "{{pool_name}}" # Panel: Pool Wait Time P95- title: "Pool Wait Time (P95)" type: timeseries query: | histogram_quantile(0.95, rate(connection_pool_wait_seconds_bucket[5m])) legend: "{{pool_name}}" # ====================================================# Row 5: Errors and Anomalies# ==================================================== # Panel: Network Errors- title: "Network Errors (Rate)" type: timeseries queries: - query: rate(node_network_receive_errors_total[5m]) legend: "RX Errors" - query: rate(node_network_transmit_errors_total[5m]) legend: "TX Errors" - query: rate(node_network_receive_drop_total[5m]) legend: "RX Drops" # Panel: Connection Failures by Type- title: "Connection Failures" type: timeseries query: | rate(http_client_connection_errors_total[5m]) legend: "{{host}} - {{error_type}}"The most valuable feature of a network dashboard is correlating metrics with events. Add deployment markers, incident annotations, and traffic overlays. When latency spikes, you should immediately see if it correlates with a deploy, traffic surge, or isolated to specific services.
We've explored comprehensive network monitoring—from TCP fundamentals to production dashboards used by principal engineers.
What's Next:
We've covered application profiling, database analysis, and network monitoring. The next page explores continuous performance testing—ensuring performance remains excellent across every code change.
You now understand network monitoring at the depth required for production systems. The network is no longer an opaque 'it just works' assumption—it's an observable, measurable component of your distributed system.