Loading learning content...
A network link might boast impressive bandwidth and low delay, yet fail to deliver packets reliably. Consider a microwave link during a rainstorm: its 1 Gbps capacity and 5ms latency remain unchanged on paper, but bit errors cause packet corruption, TCP retransmissions spike, and effective throughput collapses. Reliability captures this quality dimension that bandwidth and delay metrics miss.
Reliability as a routing metric represents the probability that a packet will be successfully delivered across a link without errors. It encompasses link stability (uptime vs. downtime), error rates (bit errors, CRC failures), and the consistency of delivery over time.
By the end of this page, you will understand how reliability is defined and measured as a routing metric, how protocols like EIGRP and older IGRP incorporate reliability, the challenges of using dynamic reliability metrics, the relationship between link reliability and overall path reliability, and when reliability metrics provide genuine value versus unnecessary complexity.
Reliability in networking quantifies how dependable a link is for packet delivery. Unlike bandwidth (a theoretical maximum) or delay (a time measurement), reliability is expressed as a probability or fraction indicating successful delivery.
1234567891011121314151617
Reliability Definition:════════════════════════════════════════════════════════ Reliability (R) = Packets Successfully Delivered / Packets Transmitted Expressed as:• Fraction: 0.0 to 1.0 (0% to 100%)• EIGRP format: 0 to 255 (where 255 = 100% reliable) Example:─────────────────────────────────────────────────────Link transmits 10,000 packetsSuccessfully delivered: 9,985 packetsFailed (errors, drops): 15 packets Reliability = 9,985 / 10,000 = 0.9985 = 99.85%In EIGRP format: 255 × 0.9985 ≈ 254Components Affecting Reliability:
Link reliability is influenced by multiple factors, each contributing to the probability of packet loss or corruption:
| Factor | Description | Impact | Typical Environment |
|---|---|---|---|
| Bit Error Rate (BER) | Probability of individual bit corruption | Corrupted packets fail CRC, discarded | Wireless, noisy electrical environments |
| Interface Errors | CRC errors, runts, giants, collisions | Direct packet loss at interface | Faulty hardware, cable issues |
| Buffer Overflows | Traffic exceeds queue capacity | Tail drops or active queue management | Congested network segments |
| Link Flapping | Interface repeatedly going up/down | Packet loss during state transitions | Unstable connections, marginal signal |
| Physical Medium Issues | Cable damage, connector corrosion | Intermittent errors, complete failures | Aging infrastructure, poor installation |
| Environmental Factors | Weather (wireless), EMI (copper) | Variable error rates | Outdoor wireless, industrial sites |
Path Reliability:
For multi-hop paths, reliability compounds multiplicatively—each additional hop reduces the overall reliability:
Path Reliability = R₁ × R₂ × R₃ × ... × Rₙ
Example: 5-hop path with each link at 99% reliability
Path Reliability = 0.99 × 0.99 × 0.99 × 0.99 × 0.99
= 0.99⁵
= 0.951 = 95.1%
Example: 10-hop path with each link at 99% reliability
Path Reliability = 0.99¹⁰ = 0.904 = 90.4%
This multiplicative relationship means that even small per-link reliability issues compound into significant path-level reliability degradation. A 1% loss per hop becomes a 10% loss over a 10-hop path.
Reliability (error-free delivery probability) differs from Availability (uptime percentage). A link might be highly available (99.99% uptime) but have poor reliability when up (high error rate). Conversely, a link might have perfect reliability when functioning but poor availability (frequent outages). Both matter for end-to-end service quality.
Cisco's Interior Gateway Routing Protocol (IGRP), developed in the mid-1980s, and its successor EIGRP both include reliability as a component of their composite metrics. Understanding this implementation provides insight into both the value and challenges of reliability as a routing metric.
12345678910111213141516171819
EIGRP Composite Metric Formula (Full):════════════════════════════════════════════════════════ Metric = [(K1 × BW) + (K2 × BW)/(256-Load) + (K3 × Delay)] × [K5/(K4 + Reliability)] Where:• K1, K2, K3, K4, K5 = Weighting constants• BW = 10^7 / minimum bandwidth (Kbps)• Delay = sum of interface delays (tens of μs)• Load = interface load (1-255)• Reliability = interface reliability (1-255, where 255 = 100%) Default K values: K1=1, K2=0, K3=1, K4=0, K5=0 With defaults, the simplified formula becomes:Metric = (BW + Delay) × 256 Note: With K4=0 and K5=0, Reliability is NOT used by default!Why Reliability is Disabled by Default:
Despite EIGRP's capability to factor in reliability, Cisco disabled it by default (K4=0, K5=0). The reasons illuminate fundamental challenges with dynamic reliability metrics:
1. Routing Instability Reliability values change as interface errors occur. If routes changed based on reliability fluctuations, transient error bursts could trigger route changes, which cause more errors during convergence, creating oscillating feedback loops.
2. Self-Correcting Nature of Errors Many reliability issues are transient—a burst of errors during brief interference, for example. By the time reliability metrics propagate and routes converge, the issue may have resolved, making the route change unnecessary.
3. Transport Layer Compensation TCP already handles packet loss through retransmission. Routing around unreliable links doesn't eliminate retransmissions for in-flight data and may create additional issues (reordering, asymmetric paths).
4. Measurement Challenges Reliability measurements require tracking packet success/failure over time. The measurement window affects accuracy: too short creates noise, too long masks real problems.
1234567891011121314151617
! Show interface reliabilityshow interface GigabitEthernet0/0 | include reliability! Output: reliability 255/255, txload 1/255, rxload 1/255 ! Values:! reliability: 255/255 = 100% reliable (1/255 = 0.4% reliable)! txload/rxload: 1/255 = nearly idle (255/255 = fully loaded) ! Show EIGRP topology with metricsshow ip eigrp topology all-links! Displays composite metric and feasible distance ! Enable reliability in EIGRP metric (NOT recommended in production)router eigrp 100 metric weights 0 1 0 1 0 1 ! Weights: TOS K1 K2 K3 K4 K5 ! This enables reliability (K5=1) in metric calculationEnabling reliability (or load) in EIGRP's metric calculation is strongly discouraged in production environments. The routing instability caused by dynamic metric components typically causes more problems than the suboptimal path selection it attempts to solve. Cisco's decision to disable these by default reflects decades of operational experience.
Even when not used directly in routing metrics, reliability measurement is essential for network operations, capacity planning, and SLA management. Understanding measurement techniques helps network engineers identify and address reliability issues.
| Counter | Description | Indicates |
|---|---|---|
| CRC Errors | Frames with failed checksum | Bit corruption (noise, cable issues, hardware) |
| Input Errors | Total received frames with any error | General receive-side problems |
| Output Errors | Frames failed to transmit | Transmit-side issues, buffer overflows |
| Runts | Frames smaller than minimum (64 bytes) | Collisions, duplex mismatch |
| Giants | Frames larger than maximum | MTU mismatch, faulty equipment |
| Frame Errors | Frames with invalid format | Protocol issues, hardware problems |
| Overrun | Receiver couldn't process fast enough | CPU/memory limitations |
| Ignored | Receiver buffer full | Buffer sizing, traffic bursts |
123456789101112131415161718192021222324252627
! Full interface statisticsshow interface GigabitEthernet0/0 ! Key sections for reliability analysis:! ─────────────────────────────────────────────────! GigabitEthernet0/0 is up, line protocol is up! reliability 255/255, txload 1/255, rxload 1/255! ! Input queue: 0/75/0/0 (size/max/drops/flushes)! ! 5 minute input rate 145000 bits/sec, 89 packets/sec! 5 minute output rate 1256000 bits/sec, 423 packets/sec! ! 125443567 packets input, 18765432123 bytes! 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored ← Good!! ! 234567890 packets output, 34567890123 bytes! 0 output errors, 0 collisions, 0 interface resets ← Good! ! Calculate reliability from counters:! Reliability = 1 - (input_errors + output_errors) / total_packets! Reliability = 1 - (0 + 0) / (125443567 + 234567890)! Reliability = 100% ! Bad example with errors:! 135 input errors, 127 CRC, 0 frame, 8 overrun, 0 ignored ← Problem!! 23 output errors, 0 collisions, 3 interface resets ← Problem!Bit Error Rate (BER) Analysis:
For deeper reliability analysis, Bit Error Rate provides a fundamental measure of link quality:
BER = Number of Bit Errors / Total Bits Transmitted
Typical BER Standards:
────────────────────────────────────────────────
• Fiber optic: BER < 10⁻¹² (1 error per trillion bits)
• Copper Ethernet: BER < 10⁻¹⁰ (1 error per 10 billion bits)
• Wireless LAN: BER ~10⁻⁶ to 10⁻⁵ (variable)
• Satellite: BER ~10⁻⁷ to 10⁻⁵ (weather-dependent)
Impact on Reliability:
────────────────────────────────────────────────
1,500-byte packet = 12,000 bits
BER 10⁻¹² (excellent fiber):
P(error) = 1 - (1 - 10⁻¹²)^12000 ≈ 0.000001% per packet
BER 10⁻⁵ (poor wireless):
P(error) = 1 - (1 - 10⁻⁵)^12000 ≈ 11.3% per packet!
The relationship between BER and packet error rate explains why technologies with higher BER (wireless, satellite) require more robust error correction and why reliability concerns are more pressing in certain environments.
A single reliability measurement provides limited value—transient errors are common. Monitor reliability trends over time. A reliability value that drops from 255 to 253 over months indicates gradual degradation (possibly a failing component). A sudden drop from 255 to 200 indicates an acute problem requiring immediate attention.
Using real-time reliability measurements in routing decisions sounds appealing—automatically route around problematic links. In practice, this approach creates more problems than it solves. Understanding these challenges explains why modern protocols avoid dynamic reliability metrics.
The ARPANET Experience:
The early ARPANET used dynamic metrics that incorporated queue lengths (related to congestion and reliability). The results were instructive:
This historical experience directly influenced the design of modern protocols. OSPF's pure bandwidth-based cost and EIGRP's default exclusion of reliability/load reflect lessons learned from ARPANET's dynamic metric experiments.
Networking often involves trade-offs between stability and optimality. A perfectly optimal routing system that constantly adapts to conditions may be less useful than a stable system that's slightly suboptimal. Users generally prefer predictable, consistent performance over theoretically better but variable performance.
Given the challenges of dynamic reliability metrics, how should network engineers address link reliability concerns? The answer involves strategic use of static configuration combined with appropriate monitoring and intervention.
Recommended Approach: Static Costs with Dynamic Monitoring:
Baseline Configuration
Continuous Monitoring
Planned Intervention
Exception: Complete Failures
12345678910111213141516171819202122232425
! Static cost assignment based on expected reliability! (Using OSPF with 10 Gbps reference bandwidth) ! Fiber links - highly reliable, use auto-calculated costinterface TenGigabitEthernet0/0 description Core Fiber - Primary ! Cost = 10000/10000 = 1 (auto-calculated) ! Microwave backhaul - weather-sensitive, less reliableinterface GigabitEthernet0/1 description Microwave Backhaul - Weather Dependent ip ospf cost 500 ! Manual high cost reflects reliability concern ! Satellite backup - high delay AND reliability concernsinterface GigabitEthernet0/2 description Satellite Backup - Emergency Only ip ospf cost 10000 ! Very high cost = last resort only ! Monitoring: Track reliability for alerting! Use SNMP or streaming telemetry to monitor:! - ifInErrors, ifOutErrors! - ifInDiscards, ifOutDiscards! - CRC error countersModern technologies like MPLS Fast Reroute (FRR) and IP FRR provide sub-50ms failover without changing IGP metrics. The IGP maintains stable costs while the FRR mechanism handles rapid recovery from failures. This combines the stability of static metrics with fast response to actual link failures.
While traditional IGPs have moved away from dynamic reliability metrics, modern networking technologies address reliability through alternative mechanisms that avoid the stability problems of metric-based approaches.
| Technology | Reliability Mechanism | Key Characteristics |
|---|---|---|
| SD-WAN | Real-time path quality measurement with per-packet steering | Application-aware; measures loss, latency, jitter; makes per-flow decisions |
| MPLS-TE FRR | Precomputed backup paths activated on failure detection | Sub-50ms failover; doesn't use IGP metric changes |
| Segment Routing TI-LFA | Topology-Independent Loop-Free Alternate paths | Extends FRR to complex topologies; maintains IGP stability |
| BGP Performance Routing | Overlay measurements with BGP path selection influence | Operates at inter-domain level; can prefer reliable AS paths |
| Application Layer | Retransmission, FEC, multi-path streaming | TCP handles loss; UDP applications add FEC or multi-path |
SD-WAN: Dynamic Reliability Done Right?
SD-WAN solutions actively measure path quality (including packet loss/reliability) and steer traffic accordingly. How do they avoid the oscillation problems?
1. Per-Flow Granularity SD-WAN makes decisions per application flow, not for all traffic. Shifting one voice call doesn't cause massive load shifts.
2. Application Awareness Different applications have different reliability requirements. Real-time apps avoid lossy paths while bulk transfers tolerate some loss.
3. Edge Intelligence Decisions are made at the network edge with local information, not propagated via routing protocols across the network.
4. Explicit Path Control SD-WAN often uses tunnels/overlays, so path changes don't affect the underlying IGP routing—isolation between layers.
5. Sophisticated Algorithms Damping, hysteresis, and weighted moving averages prevent oscillation from transient quality changes.
1234567891011121314151617181920212223242526
SD-WAN Path Quality Assessment:════════════════════════════════════════════════════════ for each path in available_paths: # Continuous measurement with synthetic probes send probe_packet to remote_site wait for response or timeout # Calculate rolling metrics with damping path.loss_rate = weighted_average( old_loss_rate × 0.8, new_measurement × 0.2 ) # Hysteresis prevents oscillation if path.loss_rate > threshold_bad + hysteresis: mark_path_as_degraded(path) elif path.loss_rate < threshold_good - hysteresis: mark_path_as_healthy(path) # else: maintain current status (hysteresis zone) # Per-application path selectionfor each new_flow: app_requirements = classify_application(flow) suitable_paths = filter_by_requirements(paths, app_requirements) selected_path = best_of(suitable_paths) # May use latency, loss, jitterModern networks often layer reliability mechanisms: stable IGP routing for base connectivity, fast reroute for rapid failure recovery, SD-WAN or performance routing for application-level optimization, and transport-layer retransmission as final backup. No single layer tries to solve all reliability problems.
Rather than relying on dynamic reliability metrics, network engineers use architectural approaches to ensure reliable packet delivery. These design principles provide reliability through redundancy and proper engineering rather than metric manipulation.
Four diverse paths ensure reliability through redundancy. Equal-cost primary paths provide ECMP load sharing. Higher-cost backup paths activate automatically on failure. No dynamic reliability metrics needed—the architecture provides reliability.
Five-nines availability (99.999%) allows only 5.26 minutes of downtime per year. This cannot be achieved through clever routing alone—it requires redundant hardware, diverse paths, automated failover, and rigorous operational procedures. Reliability metrics are just one small piece of a comprehensive reliability strategy.
We've thoroughly explored reliability as a routing metric—from its theoretical foundation through practical implementation challenges and modern alternatives. Let's consolidate the essential knowledge:
Looking Ahead:
We've now covered the fundamental individual metrics: hop count, bandwidth, delay, and reliability. In the next and final page of this module, we'll explore composite metrics—how these individual components are combined into unified metrics that balance multiple concerns, as implemented in protocols like EIGRP. We'll also discuss metric tuning and the art of traffic engineering through metric manipulation.
Composite metrics represent the practical application of everything we've learned, enabling network engineers to make nuanced routing decisions that account for multiple path characteristics simultaneously.
You now have comprehensive understanding of reliability as a routing metric—its definition and measurement, the historical attempts to incorporate it dynamically, why those attempts largely failed, and how modern networks achieve reliability through architectural design rather than metric-based routing. This perspective is essential for building robust production networks.