Computer NetworksRouting Metrics

Routing Metrics

LevelIntermediate

Duration75 mins

TopicRouting Metrics

4 / 5

Reliability

The Quality Dimension of Network Paths

A network link might boast impressive bandwidth and low delay, yet fail to deliver packets reliably. Consider a microwave link during a rainstorm: its 1 Gbps capacity and 5ms latency remain unchanged on paper, but bit errors cause packet corruption, TCP retransmissions spike, and effective throughput collapses. Reliability captures this quality dimension that bandwidth and delay metrics miss.

Reliability as a routing metric represents the probability that a packet will be successfully delivered across a link without errors. It encompasses link stability (uptime vs. downtime), error rates (bit errors, CRC failures), and the consistency of delivery over time.

What You Will Learn

By the end of this page, you will understand how reliability is defined and measured as a routing metric, how protocols like EIGRP and older IGRP incorporate reliability, the challenges of using dynamic reliability metrics, the relationship between link reliability and overall path reliability, and when reliability metrics provide genuine value versus unnecessary complexity.

Defining Reliability as a Routing Metric

Reliability in networking quantifies how dependable a link is for packet delivery. Unlike bandwidth (a theoretical maximum) or delay (a time measurement), reliability is expressed as a probability or fraction indicating successful delivery.

Reliability Definition

Concept

Reliability Definition:
════════════════════════════════════════════════════════
 
Reliability (R) = Packets Successfully Delivered / Packets Transmitted
 
Expressed as:
• Fraction: 0.0 to 1.0 (0% to 100%)
• EIGRP format: 0 to 255 (where 255 = 100% reliable)
 
Example:
─────────────────────────────────────────────────────
Link transmits 10,000 packets
Successfully delivered: 9,985 packets
Failed (errors, drops): 15 packets
 
Reliability = 9,985 / 10,000 = 0.9985 = 99.85%
In EIGRP format: 255 × 0.9985 ≈ 254

Components Affecting Reliability:

Link reliability is influenced by multiple factors, each contributing to the probability of packet loss or corruption:

Factors Affecting Link Reliability
Factor	Description	Impact	Typical Environment
Bit Error Rate (BER)	Probability of individual bit corruption	Corrupted packets fail CRC, discarded	Wireless, noisy electrical environments
Interface Errors	CRC errors, runts, giants, collisions	Direct packet loss at interface	Faulty hardware, cable issues
Buffer Overflows	Traffic exceeds queue capacity	Tail drops or active queue management	Congested network segments
Link Flapping	Interface repeatedly going up/down	Packet loss during state transitions	Unstable connections, marginal signal
Physical Medium Issues	Cable damage, connector corrosion	Intermittent errors, complete failures	Aging infrastructure, poor installation
Environmental Factors	Weather (wireless), EMI (copper)	Variable error rates	Outdoor wireless, industrial sites

Path Reliability:

For multi-hop paths, reliability compounds multiplicatively—each additional hop reduces the overall reliability:

Path Reliability = R₁ × R₂ × R₃ × ... × Rₙ

Example: 5-hop path with each link at 99% reliability
Path Reliability = 0.99 × 0.99 × 0.99 × 0.99 × 0.99
                 = 0.99⁵
                 = 0.951 = 95.1%

Example: 10-hop path with each link at 99% reliability
Path Reliability = 0.99¹⁰ = 0.904 = 90.4%

This multiplicative relationship means that even small per-link reliability issues compound into significant path-level reliability degradation. A 1% loss per hop becomes a 10% loss over a 10-hop path.

Reliability vs Availability

Reliability (error-free delivery probability) differs from Availability (uptime percentage). A link might be highly available (99.99% uptime) but have poor reliability when up (high error rate). Conversely, a link might have perfect reliability when functioning but poor availability (frequent outages). Both matter for end-to-end service quality.

Reliability in IGRP and EIGRP: Historical Implementation

Cisco's Interior Gateway Routing Protocol (IGRP), developed in the mid-1980s, and its successor EIGRP both include reliability as a component of their composite metrics. Understanding this implementation provides insight into both the value and challenges of reliability as a routing metric.

EIGRP Reliability in Composite Metric

Formula

EIGRP Composite Metric Formula (Full):
════════════════════════════════════════════════════════
 
Metric = [(K1 × BW) + (K2 × BW)/(256-Load) + (K3 × Delay)] 
         × [K5/(K4 + Reliability)]
 
Where:
• K1, K2, K3, K4, K5 = Weighting constants
• BW = 10^7 / minimum bandwidth (Kbps)
• Delay = sum of interface delays (tens of μs)
• Load = interface load (1-255)
• Reliability = interface reliability (1-255, where 255 = 100%)
 
Default K values: K1=1, K2=0, K3=1, K4=0, K5=0
 
With defaults, the simplified formula becomes:
Metric = (BW + Delay) × 256
 
Note: With K4=0 and K5=0, Reliability is NOT used by default!

Why Reliability is Disabled by Default:

Despite EIGRP's capability to factor in reliability, Cisco disabled it by default (K4=0, K5=0). The reasons illuminate fundamental challenges with dynamic reliability metrics:

1. Routing Instability Reliability values change as interface errors occur. If routes changed based on reliability fluctuations, transient error bursts could trigger route changes, which cause more errors during convergence, creating oscillating feedback loops.

2. Self-Correcting Nature of Errors Many reliability issues are transient—a burst of errors during brief interference, for example. By the time reliability metrics propagate and routes converge, the issue may have resolved, making the route change unnecessary.

3. Transport Layer Compensation TCP already handles packet loss through retransmission. Routing around unreliable links doesn't eliminate retransmissions for in-flight data and may create additional issues (reordering, asymmetric paths).

4. Measurement Challenges Reliability measurements require tracking packet success/failure over time. The measurement window affects accuracy: too short creates noise, too long masks real problems.

Viewing EIGRP Reliability Values
Cisco IOS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
! Show interface reliability
show interface GigabitEthernet0/0 | include reliability
! Output: reliability 255/255, txload 1/255, rxload 1/255
 
! Values:
! reliability: 255/255 = 100% reliable (1/255 = 0.4% reliable)
! txload/rxload: 1/255 = nearly idle (255/255 = fully loaded)
 
! Show EIGRP topology with metrics
show ip eigrp topology all-links
! Displays composite metric and feasible distance
 
! Enable reliability in EIGRP metric (NOT recommended in production)
router eigrp 100
 metric weights 0 1 0 1 0 1
 ! Weights: TOS K1 K2 K3 K4 K5
 ! This enables reliability (K5=1) in metric calculation

Do Not Enable in Production

Enabling reliability (or load) in EIGRP's metric calculation is strongly discouraged in production environments. The routing instability caused by dynamic metric components typically causes more problems than the suboptimal path selection it attempts to solve. Cisco's decision to disable these by default reflects decades of operational experience.

Measuring and Monitoring Link Reliability

Even when not used directly in routing metrics, reliability measurement is essential for network operations, capacity planning, and SLA management. Understanding measurement techniques helps network engineers identify and address reliability issues.

Interface Error Counters
Counter	Description	Indicates
CRC Errors	Frames with failed checksum	Bit corruption (noise, cable issues, hardware)
Input Errors	Total received frames with any error	General receive-side problems
Output Errors	Frames failed to transmit	Transmit-side issues, buffer overflows
Runts	Frames smaller than minimum (64 bytes)	Collisions, duplex mismatch
Giants	Frames larger than maximum	MTU mismatch, faulty equipment
Frame Errors	Frames with invalid format	Protocol issues, hardware problems
Overrun	Receiver couldn't process fast enough	CPU/memory limitations
Ignored	Receiver buffer full	Buffer sizing, traffic bursts

Analyzing Interface Reliability

Cisco IOS

! Full interface statistics
show interface GigabitEthernet0/0
 
! Key sections for reliability analysis:
! ─────────────────────────────────────────────────
! GigabitEthernet0/0 is up, line protocol is up
!   reliability 255/255, txload 1/255, rxload 1/255
!   
!   Input queue: 0/75/0/0 (size/max/drops/flushes)
!   
!   5 minute input rate 145000 bits/sec, 89 packets/sec
!   5 minute output rate 1256000 bits/sec, 423 packets/sec
!   
!   125443567 packets input, 18765432123 bytes
!   0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored ← Good!
!   
!   234567890 packets output, 34567890123 bytes
!   0 output errors, 0 collisions, 0 interface resets ← Good!
 
! Calculate reliability from counters:
! Reliability = 1 - (input_errors + output_errors) / total_packets
! Reliability = 1 - (0 + 0) / (125443567 + 234567890)
! Reliability = 100%
 
! Bad example with errors:
!   135 input errors, 127 CRC, 0 frame, 8 overrun, 0 ignored ← Problem!
!   23 output errors, 0 collisions, 3 interface resets ← Problem!

Bit Error Rate (BER) Analysis:

For deeper reliability analysis, Bit Error Rate provides a fundamental measure of link quality:

BER = Number of Bit Errors / Total Bits Transmitted

Typical BER Standards:
────────────────────────────────────────────────
• Fiber optic: BER < 10⁻¹² (1 error per trillion bits)
• Copper Ethernet: BER < 10⁻¹⁰ (1 error per 10 billion bits)
• Wireless LAN: BER ~10⁻⁶ to 10⁻⁵ (variable)
• Satellite: BER ~10⁻⁷ to 10⁻⁵ (weather-dependent)

Impact on Reliability:
────────────────────────────────────────────────
1,500-byte packet = 12,000 bits

BER 10⁻¹² (excellent fiber):
P(error) = 1 - (1 - 10⁻¹²)^12000 ≈ 0.000001% per packet

BER 10⁻⁵ (poor wireless):
P(error) = 1 - (1 - 10⁻⁵)^12000 ≈ 11.3% per packet!

The relationship between BER and packet error rate explains why technologies with higher BER (wireless, satellite) require more robust error correction and why reliability concerns are more pressing in certain environments.

Trending Over Raw Values

A single reliability measurement provides limited value—transient errors are common. Monitor reliability trends over time. A reliability value that drops from 255 to 253 over months indicates gradual degradation (possibly a failing component). A sudden drop from 255 to 200 indicates an acute problem requiring immediate attention.

The Challenges of Dynamic Reliability Metrics

Using real-time reliability measurements in routing decisions sounds appealing—automatically route around problematic links. In practice, this approach creates more problems than it solves. Understanding these challenges explains why modern protocols avoid dynamic reliability metrics.

Problems with Dynamic Reliability Metrics

•Routing Oscillation — When reliability drops on a link, traffic shifts to alternatives. This reduces load on the unreliable link, potentially improving its measured reliability (less congestion-related drops). Traffic then returns, creating oscillations that never stabilize.
•Measurement Window Dilemma — Short measurement windows (seconds) are noisy and cause route flapping. Long windows (hours) mask real problems and respond too slowly. No window size works well for all situations.
•Cascading Failures — When one link is marked unreliable and bypassed, the alternative links receive more traffic. If they approach capacity, their reliability may degrade (buffer overflows), triggering additional route changes—a cascade effect.
•Inconsistent Views — Different routers see different reliability values based on their traffic samples. This creates inconsistent routing decisions across the network, potentially causing loops or suboptimal paths.
•Transport Layer Interference — TCP interprets route changes as congestion signals, triggering slow start. Frequent route changes based on reliability can actually reduce effective throughput compared to stable but slightly unreliable paths.
•Computational Overhead — Continuously recalculating metrics based on changing reliability values increases CPU load on routers, which can itself become a source of reliability issues under heavy load.

The ARPANET Experience:

The early ARPANET used dynamic metrics that incorporated queue lengths (related to congestion and reliability). The results were instructive:

1969-1979: ARPANET used metrics based on queue depth
Routes oscillated as traffic shifted back and forth
Packets arrived out of order as paths changed mid-stream
Throughput was lower than with static metrics
1979: ARPANET switched to more stable metrics

This historical experience directly influenced the design of modern protocols. OSPF's pure bandwidth-based cost and EIGRP's default exclusion of reliability/load reflect lessons learned from ARPANET's dynamic metric experiments.

Converting Mermaid diagram...

The Stability-Optimality Trade-off

Networking often involves trade-offs between stability and optimality. A perfectly optimal routing system that constantly adapts to conditions may be less useful than a stable system that's slightly suboptimal. Users generally prefer predictable, consistent performance over theoretically better but variable performance.

Static vs Dynamic Reliability Handling

Given the challenges of dynamic reliability metrics, how should network engineers address link reliability concerns? The answer involves strategic use of static configuration combined with appropriate monitoring and intervention.

Static Approach

•Assign costs based on expected reliability
•Higher cost for historically unreliable media
•Manual adjustment when problems persist
•Operator decides when to reroute
•Stable, predictable behavior
•Slower response to transient issues

Dynamic Approach

•Automatically measure and react
•Route around problems immediately
•No manual intervention required
•Protocol decides when to reroute
•Potentially unstable, oscillating
•Fast response but may overreact

Recommended Approach: Static Costs with Dynamic Monitoring:

Baseline Configuration
- Use static metrics (bandwidth-based costs)
- Assign higher costs to known unreliable link types (wireless backhaul, satellite)
Continuous Monitoring
- Monitor error counters and reliability values
- Set alerts for significant degradation
- Track trends over time
Planned Intervention
- When persistent reliability issues are detected, manually adjust costs or replace links
- Make deliberate routing changes during maintenance windows
- Document changes and verify impact
Exception: Complete Failures
- Let IGP handle complete link failures (interface down)
- This binary up/down detection is reliable and appropriate for automatic routing changes

Proactive Reliability Design

Configuration Example

! Static cost assignment based on expected reliability
! (Using OSPF with 10 Gbps reference bandwidth)
 
! Fiber links - highly reliable, use auto-calculated cost
interface TenGigabitEthernet0/0
 description Core Fiber - Primary
 ! Cost = 10000/10000 = 1 (auto-calculated)
 
! Microwave backhaul - weather-sensitive, less reliable
interface GigabitEthernet0/1
 description Microwave Backhaul - Weather Dependent
 ip ospf cost 500
 ! Manual high cost reflects reliability concern
 
! Satellite backup - high delay AND reliability concerns
interface GigabitEthernet0/2
 description Satellite Backup - Emergency Only
 ip ospf cost 10000
 ! Very high cost = last resort only
 
! Monitoring: Track reliability for alerting
! Use SNMP or streaming telemetry to monitor:
! - ifInErrors, ifOutErrors
! - ifInDiscards, ifOutDiscards
! - CRC error counters

Event-Driven Fast Reroute

Modern technologies like MPLS Fast Reroute (FRR) and IP FRR provide sub-50ms failover without changing IGP metrics. The IGP maintains stable costs while the FRR mechanism handles rapid recovery from failures. This combines the stability of static metrics with fast response to actual link failures.

Reliability in Modern Networking

While traditional IGPs have moved away from dynamic reliability metrics, modern networking technologies address reliability through alternative mechanisms that avoid the stability problems of metric-based approaches.

Modern Approaches to Reliability
Technology	Reliability Mechanism	Key Characteristics
SD-WAN	Real-time path quality measurement with per-packet steering	Application-aware; measures loss, latency, jitter; makes per-flow decisions
MPLS-TE FRR	Precomputed backup paths activated on failure detection	Sub-50ms failover; doesn't use IGP metric changes
Segment Routing TI-LFA	Topology-Independent Loop-Free Alternate paths	Extends FRR to complex topologies; maintains IGP stability
BGP Performance Routing	Overlay measurements with BGP path selection influence	Operates at inter-domain level; can prefer reliable AS paths
Application Layer	Retransmission, FEC, multi-path streaming	TCP handles loss; UDP applications add FEC or multi-path

SD-WAN: Dynamic Reliability Done Right?

SD-WAN solutions actively measure path quality (including packet loss/reliability) and steer traffic accordingly. How do they avoid the oscillation problems?

1. Per-Flow Granularity SD-WAN makes decisions per application flow, not for all traffic. Shifting one voice call doesn't cause massive load shifts.

2. Application Awareness Different applications have different reliability requirements. Real-time apps avoid lossy paths while bulk transfers tolerate some loss.

3. Edge Intelligence Decisions are made at the network edge with local information, not propagated via routing protocols across the network.

4. Explicit Path Control SD-WAN often uses tunnels/overlays, so path changes don't affect the underlying IGP routing—isolation between layers.

5. Sophisticated Algorithms Damping, hysteresis, and weighted moving averages prevent oscillation from transient quality changes.

SD-WAN Reliability Measurement (Conceptual)

Pseudocode

SD-WAN Path Quality Assessment:
════════════════════════════════════════════════════════
 
for each path in available_paths:
    # Continuous measurement with synthetic probes
    send probe_packet to remote_site
    wait for response or timeout
    
    # Calculate rolling metrics with damping
    path.loss_rate = weighted_average(
        old_loss_rate × 0.8,
        new_measurement × 0.2
    )
    
    # Hysteresis prevents oscillation
    if path.loss_rate > threshold_bad + hysteresis:
        mark_path_as_degraded(path)
    elif path.loss_rate < threshold_good - hysteresis:
        mark_path_as_healthy(path)
    # else: maintain current status (hysteresis zone)
 
# Per-application path selection
for each new_flow:
    app_requirements = classify_application(flow)
    suitable_paths = filter_by_requirements(paths, app_requirements)
    selected_path = best_of(suitable_paths)  # May use latency, loss, jitter

Layered Approach

Modern networks often layer reliability mechanisms: stable IGP routing for base connectivity, fast reroute for rapid failure recovery, SD-WAN or performance routing for application-level optimization, and transport-layer retransmission as final backup. No single layer tries to solve all reliability problems.

Practical Reliability Design Principles

Rather than relying on dynamic reliability metrics, network engineers use architectural approaches to ensure reliable packet delivery. These design principles provide reliability through redundancy and proper engineering rather than metric manipulation.

Design Principles for Reliable Networks

•Path Diversity — Provide multiple physical paths between critical endpoints. If one path experiences reliability issues, alternatives exist. ECMP naturally distributes traffic; unequal-cost paths provide backup.
•Media Diversity — Don't rely solely on one media type. Combine fiber, copper, and wireless paths so that medium-specific issues (weather affecting wireless, backhoe cutting fiber) don't cause total outages.
•Provider Diversity — For critical WAN connections, use multiple service providers. Provider-specific outages (fiber cuts on provider's backbone, BGP misconfigurations) won't affect all paths.
•Geographic Diversity — Route diverse paths through different geographic areas. Natural disasters, power outages, or regional issues won't affect all paths simultaneously.
•Quality Equipment — Invest in reliable hardware with low failure rates. The best metric manipulation can't compensate for fundamentally unreliable equipment.
•Proactive Maintenance — Replace cables, optics, and hardware before they fail. Monitor error counters and replace equipment showing degradation trends.

Reliability Design Case StudyA financial services company needs highly reliable connectivity between two data centers 500 km apart.

Input

Output

Explanation

Four diverse paths ensure reliability through redundancy. Equal-cost primary paths provide ECMP load sharing. Higher-cost backup paths activate automatically on failure. No dynamic reliability metrics needed—the architecture provides reliability.

The 99.999% Target

Five-nines availability (99.999%) allows only 5.26 minutes of downtime per year. This cannot be achieved through clever routing alone—it requires redundant hardware, diverse paths, automated failover, and rigorous operational procedures. Reliability metrics are just one small piece of a comprehensive reliability strategy.

Summary: Reliability as a Routing Metric

We've thoroughly explored reliability as a routing metric—from its theoretical foundation through practical implementation challenges and modern alternatives. Let's consolidate the essential knowledge:

Key Takeaways

•Definition: Reliability quantifies the probability of successful packet delivery, typically expressed as a fraction (0-1) or scaled value (EIGRP's 0-255).
•Path Reliability: Multi-hop reliability compounds multiplicatively. Small per-link issues become significant path-level degradation.
•EIGRP Implementation: Reliability is available as a composite metric component but disabled by default (K4=0, K5=0) due to stability concerns.
•Measurement: Interface error counters (CRC errors, input/output errors) provide reliability indication. BER analysis offers deeper insight.
•Dynamic Metric Problems: Routing oscillation, cascading failures, measurement window dilemma, and inconsistent network views plague dynamic reliability metrics.
•Historical Lessons: ARPANET's experience with dynamic metrics led to stable, static approaches in production protocols.
•Modern Alternatives: SD-WAN, MPLS-FRR, Segment Routing provide reliability-aware behavior without IGP metric instability.
•Design Approach: Achieve reliability through architecture (path diversity, redundancy) rather than metric manipulation.

Looking Ahead:

We've now covered the fundamental individual metrics: hop count, bandwidth, delay, and reliability. In the next and final page of this module, we'll explore composite metrics—how these individual components are combined into unified metrics that balance multiple concerns, as implemented in protocols like EIGRP. We'll also discuss metric tuning and the art of traffic engineering through metric manipulation.

Composite metrics represent the practical application of everything we've learned, enabling network engineers to make nuanced routing decisions that account for multiple path characteristics simultaneously.

Page Complete

You now have comprehensive understanding of reliability as a routing metric—its definition and measurement, the historical attempts to incorporate it dynamically, why those attempts largely failed, and how modern networks achieve reliability through architectural design rather than metric-based routing. This perspective is essential for building robust production networks.

4 / 5

Loading learning content...

Computer NetworksRouting Metrics

Routing Metrics

LevelIntermediate

Duration75 mins

TopicRouting Metrics

4 / 5

Reliability

The Quality Dimension of Network Paths

What You Will Learn

Defining Reliability as a Routing Metric

Reliability Definition

Concept

Reliability Definition:
════════════════════════════════════════════════════════
 
Reliability (R) = Packets Successfully Delivered / Packets Transmitted
 
Expressed as:
• Fraction: 0.0 to 1.0 (0% to 100%)
• EIGRP format: 0 to 255 (where 255 = 100% reliable)
 
Example:
─────────────────────────────────────────────────────
Link transmits 10,000 packets
Successfully delivered: 9,985 packets
Failed (errors, drops): 15 packets
 
Reliability = 9,985 / 10,000 = 0.9985 = 99.85%
In EIGRP format: 255 × 0.9985 ≈ 254

Components Affecting Reliability:

Link reliability is influenced by multiple factors, each contributing to the probability of packet loss or corruption:

Factors Affecting Link Reliability
Factor	Description	Impact	Typical Environment
Bit Error Rate (BER)	Probability of individual bit corruption	Corrupted packets fail CRC, discarded	Wireless, noisy electrical environments
Interface Errors	CRC errors, runts, giants, collisions	Direct packet loss at interface	Faulty hardware, cable issues
Buffer Overflows	Traffic exceeds queue capacity	Tail drops or active queue management	Congested network segments
Link Flapping	Interface repeatedly going up/down	Packet loss during state transitions	Unstable connections, marginal signal
Physical Medium Issues	Cable damage, connector corrosion	Intermittent errors, complete failures	Aging infrastructure, poor installation
Environmental Factors	Weather (wireless), EMI (copper)	Variable error rates	Outdoor wireless, industrial sites

Path Reliability:

For multi-hop paths, reliability compounds multiplicatively—each additional hop reduces the overall reliability:

Path Reliability = R₁ × R₂ × R₃ × ... × Rₙ

Example: 5-hop path with each link at 99% reliability
Path Reliability = 0.99 × 0.99 × 0.99 × 0.99 × 0.99
                 = 0.99⁵
                 = 0.951 = 95.1%

Example: 10-hop path with each link at 99% reliability
Path Reliability = 0.99¹⁰ = 0.904 = 90.4%

Reliability vs Availability

Reliability in IGRP and EIGRP: Historical Implementation

EIGRP Reliability in Composite Metric

Formula

EIGRP Composite Metric Formula (Full):
════════════════════════════════════════════════════════
 
Metric = [(K1 × BW) + (K2 × BW)/(256-Load) + (K3 × Delay)] 
         × [K5/(K4 + Reliability)]
 
Where:
• K1, K2, K3, K4, K5 = Weighting constants
• BW = 10^7 / minimum bandwidth (Kbps)
• Delay = sum of interface delays (tens of μs)
• Load = interface load (1-255)
• Reliability = interface reliability (1-255, where 255 = 100%)
 
Default K values: K1=1, K2=0, K3=1, K4=0, K5=0
 
With defaults, the simplified formula becomes:
Metric = (BW + Delay) × 256
 
Note: With K4=0 and K5=0, Reliability is NOT used by default!

Why Reliability is Disabled by Default:

Despite EIGRP's capability to factor in reliability, Cisco disabled it by default (K4=0, K5=0). The reasons illuminate fundamental challenges with dynamic reliability metrics:

Viewing EIGRP Reliability Values
Cisco IOS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
! Show interface reliability
show interface GigabitEthernet0/0 | include reliability
! Output: reliability 255/255, txload 1/255, rxload 1/255
 
! Values:
! reliability: 255/255 = 100% reliable (1/255 = 0.4% reliable)
! txload/rxload: 1/255 = nearly idle (255/255 = fully loaded)
 
! Show EIGRP topology with metrics
show ip eigrp topology all-links
! Displays composite metric and feasible distance
 
! Enable reliability in EIGRP metric (NOT recommended in production)
router eigrp 100
 metric weights 0 1 0 1 0 1
 ! Weights: TOS K1 K2 K3 K4 K5
 ! This enables reliability (K5=1) in metric calculation

Do Not Enable in Production

Measuring and Monitoring Link Reliability

Interface Error Counters
Counter	Description	Indicates
CRC Errors	Frames with failed checksum	Bit corruption (noise, cable issues, hardware)
Input Errors	Total received frames with any error	General receive-side problems
Output Errors	Frames failed to transmit	Transmit-side issues, buffer overflows
Runts	Frames smaller than minimum (64 bytes)	Collisions, duplex mismatch
Giants	Frames larger than maximum	MTU mismatch, faulty equipment
Frame Errors	Frames with invalid format	Protocol issues, hardware problems
Overrun	Receiver couldn't process fast enough	CPU/memory limitations
Ignored	Receiver buffer full	Buffer sizing, traffic bursts

Analyzing Interface Reliability

Cisco IOS

! Full interface statistics
show interface GigabitEthernet0/0
 
! Key sections for reliability analysis:
! ─────────────────────────────────────────────────
! GigabitEthernet0/0 is up, line protocol is up
!   reliability 255/255, txload 1/255, rxload 1/255
!   
!   Input queue: 0/75/0/0 (size/max/drops/flushes)
!   
!   5 minute input rate 145000 bits/sec, 89 packets/sec
!   5 minute output rate 1256000 bits/sec, 423 packets/sec
!   
!   125443567 packets input, 18765432123 bytes
!   0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored ← Good!
!   
!   234567890 packets output, 34567890123 bytes
!   0 output errors, 0 collisions, 0 interface resets ← Good!
 
! Calculate reliability from counters:
! Reliability = 1 - (input_errors + output_errors) / total_packets
! Reliability = 1 - (0 + 0) / (125443567 + 234567890)
! Reliability = 100%
 
! Bad example with errors:
!   135 input errors, 127 CRC, 0 frame, 8 overrun, 0 ignored ← Problem!
!   23 output errors, 0 collisions, 3 interface resets ← Problem!

Bit Error Rate (BER) Analysis:

For deeper reliability analysis, Bit Error Rate provides a fundamental measure of link quality:

BER = Number of Bit Errors / Total Bits Transmitted

Typical BER Standards:
────────────────────────────────────────────────
• Fiber optic: BER < 10⁻¹² (1 error per trillion bits)
• Copper Ethernet: BER < 10⁻¹⁰ (1 error per 10 billion bits)
• Wireless LAN: BER ~10⁻⁶ to 10⁻⁵ (variable)
• Satellite: BER ~10⁻⁷ to 10⁻⁵ (weather-dependent)

Impact on Reliability:
────────────────────────────────────────────────
1,500-byte packet = 12,000 bits

BER 10⁻¹² (excellent fiber):
P(error) = 1 - (1 - 10⁻¹²)^12000 ≈ 0.000001% per packet

BER 10⁻⁵ (poor wireless):
P(error) = 1 - (1 - 10⁻⁵)^12000 ≈ 11.3% per packet!

Trending Over Raw Values

The Challenges of Dynamic Reliability Metrics

Problems with Dynamic Reliability Metrics

•Routing Oscillation — When reliability drops on a link, traffic shifts to alternatives. This reduces load on the unreliable link, potentially improving its measured reliability (less congestion-related drops). Traffic then returns, creating oscillations that never stabilize.
•Measurement Window Dilemma — Short measurement windows (seconds) are noisy and cause route flapping. Long windows (hours) mask real problems and respond too slowly. No window size works well for all situations.
•Cascading Failures — When one link is marked unreliable and bypassed, the alternative links receive more traffic. If they approach capacity, their reliability may degrade (buffer overflows), triggering additional route changes—a cascade effect.
•Inconsistent Views — Different routers see different reliability values based on their traffic samples. This creates inconsistent routing decisions across the network, potentially causing loops or suboptimal paths.
•Transport Layer Interference — TCP interprets route changes as congestion signals, triggering slow start. Frequent route changes based on reliability can actually reduce effective throughput compared to stable but slightly unreliable paths.
•Computational Overhead — Continuously recalculating metrics based on changing reliability values increases CPU load on routers, which can itself become a source of reliability issues under heavy load.

The ARPANET Experience:

The early ARPANET used dynamic metrics that incorporated queue lengths (related to congestion and reliability). The results were instructive:

1969-1979: ARPANET used metrics based on queue depth
Routes oscillated as traffic shifted back and forth
Packets arrived out of order as paths changed mid-stream
Throughput was lower than with static metrics
1979: ARPANET switched to more stable metrics

Converting Mermaid diagram...

The Stability-Optimality Trade-off

Static vs Dynamic Reliability Handling

Static Approach

•Assign costs based on expected reliability
•Higher cost for historically unreliable media
•Manual adjustment when problems persist
•Operator decides when to reroute
•Stable, predictable behavior
•Slower response to transient issues

Dynamic Approach

•Automatically measure and react
•Route around problems immediately
•No manual intervention required
•Protocol decides when to reroute
•Potentially unstable, oscillating
•Fast response but may overreact

Recommended Approach: Static Costs with Dynamic Monitoring:

Baseline Configuration
- Use static metrics (bandwidth-based costs)
- Assign higher costs to known unreliable link types (wireless backhaul, satellite)
Continuous Monitoring
- Monitor error counters and reliability values
- Set alerts for significant degradation
- Track trends over time
Planned Intervention
- When persistent reliability issues are detected, manually adjust costs or replace links
- Make deliberate routing changes during maintenance windows
- Document changes and verify impact
Exception: Complete Failures
- Let IGP handle complete link failures (interface down)
- This binary up/down detection is reliable and appropriate for automatic routing changes

Proactive Reliability Design

Configuration Example

! Static cost assignment based on expected reliability
! (Using OSPF with 10 Gbps reference bandwidth)
 
! Fiber links - highly reliable, use auto-calculated cost
interface TenGigabitEthernet0/0
 description Core Fiber - Primary
 ! Cost = 10000/10000 = 1 (auto-calculated)
 
! Microwave backhaul - weather-sensitive, less reliable
interface GigabitEthernet0/1
 description Microwave Backhaul - Weather Dependent
 ip ospf cost 500
 ! Manual high cost reflects reliability concern
 
! Satellite backup - high delay AND reliability concerns
interface GigabitEthernet0/2
 description Satellite Backup - Emergency Only
 ip ospf cost 10000
 ! Very high cost = last resort only
 
! Monitoring: Track reliability for alerting
! Use SNMP or streaming telemetry to monitor:
! - ifInErrors, ifOutErrors
! - ifInDiscards, ifOutDiscards
! - CRC error counters

Event-Driven Fast Reroute

Reliability in Modern Networking

Modern Approaches to Reliability
Technology	Reliability Mechanism	Key Characteristics
SD-WAN	Real-time path quality measurement with per-packet steering	Application-aware; measures loss, latency, jitter; makes per-flow decisions
MPLS-TE FRR	Precomputed backup paths activated on failure detection	Sub-50ms failover; doesn't use IGP metric changes
Segment Routing TI-LFA	Topology-Independent Loop-Free Alternate paths	Extends FRR to complex topologies; maintains IGP stability
BGP Performance Routing	Overlay measurements with BGP path selection influence	Operates at inter-domain level; can prefer reliable AS paths
Application Layer	Retransmission, FEC, multi-path streaming	TCP handles loss; UDP applications add FEC or multi-path

SD-WAN: Dynamic Reliability Done Right?

SD-WAN solutions actively measure path quality (including packet loss/reliability) and steer traffic accordingly. How do they avoid the oscillation problems?

1. Per-Flow Granularity SD-WAN makes decisions per application flow, not for all traffic. Shifting one voice call doesn't cause massive load shifts.

2. Application Awareness Different applications have different reliability requirements. Real-time apps avoid lossy paths while bulk transfers tolerate some loss.

3. Edge Intelligence Decisions are made at the network edge with local information, not propagated via routing protocols across the network.

4. Explicit Path Control SD-WAN often uses tunnels/overlays, so path changes don't affect the underlying IGP routing—isolation between layers.

5. Sophisticated Algorithms Damping, hysteresis, and weighted moving averages prevent oscillation from transient quality changes.

SD-WAN Reliability Measurement (Conceptual)

Pseudocode

SD-WAN Path Quality Assessment:
════════════════════════════════════════════════════════
 
for each path in available_paths:
    # Continuous measurement with synthetic probes
    send probe_packet to remote_site
    wait for response or timeout
    
    # Calculate rolling metrics with damping
    path.loss_rate = weighted_average(
        old_loss_rate × 0.8,
        new_measurement × 0.2
    )
    
    # Hysteresis prevents oscillation
    if path.loss_rate > threshold_bad + hysteresis:
        mark_path_as_degraded(path)
    elif path.loss_rate < threshold_good - hysteresis:
        mark_path_as_healthy(path)
    # else: maintain current status (hysteresis zone)
 
# Per-application path selection
for each new_flow:
    app_requirements = classify_application(flow)
    suitable_paths = filter_by_requirements(paths, app_requirements)
    selected_path = best_of(suitable_paths)  # May use latency, loss, jitter

Layered Approach

Practical Reliability Design Principles

Design Principles for Reliable Networks

•Path Diversity — Provide multiple physical paths between critical endpoints. If one path experiences reliability issues, alternatives exist. ECMP naturally distributes traffic; unequal-cost paths provide backup.
•Media Diversity — Don't rely solely on one media type. Combine fiber, copper, and wireless paths so that medium-specific issues (weather affecting wireless, backhoe cutting fiber) don't cause total outages.
•Provider Diversity — For critical WAN connections, use multiple service providers. Provider-specific outages (fiber cuts on provider's backbone, BGP misconfigurations) won't affect all paths.
•Geographic Diversity — Route diverse paths through different geographic areas. Natural disasters, power outages, or regional issues won't affect all paths simultaneously.
•Quality Equipment — Invest in reliable hardware with low failure rates. The best metric manipulation can't compensate for fundamentally unreliable equipment.
•Proactive Maintenance — Replace cables, optics, and hardware before they fail. Monitor error counters and replace equipment showing degradation trends.

Reliability Design Case StudyA financial services company needs highly reliable connectivity between two data centers 500 km apart.

Input

Output

Explanation

The 99.999% Target

Summary: Reliability as a Routing Metric

Key Takeaways

•Definition: Reliability quantifies the probability of successful packet delivery, typically expressed as a fraction (0-1) or scaled value (EIGRP's 0-255).
•Path Reliability: Multi-hop reliability compounds multiplicatively. Small per-link issues become significant path-level degradation.
•EIGRP Implementation: Reliability is available as a composite metric component but disabled by default (K4=0, K5=0) due to stability concerns.
•Measurement: Interface error counters (CRC errors, input/output errors) provide reliability indication. BER analysis offers deeper insight.
•Dynamic Metric Problems: Routing oscillation, cascading failures, measurement window dilemma, and inconsistent network views plague dynamic reliability metrics.
•Historical Lessons: ARPANET's experience with dynamic metrics led to stable, static approaches in production protocols.
•Modern Alternatives: SD-WAN, MPLS-FRR, Segment Routing provide reliability-aware behavior without IGP metric instability.
•Design Approach: Achieve reliability through architecture (path diversity, redundancy) rather than metric manipulation.

Looking Ahead:

Page Complete

4 / 5