Computer NetworksDatacenter Overview

Datacenter Overview: The Foundation of Modern Cloud Infrastructure

LevelAdvanced

Duration90 mins

TopicDatacenter Overview

2 / 5

Topology: The Leaf-Spine Architecture Revolution

The Network Fabric That Powers the Cloud

If datacenter architecture is the skeleton of modern cloud infrastructure, then network topology is the nervous system—the intricate pattern of connections that determines how quickly and reliably data flows between any two points in the facility.

For decades, datacenter networks followed the same three-tier hierarchical model used in enterprise campus networks: access, distribution, and core layers arranged in a tree-like structure. This model worked well when most traffic flowed north-south (in and out of the datacenter), but it buckles under the weight of modern east-west traffic (server-to-server communication within the datacenter).

Enter the leaf-spine topology—also known as a Clos network after the Bell Labs engineer Charles Clos who formalized its mathematics in 1953. This deceptively simple design has revolutionized datacenter networking, enabling the massive scale, consistent performance, and fault tolerance that cloud services demand.

This page explores leaf-spine topology from first principles through advanced implementation considerations, establishing the foundation for understanding how modern datacenters achieve their remarkable capabilities.

What You Will Learn

By the end of this page, you will understand why leaf-spine topology emerged as the datacenter standard, the mathematical foundations that ensure non-blocking performance, how Equal-Cost Multi-Path (ECMP) routing distributes traffic, and the specific engineering considerations for designing leaf-spine networks at various scales.

The Evolution from Three-Tier to Leaf-Spine

To understand why leaf-spine topology dominates modern datacenters, we must first understand what it replaced and why that replacement was necessary.

The Three-Tier Hierarchical Model

The traditional datacenter network followed a three-tier hierarchical architecture:

Access Layer (Tier 1): Top-of-Rack (ToR) switches directly connecting servers. Each rack has one or two switches aggregating all server connections.
Aggregation/Distribution Layer (Tier 2): Larger switches that aggregate multiple access switches. Typically deployed at the end of each row or for a group of racks.
Core Layer (Tier 3): High-speed backbone switches connecting aggregation switches and providing connectivity to WAN/Internet.

This model has intuitive appeal: it mirrors how traffic was assumed to flow (from servers through aggregation to the core and out), and it maps to the physical layout of racks in rows in data halls.

Converting Mermaid diagram...

The Fatal Flaw: Spanning Tree and Oversubscription

Two fundamental problems plague the three-tier model:

Problem 1: Spanning Tree Protocol (STP) Limitations

The three-tier topology typically uses Spanning Tree Protocol (STP) to prevent Layer 2 loops. STP works by blocking redundant paths, leaving only a single active path between any two points. This means:

Half or more of the available bandwidth sits idle (blocked links)
Failover to blocked paths is slow (seconds of reconvergence)
The active path becomes a bottleneck
Network utilization is fundamentally limited

Problem 2: Aggregation Layer Bottleneck

The aggregation layer creates inherent oversubscription. If 20 access switches, each with 40 Gbps uplink capacity, connect to 2 aggregation switches with 200 Gbps capacity each, the oversubscription ratio is:

(20 × 40 Gbps) / (2 × 200 Gbps) = 800/400 = 2:1

This means if all servers try to communicate simultaneously, only half the traffic can flow. In practice, ratios of 4:1 or higher were common, creating severe congestion during peak loads.

The East-West Traffic Explosion

The three-tier model was designed when traffic primarily flowed north-south (clients to servers). But modern distributed applications—MapReduce, microservices, distributed databases—generate massive east-west traffic (server-to-server). Studies show 70-80% of datacenter traffic is now east-west. The aggregation bottleneck that was tolerable for north-south traffic becomes catastrophic for east-west patterns.

The Leaf-Spine Architecture Explained

The leaf-spine topology (also called a Clos network or folded Clos) addresses the limitations of three-tier architectures through a fundamentally different design philosophy.

Core Principles

Two Layers Only: The network consists of only two switch layers—leaves and spines. No aggregation layer exists.
Full Mesh Connectivity: Every leaf switch connects to every spine switch. There are no direct leaf-to-leaf or spine-to-spine connections.
Non-Blocking Design: With proper provisioning, the network provides full bandwidth between any two servers—no oversubscription.
Equal Path Length: Any two servers are exactly two hops apart (leaf → spine → leaf), ensuring consistent latency.

Structural Definition

Leaf Switches: Connect directly to servers (and storage/appliances). Each rack typically has one or more leaf switches. Leaves also connect to every spine.
Spine Switches: Form the backbone of the fabric. They connect only to leaves—never to servers or other spines. Their sole purpose is to interconnect all leaves.

Converting Mermaid diagram...

Why Full Mesh?

The full mesh connectivity might seem excessive—why does every leaf need to connect to every spine? The answer lies in the mathematical properties this creates:

Path Diversity: Between any two leaves, there are as many paths as there are spines. With 8 spines, there are 8 independent paths between any server pair.
Load Distribution: Traffic can be spread across all paths using ECMP (Equal-Cost Multi-Path) routing, utilizing all available bandwidth.
Fault Tolerance: Losing a spine reduces capacity by 1/N (where N is spine count) but doesn't isolate any servers. Losing a spine in an 8-spine network reduces capacity by 12.5%—a graceful degradation.
Uniform Behavior: Every leaf-to-leaf path traverses exactly one spine, so latency is consistent regardless of which servers are communicating.

The Historical Insight

Charles Clos discovered this topology in 1953 while designing telephone switching networks at Bell Labs. He proved mathematically that a properly designed multi-stage network could be 'strictly non-blocking'—any input could connect to any output without rearranging existing connections. Sixty years later, this same mathematics powers the world's largest datacenters.

The Mathematics of Non-Blocking Fabrics

Understanding the mathematical foundations of leaf-spine networks enables correct sizing and performance prediction.

Definitions and Terminology

For a leaf-spine network:

L = Number of leaf switches
S = Number of spine switches
p = Number of downlinks per leaf (ports connecting to servers)
u = Number of uplinks per leaf (ports connecting to spines)
For each leaf: Total ports = p + u

Non-Blocking Condition

A leaf-spine network is non-blocking when any server can send to any other server at full line rate simultaneously. The condition for non-blocking:

u ≥ p (uplink bandwidth ≥ downlink bandwidth per leaf)

If each leaf has 48 server-facing ports at 25 Gbps and 8 uplinks at 100 Gbps:

Downlink capacity: 48 × 25 = 1,200 Gbps
Uplink capacity: 8 × 100 = 800 Gbps
Oversubscription ratio: 1,200/800 = 1.5:1

To achieve non-blocking: uplink capacity must equal or exceed 1,200 Gbps (e.g., 12 × 100 Gbps uplinks).

Calculating Fabric Capacity

Total fabric bisection bandwidth (the minimum bandwidth available for any traffic split):

Bisection BW = S × (per-uplink speed) × L / 2

For a network with 8 spines, 32 leaves, and 100 Gbps uplinks:

Each leaf has 8 uplinks (one to each spine)
Total uplinks from all leaves: 32 × 8 = 256 connections
Each spine handles: 256/8 = 32 leaf connections
Bisection bandwidth: 8 × 100 × 32 / 2 = 12,800 Gbps = 12.8 Tbps

Maximum Network Size

The maximum servers supportable in a single leaf-spine network:

Max servers = L × p = S × p (since L ≤ S for non-blocking when u = S)

With 64-port switches:

If using 32 ports for servers and 32 for uplinks
Maximum leaves = 32 (one uplink port per spine)
Maximum spines = 32
Maximum servers = 32 leaves × 32 server ports = 1,024 servers

For larger deployments, multi-stage Clos networks or super-spines extend the architecture.

Sample Leaf-Spine Network Sizing
Parameter	Small Fabric	Medium Fabric	Large Fabric
Spine switches	4	8	16
Leaf switches	8	32	64
Server ports per leaf	48	48	48
Uplink speed	100 Gbps	100 Gbps	400 Gbps
Total servers	384	1,536	3,072
Bisection bandwidth	1.6 Tbps	12.8 Tbps	102.4 Tbps
Oversubscription	3:1	1.5:1	Non-blocking

Practical Oversubscription

True non-blocking fabrics are expensive. In practice, most datacenters accept 2:1 to 4:1 oversubscription, which works well because not all servers transmit at full capacity simultaneously. The key is matching oversubscription to actual traffic patterns—compute-intensive workloads tolerate more oversubscription than storage-intensive ones.

ECMP: Equal-Cost Multi-Path Routing

The leaf-spine topology provides multiple equal-cost paths between any two servers, but this only improves performance if traffic is actually distributed across all paths. ECMP (Equal-Cost Multi-Path) is the mechanism that makes this possible.

How ECMP Works

When a router has multiple equal-cost routes to a destination, ECMP allows traffic to be distributed across all of them instead of choosing a single best path. The distribution is performed by hashing:

For each packet/flow, a hash is computed from header fields (typically 5-tuple: source IP, destination IP, source port, destination port, protocol)
The hash value determines which path is selected
All packets with the same hash use the same path, preserving order within flows
Different flows are distributed across all available paths

ECMP Hash Fields

Standard 5-tuple hashing:

Hash Input = {Source IP, Destination IP, Source Port, Destination Port, Protocol}
Path = Hash(Input) mod Number_of_Paths

For example, with 8 spines (8 equal paths):

Flow A hashes to 5 → sent via Spine 5
Flow B hashes to 2 → sent via Spine 2
Flow C hashes to 5 → sent via Spine 5 (same as Flow A)

The Polarization Problem

A critical ECMP challenge is hash polarization—when multiple switch layers use the same hash function, they may make identical path selections, causing some links to be overused while others remain idle.

Example of polarization:

Leaf uses 5-tuple hash → selects Spine 3
Spine 3 uses same hash → selects specific egress port
If hash is identical, traffic concentrates on specific paths

Mitigation strategies:

Different hash seeds at each layer: Each switch uses a different random seed, ensuring independent path selection
Asymmetric hashing: Use different header fields at different layers
Adaptive routing: Some switches can dynamically adjust path selection based on congestion
Flowlet switching: Break long flows into shorter 'flowlets' that can take different paths

Converting Mermaid diagram...

The Elephant Flow Problem

ECMP distributes flows, not bytes. A single large 'elephant flow' (like a storage backup or large data transfer) hashing to one path can saturate that link while others remain idle. Solutions include flow-aware load balancing, per-packet spraying (which requires reordering tolerance), or application-level sharding of large transfers across multiple connections.

Layer 2 vs Layer 3 Leaf-Spine

Leaf-spine networks can be implemented as Layer 2 (bridged) or Layer 3 (routed) fabrics, with significant implications for design and operation.

Layer 2 Leaf-Spine

In a Layer 2 fabric, all switches operate as bridges, and the entire network is one large broadcast domain.

Enabling technologies:

MLAG (Multi-Chassis Link Aggregation): Pairs of switches appear as one logical switch, allowing servers to use standard LACP
TRILL/SPB/VXLAN EVPN: Overlay technologies that enable multi-path forwarding while maintaining Layer 2 semantics

Advantages:

Server mobility: VMs can move freely without IP changes
Application compatibility: Works with legacy applications requiring Layer 2 adjacency
Simpler IP addressing: No need for per-rack subnets

Disadvantages:

Larger failure domain: Broadcast storms or loops can affect entire fabric
Scalability limits: Broadcast traffic increases with size
More complex control plane: Requires specialized protocols (EVPN, VXLAN)

Layer 3 Leaf-Spine (Routed Fabric)

In a Layer 3 fabric, each link is a routed interface, and routing protocols (typically BGP or OSPF) manage path selection.

Key characteristics:

Each leaf-spine link is a separate /31 IP subnet
Server subnets terminate at the leaf switches
Routing protocol advertises reachability across the fabric
ECMP is native to the routing protocol

Advantages:

Superior scalability: Routing protocols scale to thousands of switches
Fault isolation: Problems are contained to individual subnets
No broadcast storms: Layer 3 boundaries prevent Layer 2 failure propagation
Native ECMP: All modern routing protocols support ECMP

Disadvantages:

Server mobility requires IP changes or overlay networks
More IP address consumption (each link needs a subnet)
Routing protocol configuration required

When to Use Layer 2

•Legacy applications requiring Layer 2 adjacency
•Frequent VM migration without IP changes
•Smaller deployments (<100 racks)
•VMware/Hyper-V environments expecting L2
•Storage protocols requiring L2 (some iSCSI, FCoE)

When to Use Layer 3

•Hyperscale deployments (hundreds+ of racks)
•Cloud-native applications (containers, Kubernetes)
•Maximum fault isolation requirements
•Simpler operations (routing is well-understood)
•Performance-critical workloads (lowest latency)

The Hyperscale Consensus

All major hyperscale operators (Google, Facebook/Meta, Microsoft, Amazon) use Layer 3 routed fabrics, typically with BGP as the routing protocol. This approach provides the scalability, fault isolation, and operational simplicity needed at massive scale. Layer 2 capabilities are provided through overlay networks (VXLAN/EVPN) when needed.

Multi-Stage Clos for Massive Scale

A single leaf-spine network has inherent size limits determined by switch port counts. What happens when you need more capacity than one fabric can provide? The answer is multi-stage Clos networks using super-spine switches.

The Scaling Challenge

With 64-port switches allocating 32 ports for server connections and 32 for uplinks:

Maximum spines: 32 (one uplink to each)
Maximum leaves: 32 (one downlink from each spine)
Maximum servers: 32 × 32 = 1,024

For datacenters with tens of thousands of servers, this isn't enough.

The Super-Spine Solution

A 3-stage Clos (or 5-stage for even larger deployments) adds a layer above the spines:

Leaves (Stage 1): Connect to servers and to spines
Spines (Stage 2): Connect to leaves and to super-spines
Super-Spines (Stage 3): Connect only to spines

This creates a hierarchy of fabrics, where each 'pod' of leaves and spines is interconnected by the super-spine layer.

Converting Mermaid diagram...

Scaling Mathematics

For a 3-stage Clos with k-port switches:

Each pod has k/2 spines and supports (k/2)² leaves
Super-spine layer has (k/2)² switches
Total pods: k/2
Maximum servers: (k/2)³ = k³/8

With 64-port switches: (64)³/8 = 32,768 servers With 128-port switches: (128)³/8 = 262,144 servers

Hyperscale networks extend to 5-stage Clos for even larger deployments, potentially supporting millions of servers.

Traffic Locality Optimization

Multi-stage networks exhibit traffic locality benefits:

Intra-rack traffic: Stays within the leaf switch (zero hops through fabric)
Intra-pod traffic: Traverses only pod spines (2 hops)
Inter-pod traffic: Traverses super-spines (4 hops)

Application-aware placement can minimize inter-pod traffic, improving performance and reducing super-spine load.

Facebook's Fabric

Facebook (Meta) pioneered the 4-post datacenter design where each building contains 4 'fabric clusters' interconnected at the super-spine level. Their published designs show 3-stage Clos fabrics supporting 100,000+ servers per fabric cluster, with multiple clusters per building providing further scale.

Routing Protocol Considerations

Layer 3 leaf-spine networks require routing protocols to distribute reachability information and enable ECMP. The choice of protocol significantly impacts scalability, convergence speed, and operational complexity.

OSPF/IS-IS (Link-State Protocols)

Traditional interior gateway protocols used in enterprise networks.

Characteristics:

All routers maintain full topology database
Shortest Path First (SPF) algorithm computes best paths
Fast convergence (sub-second with tuning)
Native ECMP support

Limitations for large fabrics:

SPF computation scales poorly (O(n²) complexity)
Large Link State Databases (LSDBs) consume memory
Flooding of updates across entire area
Typically limited to ~1,000 nodes per area

Mitigation: Use areas/levels to partition the network, but this adds operational complexity.

BGP (eBGP in the Datacenter)

BGP, traditionally used for inter-domain routing, has emerged as the preferred protocol for datacenter fabrics.

Why BGP for datacenters:

Scalability: BGP was designed for the Internet's millions of routes; datacenter scale is trivial
Incremental updates: Only changed routes are advertised (no flooding)
Policy flexibility: Rich attribute-based control over path selection
Operational familiarity: Network engineers know BGP
Vendor neutrality: Excellent multi-vendor support

eBGP (External BGP) design:

Each switch has a unique ASN (Autonomous System Number)
Every link is an eBGP peering
Path selection uses AS_PATH length (all paths are equal in leaf-spine)
ECMP across all equal AS_PATH alternatives

Configuration pattern:

Leaf ASN: 65001, 65002, 65003, ...
Spine ASN: 64001, 64002, 64003, 64004

Leaf 1 peers with:
  - Spine 1 (ASN 64001) on link 1
  - Spine 2 (ASN 64002) on link 2
  - Spine 3 (ASN 64003) on link 3
  - Spine 4 (ASN 64004) on link 4

Routing Protocol Comparison for Leaf-Spine
Attribute	OSPF/IS-IS	eBGP
Convergence speed	Sub-second	1-3 seconds (tunable)
Scalability	1000s of nodes with areas	100,000s of nodes
Configuration complexity	Lower initial complexity	Higher initial, simpler scaling
Memory usage	Higher (full LSDB)	Lower (only best paths)
Update efficiency	Flooding (all changes)	Targeted (only affected peers)
Multi-vendor support	Good	Excellent
Hyperscale adoption	Limited	Universal

RFC 7938: BGP in the Datacenter

IETF RFC 7938 'Use of BGP for Routing in Large-Scale Data Centers' formalizes the BGP design patterns used by hyperscale operators. It recommends eBGP with private ASNs, aggressive timers for fast convergence, and ECMP for load balancing. This RFC has become the reference architecture for modern datacenter routing.

Summary: Mastering Datacenter Topology

We've explored datacenter network topology from the historical evolution through modern leaf-spine design to advanced multi-stage architectures. This knowledge forms the foundation for understanding how datacenter networks achieve their remarkable scale and performance.

Key Takeaways

•Three-tier limitations drove evolution — STP blocking and aggregation bottlenecks couldn't handle east-west traffic patterns
•Leaf-spine provides predictable performance — Two hops between any servers, consistent latency, full bandwidth availability
•Non-blocking requires proper sizing — Uplink capacity must match or exceed downlink capacity per leaf
•ECMP distributes traffic across all paths — Hash-based load balancing utilizes the full mesh connectivity
•Layer 3 fabrics scale better — Routed designs with BGP are the hyperscale standard
•Multi-stage Clos extends scale — Super-spines enable fabrics with hundreds of thousands of servers
•BGP dominates datacenter routing — Scalability, incremental updates, and operational familiarity drive adoption

What's next:

With topology established, we'll examine scalability—how datacenter networks grow to meet increasing demand. You'll understand horizontal vs. vertical scaling, capacity planning methodologies, and the practical constraints that limit growth at each architectural layer.

Page Complete

You now understand the leaf-spine topology that powers modern datacenters—from its mathematical foundations through ECMP load balancing to multi-stage designs for massive scale. This topology knowledge is essential for designing, operating, or troubleshooting datacenter networks at any scale.

2 / 5

Loading learning content...

Computer NetworksDatacenter Overview

Datacenter Overview: The Foundation of Modern Cloud Infrastructure

LevelAdvanced

Duration90 mins

TopicDatacenter Overview

2 / 5

Topology: The Leaf-Spine Architecture Revolution

The Network Fabric That Powers the Cloud

What You Will Learn

The Evolution from Three-Tier to Leaf-Spine

To understand why leaf-spine topology dominates modern datacenters, we must first understand what it replaced and why that replacement was necessary.

The Three-Tier Hierarchical Model

The traditional datacenter network followed a three-tier hierarchical architecture:

Access Layer (Tier 1): Top-of-Rack (ToR) switches directly connecting servers. Each rack has one or two switches aggregating all server connections.
Aggregation/Distribution Layer (Tier 2): Larger switches that aggregate multiple access switches. Typically deployed at the end of each row or for a group of racks.
Core Layer (Tier 3): High-speed backbone switches connecting aggregation switches and providing connectivity to WAN/Internet.

This model has intuitive appeal: it mirrors how traffic was assumed to flow (from servers through aggregation to the core and out), and it maps to the physical layout of racks in rows in data halls.

Converting Mermaid diagram...

The Fatal Flaw: Spanning Tree and Oversubscription

Two fundamental problems plague the three-tier model:

Problem 1: Spanning Tree Protocol (STP) Limitations

Half or more of the available bandwidth sits idle (blocked links)
Failover to blocked paths is slow (seconds of reconvergence)
The active path becomes a bottleneck
Network utilization is fundamentally limited

Problem 2: Aggregation Layer Bottleneck

(20 × 40 Gbps) / (2 × 200 Gbps) = 800/400 = 2:1

This means if all servers try to communicate simultaneously, only half the traffic can flow. In practice, ratios of 4:1 or higher were common, creating severe congestion during peak loads.

The East-West Traffic Explosion

The Leaf-Spine Architecture Explained

The leaf-spine topology (also called a Clos network or folded Clos) addresses the limitations of three-tier architectures through a fundamentally different design philosophy.

Core Principles

Two Layers Only: The network consists of only two switch layers—leaves and spines. No aggregation layer exists.
Full Mesh Connectivity: Every leaf switch connects to every spine switch. There are no direct leaf-to-leaf or spine-to-spine connections.
Non-Blocking Design: With proper provisioning, the network provides full bandwidth between any two servers—no oversubscription.
Equal Path Length: Any two servers are exactly two hops apart (leaf → spine → leaf), ensuring consistent latency.

Structural Definition

Leaf Switches: Connect directly to servers (and storage/appliances). Each rack typically has one or more leaf switches. Leaves also connect to every spine.
Spine Switches: Form the backbone of the fabric. They connect only to leaves—never to servers or other spines. Their sole purpose is to interconnect all leaves.

Converting Mermaid diagram...

Why Full Mesh?

The full mesh connectivity might seem excessive—why does every leaf need to connect to every spine? The answer lies in the mathematical properties this creates:

Path Diversity: Between any two leaves, there are as many paths as there are spines. With 8 spines, there are 8 independent paths between any server pair.
Load Distribution: Traffic can be spread across all paths using ECMP (Equal-Cost Multi-Path) routing, utilizing all available bandwidth.
Fault Tolerance: Losing a spine reduces capacity by 1/N (where N is spine count) but doesn't isolate any servers. Losing a spine in an 8-spine network reduces capacity by 12.5%—a graceful degradation.
Uniform Behavior: Every leaf-to-leaf path traverses exactly one spine, so latency is consistent regardless of which servers are communicating.

The Historical Insight

The Mathematics of Non-Blocking Fabrics

Understanding the mathematical foundations of leaf-spine networks enables correct sizing and performance prediction.

Definitions and Terminology

For a leaf-spine network:

L = Number of leaf switches
S = Number of spine switches
p = Number of downlinks per leaf (ports connecting to servers)
u = Number of uplinks per leaf (ports connecting to spines)
For each leaf: Total ports = p + u

Non-Blocking Condition

A leaf-spine network is non-blocking when any server can send to any other server at full line rate simultaneously. The condition for non-blocking:

u ≥ p (uplink bandwidth ≥ downlink bandwidth per leaf)

If each leaf has 48 server-facing ports at 25 Gbps and 8 uplinks at 100 Gbps:

Downlink capacity: 48 × 25 = 1,200 Gbps
Uplink capacity: 8 × 100 = 800 Gbps
Oversubscription ratio: 1,200/800 = 1.5:1

To achieve non-blocking: uplink capacity must equal or exceed 1,200 Gbps (e.g., 12 × 100 Gbps uplinks).

Calculating Fabric Capacity

Total fabric bisection bandwidth (the minimum bandwidth available for any traffic split):

Bisection BW = S × (per-uplink speed) × L / 2

For a network with 8 spines, 32 leaves, and 100 Gbps uplinks:

Each leaf has 8 uplinks (one to each spine)
Total uplinks from all leaves: 32 × 8 = 256 connections
Each spine handles: 256/8 = 32 leaf connections
Bisection bandwidth: 8 × 100 × 32 / 2 = 12,800 Gbps = 12.8 Tbps

Maximum Network Size

The maximum servers supportable in a single leaf-spine network:

Max servers = L × p = S × p (since L ≤ S for non-blocking when u = S)

With 64-port switches:

If using 32 ports for servers and 32 for uplinks
Maximum leaves = 32 (one uplink port per spine)
Maximum spines = 32
Maximum servers = 32 leaves × 32 server ports = 1,024 servers

For larger deployments, multi-stage Clos networks or super-spines extend the architecture.

Sample Leaf-Spine Network Sizing
Parameter	Small Fabric	Medium Fabric	Large Fabric
Spine switches	4	8	16
Leaf switches	8	32	64
Server ports per leaf	48	48	48
Uplink speed	100 Gbps	100 Gbps	400 Gbps
Total servers	384	1,536	3,072
Bisection bandwidth	1.6 Tbps	12.8 Tbps	102.4 Tbps
Oversubscription	3:1	1.5:1	Non-blocking

Practical Oversubscription

ECMP: Equal-Cost Multi-Path Routing

How ECMP Works

For each packet/flow, a hash is computed from header fields (typically 5-tuple: source IP, destination IP, source port, destination port, protocol)
The hash value determines which path is selected
All packets with the same hash use the same path, preserving order within flows
Different flows are distributed across all available paths

ECMP Hash Fields

Standard 5-tuple hashing:

Hash Input = {Source IP, Destination IP, Source Port, Destination Port, Protocol}
Path = Hash(Input) mod Number_of_Paths

For example, with 8 spines (8 equal paths):

Flow A hashes to 5 → sent via Spine 5
Flow B hashes to 2 → sent via Spine 2
Flow C hashes to 5 → sent via Spine 5 (same as Flow A)

The Polarization Problem

Example of polarization:

Leaf uses 5-tuple hash → selects Spine 3
Spine 3 uses same hash → selects specific egress port
If hash is identical, traffic concentrates on specific paths

Mitigation strategies:

Different hash seeds at each layer: Each switch uses a different random seed, ensuring independent path selection
Asymmetric hashing: Use different header fields at different layers
Adaptive routing: Some switches can dynamically adjust path selection based on congestion
Flowlet switching: Break long flows into shorter 'flowlets' that can take different paths

Converting Mermaid diagram...

The Elephant Flow Problem

Layer 2 vs Layer 3 Leaf-Spine

Leaf-spine networks can be implemented as Layer 2 (bridged) or Layer 3 (routed) fabrics, with significant implications for design and operation.

Layer 2 Leaf-Spine

In a Layer 2 fabric, all switches operate as bridges, and the entire network is one large broadcast domain.

Enabling technologies:

MLAG (Multi-Chassis Link Aggregation): Pairs of switches appear as one logical switch, allowing servers to use standard LACP
TRILL/SPB/VXLAN EVPN: Overlay technologies that enable multi-path forwarding while maintaining Layer 2 semantics

Advantages:

Server mobility: VMs can move freely without IP changes
Application compatibility: Works with legacy applications requiring Layer 2 adjacency
Simpler IP addressing: No need for per-rack subnets

Disadvantages:

Larger failure domain: Broadcast storms or loops can affect entire fabric
Scalability limits: Broadcast traffic increases with size
More complex control plane: Requires specialized protocols (EVPN, VXLAN)

Layer 3 Leaf-Spine (Routed Fabric)

In a Layer 3 fabric, each link is a routed interface, and routing protocols (typically BGP or OSPF) manage path selection.

Key characteristics:

Each leaf-spine link is a separate /31 IP subnet
Server subnets terminate at the leaf switches
Routing protocol advertises reachability across the fabric
ECMP is native to the routing protocol

Advantages:

Superior scalability: Routing protocols scale to thousands of switches
Fault isolation: Problems are contained to individual subnets
No broadcast storms: Layer 3 boundaries prevent Layer 2 failure propagation
Native ECMP: All modern routing protocols support ECMP

Disadvantages:

Server mobility requires IP changes or overlay networks
More IP address consumption (each link needs a subnet)
Routing protocol configuration required

When to Use Layer 2

•Legacy applications requiring Layer 2 adjacency
•Frequent VM migration without IP changes
•Smaller deployments (<100 racks)
•VMware/Hyper-V environments expecting L2
•Storage protocols requiring L2 (some iSCSI, FCoE)

When to Use Layer 3

•Hyperscale deployments (hundreds+ of racks)
•Cloud-native applications (containers, Kubernetes)
•Maximum fault isolation requirements
•Simpler operations (routing is well-understood)
•Performance-critical workloads (lowest latency)

The Hyperscale Consensus

Multi-Stage Clos for Massive Scale

The Scaling Challenge

With 64-port switches allocating 32 ports for server connections and 32 for uplinks:

Maximum spines: 32 (one uplink to each)
Maximum leaves: 32 (one downlink from each spine)
Maximum servers: 32 × 32 = 1,024

For datacenters with tens of thousands of servers, this isn't enough.

The Super-Spine Solution

A 3-stage Clos (or 5-stage for even larger deployments) adds a layer above the spines:

Leaves (Stage 1): Connect to servers and to spines
Spines (Stage 2): Connect to leaves and to super-spines
Super-Spines (Stage 3): Connect only to spines

This creates a hierarchy of fabrics, where each 'pod' of leaves and spines is interconnected by the super-spine layer.

Converting Mermaid diagram...

Scaling Mathematics

For a 3-stage Clos with k-port switches:

Each pod has k/2 spines and supports (k/2)² leaves
Super-spine layer has (k/2)² switches
Total pods: k/2
Maximum servers: (k/2)³ = k³/8

With 64-port switches: (64)³/8 = 32,768 servers With 128-port switches: (128)³/8 = 262,144 servers

Hyperscale networks extend to 5-stage Clos for even larger deployments, potentially supporting millions of servers.

Traffic Locality Optimization

Multi-stage networks exhibit traffic locality benefits:

Intra-rack traffic: Stays within the leaf switch (zero hops through fabric)
Intra-pod traffic: Traverses only pod spines (2 hops)
Inter-pod traffic: Traverses super-spines (4 hops)

Application-aware placement can minimize inter-pod traffic, improving performance and reducing super-spine load.

Facebook's Fabric

Routing Protocol Considerations

OSPF/IS-IS (Link-State Protocols)

Traditional interior gateway protocols used in enterprise networks.

Characteristics:

All routers maintain full topology database
Shortest Path First (SPF) algorithm computes best paths
Fast convergence (sub-second with tuning)
Native ECMP support

Limitations for large fabrics:

SPF computation scales poorly (O(n²) complexity)
Large Link State Databases (LSDBs) consume memory
Flooding of updates across entire area
Typically limited to ~1,000 nodes per area

Mitigation: Use areas/levels to partition the network, but this adds operational complexity.

BGP (eBGP in the Datacenter)

BGP, traditionally used for inter-domain routing, has emerged as the preferred protocol for datacenter fabrics.

Why BGP for datacenters:

Scalability: BGP was designed for the Internet's millions of routes; datacenter scale is trivial
Incremental updates: Only changed routes are advertised (no flooding)
Policy flexibility: Rich attribute-based control over path selection
Operational familiarity: Network engineers know BGP
Vendor neutrality: Excellent multi-vendor support

eBGP (External BGP) design:

Each switch has a unique ASN (Autonomous System Number)
Every link is an eBGP peering
Path selection uses AS_PATH length (all paths are equal in leaf-spine)
ECMP across all equal AS_PATH alternatives

Configuration pattern:

Leaf ASN: 65001, 65002, 65003, ...
Spine ASN: 64001, 64002, 64003, 64004

Leaf 1 peers with:
  - Spine 1 (ASN 64001) on link 1
  - Spine 2 (ASN 64002) on link 2
  - Spine 3 (ASN 64003) on link 3
  - Spine 4 (ASN 64004) on link 4

Routing Protocol Comparison for Leaf-Spine
Attribute	OSPF/IS-IS	eBGP
Convergence speed	Sub-second	1-3 seconds (tunable)
Scalability	1000s of nodes with areas	100,000s of nodes
Configuration complexity	Lower initial complexity	Higher initial, simpler scaling
Memory usage	Higher (full LSDB)	Lower (only best paths)
Update efficiency	Flooding (all changes)	Targeted (only affected peers)
Multi-vendor support	Good	Excellent
Hyperscale adoption	Limited	Universal

RFC 7938: BGP in the Datacenter

Summary: Mastering Datacenter Topology

Key Takeaways

•Three-tier limitations drove evolution — STP blocking and aggregation bottlenecks couldn't handle east-west traffic patterns
•Leaf-spine provides predictable performance — Two hops between any servers, consistent latency, full bandwidth availability
•Non-blocking requires proper sizing — Uplink capacity must match or exceed downlink capacity per leaf
•ECMP distributes traffic across all paths — Hash-based load balancing utilizes the full mesh connectivity
•Layer 3 fabrics scale better — Routed designs with BGP are the hyperscale standard
•Multi-stage Clos extends scale — Super-spines enable fabrics with hundreds of thousands of servers
•BGP dominates datacenter routing — Scalability, incremental updates, and operational familiarity drive adoption

What's next:

Page Complete

2 / 5