Loading learning content...
In the previous page, we explored how virtual switches create network connectivity within a single physical host. But modern cloud environments span thousands of hosts across multiple datacenters, sometimes across continents. How do we extend virtual networks across this vast physical infrastructure while maintaining the illusion of a single, flat Layer 2 network?
The answer lies in overlay networks—a transformative approach that completely decouples logical network topology from physical network topology. An overlay network is a virtual network built on top of an existing physical network (the underlay), using encapsulation to tunnel Layer 2 traffic across Layer 3 boundaries.
Overlay networks are not merely a technical optimization; they represent a fundamental shift in how we think about network architecture. They enable:
This page will take you from first principles through the complete architecture of overlay networking, preparing you to understand specific implementations like VXLAN in subsequent pages.
By the end of this page, you will understand why overlay networks exist, how they differ from traditional VLANs, the fundamental architecture (underlay vs. overlay, tunnel endpoints, encapsulation), control plane options, the critical problem of mapping (VM location discovery), and how overlay networks enable the elastic, multi-tenant cloud networking we take for granted today.
Before overlay networks, network segmentation relied primarily on VLANs (Virtual Local Area Networks), standardized in IEEE 802.1Q. VLANs work well for traditional enterprise networks, but they suffer from fundamental limitations that make them unsuitable for cloud-scale environments.
The most immediate limitation is the VLAN ID space. The 802.1Q standard allocates 12 bits for the VLAN ID, yielding only 4,094 usable VLANs (VLAN 0 and 4095 are reserved). This seems adequate until you consider:
4,094 VLANs simply cannot support a cloud platform serving 10,000 tenants.
VLANs are Layer 2 constructs—they require a continuous Layer 2 domain (a broadcast domain) across all hosts in the VLAN. This creates severe problems:
Spanning Tree Inefficiency: To prevent loops in Layer 2 networks, spanning tree protocols (STP) block redundant paths. This means only a fraction of available bandwidth is used; the rest is reserved for failover.
Failure Domain Size: A spanning tree domain is a single failure domain. A misconfigured switch, a broadcast storm, or a rogue device can bring down the entire VLAN, potentially affecting thousands of hosts.
Physical Topology Coupling: VMs can only move within the Layer 2 domain. This constrains VM placement—you cannot freely migrate VMs across datacenters unless you stretch Layer 2 across the WAN (a dangerous practice).
| Limitation | VLAN Impact | Overlay Solution |
|---|---|---|
| Network ID Space | 4,094 VLANs maximum | 16 million+ overlay networks (24-bit VNI) |
| Layer 2 Spanning | Requires Layer 2 end-to-end | Tunnels over Layer 3 IP networks |
| Spanning Tree | Blocks redundant paths, limits bandwidth | Uses IP ECMP, full bandwidth utilization |
| Failure Domain | Entire VLAN is single failure domain | Failures isolated to underlay segments |
| VM Mobility | Constrained to L2 domain | Unrestricted across L3 networks |
| Provisioning | Requires switch configuration (hours/days) | API-driven (seconds) |
| Multi-DC | Requires risky L2 DCI stretching | Native IP routing between datacenters |
Some organizations attempted to solve VM mobility by stretching VLANs across datacenter interconnects (DCI). This creates massive failure domains spanning datacenters—a broadcast storm or mis-configuration in one DC can cascade to another. Overlay networks solve this problem elegantly by using the inherently robust Layer 3 IP network as the transport.
An overlay network is fundamentally a network-over-network architecture where a logical network is constructed atop a physical network through encapsulation and tunneling. Let's precisely define the key concepts:
The underlay is the physical network infrastructure—routers, switches, cables, IP addressing—that provides basic IP connectivity between hosts. The underlay doesn't need to know anything about overlay networks or virtual machines; it simply routes IP packets from source to destination.
Key underlay requirements:
The overlay is the logical network created by tunneling Layer 2 frames inside Layer 3 packets. From the perspective of virtual machines, the overlay appears as a normal Ethernet network—they have MAC addresses, send Ethernet frames, and are unaware that their traffic is being tunneled.
Key overlay properties:
Each overlay network is identified by a Virtual Network Identifier (VNI)—a numerical tag that distinguishes traffic belonging to different overlay networks. VNIs are analogous to VLAN IDs but typically use 24 bits, supporting over 16 million isolated networks.
Overlay networks exemplify the computer science principle: 'Any problem can be solved by adding a layer of indirection.' By inserting an overlay layer, we gain flexibility to place and move VMs anywhere, independent of physical network topology. The cost is encapsulation overhead—a tradeoff well worth making for cloud environments.
The critical component that bridges overlay and underlay networks is the Tunnel Endpoint—typically called a VTEP (Virtual Tunnel Endpoint) in overlay terminology. VTEPs perform the encapsulation and decapsulation operations that make overlay networks function.
Encapsulation (Egress) When a VM sends a frame destined for another VM on the same overlay network but on a different physical host:
Decapsulation (Ingress) When an encapsulated packet arrives at the destination VTEP:
Software VTEP (Virtual Switch) The most common deployment: the VTEP function is implemented within the virtual switch (e.g., Open vSwitch) running on each hypervisor host. Every host is a VTEP.
Advantages: Fine-grained encapsulation, full feature support, works on commodity hardware. Disadvantages: CPU overhead on hypervisor, must scale with host count.
Hardware VTEP (Physical Switch) Top-of-rack switches can implement VTEP functionality in hardware ASICs, encapsulating/decapsulating at the edge of the physical network.
Advantages: Line-rate encapsulation, offloads hypervisor CPU. Disadvantages: Requires compatible switch hardware, less flexible than software.
Gateway VTEP Dedicated appliances (physical or virtual) that bridge between overlay networks and external networks (physical servers, internet, legacy infrastructure).
Common use cases: Connecting overlay networks to bare-metal servers, providing internet gateway services, bridging to external partners.
| Placement | Performance | Scalability | Cost | Use Case |
|---|---|---|---|---|
| Software VTEP | Good (10-40 Gbps) | Scales with hosts | Included in hypervisor | General cloud/virtualization |
| SmartNIC VTEP | Excellent (100+ Gbps) | Scales with hosts | SmartNIC cost ($500-2000) | High-performance clouds |
| ToR Switch VTEP | Line-rate | Limited by switch ports | Moderate (VTEP-capable switch) | Hardware offload deployments |
| Gateway VTEP | Varies | Centralized bottleneck | Dedicated appliance | External connectivity |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// VTEP Forwarding Table (VNI + Destination MAC → Remote VTEP IP)// This table enables the VTEP to route overlay traffic VTEP_FORWARDING_TABLE = { // VNI 5000: Production Network (VNI=5000, MAC="AA:AA:AA:AA:AA:01"): VTEP_IP="192.168.1.10", (VNI=5000, MAC="AA:AA:AA:AA:AA:02"): VTEP_IP="192.168.2.20", (VNI=5000, MAC="AA:AA:AA:AA:AA:03"): VTEP_IP="192.168.3.30", // VNI 6000: Development Network (completely isolated) (VNI=6000, MAC="BB:BB:BB:BB:BB:01"): VTEP_IP="192.168.1.10", (VNI=6000, MAC="BB:BB:BB:BB:BB:02"): VTEP_IP="192.168.4.40", // VNI 7000: Tenant A (isolated from all others) (VNI=7000, MAC="CC:CC:CC:CC:CC:01"): VTEP_IP="192.168.5.50",} function encapsulate_frame(original_frame, source_vni): dst_mac = original_frame.destination_mac // Lookup remote VTEP key = (VNI=source_vni, MAC=dst_mac) if key in VTEP_FORWARDING_TABLE: remote_vtep_ip = VTEP_FORWARDING_TABLE[key] else: // Unknown destination - flood to all VTEPs in VNI remote_vtep_ip = get_flood_vteps(source_vni) // Build encapsulated packet outer_packet = { outer_ethernet: { dst_mac: next_hop_router_mac, src_mac: local_nic_mac, ethertype: 0x0800 // IPv4 }, outer_ip: { src_ip: local_vtep_ip, dst_ip: remote_vtep_ip, protocol: 17 // UDP }, outer_udp: { src_port: hash(original_frame) % 65535, // Entropy for ECMP dst_port: 4789 // VXLAN standard port }, overlay_header: { vni: source_vni, flags: 0x08 // VNI valid }, payload: original_frame // Complete original Ethernet frame } return outer_packetThe core operation of overlay networking is encapsulation—wrapping the original Layer 2 frame inside a new Layer 3 packet for transport across the underlay. Understanding the exact structure of encapsulated packets is crucial for troubleshooting, MTU planning, and performance optimization.
A typical overlay packet has this structure (using VXLAN as the example):
+-------------------------+
| Outer Ethernet Header | 14 bytes
| (Underlay L2) |
+-------------------------+
| Outer IP Header | 20 bytes (IPv4) or 40 bytes (IPv6)
| (Underlay L3) |
+-------------------------+
| Outer UDP Header | 8 bytes
| (Encapsulation Layer) |
+-------------------------+
| VXLAN Header | 8 bytes
| (VNI + Flags) |
+-------------------------+
| Original Ethernet Frame | Variable (14 + payload + optional VLAN tag)
| (Overlay L2) |
+-------------------------+
| Original Payload | Variable
| (Overlay L3+) |
+-------------------------+
The encapsulation adds 50 bytes of overhead for VXLAN over IPv4:
For IPv6 underlay, add another 20 bytes (40 bytes for IPv6 header vs. 20 for IPv4).
This overhead has significant implications for Maximum Transmission Unit (MTU):
If the underlay network has a standard 1500-byte MTU:
Solutions:
| Protocol | Header Size | Total Overhead (IPv4) | Total Overhead (IPv6) | Standard Port |
|---|---|---|---|---|
| VXLAN | 8 bytes | 50 bytes | 70 bytes | UDP 4789 |
| Geneve | 8+ bytes (variable) | 50+ bytes | 70+ bytes | UDP 6081 |
| GRE | 4-8 bytes | 38-46 bytes | 58-66 bytes | IP Protocol 47 |
| NVGRE | 8 bytes | 42 bytes | 62 bytes | GRE + VSID |
| STT | 18 bytes | 76 bytes | 96 bytes | TCP-like (proprietary) |
MTU problems in overlay networks are notoriously difficult to diagnose. TCP often works fine (due to MSS negotiation), but UDP-based applications or ICMP may fail mysteriously. Always ensure your underlay MTU exceeds overlay MTU plus encapsulation overhead by a comfortable margin—9000 bytes for underlay is strongly recommended.
Notice that the outer UDP source port is typically set to a hash of the inner packet fields. This is not arbitrary—it serves a critical performance purpose.
Modern datacenter networks use Equal-Cost Multi-Path (ECMP) routing to distribute traffic across multiple paths. ECMP routers hash packet headers to select output paths. If all encapsulated traffic used the same source port, all traffic between two VTEPs would take the same path, defeating ECMP's load-balancing benefits.
By hashing the inner packet's 5-tuple (source IP, destination IP, source port, destination port, protocol) to generate the outer UDP source port, we ensure that different flows between the same VTEPs take different physical paths, achieving proper load distribution.
Overlay encapsulation handles the data plane—how packets are formatted and forwarded. But encapsulation alone doesn't answer a critical question: How does a VTEP know which remote VTEP hosts the destination MAC address?
This is the job of the control plane—the mechanism by which VTEPs discover VM locations and populate their forwarding tables. Several control plane approaches exist, each with distinct tradeoffs.
The simplest approach: treat the overlay like a traditional Ethernet network.
How it works:
Multicast-based flooding: Each VNI is mapped to an IP multicast group. Flooded traffic is sent to the multicast group, and underlay multicast routing delivers it to all participating VTEPs.
Ingress replication (head-end replication): If multicast isn't available, the source VTEP unicasts copies of flooded frames to each remote VTEP in the VNI.
Pros: Simple, works without any external controller, familiar Ethernet semantics. Cons: Flooding doesn't scale (n² traffic for n VTEPs), requires underlay multicast or excess bandwidth.
An SDN controller maintains a global database of VM-to-VTEP mappings and distributes this information to all VTEPs.
How it works:
Pros: Eliminates flooding, precise forwarding from the first packet, controller can enforce policy. Cons: Controller becomes potential bottleneck/single point of failure, requires tight orchestration integration.
VTEPs participate in a distributed routing protocol (BGP with EVPN extensions) to exchange MAC/IP reachability information.
How it works:
Pros: Scales to very large deployments, industry-standard protocol, no flooding at all, multi-vendor interoperability. Cons: Requires BGP infrastructure, more complex initial setup.
| Approach | Scalability | Complexity | Flooding Required | Multi-Vendor |
|---|---|---|---|---|
| Flood-and-Learn + Multicast | Low-Medium | Low | Yes (multicast) | Good |
| Flood-and-Learn + Ingress Replication | Low | Low | Yes (unicast copies) | Good |
| Centralized Controller | Medium-High | Medium | No | Controller-dependent |
| BGP EVPN | Very High | High | No | Excellent (IETF standard) |
BGP EVPN has emerged as the industry-standard control plane for overlay networks, particularly in large datacenter and cloud provider environments. It provides not only MAC learning but also IP routing, multi-homing support, and sophisticated traffic engineering—all using the well-proven BGP protocol infrastructure.
One of the most significant scalability challenges in overlay networks is ARP (Address Resolution Protocol) traffic. In traditional Ethernet networks, ARP requests are broadcast to all hosts in the subnet. In overlay networks, this means flooding to all VTEPs—exactly the kind of broadcast traffic we want to minimize.
Consider a subnet with 10,000 VMs across 1,000 VTEPs:
This broadcast amplification can overwhelm VTEP CPUs and underlay bandwidth during mass events like datacenter power-on or disaster recovery failover.
The solution is ARP suppression—having the VTEP answer ARP requests locally instead of flooding them.
How it works:
BGP EVPN elegantly solves ARP suppression through Type-2 MAC/IP Advertisement routes. When a VTEP advertises a MAC address, it can include the associated IP address(es):
BGP EVPN Route Type-2 (MAC/IP Advertisement):
Route Distinguisher: 192.168.1.10:100
Ethernet Tag ID: 0
MAC Address: 00:50:56:01:02:03
IP Address: 10.0.1.50 ← ARP cache population
VNI: 5000
Next Hop: 192.168.1.10 (VTEP IP)
All VTEPs receiving this route install both the MAC→VTEP mapping and the IP→MAC mapping, enabling them to suppress ARP for this IP entirely.
Overlay networks can operate at Layer 2 (bridging) or Layer 3 (routing), with significant implications for architecture and use cases.
L2 overlays extend a single broadcast domain across multiple physical hosts. VMs in the same VNI are on the same logical Ethernet segment and can communicate using their MAC addresses.
Characteristics:
Use cases:
L3 overlays route packets between VMs based on IP addresses, with each host acting as an IP gateway.
Characteristics:
Use cases:
Modern overlay networks combine L2 and L3 functionality using distributed routing with anycast gateways:
This architecture provides the best of both worlds: L2 semantics within a subnet for compatibility, plus efficient L3 routing between subnets.
| Characteristic | Layer 2 Overlay | Layer 3 Overlay |
|---|---|---|
| Forwarding Decision | MAC address lookup | IP address lookup |
| Broadcast Domain | Shared across VNI | None (routing only) |
| ARP Traffic | Flooded (or suppressed) | Not required (host routes) |
| VM Subnet | Common subnet per VNI | Per-VM /32 or any subnet |
| Live Migration | Within subnet (same L2) | Unrestricted (routing follows) |
| Scalability | Limited by broadcast | Highly scalable |
| Complexity | Lower (familiar Ethernet) | Higher (routing required) |
| Typical Use | Traditional VMs | Containers, microservices |
EVPN supports a sophisticated model called Symmetric Integrated Routing and Bridging (Symmetric IRB). In this model, both ingress and egress VTEPs perform L3 routing, and the VNI in the tunnel header identifies the VRF, not the L2 segment. This enables efficient distributed routing without requiring L2 stretch across the fabric.
Overlay networks create isolated virtual domains, but workloads inevitably need to communicate with the outside world—physical servers, external networks, the internet. Gateway services bridge the gap between overlay and underlay/external networks.
L2 Gateway (Bridging) Bridges overlay L2 network to a physical VLAN, making overlay VMs appear on the same Ethernet segment as physical servers.
Use case: Integrating VMs with legacy physical database servers or storage arrays.
Implementation: A VTEP (hardware or software) with interfaces in both the overlay VNI and the physical VLAN, performing MAC bridging between them.
L3 Gateway (Routing) Routes between overlay networks and external IP networks (physical subnets, internet, partner WAN links).
Use case: Providing internet access, connecting to external services, multi-site connectivity.
Implementation: Distributed (on every VTEP) or centralized (dedicated gateway appliances). Distributed is preferred for performance and resilience.
NAT Gateway Provides Network Address Translation between private overlay addresses and public addresses.
Use case: Internet access for VMs with private IPs, hiding internal topology from external networks.
Centralized Gateway:
Distributed Gateway:
Overlay networks represent a fundamental advancement in network architecture—the complete separation of logical network topology from physical infrastructure. This abstraction enables the flexibility, scalability, and multi-tenancy that define modern cloud computing.
You now understand the fundamental architecture of overlay networks—why they exist, how they work, and what problems they solve. This conceptual foundation is essential for understanding specific implementations.
Next Up: We'll dive deep into VXLAN (Virtual eXtensible LAN)—the dominant overlay protocol in datacenter and cloud environments. You'll learn the exact packet format, control plane options (multicast, unicast, EVPN), configuration examples, and operational considerations for production VXLAN deployments.