Computer NetworksNetwork Virtualization

Network Virtualization: Abstracting Physical Infrastructure

LevelAdvanced

Duration90 mins

TopicNetwork Virtualization

2 / 5

Overlay Networks: Decoupling Logical from Physical

The Overlay Revolution

In the previous page, we explored how virtual switches create network connectivity within a single physical host. But modern cloud environments span thousands of hosts across multiple datacenters, sometimes across continents. How do we extend virtual networks across this vast physical infrastructure while maintaining the illusion of a single, flat Layer 2 network?

The answer lies in overlay networks—a transformative approach that completely decouples logical network topology from physical network topology. An overlay network is a virtual network built on top of an existing physical network (the underlay), using encapsulation to tunnel Layer 2 traffic across Layer 3 boundaries.

Overlay networks are not merely a technical optimization; they represent a fundamental shift in how we think about network architecture. They enable:

Virtual machines to maintain the same IP address when migrating between datacenters
Tens of thousands of isolated tenant networks on a single physical infrastructure
Network provisioning in seconds (API calls) instead of days (hardware changes)
Consistent networking across heterogeneous physical networks

This page will take you from first principles through the complete architecture of overlay networking, preparing you to understand specific implementations like VXLAN in subsequent pages.

What You Will Learn

By the end of this page, you will understand why overlay networks exist, how they differ from traditional VLANs, the fundamental architecture (underlay vs. overlay, tunnel endpoints, encapsulation), control plane options, the critical problem of mapping (VM location discovery), and how overlay networks enable the elastic, multi-tenant cloud networking we take for granted today.

The Limitations of Traditional VLANs

Before overlay networks, network segmentation relied primarily on VLANs (Virtual Local Area Networks), standardized in IEEE 802.1Q. VLANs work well for traditional enterprise networks, but they suffer from fundamental limitations that make them unsuitable for cloud-scale environments.

The 4096 VLAN Limit

The most immediate limitation is the VLAN ID space. The 802.1Q standard allocates 12 bits for the VLAN ID, yielding only 4,094 usable VLANs (VLAN 0 and 4095 are reserved). This seems adequate until you consider:

Multi-tenant cloud: A large cloud provider hosts thousands of tenants, each potentially requiring multiple isolated networks
Microservices architectures: Modern applications may spawn thousands of isolated network segments for security
Service chaining: Network functions virtualization creates hundreds of service chains, each in its own segment

4,094 VLANs simply cannot support a cloud platform serving 10,000 tenants.

Layer 2 Spanning Requirements

VLANs are Layer 2 constructs—they require a continuous Layer 2 domain (a broadcast domain) across all hosts in the VLAN. This creates severe problems:

Spanning Tree Inefficiency: To prevent loops in Layer 2 networks, spanning tree protocols (STP) block redundant paths. This means only a fraction of available bandwidth is used; the rest is reserved for failover.

Failure Domain Size: A spanning tree domain is a single failure domain. A misconfigured switch, a broadcast storm, or a rogue device can bring down the entire VLAN, potentially affecting thousands of hosts.

Physical Topology Coupling: VMs can only move within the Layer 2 domain. This constrains VM placement—you cannot freely migrate VMs across datacenters unless you stretch Layer 2 across the WAN (a dangerous practice).

VLAN Limitations vs. Overlay Network Capabilities
Limitation	VLAN Impact	Overlay Solution
Network ID Space	4,094 VLANs maximum	16 million+ overlay networks (24-bit VNI)
Layer 2 Spanning	Requires Layer 2 end-to-end	Tunnels over Layer 3 IP networks
Spanning Tree	Blocks redundant paths, limits bandwidth	Uses IP ECMP, full bandwidth utilization
Failure Domain	Entire VLAN is single failure domain	Failures isolated to underlay segments
VM Mobility	Constrained to L2 domain	Unrestricted across L3 networks
Provisioning	Requires switch configuration (hours/days)	API-driven (seconds)
Multi-DC	Requires risky L2 DCI stretching	Native IP routing between datacenters

The Danger of Stretched Layer 2

Some organizations attempted to solve VM mobility by stretching VLANs across datacenter interconnects (DCI). This creates massive failure domains spanning datacenters—a broadcast storm or mis-configuration in one DC can cascade to another. Overlay networks solve this problem elegantly by using the inherently robust Layer 3 IP network as the transport.

Overlay Network Fundamentals

An overlay network is fundamentally a network-over-network architecture where a logical network is constructed atop a physical network through encapsulation and tunneling. Let's precisely define the key concepts:

Underlay Network

The underlay is the physical network infrastructure—routers, switches, cables, IP addressing—that provides basic IP connectivity between hosts. The underlay doesn't need to know anything about overlay networks or virtual machines; it simply routes IP packets from source to destination.

Key underlay requirements:

IP reachability between all hosts participating in the overlay
Sufficient MTU to accommodate encapsulation overhead (typically 50-100 bytes)
Adequate bandwidth and low latency for encapsulated traffic
Ideally, Equal-Cost Multi-Path (ECMP) routing for load distribution

Overlay Network

The overlay is the logical network created by tunneling Layer 2 frames inside Layer 3 packets. From the perspective of virtual machines, the overlay appears as a normal Ethernet network—they have MAC addresses, send Ethernet frames, and are unaware that their traffic is being tunneled.

Key overlay properties:

Completely independent topology (VMs on different physical subnets appear to be on same L2 segment)
Independent address space (VM IPs don't conflict with underlay IPs)
Scalable to millions of isolated networks (24+ bit network identifiers)
Programmable via software (no physical switch configuration needed)

Network Virtualization Identifier (VNI)

Each overlay network is identified by a Virtual Network Identifier (VNI)—a numerical tag that distinguishes traffic belonging to different overlay networks. VNIs are analogous to VLAN IDs but typically use 24 bits, supporting over 16 million isolated networks.

Converting Mermaid diagram...

The Power of Indirection

Overlay networks exemplify the computer science principle: 'Any problem can be solved by adding a layer of indirection.' By inserting an overlay layer, we gain flexibility to place and move VMs anywhere, independent of physical network topology. The cost is encapsulation overhead—a tradeoff well worth making for cloud environments.

Tunnel Endpoints (VTEPs)

The critical component that bridges overlay and underlay networks is the Tunnel Endpoint—typically called a VTEP (Virtual Tunnel Endpoint) in overlay terminology. VTEPs perform the encapsulation and decapsulation operations that make overlay networks function.

VTEP Functions

Encapsulation (Egress) When a VM sends a frame destined for another VM on the same overlay network but on a different physical host:

VTEP receives the original Ethernet frame from the virtual switch
VTEP looks up the destination MAC to find the remote VTEP IP address
VTEP encapsulates the original frame in an outer UDP/IP packet addressed to the remote VTEP
The encapsulated packet is sent to the underlay network for IP routing

Decapsulation (Ingress) When an encapsulated packet arrives at the destination VTEP:

VTEP receives the outer UDP/IP packet
VTEP validates the tunnel header and extracts the VNI
VTEP removes the encapsulation to recover the original Ethernet frame
VTEP forwards the original frame to the appropriate local virtual switch port

VTEP Placement Options

Software VTEP (Virtual Switch) The most common deployment: the VTEP function is implemented within the virtual switch (e.g., Open vSwitch) running on each hypervisor host. Every host is a VTEP.

Advantages: Fine-grained encapsulation, full feature support, works on commodity hardware. Disadvantages: CPU overhead on hypervisor, must scale with host count.

Hardware VTEP (Physical Switch) Top-of-rack switches can implement VTEP functionality in hardware ASICs, encapsulating/decapsulating at the edge of the physical network.

Advantages: Line-rate encapsulation, offloads hypervisor CPU. Disadvantages: Requires compatible switch hardware, less flexible than software.

Gateway VTEP Dedicated appliances (physical or virtual) that bridge between overlay networks and external networks (physical servers, internet, legacy infrastructure).

Common use cases: Connecting overlay networks to bare-metal servers, providing internet gateway services, bridging to external partners.

VTEP Placement Comparison
Placement	Performance	Scalability	Cost	Use Case
Software VTEP	Good (10-40 Gbps)	Scales with hosts	Included in hypervisor	General cloud/virtualization
SmartNIC VTEP	Excellent (100+ Gbps)	Scales with hosts	SmartNIC cost ($500-2000)	High-performance clouds
ToR Switch VTEP	Line-rate	Limited by switch ports	Moderate (VTEP-capable switch)	Hardware offload deployments
Gateway VTEP	Varies	Centralized bottleneck	Dedicated appliance	External connectivity

vtep-lookup-example.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// VTEP Forwarding Table (VNI + Destination MAC → Remote VTEP IP)
// This table enables the VTEP to route overlay traffic
 
VTEP_FORWARDING_TABLE = {
    // VNI 5000: Production Network
    (VNI=5000, MAC="AA:AA:AA:AA:AA:01"): VTEP_IP="192.168.1.10",
    (VNI=5000, MAC="AA:AA:AA:AA:AA:02"): VTEP_IP="192.168.2.20",
    (VNI=5000, MAC="AA:AA:AA:AA:AA:03"): VTEP_IP="192.168.3.30",
    
    // VNI 6000: Development Network (completely isolated)
    (VNI=6000, MAC="BB:BB:BB:BB:BB:01"): VTEP_IP="192.168.1.10",
    (VNI=6000, MAC="BB:BB:BB:BB:BB:02"): VTEP_IP="192.168.4.40",
    
    // VNI 7000: Tenant A (isolated from all others)
    (VNI=7000, MAC="CC:CC:CC:CC:CC:01"): VTEP_IP="192.168.5.50",
}
 
function encapsulate_frame(original_frame, source_vni):
    dst_mac = original_frame.destination_mac
    
    // Lookup remote VTEP
    key = (VNI=source_vni, MAC=dst_mac)
    if key in VTEP_FORWARDING_TABLE:
        remote_vtep_ip = VTEP_FORWARDING_TABLE[key]
    else:
        // Unknown destination - flood to all VTEPs in VNI
        remote_vtep_ip = get_flood_vteps(source_vni)
    
    // Build encapsulated packet
    outer_packet = {
        outer_ethernet: {
            dst_mac: next_hop_router_mac,
            src_mac: local_nic_mac,
            ethertype: 0x0800  // IPv4
        },
        outer_ip: {
            src_ip: local_vtep_ip,
            dst_ip: remote_vtep_ip,
            protocol: 17  // UDP
        },
        outer_udp: {
            src_port: hash(original_frame) % 65535,  // Entropy for ECMP
            dst_port: 4789  // VXLAN standard port
        },
        overlay_header: {
            vni: source_vni,
            flags: 0x08  // VNI valid
        },
        payload: original_frame  // Complete original Ethernet frame
    }
    
    return outer_packet

Encapsulation Mechanics: Packet-in-Packet

The core operation of overlay networking is encapsulation—wrapping the original Layer 2 frame inside a new Layer 3 packet for transport across the underlay. Understanding the exact structure of encapsulated packets is crucial for troubleshooting, MTU planning, and performance optimization.

The Encapsulation Stack

A typical overlay packet has this structure (using VXLAN as the example):

+-------------------------+
| Outer Ethernet Header   |  14 bytes
| (Underlay L2)           |
+-------------------------+
| Outer IP Header         |  20 bytes (IPv4) or 40 bytes (IPv6)
| (Underlay L3)           |
+-------------------------+
| Outer UDP Header        |  8 bytes
| (Encapsulation Layer)   |
+-------------------------+
| VXLAN Header            |  8 bytes
| (VNI + Flags)           |
+-------------------------+
| Original Ethernet Frame |  Variable (14 + payload + optional VLAN tag)
| (Overlay L2)            |
+-------------------------+
| Original Payload        |  Variable
| (Overlay L3+)           |
+-------------------------+

Encapsulation Overhead

The encapsulation adds 50 bytes of overhead for VXLAN over IPv4:

Outer Ethernet: 14 bytes
Outer IPv4: 20 bytes
Outer UDP: 8 bytes
VXLAN: 8 bytes

For IPv6 underlay, add another 20 bytes (40 bytes for IPv6 header vs. 20 for IPv4).

MTU Implications

This overhead has significant implications for Maximum Transmission Unit (MTU):

If the underlay network has a standard 1500-byte MTU:

Maximum overlay frame size = 1500 - 50 = 1450 bytes
VMs expecting 1500-byte MTU will experience fragmentation or drops

Solutions:

Increase underlay MTU: Configure physical network for jumbo frames (9000 bytes)
Reduce overlay MTU: Configure VMs for smaller MTU (1450 bytes)
Path MTU Discovery: Let TCP/IP stack discover the effective MTU
MSS Clamping: Rewrite TCP MSS option to prevent oversized segments

Overlay Protocol Encapsulation Overhead Comparison
Protocol	Header Size	Total Overhead (IPv4)	Total Overhead (IPv6)	Standard Port
VXLAN	8 bytes	50 bytes	70 bytes	UDP 4789
Geneve	8+ bytes (variable)	50+ bytes	70+ bytes	UDP 6081
GRE	4-8 bytes	38-46 bytes	58-66 bytes	IP Protocol 47
NVGRE	8 bytes	42 bytes	62 bytes	GRE + VSID
STT	18 bytes	76 bytes	96 bytes	TCP-like (proprietary)

MTU Misconfiguration: A Silent Killer

MTU problems in overlay networks are notoriously difficult to diagnose. TCP often works fine (due to MSS negotiation), but UDP-based applications or ICMP may fail mysteriously. Always ensure your underlay MTU exceeds overlay MTU plus encapsulation overhead by a comfortable margin—9000 bytes for underlay is strongly recommended.

UDP Source Port Entropy

Notice that the outer UDP source port is typically set to a hash of the inner packet fields. This is not arbitrary—it serves a critical performance purpose.

Modern datacenter networks use Equal-Cost Multi-Path (ECMP) routing to distribute traffic across multiple paths. ECMP routers hash packet headers to select output paths. If all encapsulated traffic used the same source port, all traffic between two VTEPs would take the same path, defeating ECMP's load-balancing benefits.

By hashing the inner packet's 5-tuple (source IP, destination IP, source port, destination port, protocol) to generate the outer UDP source port, we ensure that different flows between the same VTEPs take different physical paths, achieving proper load distribution.

Control Plane Architecture

Overlay encapsulation handles the data plane—how packets are formatted and forwarded. But encapsulation alone doesn't answer a critical question: How does a VTEP know which remote VTEP hosts the destination MAC address?

This is the job of the control plane—the mechanism by which VTEPs discover VM locations and populate their forwarding tables. Several control plane approaches exist, each with distinct tradeoffs.

1. Flood-and-Learn (Data Plane Learning)

The simplest approach: treat the overlay like a traditional Ethernet network.

How it works:

Unknown unicast and broadcast frames are flooded to all VTEPs in the VNI
VTEPs learn source MAC → source VTEP mappings from flooded traffic
Over time, forwarding tables populate through normal traffic patterns

Multicast-based flooding: Each VNI is mapped to an IP multicast group. Flooded traffic is sent to the multicast group, and underlay multicast routing delivers it to all participating VTEPs.

Ingress replication (head-end replication): If multicast isn't available, the source VTEP unicasts copies of flooded frames to each remote VTEP in the VNI.

Pros: Simple, works without any external controller, familiar Ethernet semantics. Cons: Flooding doesn't scale (n² traffic for n VTEPs), requires underlay multicast or excess bandwidth.

2. Centralized Controller

An SDN controller maintains a global database of VM-to-VTEP mappings and distributes this information to all VTEPs.

How it works:

When a VM is created/migrated, the hypervisor notifies the controller
Controller updates its global mapping database
Controller pushes updated mappings to all affected VTEPs via OVSDB, OpenFlow, or custom protocols

Pros: Eliminates flooding, precise forwarding from the first packet, controller can enforce policy. Cons: Controller becomes potential bottleneck/single point of failure, requires tight orchestration integration.

3. Distributed Control Plane (BGP EVPN)

VTEPs participate in a distributed routing protocol (BGP with EVPN extensions) to exchange MAC/IP reachability information.

How it works:

VTEPs peer with BGP route reflectors (or each other in small deployments)
Each VTEP advertises its local MACs/IPs using EVPN Type-2 (MAC/IP Advertisement) routes
EVPN routes include VNI, MAC address, optionally IP address, and VTEP IP
VTEPs install received routes in their forwarding tables

Pros: Scales to very large deployments, industry-standard protocol, no flooding at all, multi-vendor interoperability. Cons: Requires BGP infrastructure, more complex initial setup.

Control Plane Approach Comparison
Approach	Scalability	Complexity	Flooding Required	Multi-Vendor
Flood-and-Learn + Multicast	Low-Medium	Low	Yes (multicast)	Good
Flood-and-Learn + Ingress Replication	Low	Low	Yes (unicast copies)	Good
Centralized Controller	Medium-High	Medium	No	Controller-dependent
BGP EVPN	Very High	High	No	Excellent (IETF standard)

EVPN: The Industry Direction

BGP EVPN has emerged as the industry-standard control plane for overlay networks, particularly in large datacenter and cloud provider environments. It provides not only MAC learning but also IP routing, multi-homing support, and sophisticated traffic engineering—all using the well-proven BGP protocol infrastructure.

ARP/ND Suppression and Proxy

One of the most significant scalability challenges in overlay networks is ARP (Address Resolution Protocol) traffic. In traditional Ethernet networks, ARP requests are broadcast to all hosts in the subnet. In overlay networks, this means flooding to all VTEPs—exactly the kind of broadcast traffic we want to minimize.

The ARP Problem at Scale

Consider a subnet with 10,000 VMs across 1,000 VTEPs:

Each VM generates periodic ARP requests (roughly once per ARP cache timeout, typically 20-60 minutes)
At steady state: ~10,000 ARP requests per hour = ~3 per second
Each ARP request is flooded to 1,000 VTEPs = 3,000 encapsulated packets per second just for ARP
At 5,000 VMs sending simultaneously (VM boot storm): 5,000,000 packets/second for ARP alone

This broadcast amplification can overwhelm VTEP CPUs and underlay bandwidth during mass events like datacenter power-on or disaster recovery failover.

ARP Suppression

The solution is ARP suppression—having the VTEP answer ARP requests locally instead of flooding them.

How it works:

VTEP maintains an ARP cache mapping IP addresses to MAC addresses for all VMs in the local VNI
Cache is populated from the control plane (EVPN, controller) or from snooping ARP replies
When a local VM sends an ARP request, the VTEP checks its cache
If the entry exists, VTEP generates an ARP reply locally without flooding
If the entry doesn't exist, VTEP floods the request (last resort)

EVPN Type-2 Routes with IP

BGP EVPN elegantly solves ARP suppression through Type-2 MAC/IP Advertisement routes. When a VTEP advertises a MAC address, it can include the associated IP address(es):

BGP EVPN Route Type-2 (MAC/IP Advertisement):
  Route Distinguisher: 192.168.1.10:100
  Ethernet Tag ID: 0
  MAC Address: 00:50:56:01:02:03
  IP Address: 10.0.1.50  ← ARP cache population
  VNI: 5000
  Next Hop: 192.168.1.10  (VTEP IP)

All VTEPs receiving this route install both the MAC→VTEP mapping and the IP→MAC mapping, enabling them to suppress ARP for this IP entirely.

Converting Mermaid diagram...

Benefits of ARP Suppression

•Eliminates broadcast storms from ARP during VM boot/migration events
•Reduces underlay bandwidth by avoiding n×m flooding pattern
•Lower VTEP CPU usage from processing flooded packets
•Faster ARP resolution (local cache lookup vs. network round-trip)
•Improved scalability to much larger overlay deployments

Layer 2 vs. Layer 3 Overlays

Overlay networks can operate at Layer 2 (bridging) or Layer 3 (routing), with significant implications for architecture and use cases.

Layer 2 Overlay (Bridge Mode)

L2 overlays extend a single broadcast domain across multiple physical hosts. VMs in the same VNI are on the same logical Ethernet segment and can communicate using their MAC addresses.

Characteristics:

VMs share a common IP subnet
All Layer 2 protocols (ARP, DHCP, etc.) work transparently
Broadcast traffic is flooded within the VNI
Simple to understand—behaves like a physical Ethernet switch

Use cases:

Traditional applications expecting Ethernet semantics
VM live migration within a subnet
Cluster software requiring Layer 2 adjacency (heartbeat, etc.)

Layer 3 Overlay (Routing Mode)

L3 overlays route packets between VMs based on IP addresses, with each host acting as an IP gateway.

Characteristics:

VMs on different hosts can be in different subnets
No broadcast domain—each VM has a /32 route
No flooding—all traffic is unicast routed
Inter-subnet routing happens at the first hop (distributed gateway)

Use cases:

Microservices and containerized workloads (each pod is a /32)
Large-scale deployments where broadcast must be eliminated
Multi-tenant environments requiring maximum isolation

Distributed Routing (Anycast Gateway)

Modern overlay networks combine L2 and L3 functionality using distributed routing with anycast gateways:

Every VTEP is configured as the default gateway for local VMs
All VTEPs share the same virtual gateway IP and MAC (anycast)
A VM's first-hop router is always the local VTEP—no traffic hairpins through a central router
Inter-subnet traffic is routed locally and encapsulated to the destination VTEP

This architecture provides the best of both worlds: L2 semantics within a subnet for compatibility, plus efficient L3 routing between subnets.

Layer 2 vs. Layer 3 Overlay Comparison
Characteristic	Layer 2 Overlay	Layer 3 Overlay
Forwarding Decision	MAC address lookup	IP address lookup
Broadcast Domain	Shared across VNI	None (routing only)
ARP Traffic	Flooded (or suppressed)	Not required (host routes)
VM Subnet	Common subnet per VNI	Per-VM /32 or any subnet
Live Migration	Within subnet (same L2)	Unrestricted (routing follows)
Scalability	Limited by broadcast	Highly scalable
Complexity	Lower (familiar Ethernet)	Higher (routing required)
Typical Use	Traditional VMs	Containers, microservices

EVPN Symmetric IRB

EVPN supports a sophisticated model called Symmetric Integrated Routing and Bridging (Symmetric IRB). In this model, both ingress and egress VTEPs perform L3 routing, and the VNI in the tunnel header identifies the VRF, not the L2 segment. This enables efficient distributed routing without requiring L2 stretch across the fabric.

Gateway Services: Bridging Overlay and Underlay

Overlay networks create isolated virtual domains, but workloads inevitably need to communicate with the outside world—physical servers, external networks, the internet. Gateway services bridge the gap between overlay and underlay/external networks.

Gateway Types

L2 Gateway (Bridging) Bridges overlay L2 network to a physical VLAN, making overlay VMs appear on the same Ethernet segment as physical servers.

Use case: Integrating VMs with legacy physical database servers or storage arrays.

Implementation: A VTEP (hardware or software) with interfaces in both the overlay VNI and the physical VLAN, performing MAC bridging between them.

L3 Gateway (Routing) Routes between overlay networks and external IP networks (physical subnets, internet, partner WAN links).

Use case: Providing internet access, connecting to external services, multi-site connectivity.

Implementation: Distributed (on every VTEP) or centralized (dedicated gateway appliances). Distributed is preferred for performance and resilience.

NAT Gateway Provides Network Address Translation between private overlay addresses and public addresses.

Use case: Internet access for VMs with private IPs, hiding internal topology from external networks.

Distributed vs. Centralized Gateways

Centralized Gateway:

All external traffic passes through dedicated gateway nodes
Simpler to manage, easier to apply security policies
Potential bottleneck and single point of failure
Suboptimal traffic paths (hairpinning)

Distributed Gateway:

Every VTEP can route to external networks
Traffic takes optimal path (exit at nearest border)
Requires consistent routing policy across all VTEPs
More complex to implement but better performance and resilience

Converting Mermaid diagram...

Gateway Design Considerations

•Capacity planning: Centralized gateways must handle all external traffic—size for peak demand plus growth
•High availability: Gateway failure blocks external access; deploy in active-active or active-standby pairs
•Security policy enforcement: Gateways are natural chokepoints for firewalling, IDS/IPS, DDoS protection
•Traffic engineering: Consider where traffic exits in multi-DC deployments—local breakout vs. centralized
•NAT table size: High connection counts require sufficient NAT state tracking capacity

Summary: The Power of Overlay Abstraction

Overlay networks represent a fundamental advancement in network architecture—the complete separation of logical network topology from physical infrastructure. This abstraction enables the flexibility, scalability, and multi-tenancy that define modern cloud computing.

Key Takeaways

•VLANs don't scale for cloud — 4,094 VLAN limit and Layer 2 spanning requirements make VLANs unsuitable for large multi-tenant environments
•Overlays tunnel Layer 2 over Layer 3 — Encapsulating Ethernet frames in IP packets allows logical networks to span any IP-reachable infrastructure
•VTEPs are the bridge — Virtual Tunnel Endpoints perform encapsulation/decapsulation, converting between overlay and underlay domains
•Encapsulation adds overhead — 50+ bytes per packet; underlay MTU must accommodate this to avoid fragmentation
•Control plane is critical — Flood-and-learn, centralized controller, or BGP EVPN manage VM location discovery at scale
•ARP suppression eliminates broadcasts — Critical for scalability; local VTEPs answer ARP requests using cached information
•L2 and L3 overlays serve different needs — L2 for compatibility, L3 for maximum scale; distributed routing combines both
•Gateways connect to the outside world — L2/L3/NAT gateways bridge isolated overlay networks to external infrastructure

Overlay Concepts Mastered

You now understand the fundamental architecture of overlay networks—why they exist, how they work, and what problems they solve. This conceptual foundation is essential for understanding specific implementations.

Next Up: We'll dive deep into VXLAN (Virtual eXtensible LAN)—the dominant overlay protocol in datacenter and cloud environments. You'll learn the exact packet format, control plane options (multicast, unicast, EVPN), configuration examples, and operational considerations for production VXLAN deployments.

2 / 5

Loading learning content...

Computer NetworksNetwork Virtualization

Network Virtualization: Abstracting Physical Infrastructure

LevelAdvanced

Duration90 mins

TopicNetwork Virtualization

2 / 5

Overlay Networks: Decoupling Logical from Physical

The Overlay Revolution

Overlay networks are not merely a technical optimization; they represent a fundamental shift in how we think about network architecture. They enable:

Virtual machines to maintain the same IP address when migrating between datacenters
Tens of thousands of isolated tenant networks on a single physical infrastructure
Network provisioning in seconds (API calls) instead of days (hardware changes)
Consistent networking across heterogeneous physical networks

This page will take you from first principles through the complete architecture of overlay networking, preparing you to understand specific implementations like VXLAN in subsequent pages.

What You Will Learn

The Limitations of Traditional VLANs

The 4096 VLAN Limit

Multi-tenant cloud: A large cloud provider hosts thousands of tenants, each potentially requiring multiple isolated networks
Microservices architectures: Modern applications may spawn thousands of isolated network segments for security
Service chaining: Network functions virtualization creates hundreds of service chains, each in its own segment

4,094 VLANs simply cannot support a cloud platform serving 10,000 tenants.

Layer 2 Spanning Requirements

VLANs are Layer 2 constructs—they require a continuous Layer 2 domain (a broadcast domain) across all hosts in the VLAN. This creates severe problems:

VLAN Limitations vs. Overlay Network Capabilities
Limitation	VLAN Impact	Overlay Solution
Network ID Space	4,094 VLANs maximum	16 million+ overlay networks (24-bit VNI)
Layer 2 Spanning	Requires Layer 2 end-to-end	Tunnels over Layer 3 IP networks
Spanning Tree	Blocks redundant paths, limits bandwidth	Uses IP ECMP, full bandwidth utilization
Failure Domain	Entire VLAN is single failure domain	Failures isolated to underlay segments
VM Mobility	Constrained to L2 domain	Unrestricted across L3 networks
Provisioning	Requires switch configuration (hours/days)	API-driven (seconds)
Multi-DC	Requires risky L2 DCI stretching	Native IP routing between datacenters

The Danger of Stretched Layer 2

Overlay Network Fundamentals

Underlay Network

Key underlay requirements:

IP reachability between all hosts participating in the overlay
Sufficient MTU to accommodate encapsulation overhead (typically 50-100 bytes)
Adequate bandwidth and low latency for encapsulated traffic
Ideally, Equal-Cost Multi-Path (ECMP) routing for load distribution

Overlay Network

Key overlay properties:

Completely independent topology (VMs on different physical subnets appear to be on same L2 segment)
Independent address space (VM IPs don't conflict with underlay IPs)
Scalable to millions of isolated networks (24+ bit network identifiers)
Programmable via software (no physical switch configuration needed)

Network Virtualization Identifier (VNI)

Converting Mermaid diagram...

The Power of Indirection

Tunnel Endpoints (VTEPs)

VTEP Functions

Encapsulation (Egress) When a VM sends a frame destined for another VM on the same overlay network but on a different physical host:

VTEP receives the original Ethernet frame from the virtual switch
VTEP looks up the destination MAC to find the remote VTEP IP address
VTEP encapsulates the original frame in an outer UDP/IP packet addressed to the remote VTEP
The encapsulated packet is sent to the underlay network for IP routing

Decapsulation (Ingress) When an encapsulated packet arrives at the destination VTEP:

VTEP receives the outer UDP/IP packet
VTEP validates the tunnel header and extracts the VNI
VTEP removes the encapsulation to recover the original Ethernet frame
VTEP forwards the original frame to the appropriate local virtual switch port

VTEP Placement Options

Software VTEP (Virtual Switch) The most common deployment: the VTEP function is implemented within the virtual switch (e.g., Open vSwitch) running on each hypervisor host. Every host is a VTEP.

Advantages: Fine-grained encapsulation, full feature support, works on commodity hardware. Disadvantages: CPU overhead on hypervisor, must scale with host count.

Hardware VTEP (Physical Switch) Top-of-rack switches can implement VTEP functionality in hardware ASICs, encapsulating/decapsulating at the edge of the physical network.

Advantages: Line-rate encapsulation, offloads hypervisor CPU. Disadvantages: Requires compatible switch hardware, less flexible than software.

Gateway VTEP Dedicated appliances (physical or virtual) that bridge between overlay networks and external networks (physical servers, internet, legacy infrastructure).

Common use cases: Connecting overlay networks to bare-metal servers, providing internet gateway services, bridging to external partners.

VTEP Placement Comparison
Placement	Performance	Scalability	Cost	Use Case
Software VTEP	Good (10-40 Gbps)	Scales with hosts	Included in hypervisor	General cloud/virtualization
SmartNIC VTEP	Excellent (100+ Gbps)	Scales with hosts	SmartNIC cost ($500-2000)	High-performance clouds
ToR Switch VTEP	Line-rate	Limited by switch ports	Moderate (VTEP-capable switch)	Hardware offload deployments
Gateway VTEP	Varies	Centralized bottleneck	Dedicated appliance	External connectivity

vtep-lookup-example.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// VTEP Forwarding Table (VNI + Destination MAC → Remote VTEP IP)
// This table enables the VTEP to route overlay traffic
 
VTEP_FORWARDING_TABLE = {
    // VNI 5000: Production Network
    (VNI=5000, MAC="AA:AA:AA:AA:AA:01"): VTEP_IP="192.168.1.10",
    (VNI=5000, MAC="AA:AA:AA:AA:AA:02"): VTEP_IP="192.168.2.20",
    (VNI=5000, MAC="AA:AA:AA:AA:AA:03"): VTEP_IP="192.168.3.30",
    
    // VNI 6000: Development Network (completely isolated)
    (VNI=6000, MAC="BB:BB:BB:BB:BB:01"): VTEP_IP="192.168.1.10",
    (VNI=6000, MAC="BB:BB:BB:BB:BB:02"): VTEP_IP="192.168.4.40",
    
    // VNI 7000: Tenant A (isolated from all others)
    (VNI=7000, MAC="CC:CC:CC:CC:CC:01"): VTEP_IP="192.168.5.50",
}
 
function encapsulate_frame(original_frame, source_vni):
    dst_mac = original_frame.destination_mac
    
    // Lookup remote VTEP
    key = (VNI=source_vni, MAC=dst_mac)
    if key in VTEP_FORWARDING_TABLE:
        remote_vtep_ip = VTEP_FORWARDING_TABLE[key]
    else:
        // Unknown destination - flood to all VTEPs in VNI
        remote_vtep_ip = get_flood_vteps(source_vni)
    
    // Build encapsulated packet
    outer_packet = {
        outer_ethernet: {
            dst_mac: next_hop_router_mac,
            src_mac: local_nic_mac,
            ethertype: 0x0800  // IPv4
        },
        outer_ip: {
            src_ip: local_vtep_ip,
            dst_ip: remote_vtep_ip,
            protocol: 17  // UDP
        },
        outer_udp: {
            src_port: hash(original_frame) % 65535,  // Entropy for ECMP
            dst_port: 4789  // VXLAN standard port
        },
        overlay_header: {
            vni: source_vni,
            flags: 0x08  // VNI valid
        },
        payload: original_frame  // Complete original Ethernet frame
    }
    
    return outer_packet

Encapsulation Mechanics: Packet-in-Packet

The Encapsulation Stack

A typical overlay packet has this structure (using VXLAN as the example):

+-------------------------+
| Outer Ethernet Header   |  14 bytes
| (Underlay L2)           |
+-------------------------+
| Outer IP Header         |  20 bytes (IPv4) or 40 bytes (IPv6)
| (Underlay L3)           |
+-------------------------+
| Outer UDP Header        |  8 bytes
| (Encapsulation Layer)   |
+-------------------------+
| VXLAN Header            |  8 bytes
| (VNI + Flags)           |
+-------------------------+
| Original Ethernet Frame |  Variable (14 + payload + optional VLAN tag)
| (Overlay L2)            |
+-------------------------+
| Original Payload        |  Variable
| (Overlay L3+)           |
+-------------------------+

Encapsulation Overhead

The encapsulation adds 50 bytes of overhead for VXLAN over IPv4:

Outer Ethernet: 14 bytes
Outer IPv4: 20 bytes
Outer UDP: 8 bytes
VXLAN: 8 bytes

For IPv6 underlay, add another 20 bytes (40 bytes for IPv6 header vs. 20 for IPv4).

MTU Implications

This overhead has significant implications for Maximum Transmission Unit (MTU):

If the underlay network has a standard 1500-byte MTU:

Maximum overlay frame size = 1500 - 50 = 1450 bytes
VMs expecting 1500-byte MTU will experience fragmentation or drops

Solutions:

Increase underlay MTU: Configure physical network for jumbo frames (9000 bytes)
Reduce overlay MTU: Configure VMs for smaller MTU (1450 bytes)
Path MTU Discovery: Let TCP/IP stack discover the effective MTU
MSS Clamping: Rewrite TCP MSS option to prevent oversized segments

Overlay Protocol Encapsulation Overhead Comparison
Protocol	Header Size	Total Overhead (IPv4)	Total Overhead (IPv6)	Standard Port
VXLAN	8 bytes	50 bytes	70 bytes	UDP 4789
Geneve	8+ bytes (variable)	50+ bytes	70+ bytes	UDP 6081
GRE	4-8 bytes	38-46 bytes	58-66 bytes	IP Protocol 47
NVGRE	8 bytes	42 bytes	62 bytes	GRE + VSID
STT	18 bytes	76 bytes	96 bytes	TCP-like (proprietary)

MTU Misconfiguration: A Silent Killer

UDP Source Port Entropy

Notice that the outer UDP source port is typically set to a hash of the inner packet fields. This is not arbitrary—it serves a critical performance purpose.

Control Plane Architecture

1. Flood-and-Learn (Data Plane Learning)

The simplest approach: treat the overlay like a traditional Ethernet network.

How it works:

Unknown unicast and broadcast frames are flooded to all VTEPs in the VNI
VTEPs learn source MAC → source VTEP mappings from flooded traffic
Over time, forwarding tables populate through normal traffic patterns

Multicast-based flooding: Each VNI is mapped to an IP multicast group. Flooded traffic is sent to the multicast group, and underlay multicast routing delivers it to all participating VTEPs.

Ingress replication (head-end replication): If multicast isn't available, the source VTEP unicasts copies of flooded frames to each remote VTEP in the VNI.

Pros: Simple, works without any external controller, familiar Ethernet semantics. Cons: Flooding doesn't scale (n² traffic for n VTEPs), requires underlay multicast or excess bandwidth.

2. Centralized Controller

An SDN controller maintains a global database of VM-to-VTEP mappings and distributes this information to all VTEPs.

How it works:

When a VM is created/migrated, the hypervisor notifies the controller
Controller updates its global mapping database
Controller pushes updated mappings to all affected VTEPs via OVSDB, OpenFlow, or custom protocols

3. Distributed Control Plane (BGP EVPN)

VTEPs participate in a distributed routing protocol (BGP with EVPN extensions) to exchange MAC/IP reachability information.

How it works:

VTEPs peer with BGP route reflectors (or each other in small deployments)
Each VTEP advertises its local MACs/IPs using EVPN Type-2 (MAC/IP Advertisement) routes
EVPN routes include VNI, MAC address, optionally IP address, and VTEP IP
VTEPs install received routes in their forwarding tables

Pros: Scales to very large deployments, industry-standard protocol, no flooding at all, multi-vendor interoperability. Cons: Requires BGP infrastructure, more complex initial setup.

Control Plane Approach Comparison
Approach	Scalability	Complexity	Flooding Required	Multi-Vendor
Flood-and-Learn + Multicast	Low-Medium	Low	Yes (multicast)	Good
Flood-and-Learn + Ingress Replication	Low	Low	Yes (unicast copies)	Good
Centralized Controller	Medium-High	Medium	No	Controller-dependent
BGP EVPN	Very High	High	No	Excellent (IETF standard)

EVPN: The Industry Direction

ARP/ND Suppression and Proxy

The ARP Problem at Scale

Consider a subnet with 10,000 VMs across 1,000 VTEPs:

Each VM generates periodic ARP requests (roughly once per ARP cache timeout, typically 20-60 minutes)
At steady state: ~10,000 ARP requests per hour = ~3 per second
Each ARP request is flooded to 1,000 VTEPs = 3,000 encapsulated packets per second just for ARP
At 5,000 VMs sending simultaneously (VM boot storm): 5,000,000 packets/second for ARP alone

This broadcast amplification can overwhelm VTEP CPUs and underlay bandwidth during mass events like datacenter power-on or disaster recovery failover.

ARP Suppression

The solution is ARP suppression—having the VTEP answer ARP requests locally instead of flooding them.

How it works:

VTEP maintains an ARP cache mapping IP addresses to MAC addresses for all VMs in the local VNI
Cache is populated from the control plane (EVPN, controller) or from snooping ARP replies
When a local VM sends an ARP request, the VTEP checks its cache
If the entry exists, VTEP generates an ARP reply locally without flooding
If the entry doesn't exist, VTEP floods the request (last resort)

EVPN Type-2 Routes with IP

BGP EVPN elegantly solves ARP suppression through Type-2 MAC/IP Advertisement routes. When a VTEP advertises a MAC address, it can include the associated IP address(es):

BGP EVPN Route Type-2 (MAC/IP Advertisement):
  Route Distinguisher: 192.168.1.10:100
  Ethernet Tag ID: 0
  MAC Address: 00:50:56:01:02:03
  IP Address: 10.0.1.50  ← ARP cache population
  VNI: 5000
  Next Hop: 192.168.1.10  (VTEP IP)

All VTEPs receiving this route install both the MAC→VTEP mapping and the IP→MAC mapping, enabling them to suppress ARP for this IP entirely.

Converting Mermaid diagram...

Benefits of ARP Suppression

•Eliminates broadcast storms from ARP during VM boot/migration events
•Reduces underlay bandwidth by avoiding n×m flooding pattern
•Lower VTEP CPU usage from processing flooded packets
•Faster ARP resolution (local cache lookup vs. network round-trip)
•Improved scalability to much larger overlay deployments

Layer 2 vs. Layer 3 Overlays

Overlay networks can operate at Layer 2 (bridging) or Layer 3 (routing), with significant implications for architecture and use cases.

Layer 2 Overlay (Bridge Mode)

L2 overlays extend a single broadcast domain across multiple physical hosts. VMs in the same VNI are on the same logical Ethernet segment and can communicate using their MAC addresses.

Characteristics:

VMs share a common IP subnet
All Layer 2 protocols (ARP, DHCP, etc.) work transparently
Broadcast traffic is flooded within the VNI
Simple to understand—behaves like a physical Ethernet switch

Use cases:

Traditional applications expecting Ethernet semantics
VM live migration within a subnet
Cluster software requiring Layer 2 adjacency (heartbeat, etc.)

Layer 3 Overlay (Routing Mode)

L3 overlays route packets between VMs based on IP addresses, with each host acting as an IP gateway.

Characteristics:

VMs on different hosts can be in different subnets
No broadcast domain—each VM has a /32 route
No flooding—all traffic is unicast routed
Inter-subnet routing happens at the first hop (distributed gateway)

Use cases:

Microservices and containerized workloads (each pod is a /32)
Large-scale deployments where broadcast must be eliminated
Multi-tenant environments requiring maximum isolation

Distributed Routing (Anycast Gateway)

Modern overlay networks combine L2 and L3 functionality using distributed routing with anycast gateways:

Every VTEP is configured as the default gateway for local VMs
All VTEPs share the same virtual gateway IP and MAC (anycast)
A VM's first-hop router is always the local VTEP—no traffic hairpins through a central router
Inter-subnet traffic is routed locally and encapsulated to the destination VTEP

This architecture provides the best of both worlds: L2 semantics within a subnet for compatibility, plus efficient L3 routing between subnets.

Layer 2 vs. Layer 3 Overlay Comparison
Characteristic	Layer 2 Overlay	Layer 3 Overlay
Forwarding Decision	MAC address lookup	IP address lookup
Broadcast Domain	Shared across VNI	None (routing only)
ARP Traffic	Flooded (or suppressed)	Not required (host routes)
VM Subnet	Common subnet per VNI	Per-VM /32 or any subnet
Live Migration	Within subnet (same L2)	Unrestricted (routing follows)
Scalability	Limited by broadcast	Highly scalable
Complexity	Lower (familiar Ethernet)	Higher (routing required)
Typical Use	Traditional VMs	Containers, microservices