Computer NetworksRTP and RTCP

Real-time Transport Protocol and Control Protocol

LevelIntermediate

Duration60 mins

TopicRTP and RTCP

4 / 5

Multimedia Streaming: Architectures and Delivery Patterns

The Many Faces of Streaming Media

Streaming media has transformed how we consume and create content. From video calls with family across continents to live broadcasts reaching millions simultaneously, from cloud gaming that makes powerful hardware unnecessary to telemedicine that brings specialists to rural communities—streaming media is ubiquitous.\n\nYet not all streaming is the same. A Netflix viewer watching a movie has fundamentally different requirements than a surgeon performing remote surgery. A sports fan watching a live match needs different infrastructure than a student watching recorded lectures. A video conference participant expects instant interaction; a podcast listener can tolerate minutes of buffering.\n\nThis page explores the architectures, protocols, and engineering decisions that enable multimedia streaming across these diverse use cases, focusing on how RTP/RTCP form the foundation for interactive real-time applications.

What You Will Learn

By the end of this page, you will understand the spectrum from on-demand to real-time streaming, the roles of different streaming architectures (mesh, MCU, SFU), how codecs and adaptive bitrate interact with RTP transport, and the infrastructure that powers modern streaming platforms.

Streaming Paradigms

Multimedia streaming exists on a spectrum from highly interactive to completely passive. Understanding where an application falls on this spectrum determines the appropriate architecture and protocols.\n\nReal-time Interactive Streaming:\nTwo-way communication where participants expect sub-200ms latency for natural conversation. Examples include video calls, teleconferencing, remote surgery, and cloud gaming. RTP/RTCP is the dominant protocol choice because TCP's latency penalties are unacceptable.\n\nLive Streaming:\nOne-way broadcast to many viewers with latency typically ranging from 2-30 seconds. Examples include live sports, news broadcasts, and gaming streams on Twitch. Some platforms use RTP internally but deliver via HTTP-based protocols (HLS, DASH) to browsers.\n\nOn-Demand Streaming:\nPre-recorded content delivered when requested, with latency measured in seconds to minutes of buffering. Netflix, YouTube VOD, and podcast apps fall here. HTTP-based adaptive streaming (HLS, DASH) dominates because reliability matters more than low latency.

Streaming Paradigm Comparison
Characteristic	Real-time Interactive	Live Streaming	On-Demand
Typical latency	< 200ms	2-30 seconds	Seconds to minutes
Direction	Bidirectional	One-to-many	One-to-one (server-viewer)
Common protocols	RTP/RTCP (WebRTC)	RTP, HLS, DASH	HLS, DASH, HTTP
Loss tolerance	High (prefer glitch over delay)	Medium	None (buffering acceptable)
Buffering	50-200ms jitter buffer	2-10 second buffer	30-60 second buffer
Scaling challenge	Signaling and mesh limits	CDN edge distribution	CDN caching efficiency
Example apps	Zoom, Teams, Stadia	Twitch, YouTube Live	Netflix, Spotify

Protocol Selection Tradeoffs

RTP excels for interactive use cases but requires specialized infrastructure. HTTP-based streaming leverages existing CDN infrastructure and works through any firewall. Many platforms use a hybrid: RTP for ingest (broadcaster to server) and HTTP for distribution (server to viewers).

Video Conferencing Architectures

For real-time interactive streaming with multiple participants, three architectural patterns have emerged, each with distinct tradeoffs in quality, latency, bandwidth, and server cost.\n\n1. Mesh Architecture (Peer-to-Peer)\n\nEach participant sends their media stream directly to every other participant. No central server processes media.\n\nAdvantages:\n- Lowest possible latency (direct connections)\n- No server infrastructure cost\n- Maximum quality (no transcoding)\n\nDisadvantages:\n- Upload bandwidth scales as O(n-1) per participant\n- Impractical beyond 4-5 participants\n- NAT traversal required for each peer pair

Mesh vs MCU vs SFU

Architectures

MESH (Peer-to-Peer):
    ┌───────────┐       ┌───────────┐
    │  Alice    │◄─────►│   Bob     │
    └───────────┘       └───────────┘
          ▲  ▲                ▲
          │  │                │
          │  └────────────────┼───────┐
          │                   │       │
          ▼                   ▼       ▼
    ┌───────────┐       ┌───────────┐
    │  Carol    │◄─────►│   Dave    │
    └───────────┘       └───────────┘
    
    4 participants = 6 connections, each sends 3 streams
 
═══════════════════════════════════════════════════════════
 
MCU (Multipoint Control Unit):
    ┌───────────┐     ┌───────────┐
    │  Alice    │────►│           │────►│  Alice    │
    └───────────┘     │    MCU    │     └───────────┘
    ┌───────────┐     │  (Mix &   │     ┌───────────┐
    │   Bob     │────►│ Transcode)│────►│   Bob     │
    └───────────┘     │           │     └───────────┘
    ┌───────────┐     │           │     ┌───────────┐
    │  Carol    │────►│           │────►│  Carol    │
    └───────────┘     └───────────┘     └───────────┘
    
    Each sends 1 stream, receives 1 mixed stream
 
═══════════════════════════════════════════════════════════
 
SFU (Selective Forwarding Unit):
                      ┌───────────────────┐
    ┌───────────┐     │                   │
    │  Alice    │────►│       SFU         │────► Bob gets Alice, Carol
    └───────────┘     │   (Forward only,  │
    ┌───────────┐     │   no transcode)   │────► Carol gets Alice, Bob
    │   Bob     │────►│                   │
    └───────────┘     │                   │────► Alice gets Bob, Carol
    ┌───────────┐     │                   │
    │  Carol    │────►│                   │
    └───────────┘     └───────────────────┘
    
    Each sends 1 stream, receives n-1 separate streams

2. MCU Architecture (Multipoint Control Unit)\n\nA central server receives all participant streams, decodes them, mixes/composites them into a single output, re-encodes, and sends this combined stream to each participant.\n\nAdvantages:\n- Minimal client bandwidth (send 1, receive 1)\n- Consistent quality regardless of participant count\n- Can serve heterogeneous clients (different codecs/resolutions)\n\nDisadvantages:\n- High server CPU cost (decode + encode for each participant)\n- Adds transcoding latency (50-200ms)\n- Quality loss from re-encoding\n- Single point of failure

3. SFU Architecture (Selective Forwarding Unit)\n\nA central server receives participant streams and forwards them to other participants without transcoding. The server makes intelligent decisions about which streams to forward based on subscriber preferences, available bandwidth, and active speaker detection.\n\nAdvantages:\n- Lower server CPU than MCU (no transcoding)\n- No quality loss from re-encoding\n- Flexible—participants can receive different stream selections\n- Supports simulcast (multiple quality levels per sender)\n\nDisadvantages:\n- Higher client download bandwidth than MCU\n- Requires clients to decode multiple streams\n- More complex server logic for stream selection

Modern Choice: SFU

SFU has become the dominant architecture for modern video conferencing. WebRTC platforms (Zoom, Teams, Google Meet, Discord) predominantly use SFU because it balances quality, latency, and server cost. MCU is used when clients are constrained (old phones, embedded devices) or for recording/broadcasting.

Simulcast and Scalable Video Coding

In SFU architectures, participants often have varying bandwidth capabilities and display sizes. Sending 1080p video to a participant viewing on a small thumbnail wastes bandwidth. Two techniques address this: Simulcast and Scalable Video Coding (SVC).\n\nSimulcast:\nThe sender encodes and transmits their video at multiple quality levels simultaneously (e.g., 1080p, 720p, 360p). The SFU selects which layer to forward to each receiver based on their bandwidth and viewing needs.

Simulcast Layers

Example

Simulcast: Sender encodes 3 independent streams
 
Sender CPU encodes:
┌────────────────────────────────────────┐
│ Layer 0: 1080p @ 2.5 Mbps (high)      │─────► Full viewers
│ Layer 1: 720p  @ 1.0 Mbps (medium)    │─────► Mobile/constrained
│ Layer 2: 360p  @ 0.3 Mbps (low)       │─────► Thumbnails
└────────────────────────────────────────┘
 
SFU receives all 3 layers and forwards selectively:
┌─────────────────────────────────────────────────────────┐
│  Subscriber A (good network, active speaker view)       │
│  → Receives: Layer 0 (1080p)                            │
├─────────────────────────────────────────────────────────┤
│  Subscriber B (mobile, bandwidth limited)               │
│  → Receives: Layer 1 (720p)                             │
├─────────────────────────────────────────────────────────┤
│  Subscriber C (viewing 4x4 grid of participants)        │
│  → Receives: Layer 2 for all (360p thumbnails)          │
└─────────────────────────────────────────────────────────┘
 
Advantages:
+ Each layer is independently decodable
+ Standard codecs (H.264, VP8) work
+ SFU logic is simple—just drop/forward
 
Disadvantages:
- 3x encoding load on sender
- Total upload bandwidth is sum of all layers (~3.8 Mbps example)
- Inefficient—layers don't share common information

Scalable Video Coding (SVC):\nThe sender encodes a single video stream with embedded layers. A base layer provides minimum quality, and enhancement layers progressively improve quality. The SFU can drop enhancement layers to reduce bandwidth without re-encoding.\n\nSVC layer types:\n- Spatial layers: Different resolutions (similar to simulcast)\n- Temporal layers: Different frame rates (30fps base, enhancements for 60fps)\n- Quality layers (SNR): Same resolution but higher fidelity

SVC Layer Structure

VP9 SVC

VP9 SVC Example: Spatial + Temporal Layers
 
Stream structure (single encoded output):
┌────────────────────────────────────────────────────┐
│  S2/T2: 1080p @ 30fps  (depends on S2/T1, S2/T0)  │
│  S2/T1: 1080p @ 15fps  (depends on S2/T0)         │
│  S2/T0: 1080p @ 7.5fps (base for spatial 2)       │
├────────────────────────────────────────────────────┤
│  S1/T2: 720p @ 30fps   (depends on S1/T1, S1/T0)  │
│  S1/T1: 720p @ 15fps   (depends on S1/T0)         │
│  S1/T0: 720p @ 7.5fps  (base for spatial 1)       │
├────────────────────────────────────────────────────┤
│  S0/T2: 360p @ 30fps   (depends on S0/T1, S0/T0)  │
│  S0/T1: 360p @ 15fps   (depends on S0/T0)         │
│  S0/T0: 360p @ 7.5fps  (MANDATORY BASE)           │
└────────────────────────────────────────────────────┘
 
SFU Layer Selection:
- "Give me 720p/30fps" → Forward: S0/*, S1/* (drop S2)
- "Give me 1080p/15fps" → Forward: S0/*, S1/*, S2/T0, S2/T1 (drop T2)
- "Just thumbnail" → Forward: S0/T0, S0/T1 (360p/15fps)
 
Advantages over Simulcast:
+ Single encode (lower sender CPU)
+ ~40% more bandwidth efficient (layers share motion vectors)
+ Finer granularity (temporal layers enable smooth degradation)
 
Disadvantages:
- Requires SVC-capable codec (VP9-SVC, H.264-SVC, AV1)
- More complex SFU logic (dependency tracking)
- Some decoder limitations

Industry Adoption

Simulcast remains more widely deployed due to simpler implementation. VP9-SVC is gaining adoption (used by Google Meet). AV1-SVC promises further efficiency gains but requires significant compute. The trend is toward SVC as codec support matures.

Adaptive Bitrate Streaming

Network conditions fluctuate constantly—WiFi interference, mobile handoffs, congested links, background downloads. Adaptive bitrate streaming adjusts video quality in real-time to maintain smooth playback despite these variations.\n\nFor HTTP-based streaming (HLS/DASH):\nContent is pre-encoded at multiple bitrates. The player monitors buffer levels and network throughput, requesting higher or lower quality segments based on conditions. The decision happens at segment boundaries (2-10 seconds).\n\nFor RTP-based streaming:\nAdaptation must happen continuously based on RTCP feedback. Senders receive loss reports, delay measurements, and explicit feedback, adjusting encoding parameters or switching simulcast/SVC layers within milliseconds.

Adaptive Bitrate Techniques
Technique	RTP Response Time	HTTP Response Time	Mechanism
Encoder rate control	100-500ms	N/A (pre-encoded)	Adjust QP/bitrate targets
Simulcast layer switch	Immediate	2-10 seconds	SFU forwards different layer
SVC layer dropping	Immediate	2-10 seconds	Drop enhancement layers
Resolution change	Next keyframe	2-10 seconds	Encode at lower resolution
Frame rate reduction	Immediate	2-10 seconds	Skip temporal layers or frames

Congestion control for RTP:\n\nModern RTP implementations use sophisticated algorithms to detect congestion and adapt accordingly:\n\nGoogle Congestion Control (GCC):\nUsed by WebRTC, GCC combines delay-based and loss-based detection. It monitors one-way delay variation (via RTCP TWCC feedback) and adjusts target bitrate. Increases are gentle; decreases are aggressive, following AIMD principles.\n\nNADA (Network-Assisted Dynamic Adaptation):\nAn IETF standard algorithm using ECN (Explicit Congestion Notification) marks when available. Falls back to delay-based detection otherwise.\n\nSCReAM (Self-Clocked Rate Adaptation for Multimedia):\nDesigned for cellular networks with self-inflicted queuing delays. Uses packet pacing and careful RTT estimation.

GCC Rate Adaptation

Algorithm Overview

Google Congestion Control (GCC) Operation:
 
Input: TWCC feedback (packet arrival times)
Output: Target sending bitrate
 
1. Estimate one-way delay trend:
   For each packet pair, compute delay difference:
   d(i) = (arrival[i] - arrival[i-1]) - (send[i] - send[i-1])
   
   Positive d(i) → queuing delay increasing → congestion
   Negative d(i) → queuing delay decreasing → recovery
 
2. Apply Kalman filter to smooth delay estimate:
   Produces smoothed delay gradient m_hat
 
3. Compare m_hat to threshold:
   |m_hat| < threshold    → NORMAL (can increase rate)
   m_hat > threshold      → OVERUSE (decrease rate)
   m_hat < -threshold     → UNDERUSE (can increase faster)
 
4. Adjust rate with AIMD:
   NORMAL:   rate *= 1.08 (increase ~8% per second)
   OVERUSE:  rate *= 0.85 (decrease 15%)
   UNDERUSE: rate = link_capacity estimate
 
5. Apply constraints:
   rate = min(rate, receiver_estimated_capacity)
   rate = max(rate, minimum_bitrate)
 
Result: Smoothly adapts bitrate to available capacity,
backing off quickly when congestion detected

Bufferbloat Challenge

Large router buffers can hide congestion—packets queue rather than drop, causing delay to increase before loss occurs. Delay-based algorithms like GCC detect this early. Loss-based algorithms only react after buffers fill and overflow.

Media Processing Pipeline

Streaming media flows through a complex pipeline from capture to display. Understanding this pipeline reveals where latency accumulates and how quality is preserved or degraded.\n\nSender-side pipeline:\n\n1. Capture: Camera/microphone provides raw samples (frames/audio buffers)\n2. Pre-processing: Noise reduction, echo cancellation, video enhancement\n3. Encoding: Compress to codec format (H.264, VP9, Opus)\n4. Packetization: Split encoded data into RTP packets\n5. Transmission: Send over network with pacing

End-to-End Latency Budget

Latency Breakdown

End-to-End Video Call Latency Budget:
 
SENDER SIDE:
┌───────────────────────────────────────────────────────┐
│ Capture latency        │ 16-33ms  (frame interval)   │
│ Pre-processing         │ 0-10ms   (enhancement)       │
│ Encoding               │ 10-50ms  (codec dependent)   │
│ Packetization          │ < 1ms                        │
│ OS/Driver buffer       │ 0-10ms                       │
│ ─────────────────────────────────────────────────────│
│ Sender total           │ ~30-100ms                    │
└───────────────────────────────────────────────────────┘
 
NETWORK:
┌───────────────────────────────────────────────────────┐
│ Transmission delay     │ <1ms (LAN) to 100ms+ (intl) │
│ Propagation delay      │ Light speed (~5ms per 1000km)│
│ Queuing delay          │ 0-50ms (varies with load)    │
│ ─────────────────────────────────────────────────────│
│ Network total          │ ~5-200ms                     │
└───────────────────────────────────────────────────────┘
 
RECEIVER SIDE:
┌───────────────────────────────────────────────────────┐
│ Jitter buffer          │ 20-100ms (adaptive)          │
│ Reassembly             │ < 1ms                        │
│ Decoding               │ 5-30ms   (codec dependent)   │
│ Post-processing        │ 0-10ms   (deinterlacing etc) │
│ Render to display      │ 8-16ms   (vsync dependent)   │
│ ─────────────────────────────────────────────────────│
│ Receiver total         │ ~40-150ms                    │
└───────────────────────────────────────────────────────┘
 
TOTAL END-TO-END:
┌───────────────────────────────────────────────────────┐
│ Optimal (LAN, tuned)   │ ~80ms                        │
│ Typical (WAN, WebRTC)  │ ~150-250ms                   │
│ Acceptable interactive │ < 400ms                      │
│ Noticeable delay       │ > 400ms                      │
└───────────────────────────────────────────────────────┘

Receiver-side pipeline:\n\n1. Reception: Receive RTP packets, extract sequence and timing\n2. Jitter buffer: Hold packets to smooth arrival variation\n3. Reassembly: Combine packets into complete frames\n4. Decoding: Decompress codec data to raw samples\n5. Post-processing: Deinterlacing, scaling, color correction\n6. Rendering: Display frames, play audio samples

Latency Optimization

The biggest latency contributors are encoding, jitter buffering, and network delay. Hardware encoders (NVENC, QuickSync) reduce encoding latency. Adaptive jitter buffers minimize buffering when network is stable. CDNs reduce geographic network distance.

Streaming Infrastructure

Large-scale streaming requires infrastructure beyond single servers. Production deployments use distributed systems with specialized components for different functions.\n\nMedia servers:\nCore processing nodes that handle RTP/RTCP sessions. For SFU architecture, these forward packets between participants. May include recording, transcoding for compatibility, or stream mixing capabilities.\n\nSignaling servers:\nHandle session setup (SIP, WebRTC signaling) separate from media flow. Coordinate room management, authentication, and participant discovery. Significantly lower bandwidth than media servers.\n\nTURN servers:\nRelay servers for NAT traversal when direct connections fail. Required for ~15-20% of WebRTC connections that can't establish direct paths. Bandwidth-intensive but stateless.

Streaming Infrastructure Components
Component	Function	Scaling Strategy	Example Services
Media Server (SFU)	Forward RTP/RTCP packets	Horizontal (by rooms/participants)	Jitsi, mediasoup, Janus
Media Server (MCU)	Transcode and mix streams	Vertical (more CPU per node)	Kurento, OpenVidu
Signaling Server	Session setup, room management	Horizontal (stateless)	Socket.io, custom WebSocket
TURN Relay	NAT traversal relay	Horizontal (by bandwidth)	coturn, Twilio TURN
Recording Server	Capture and store streams	By storage capacity	Kurento, Jibri
CDN Edge	HTTP streaming distribution	Geographic (edge POPs)	CloudFront, Akamai

Geographic distribution:\n\nFor global streaming platforms, server placement matters enormously. A participant in Tokyo connecting to a server in New York adds 80-100ms of network RTT—before any processing latency. Production deployments:\n\n- Deploy SFU/MCU nodes in multiple regions\n- Route participants to nearest server\n- Cascade between servers for multi-region calls\n- Use anycast addressing for automatic failover

Cascaded SFU Architecture

Multi-Region

Cascaded SFU for Global Video Call:
 
Tokyo Region:
┌────────────────────────────────────────────────────┐
│   Tokyo SFU                                        │
│  ┌──────┐  ┌──────┐  ┌──────┐                     │
│  │User A│  │User B│  │User C│  (Tokyo users)      │
│  └──┬───┘  └──┬───┘  └──┬───┘                     │
│     └────────┴────────┤                            │
│                       ▼                            │
│              Local forwarding                      │
│                       │                            │
└───────────────────────┼────────────────────────────┘
                        │ Cascade link
                        │ (server-to-server)
                        ▼
New York Region:
┌────────────────────────────────────────────────────┐
│                       │                            │
│              Cascade reception                     │
│                       ▼                            │
│     ┌────────┬────────┤                            │
│  ┌──┴───┐  ┌─┴────┐  ┌┴─────┐                     │
│  │User X│  │User Y│  │User Z│  (NY users)         │
│  └──────┘  └──────┘  └──────┘                     │
│                                                    │
│   New York SFU                                     │
└────────────────────────────────────────────────────┘
 
Benefits:
• Tokyo↔Tokyo: ~20ms RTT (local)
• NY↔NY: ~20ms RTT (local)
• Tokyo↔NY: ~150ms RTT (single cascade hop)
vs. All to single server: ~300ms for half the users

Cascade Complexity

Cascading adds implementation complexity—managing inter-server subscriptions, handling server failures, and synchronizing session state. Many platforms start with single-region deployment and add cascading as scale demands.

Codec Considerations for Streaming

Codec selection profoundly impacts streaming quality, latency, and compatibility. Different codecs optimize for different priorities.\n\nVideo codecs for RTP streaming:

Video Codec Comparison for Real-time Streaming
Codec	Compression	Latency	CPU Cost	Browser Support
H.264	Good	Low (optimized)	Low (HW common)	Universal
VP8	Good	Low	Medium	Chrome, Firefox, Edge
VP9	Better (~40%)	Medium	High	Chrome, Firefox, Edge
H.265/HEVC	Better (~50%)	Medium	High (HW needed)	Safari only (web)
AV1	Best (~50-60%)	High	Very High	Chrome 90+, Firefox 98+

Audio codecs for RTP streaming:\n\n- Opus: The gold standard for real-time audio. Supports 6-510 kbps, speech and music, variable frame sizes, built-in FEC. Mandatory for WebRTC.\n- G.711 (PCMU/PCMA): 64 kbps uncompressed audio. Zero compression latency, universal compatibility, wastes bandwidth.\n- G.722: 64 kbps wideband audio. Better quality than G.711 for same bitrate.\n- AAC: Efficient but typically not used for real-time (licensing, latency).\n\nKey codec features for real-time:

Real-time Codec Requirements

•Low encoding latency — Frames must encode within their display interval; multi-frame lookahead adds unacceptable delay
•Error resilience — Decodable even with packet loss; intra-refresh, FEC, temporal layering help
•Rate adaptation — Encoder must handle rapid bitrate changes without artifacts
•Hardware acceleration — Essential for mobile devices to save battery; H.264/HEVC have best support
•SVC support — Enables bandwidth adaptation without transcoding; VP9, AV1 have native support

Practical Choice

For maximum compatibility: H.264 + Opus. For best quality/bandwidth: VP9 (with H.264 fallback) + Opus. AV1 adoption is growing but encode latency and compute requirements limit real-time use to powerful devices.

Summary: Multimedia Streaming Mastered

We've explored the architectures, techniques, and infrastructure that enable multimedia streaming at scale, from intimate video calls to global live broadcasts.

Key Takeaways

•Streaming paradigms differ — Real-time interactive, live, and on-demand have fundamentally different latency and protocol requirements.
•SFU dominates conferencing — Balances quality, latency, and server cost better than mesh or MCU for most use cases.
•Simulcast and SVC enable adaptation — Multiple quality layers allow SFUs to serve heterogeneous receivers without transcoding.
•Congestion control is essential — GCC and similar algorithms adapt encoding rate to network conditions based on RTCP feedback.
•Pipeline latency adds up — Capture, encoding, network, buffering, decoding each contribute; total budget is ~150-250ms typical.
•Geographic distribution matters — Cascaded SFUs reduce latency for global participants by avoiding cross-continent media paths.

What's next:\n\nNow that we understand streaming architectures and delivery, we'll examine Quality of Service (QoS) considerations—the network-layer mechanisms and policies that ensure real-time media receives appropriate treatment from routers and switches throughout its path.

Page Complete

You now understand how RTP/RTCP enable multimedia streaming across different architectures and scales. This knowledge is essential for designing, implementing, and troubleshooting real-time communication systems.

4 / 5

Loading learning content...

Computer NetworksRTP and RTCP

Real-time Transport Protocol and Control Protocol

LevelIntermediate

Duration60 mins

TopicRTP and RTCP

4 / 5

Multimedia Streaming: Architectures and Delivery Patterns

The Many Faces of Streaming Media

What You Will Learn

Streaming Paradigms

Streaming Paradigm Comparison
Characteristic	Real-time Interactive	Live Streaming	On-Demand
Typical latency	< 200ms	2-30 seconds	Seconds to minutes
Direction	Bidirectional	One-to-many	One-to-one (server-viewer)
Common protocols	RTP/RTCP (WebRTC)	RTP, HLS, DASH	HLS, DASH, HTTP
Loss tolerance	High (prefer glitch over delay)	Medium	None (buffering acceptable)
Buffering	50-200ms jitter buffer	2-10 second buffer	30-60 second buffer
Scaling challenge	Signaling and mesh limits	CDN edge distribution	CDN caching efficiency
Example apps	Zoom, Teams, Stadia	Twitch, YouTube Live	Netflix, Spotify

Protocol Selection Tradeoffs

Video Conferencing Architectures

Mesh vs MCU vs SFU

Architectures

MESH (Peer-to-Peer):
    ┌───────────┐       ┌───────────┐
    │  Alice    │◄─────►│   Bob     │
    └───────────┘       └───────────┘
          ▲  ▲                ▲
          │  │                │
          │  └────────────────┼───────┐
          │                   │       │
          ▼                   ▼       ▼
    ┌───────────┐       ┌───────────┐
    │  Carol    │◄─────►│   Dave    │
    └───────────┘       └───────────┘
    
    4 participants = 6 connections, each sends 3 streams
 
═══════════════════════════════════════════════════════════
 
MCU (Multipoint Control Unit):
    ┌───────────┐     ┌───────────┐
    │  Alice    │────►│           │────►│  Alice    │
    └───────────┘     │    MCU    │     └───────────┘
    ┌───────────┐     │  (Mix &   │     ┌───────────┐
    │   Bob     │────►│ Transcode)│────►│   Bob     │
    └───────────┘     │           │     └───────────┘
    ┌───────────┐     │           │     ┌───────────┐
    │  Carol    │────►│           │────►│  Carol    │
    └───────────┘     └───────────┘     └───────────┘
    
    Each sends 1 stream, receives 1 mixed stream
 
═══════════════════════════════════════════════════════════
 
SFU (Selective Forwarding Unit):
                      ┌───────────────────┐
    ┌───────────┐     │                   │
    │  Alice    │────►│       SFU         │────► Bob gets Alice, Carol
    └───────────┘     │   (Forward only,  │
    ┌───────────┐     │   no transcode)   │────► Carol gets Alice, Bob
    │   Bob     │────►│                   │
    └───────────┘     │                   │────► Alice gets Bob, Carol
    ┌───────────┐     │                   │
    │  Carol    │────►│                   │
    └───────────┘     └───────────────────┘
    
    Each sends 1 stream, receives n-1 separate streams

Modern Choice: SFU

Simulcast and Scalable Video Coding

Simulcast Layers

Example

Simulcast: Sender encodes 3 independent streams
 
Sender CPU encodes:
┌────────────────────────────────────────┐
│ Layer 0: 1080p @ 2.5 Mbps (high)      │─────► Full viewers
│ Layer 1: 720p  @ 1.0 Mbps (medium)    │─────► Mobile/constrained
│ Layer 2: 360p  @ 0.3 Mbps (low)       │─────► Thumbnails
└────────────────────────────────────────┘
 
SFU receives all 3 layers and forwards selectively:
┌─────────────────────────────────────────────────────────┐
│  Subscriber A (good network, active speaker view)       │
│  → Receives: Layer 0 (1080p)                            │
├─────────────────────────────────────────────────────────┤
│  Subscriber B (mobile, bandwidth limited)               │
│  → Receives: Layer 1 (720p)                             │
├─────────────────────────────────────────────────────────┤
│  Subscriber C (viewing 4x4 grid of participants)        │
│  → Receives: Layer 2 for all (360p thumbnails)          │
└─────────────────────────────────────────────────────────┘
 
Advantages:
+ Each layer is independently decodable
+ Standard codecs (H.264, VP8) work
+ SFU logic is simple—just drop/forward
 
Disadvantages:
- 3x encoding load on sender
- Total upload bandwidth is sum of all layers (~3.8 Mbps example)
- Inefficient—layers don't share common information

SVC Layer Structure

VP9 SVC

VP9 SVC Example: Spatial + Temporal Layers
 
Stream structure (single encoded output):
┌────────────────────────────────────────────────────┐
│  S2/T2: 1080p @ 30fps  (depends on S2/T1, S2/T0)  │
│  S2/T1: 1080p @ 15fps  (depends on S2/T0)         │
│  S2/T0: 1080p @ 7.5fps (base for spatial 2)       │
├────────────────────────────────────────────────────┤
│  S1/T2: 720p @ 30fps   (depends on S1/T1, S1/T0)  │
│  S1/T1: 720p @ 15fps   (depends on S1/T0)         │
│  S1/T0: 720p @ 7.5fps  (base for spatial 1)       │
├────────────────────────────────────────────────────┤
│  S0/T2: 360p @ 30fps   (depends on S0/T1, S0/T0)  │
│  S0/T1: 360p @ 15fps   (depends on S0/T0)         │
│  S0/T0: 360p @ 7.5fps  (MANDATORY BASE)           │
└────────────────────────────────────────────────────┘
 
SFU Layer Selection:
- "Give me 720p/30fps" → Forward: S0/*, S1/* (drop S2)
- "Give me 1080p/15fps" → Forward: S0/*, S1/*, S2/T0, S2/T1 (drop T2)
- "Just thumbnail" → Forward: S0/T0, S0/T1 (360p/15fps)
 
Advantages over Simulcast:
+ Single encode (lower sender CPU)
+ ~40% more bandwidth efficient (layers share motion vectors)
+ Finer granularity (temporal layers enable smooth degradation)
 
Disadvantages:
- Requires SVC-capable codec (VP9-SVC, H.264-SVC, AV1)
- More complex SFU logic (dependency tracking)
- Some decoder limitations

Industry Adoption

Adaptive Bitrate Streaming

Adaptive Bitrate Techniques
Technique	RTP Response Time	HTTP Response Time	Mechanism
Encoder rate control	100-500ms	N/A (pre-encoded)	Adjust QP/bitrate targets
Simulcast layer switch	Immediate	2-10 seconds	SFU forwards different layer
SVC layer dropping	Immediate	2-10 seconds	Drop enhancement layers
Resolution change	Next keyframe	2-10 seconds	Encode at lower resolution
Frame rate reduction	Immediate	2-10 seconds	Skip temporal layers or frames

GCC Rate Adaptation

Algorithm Overview

Google Congestion Control (GCC) Operation:
 
Input: TWCC feedback (packet arrival times)
Output: Target sending bitrate
 
1. Estimate one-way delay trend:
   For each packet pair, compute delay difference:
   d(i) = (arrival[i] - arrival[i-1]) - (send[i] - send[i-1])
   
   Positive d(i) → queuing delay increasing → congestion
   Negative d(i) → queuing delay decreasing → recovery
 
2. Apply Kalman filter to smooth delay estimate:
   Produces smoothed delay gradient m_hat
 
3. Compare m_hat to threshold:
   |m_hat| < threshold    → NORMAL (can increase rate)
   m_hat > threshold      → OVERUSE (decrease rate)
   m_hat < -threshold     → UNDERUSE (can increase faster)
 
4. Adjust rate with AIMD:
   NORMAL:   rate *= 1.08 (increase ~8% per second)
   OVERUSE:  rate *= 0.85 (decrease 15%)
   UNDERUSE: rate = link_capacity estimate
 
5. Apply constraints:
   rate = min(rate, receiver_estimated_capacity)
   rate = max(rate, minimum_bitrate)
 
Result: Smoothly adapts bitrate to available capacity,
backing off quickly when congestion detected

Bufferbloat Challenge

Media Processing Pipeline

End-to-End Latency Budget

Latency Breakdown

End-to-End Video Call Latency Budget:
 
SENDER SIDE:
┌───────────────────────────────────────────────────────┐
│ Capture latency        │ 16-33ms  (frame interval)   │
│ Pre-processing         │ 0-10ms   (enhancement)       │
│ Encoding               │ 10-50ms  (codec dependent)   │
│ Packetization          │ < 1ms                        │
│ OS/Driver buffer       │ 0-10ms                       │
│ ─────────────────────────────────────────────────────│
│ Sender total           │ ~30-100ms                    │
└───────────────────────────────────────────────────────┘
 
NETWORK:
┌───────────────────────────────────────────────────────┐
│ Transmission delay     │ <1ms (LAN) to 100ms+ (intl) │
│ Propagation delay      │ Light speed (~5ms per 1000km)│
│ Queuing delay          │ 0-50ms (varies with load)    │
│ ─────────────────────────────────────────────────────│
│ Network total          │ ~5-200ms                     │
└───────────────────────────────────────────────────────┘
 
RECEIVER SIDE:
┌───────────────────────────────────────────────────────┐
│ Jitter buffer          │ 20-100ms (adaptive)          │
│ Reassembly             │ < 1ms                        │
│ Decoding               │ 5-30ms   (codec dependent)   │
│ Post-processing        │ 0-10ms   (deinterlacing etc) │
│ Render to display      │ 8-16ms   (vsync dependent)   │
│ ─────────────────────────────────────────────────────│
│ Receiver total         │ ~40-150ms                    │
└───────────────────────────────────────────────────────┘
 
TOTAL END-TO-END:
┌───────────────────────────────────────────────────────┐
│ Optimal (LAN, tuned)   │ ~80ms                        │
│ Typical (WAN, WebRTC)  │ ~150-250ms                   │
│ Acceptable interactive │ < 400ms                      │
│ Noticeable delay       │ > 400ms                      │
└───────────────────────────────────────────────────────┘

Latency Optimization

Streaming Infrastructure

Streaming Infrastructure Components
Component	Function	Scaling Strategy	Example Services
Media Server (SFU)	Forward RTP/RTCP packets	Horizontal (by rooms/participants)	Jitsi, mediasoup, Janus
Media Server (MCU)	Transcode and mix streams	Vertical (more CPU per node)	Kurento, OpenVidu
Signaling Server	Session setup, room management	Horizontal (stateless)	Socket.io, custom WebSocket
TURN Relay	NAT traversal relay	Horizontal (by bandwidth)	coturn, Twilio TURN
Recording Server	Capture and store streams	By storage capacity	Kurento, Jibri
CDN Edge	HTTP streaming distribution	Geographic (edge POPs)	CloudFront, Akamai

Cascaded SFU Architecture

Multi-Region

Cascaded SFU for Global Video Call:
 
Tokyo Region:
┌────────────────────────────────────────────────────┐
│   Tokyo SFU                                        │
│  ┌──────┐  ┌──────┐  ┌──────┐                     │
│  │User A│  │User B│  │User C│  (Tokyo users)      │
│  └──┬───┘  └──┬───┘  └──┬───┘                     │
│     └────────┴────────┤                            │
│                       ▼                            │
│              Local forwarding                      │
│                       │                            │
└───────────────────────┼────────────────────────────┘
                        │ Cascade link
                        │ (server-to-server)
                        ▼
New York Region:
┌────────────────────────────────────────────────────┐
│                       │                            │
│              Cascade reception                     │
│                       ▼                            │
│     ┌────────┬────────┤                            │
│  ┌──┴───┐  ┌─┴────┐  ┌┴─────┐                     │
│  │User X│  │User Y│  │User Z│  (NY users)         │
│  └──────┘  └──────┘  └──────┘                     │
│                                                    │
│   New York SFU                                     │
└────────────────────────────────────────────────────┘
 
Benefits:
• Tokyo↔Tokyo: ~20ms RTT (local)
• NY↔NY: ~20ms RTT (local)
• Tokyo↔NY: ~150ms RTT (single cascade hop)
vs. All to single server: ~300ms for half the users

Cascade Complexity

Codec Considerations for Streaming

Codec selection profoundly impacts streaming quality, latency, and compatibility. Different codecs optimize for different priorities.\n\nVideo codecs for RTP streaming:

Video Codec Comparison for Real-time Streaming
Codec	Compression	Latency	CPU Cost	Browser Support
H.264	Good	Low (optimized)	Low (HW common)	Universal
VP8	Good	Low	Medium	Chrome, Firefox, Edge
VP9	Better (~40%)	Medium	High	Chrome, Firefox, Edge
H.265/HEVC	Better (~50%)	Medium	High (HW needed)	Safari only (web)
AV1	Best (~50-60%)	High	Very High	Chrome 90+, Firefox 98+

Real-time Codec Requirements

•Low encoding latency — Frames must encode within their display interval; multi-frame lookahead adds unacceptable delay
•Error resilience — Decodable even with packet loss; intra-refresh, FEC, temporal layering help
•Rate adaptation — Encoder must handle rapid bitrate changes without artifacts
•Hardware acceleration — Essential for mobile devices to save battery; H.264/HEVC have best support
•SVC support — Enables bandwidth adaptation without transcoding; VP9, AV1 have native support

Practical Choice

Summary: Multimedia Streaming Mastered

We've explored the architectures, techniques, and infrastructure that enable multimedia streaming at scale, from intimate video calls to global live broadcasts.

Key Takeaways

•Streaming paradigms differ — Real-time interactive, live, and on-demand have fundamentally different latency and protocol requirements.
•SFU dominates conferencing — Balances quality, latency, and server cost better than mesh or MCU for most use cases.
•Simulcast and SVC enable adaptation — Multiple quality layers allow SFUs to serve heterogeneous receivers without transcoding.
•Congestion control is essential — GCC and similar algorithms adapt encoding rate to network conditions based on RTCP feedback.
•Pipeline latency adds up — Capture, encoding, network, buffering, decoding each contribute; total budget is ~150-250ms typical.
•Geographic distribution matters — Cascaded SFUs reduce latency for global participants by avoiding cross-continent media paths.

Page Complete

4 / 5