Udp Based Protocols - Learning Module

Loading content...

0/228

Real-time Transport Protocol (RTP)

The Protocol Behind Every Voice and Video Call

Every Zoom call, every WhatsApp voice message, every YouTube live stream, and every Discord conversation relies on a single foundational protocol designed specifically for real-time media: the Real-time Transport Protocol (RTP).

RTP emerged from a fundamental insight: real-time media has requirements that are fundamentally incompatible with TCP's design. When you're on a video call, you don't care about a frame that arrived 5 seconds late—it's useless. You'd rather skip it and show the current frame. TCP's insistence on reliable, ordered delivery creates latency that destroys real-time user experience.

RTP, built on UDP, provides the infrastructure for real-time media while delegating reliability decisions to the application layer, where they can be made intelligently based on media semantics.

What You Will Learn

This page provides a comprehensive exploration of RTP's architecture, header format, and operational mechanisms. You will understand how RTP enables real-time audio and video transmission, how it integrates with RTCP for quality monitoring, and why UDP is the ideal substrate for this critical protocol.

The Real-Time Media Challenge

Real-time media fundamentally differs from file transfer or web browsing. Understanding these differences explains why RTP exists and why it uses UDP.

The Constraints of Real-Time Communication:

Human perception imposes strict timing requirements on real-time media:

Conversational latency: More than 150ms one-way delay makes conversation feel unnatural. Above 300ms, people start talking over each other.
Video synchronization: Audio and video must be synchronized within 45ms, or lip-sync errors become noticeable.
Jitter sensitivity: Variable delay (jitter) causes choppy playback even when average latency is acceptable.
Loss tolerance: Human senses can tolerate some data loss (missing samples, dropped frames) far better than they can tolerate delay.

Human Perception Thresholds for Real-Time Media
Parameter	Threshold	Effect When Exceeded
One-way audio latency	< 150ms	Conversation becomes awkward, overlap
Round-trip audio latency	< 300ms	Echo, confusion, interruptions
Audio-video sync offset	< 45ms	Visible lip-sync errors
Audio jitter	< 30ms	Choppy, robotic audio
Video jitter	< 100ms	Stuttering, freezing video
Audio packet loss	< 1%	Audible pops, gaps
Video packet loss	< 2%	Artifacts, macro-blocking

Why TCP Fails for Real-Time Media:

TCP's reliability guarantees, essential for file transfer, become liabilities for real-time media:

Retransmission delays: When TCP detects loss, it retransmits. But by the time the retransmitted packet arrives (at least one RTT later), the media playout time has passed. The retransmitted frame is useless.
Head-of-line blocking: TCP cannot deliver subsequent packets until missing ones arrive. For video, this means all frames wait for a single missing packet—introducing massive latency spikes.
Congestion window reduction: TCP interprets loss as congestion and reduces sending rate. For real-time media, this causes quality degradation at exactly the wrong moment.
In-order delivery blocking: TCP buffers out-of-order packets until the gap is filled. RTP applications would rather receive and display what's available now.

The RTP Philosophy

RTP's design philosophy: provide the infrastructure for real-time media (timestamps, sequencing, source identification) while leaving reliability decisions to the application. A video codec knows whether it can recover from a lost P-frame; TCP cannot. This separation of concerns enables intelligent loss handling tailored to media semantics.

RTP Architecture Overview

RTP (RFC 3550) is actually two closely related protocols working together:

RTP (Real-time Transport Protocol): Carries the actual media data—audio samples, video frames, or other real-time content.

RTCP (RTP Control Protocol): Provides feedback about transmission quality—packet loss, jitter, round-trip time—enabling adaptive quality adjustment.

This separation allows RTCP traffic (low-bandwidth, periodic reports) to travel alongside RTP media (high-bandwidth, continuous) without interference.

Converting Mermaid diagram...

The RTP Session Concept:

An RTP session is identified by a combination of:

Transport address (IP address + port)
The payload format being used

Each participant in an RTP session is identified by a Synchronization Source (SSRC)—a 32-bit identifier that uniquely identifies a media source within the session. Multiple participants can communicate in the same session (multicast/conference calls), each with their own SSRC.

Port Conventions:

By convention, RTP and RTCP use adjacent port pairs:

RTP: even port number (e.g., 5004)
RTCP: next odd number (e.g., 5005)

This convention simplifies firewall configuration and NAT traversal, though modern applications often use RTCP multiplexed on the same port as RTP (RFC 5761).

RTP is Application-Layer Protocol

Despite being called a 'transport' protocol, RTP operates at the application layer. It runs on top of UDP and provides no transport-level guarantees. The 'Transport' in RTP refers to its purpose: transporting real-time media data.

RTP Header Format

The RTP header is optimized for real-time media transport, containing exactly the information needed for timing, sequencing, and source identification—nothing more.

RTP Header Structure (12 bytes minimum):

RTP Fixed Header Fields
Field	Size	Description
Version (V)	2 bits	RTP version, always 2
Padding (P)	1 bit	Indicates padding bytes at end of packet
Extension (X)	1 bit	Indicates header extension present
CSRC Count (CC)	4 bits	Number of CSRC identifiers (0-15)
Marker (M)	1 bit	Profile-specific marker (e.g., frame boundary)
Payload Type (PT)	7 bits	Identifies codec/format (0-127)
Sequence Number	16 bits	Increments for each packet (wraps)
Timestamp	32 bits	Sampling instant of first sample
SSRC	32 bits	Synchronization source identifier
CSRC List	0-60 bytes	Contributing source identifiers

Converting Mermaid diagram...

Critical Header Fields Explained:

Sequence Number (16 bits): Increments by one for each RTP packet sent. Used for:

Detecting packet loss (gaps in sequence)
Reordering out-of-sequence packets
Does NOT wrap at media boundaries—continues across frames

The 16-bit value wraps around (0-65535) and receivers must handle wraparound correctly.

Timestamp (32 bits): Reflects the sampling instant of the first sample in the packet. Critical for:

Synchronizing playback timing
Calculating inter-packet arrival jitter
Lip-sync between audio and video streams

Timestamp increments are media-dependent:

Audio (8kHz sample rate): +160 per 20ms packet (160 samples)
Video (90kHz clock): +3000 per 33.3ms frame (30fps)

Sequence Number vs. Timestamp

Sequence numbers increment per packet; timestamps reflect sampling time. For audio, if silence suppression drops packets, timestamp continues advancing but sequence numbers skip. For video, multiple packets from one frame share the same timestamp but have consecutive sequence numbers.

SSRC (Synchronization Source - 32 bits):

Randomly generated identifier for each media source. Used for:

Distinguishing multiple participants in a conference
Separating streams from the same sender (e.g., camera + screen share)
SSRC collision detection (if two sources randomly pick the same SSRC)

Payload Type (7 bits):

Identifies the codec/format of the payload. Standard assignments from RFC 3551 exist for common codecs, with dynamic assignment (96-127) for others:

Common RTP Payload Types
PT	Codec	Media Type	Clock Rate
0	PCMU (G.711 μ-law)	Audio	8000 Hz
8	PCMA (G.711 A-law)	Audio	8000 Hz
3	GSM	Audio	8000 Hz
26	Motion JPEG	Video	90000 Hz
31	H.261	Video	90000 Hz
34	H.263	Video	90000 Hz
96-127	Dynamic	Any	Negotiated

Dynamic payload types (96-127) are negotiated via signaling protocols like SDP (Session Description Protocol). Modern codecs like H.264, H.265, VP8, VP9, AV1 (video) and Opus (audio) use dynamic assignment.

Marker Bit:

Profile-specific flag. Common uses:

Video: set on the last packet of a video frame
Audio: set on first packet after a silence period

Enables receivers to identify frame boundaries without parsing payload data.

RTCP - Quality Feedback Mechanism

RTP alone provides no feedback about transmission quality. RTCP (RTP Control Protocol) fills this gap, providing out-of-band quality metrics that enable participants to adapt to network conditions.

RTCP's Core Functions:

Quality Reporting: Senders and receivers exchange statistics about packet loss, jitter, and delay
Participant Identification: Canonical names (CNAME) associate SSRC values with participants
Membership Tracking: Periodic reports indicate active participants
Lip-sync Information: Timestamps correlate RTP clock with wall-clock time

RTCP Packet Types:

RTCP Packet Types
Type	Name	Description
200	SR (Sender Report)	Statistics from active senders
201	RR (Receiver Report)	Statistics from receivers
202	SDES (Source Description)	CNAME and participant info
203	BYE	Indicates participant leaving
204	APP	Application-specific data
205	RTPFB	Transport layer feedback (NACK)
206	PSFB	Payload-specific feedback (PLI, FIR)

Sender Report (SR) Contents:

Sent by participants who are also sending RTP:

NTP timestamp: Wall-clock time when report was sent
RTP timestamp: Corresponding RTP timestamp for lip-sync calculation
Sender's packet count: Total RTP packets sent
Sender's octet count: Total bytes sent
Report blocks: Reception statistics for each source this sender receives

Converting Mermaid diagram...

Receiver Report (RR) Statistics:

Receiver reports contain per-source reception quality metrics:

Fraction lost: Packet loss rate since last report (0-255, maps to 0-100%)
Cumulative packets lost: Total packets lost since session start (24 bits, signed)
Extended highest sequence number: Tracks sequence number wraparounds
Interarrival jitter: Smoothed estimate of packet timing variation
Last SR timestamp (LSR): When last SR was received (for RTT calculation)
Delay since last SR (DLSR): How long ago (for RTT calculation)

RTT Calculation from RTCP

RTT = current_time - LSR - DLSR. When sender A receives RR from B containing LSR and DLSR, A can calculate the round-trip time by subtracting when it sent the SR (LSR) and how long B held it (DLSR) from the current time.

RTCP Bandwidth Throttling:

RTCP is designed to consume no more than 5% of session bandwidth. As the number of participants grows, RTCP reporting intervals increase to stay within this budget:

Average RTCP interval = (session bandwidth × 0.05) / (RTCP packet size × participants)

In small calls, this yields intervals of ~5 seconds. In large conferences with thousands of participants, intervals can stretch to minutes.

Jitter Buffer and Playout

Network jitter—the variation in packet arrival times—is the primary challenge for real-time media playback. Even if average latency is acceptable, high jitter causes packets to miss their playout deadline, creating gaps in audio and video.

The Jitter Buffer Concept:

A jitter buffer sits between the network and the decoder, absorbing timing variations by introducing a small, controlled delay. Packets are held until their scheduled playout time, smoothing out arrival time variations.

Converting Mermaid diagram...

Static vs. Adaptive Jitter Buffers:

Static Buffer:

Fixed delay (e.g., always 60ms)
Simple implementation
May be too small (late packets) or too large (unnecessary latency)

Adaptive Buffer:

Dynamically adjusts depth based on observed jitter
Shrinks when network is stable (lower latency)
Grows when jitter increases (fewer dropped packets)
More complex, requires smooth adjustment to avoid audio artifacts

Jitter Buffer Tradeoffs
Buffer Depth	Late Packet Handling	User Experience
Too small (20ms)	Many packets arrive late, dropped	Choppy audio, gaps
Optimal (~50ms)	Most packets arrive in time	Smooth playback, low latency
Too large (200ms)	All packets arrive in time	Noticeable delay, awkward conversation
Adaptive	Adjusts to conditions	Best balance of latency and quality

Packet Loss Concealment:

When packets are lost or arrive too late, the jitter buffer must handle the gap. Common strategies:

For Audio:

Silence insertion: Simple but produces jarring gaps
Last-sample repetition: Repeat the last received sample
Interpolation: Smooth between surrounding samples
Packet Loss Concealment (PLC): Codec-specific algorithms that synthesize plausible audio based on preceding patterns (Opus excels at this)

For Video:

Frame freezing: Repeat the last good frame
Error concealment: Copy blocks from adjacent frames
Request I-frame: Ask sender for a complete keyframe (PLI/FIR via RTCP)
Forward Error Correction: Reconstruct from redundant data (if available)

The Opus Codec Advantage

Modern codecs like Opus include sophisticated PLC algorithms. Opus can conceal up to 20ms of consecutive audio loss with virtually no perceptible artifacts, making it ideal for VoIP and conferencing where some packet loss is inevitable.

RTP Applications and Use Cases

RTP underpins virtually all real-time communication on the internet. Understanding its application contexts reveals why it was designed the way it was.

Voice over IP (VoIP):

VoIP applications (Skype, WhatsApp, Zoom audio, Discord) use RTP for voice transport:

Codec: Opus (dynamic), G.711 (legacy), G.729 (bandwidth-constrained)
Packet interval: Typically 20ms (48 bytes audio data per packet)
Latency requirement: <150ms one-way
Loss tolerance: ~1% with PLC, higher with aggressive concealment

RTP Application Profiles
Application	Media Type	Typical Bitrate	Latency Target	Loss Tolerance
VoIP Call	Audio	6-64 kbps	< 150ms	~1%
Video Call (720p)	Audio + Video	1-2 Mbps	< 200ms	~2%
Webinar	Audio + Video + Screen	2-5 Mbps	< 500ms	~3%
Game Streaming	Video + Controls	5-50 Mbps	< 50ms	< 0.1%
Surveillance IP Camera	Video	1-8 Mbps	< 500ms	~5%
Live Sports Broadcast	Audio + Video	10-50 Mbps	< 3000ms	~0.5%

Video Conferencing:

Modern video conferencing (Zoom, Teams, Meet) uses RTP with sophisticated enhancements:

Simulcast: Sender transmits multiple quality levels; receiver selects based on available bandwidth
SVC (Scalable Video Coding): Single encoded stream with embeddable layers
Adaptive bitrate: Real-time encoding adjustments based on RTCP feedback
Selective forwarding: SFU (Selective Forwarding Unit) servers route streams without transcoding

Converting Mermaid diagram...

WebRTC and RTP:

WebRTC (Web Real-Time Communication) brings RTP to web browsers:

Browsers implement full RTP/RTCP stack
SRTP (Secure RTP) mandatory for encryption
ICE (Interactive Connectivity Establishment) for NAT traversal
DTLS for key exchange
JavaScript API for media capture and playback

WebRTC has made RTP accessible to any web application, powering video chat features in everything from telemedicine to online education.

RTP's Ubiquity

RTP carries the world's real-time media. From 911 emergency calls over VoIP, to Twitch streams, to surgical telepresence, to Mars rover communications—RTP's design has proven adaptable to an incredible range of requirements over its 25+ year history.

RTP Security - SRTP

Native RTP provides no security—packets are unencrypted and unauthenticated. For any security-sensitive application, SRTP (Secure Real-time Transport Protocol) is essential.

SRTP Security Services:

Confidentiality: Payload encryption prevents eavesdropping
Message authentication: HMAC prevents packet tampering
Replay protection: Sequence number tracking prevents replay attacks
Header fields: Some fields (SSRC, sequence number) remain unencrypted for routing

SRTP Encryption:

SRTP uses AES (Advanced Encryption Standard) in counter mode:

AES-CM (Counter Mode): Stream cipher properties, no block alignment needed
Key derivation: Master key generates separate encryption and authentication keys
Per-packet nonce: Derived from SSRC and packet index (includes rollover counter)

Authentication Tag:

SRTP appends an authentication tag (typically 80 or 32 bits) computed over:

RTP header (with modifications for mutable fields)
Encrypted payload

Receivers verify the tag before decryption, detecting tampering immediately.

SRTP Cryptographic Options
Algorithm	Key Size	Tag Size	Notes
AES_CM_128_HMAC_SHA1_80	128 bits	80 bits	Most common, good security
AES_CM_128_HMAC_SHA1_32	128 bits	32 bits	Reduced overhead, some applications
AES_256_CM_HMAC_SHA1_80	256 bits	80 bits	Higher security, more CPU
AEAD_AES_128_GCM	128 bits	128 bits	Modern, authenticated encryption

Key Exchange is Critical

SRTP protects media, but key exchange happens out-of-band. In WebRTC, DTLS-SRTP provides secure key negotiation. In VoIP, ZRTP or SDES (with secure signaling) is used. Without secure key exchange, SRTP encryption is meaningless—an attacker who intercepts keys can decrypt everything.

Summary: Real-time Transport Protocol

RTP represents a purpose-built solution for a specific problem domain: real-time media transport. Its design choices—running on UDP, providing timing/sequencing without reliability, separating media from control—reflect deep understanding of real-time communication requirements.

Key Takeaways

•UDP Foundation — Real-time media cannot tolerate TCP's reliability-induced latency. UDP provides the low-latency substrate RTP needs.
•RTP Header Design — Optimized for real-time: timestamps for playback timing, sequence numbers for loss detection, SSRC for source identification.
•RTCP Feedback — Quality reports enable adaptive bitrate, loss concealment decisions, and lip-sync calculation.
•Jitter Buffers — Bridge the gap between network timing variations and smooth playback requirements.
•Ubiquitous Deployment — RTP powers VoIP, video conferencing, streaming, gaming, and surveillance worldwide.
•SRTP Security — Encryption and authentication are essential for any sensitive real-time communication.
•Application Control — RTP provides infrastructure; applications make intelligent decisions about loss, quality, and adaptation.

What's Next:

We'll explore TFTP (Trivial File Transfer Protocol), another UDP-based protocol designed for a very different purpose: simple, lightweight file transfer for network bootstrapping scenarios where TCP's complexity would be a liability.

Page Complete

You now understand RTP's architecture, header format, RTCP feedback mechanisms, and application contexts. RTP's design demonstrates how UDP's simplicity enables specialized protocols that outperform TCP for their intended use cases.

Real-time Transport Protocol (RTP)

The Protocol Behind Every Voice and Video Call

RTP, built on UDP, provides the infrastructure for real-time media while delegating reliability decisions to the application layer, where they can be made intelligently based on media semantics.

What You Will Learn

The Real-Time Media Challenge

Real-time media fundamentally differs from file transfer or web browsing. Understanding these differences explains why RTP exists and why it uses UDP.

The Constraints of Real-Time Communication:

Human perception imposes strict timing requirements on real-time media:

Conversational latency: More than 150ms one-way delay makes conversation feel unnatural. Above 300ms, people start talking over each other.
Video synchronization: Audio and video must be synchronized within 45ms, or lip-sync errors become noticeable.
Jitter sensitivity: Variable delay (jitter) causes choppy playback even when average latency is acceptable.
Loss tolerance: Human senses can tolerate some data loss (missing samples, dropped frames) far better than they can tolerate delay.

Human Perception Thresholds for Real-Time Media
Parameter	Threshold	Effect When Exceeded
One-way audio latency	< 150ms	Conversation becomes awkward, overlap
Round-trip audio latency	< 300ms	Echo, confusion, interruptions
Audio-video sync offset	< 45ms	Visible lip-sync errors
Audio jitter	< 30ms	Choppy, robotic audio
Video jitter	< 100ms	Stuttering, freezing video
Audio packet loss	< 1%	Audible pops, gaps
Video packet loss	< 2%	Artifacts, macro-blocking

Why TCP Fails for Real-Time Media:

TCP's reliability guarantees, essential for file transfer, become liabilities for real-time media:

Retransmission delays: When TCP detects loss, it retransmits. But by the time the retransmitted packet arrives (at least one RTT later), the media playout time has passed. The retransmitted frame is useless.
Head-of-line blocking: TCP cannot deliver subsequent packets until missing ones arrive. For video, this means all frames wait for a single missing packet—introducing massive latency spikes.
Congestion window reduction: TCP interprets loss as congestion and reduces sending rate. For real-time media, this causes quality degradation at exactly the wrong moment.
In-order delivery blocking: TCP buffers out-of-order packets until the gap is filled. RTP applications would rather receive and display what's available now.

The RTP Philosophy

RTP Architecture Overview

RTP (RFC 3550) is actually two closely related protocols working together:

RTP (Real-time Transport Protocol): Carries the actual media data—audio samples, video frames, or other real-time content.

RTCP (RTP Control Protocol): Provides feedback about transmission quality—packet loss, jitter, round-trip time—enabling adaptive quality adjustment.

This separation allows RTCP traffic (low-bandwidth, periodic reports) to travel alongside RTP media (high-bandwidth, continuous) without interference.

Converting Mermaid diagram...

The RTP Session Concept:

An RTP session is identified by a combination of:

Transport address (IP address + port)
The payload format being used

Port Conventions:

By convention, RTP and RTCP use adjacent port pairs:

RTP: even port number (e.g., 5004)
RTCP: next odd number (e.g., 5005)

This convention simplifies firewall configuration and NAT traversal, though modern applications often use RTCP multiplexed on the same port as RTP (RFC 5761).

RTP is Application-Layer Protocol

RTP Header Format

The RTP header is optimized for real-time media transport, containing exactly the information needed for timing, sequencing, and source identification—nothing more.

RTP Header Structure (12 bytes minimum):

RTP Fixed Header Fields
Field	Size	Description
Version (V)	2 bits	RTP version, always 2
Padding (P)	1 bit	Indicates padding bytes at end of packet
Extension (X)	1 bit	Indicates header extension present
CSRC Count (CC)	4 bits	Number of CSRC identifiers (0-15)
Marker (M)	1 bit	Profile-specific marker (e.g., frame boundary)
Payload Type (PT)	7 bits	Identifies codec/format (0-127)
Sequence Number	16 bits	Increments for each packet (wraps)
Timestamp	32 bits	Sampling instant of first sample
SSRC	32 bits	Synchronization source identifier
CSRC List	0-60 bytes	Contributing source identifiers

Converting Mermaid diagram...

Critical Header Fields Explained:

Sequence Number (16 bits): Increments by one for each RTP packet sent. Used for:

Detecting packet loss (gaps in sequence)
Reordering out-of-sequence packets
Does NOT wrap at media boundaries—continues across frames

The 16-bit value wraps around (0-65535) and receivers must handle wraparound correctly.

Timestamp (32 bits): Reflects the sampling instant of the first sample in the packet. Critical for:

Synchronizing playback timing
Calculating inter-packet arrival jitter
Lip-sync between audio and video streams

Timestamp increments are media-dependent:

Audio (8kHz sample rate): +160 per 20ms packet (160 samples)
Video (90kHz clock): +3000 per 33.3ms frame (30fps)

Sequence Number vs. Timestamp

SSRC (Synchronization Source - 32 bits):

Randomly generated identifier for each media source. Used for:

Distinguishing multiple participants in a conference
Separating streams from the same sender (e.g., camera + screen share)
SSRC collision detection (if two sources randomly pick the same SSRC)

Payload Type (7 bits):

Identifies the codec/format of the payload. Standard assignments from RFC 3551 exist for common codecs, with dynamic assignment (96-127) for others:

Common RTP Payload Types
PT	Codec	Media Type	Clock Rate
0	PCMU (G.711 μ-law)	Audio	8000 Hz
8	PCMA (G.711 A-law)	Audio	8000 Hz
3	GSM	Audio	8000 Hz
26	Motion JPEG	Video	90000 Hz
31	H.261	Video	90000 Hz
34	H.263	Video	90000 Hz
96-127	Dynamic	Any	Negotiated

Marker Bit:

Profile-specific flag. Common uses:

Video: set on the last packet of a video frame
Audio: set on first packet after a silence period

Enables receivers to identify frame boundaries without parsing payload data.

RTCP - Quality Feedback Mechanism

RTP alone provides no feedback about transmission quality. RTCP (RTP Control Protocol) fills this gap, providing out-of-band quality metrics that enable participants to adapt to network conditions.

RTCP's Core Functions:

Quality Reporting: Senders and receivers exchange statistics about packet loss, jitter, and delay
Participant Identification: Canonical names (CNAME) associate SSRC values with participants
Membership Tracking: Periodic reports indicate active participants
Lip-sync Information: Timestamps correlate RTP clock with wall-clock time

RTCP Packet Types:

RTCP Packet Types
Type	Name	Description
200	SR (Sender Report)	Statistics from active senders
201	RR (Receiver Report)	Statistics from receivers
202	SDES (Source Description)	CNAME and participant info
203	BYE	Indicates participant leaving
204	APP	Application-specific data
205	RTPFB	Transport layer feedback (NACK)
206	PSFB	Payload-specific feedback (PLI, FIR)

Sender Report (SR) Contents:

Sent by participants who are also sending RTP:

NTP timestamp: Wall-clock time when report was sent
RTP timestamp: Corresponding RTP timestamp for lip-sync calculation
Sender's packet count: Total RTP packets sent
Sender's octet count: Total bytes sent
Report blocks: Reception statistics for each source this sender receives

Converting Mermaid diagram...

Receiver Report (RR) Statistics:

Receiver reports contain per-source reception quality metrics:

Fraction lost: Packet loss rate since last report (0-255, maps to 0-100%)
Cumulative packets lost: Total packets lost since session start (24 bits, signed)
Extended highest sequence number: Tracks sequence number wraparounds
Interarrival jitter: Smoothed estimate of packet timing variation
Last SR timestamp (LSR): When last SR was received (for RTT calculation)
Delay since last SR (DLSR): How long ago (for RTT calculation)

RTT Calculation from RTCP

RTCP Bandwidth Throttling:

RTCP is designed to consume no more than 5% of session bandwidth. As the number of participants grows, RTCP reporting intervals increase to stay within this budget:

Average RTCP interval = (session bandwidth × 0.05) / (RTCP packet size × participants)

In small calls, this yields intervals of ~5 seconds. In large conferences with thousands of participants, intervals can stretch to minutes.

Jitter Buffer and Playout

The Jitter Buffer Concept:

Converting Mermaid diagram...

Static vs. Adaptive Jitter Buffers:

Static Buffer:

Fixed delay (e.g., always 60ms)
Simple implementation
May be too small (late packets) or too large (unnecessary latency)

Adaptive Buffer:

Dynamically adjusts depth based on observed jitter
Shrinks when network is stable (lower latency)
Grows when jitter increases (fewer dropped packets)
More complex, requires smooth adjustment to avoid audio artifacts

Jitter Buffer Tradeoffs
Buffer Depth	Late Packet Handling	User Experience
Too small (20ms)	Many packets arrive late, dropped	Choppy audio, gaps
Optimal (~50ms)	Most packets arrive in time	Smooth playback, low latency
Too large (200ms)	All packets arrive in time	Noticeable delay, awkward conversation
Adaptive	Adjusts to conditions	Best balance of latency and quality

Packet Loss Concealment:

When packets are lost or arrive too late, the jitter buffer must handle the gap. Common strategies:

For Audio:

Silence insertion: Simple but produces jarring gaps
Last-sample repetition: Repeat the last received sample
Interpolation: Smooth between surrounding samples
Packet Loss Concealment (PLC): Codec-specific algorithms that synthesize plausible audio based on preceding patterns (Opus excels at this)

For Video:

Frame freezing: Repeat the last good frame
Error concealment: Copy blocks from adjacent frames
Request I-frame: Ask sender for a complete keyframe (PLI/FIR via RTCP)
Forward Error Correction: Reconstruct from redundant data (if available)

The Opus Codec Advantage

RTP Applications and Use Cases

RTP underpins virtually all real-time communication on the internet. Understanding its application contexts reveals why it was designed the way it was.

Voice over IP (VoIP):

VoIP applications (Skype, WhatsApp, Zoom audio, Discord) use RTP for voice transport:

Codec: Opus (dynamic), G.711 (legacy), G.729 (bandwidth-constrained)
Packet interval: Typically 20ms (48 bytes audio data per packet)
Latency requirement: <150ms one-way
Loss tolerance: ~1% with PLC, higher with aggressive concealment

RTP Application Profiles
Application	Media Type	Typical Bitrate	Latency Target	Loss Tolerance
VoIP Call	Audio	6-64 kbps	< 150ms	~1%
Video Call (720p)	Audio + Video	1-2 Mbps	< 200ms	~2%
Webinar	Audio + Video + Screen	2-5 Mbps	< 500ms	~3%
Game Streaming	Video + Controls	5-50 Mbps	< 50ms	< 0.1%
Surveillance IP Camera	Video	1-8 Mbps	< 500ms	~5%
Live Sports Broadcast	Audio + Video	10-50 Mbps	< 3000ms	~0.5%

Video Conferencing:

Modern video conferencing (Zoom, Teams, Meet) uses RTP with sophisticated enhancements:

Simulcast: Sender transmits multiple quality levels; receiver selects based on available bandwidth
SVC (Scalable Video Coding): Single encoded stream with embeddable layers
Adaptive bitrate: Real-time encoding adjustments based on RTCP feedback
Selective forwarding: SFU (Selective Forwarding Unit) servers route streams without transcoding

Converting Mermaid diagram...

WebRTC and RTP:

WebRTC (Web Real-Time Communication) brings RTP to web browsers:

Browsers implement full RTP/RTCP stack
SRTP (Secure RTP) mandatory for encryption
ICE (Interactive Connectivity Establishment) for NAT traversal
DTLS for key exchange
JavaScript API for media capture and playback

WebRTC has made RTP accessible to any web application, powering video chat features in everything from telemedicine to online education.

RTP's Ubiquity

RTP Security - SRTP

Native RTP provides no security—packets are unencrypted and unauthenticated. For any security-sensitive application, SRTP (Secure Real-time Transport Protocol) is essential.

SRTP Security Services:

Confidentiality: Payload encryption prevents eavesdropping
Message authentication: HMAC prevents packet tampering
Replay protection: Sequence number tracking prevents replay attacks
Header fields: Some fields (SSRC, sequence number) remain unencrypted for routing

SRTP Encryption:

SRTP uses AES (Advanced Encryption Standard) in counter mode:

AES-CM (Counter Mode): Stream cipher properties, no block alignment needed
Key derivation: Master key generates separate encryption and authentication keys
Per-packet nonce: Derived from SSRC and packet index (includes rollover counter)

Authentication Tag:

SRTP appends an authentication tag (typically 80 or 32 bits) computed over:

RTP header (with modifications for mutable fields)
Encrypted payload

Receivers verify the tag before decryption, detecting tampering immediately.

SRTP Cryptographic Options
Algorithm	Key Size	Tag Size	Notes
AES_CM_128_HMAC_SHA1_80	128 bits	80 bits	Most common, good security
AES_CM_128_HMAC_SHA1_32	128 bits	32 bits	Reduced overhead, some applications
AES_256_CM_HMAC_SHA1_80	256 bits	80 bits	Higher security, more CPU
AEAD_AES_128_GCM	128 bits	128 bits	Modern, authenticated encryption

Key Exchange is Critical

Summary: Real-time Transport Protocol

Key Takeaways

•UDP Foundation — Real-time media cannot tolerate TCP's reliability-induced latency. UDP provides the low-latency substrate RTP needs.
•RTP Header Design — Optimized for real-time: timestamps for playback timing, sequence numbers for loss detection, SSRC for source identification.
•RTCP Feedback — Quality reports enable adaptive bitrate, loss concealment decisions, and lip-sync calculation.
•Jitter Buffers — Bridge the gap between network timing variations and smooth playback requirements.
•Ubiquitous Deployment — RTP powers VoIP, video conferencing, streaming, gaming, and surveillance worldwide.
•SRTP Security — Encryption and authentication are essential for any sensitive real-time communication.
•Application Control — RTP provides infrastructure; applications make intelligent decisions about loss, quality, and adaptation.

What's Next:

Page Complete