Rtp And Rtcp - Learning Module

Loading content...

0/228

RTP Header: Anatomy of Real-time Packets

The 12 Bytes That Enable Real-time Communication

Every RTP packet—whether carrying a voice sample from a phone call, a video frame from a conference, or game state from a cloud gaming server—begins with the same 12-byte structure. This compact header contains everything receivers need to properly sequence, time, and process real-time media.\n\nThe RTP header represents decades of protocol engineering wisdom, carefully balancing information density against overhead minimization. Every bit serves a purpose; nothing is wasted. Understanding this header is essential for anyone implementing, debugging, or optimizing real-time communication systems.\n\nThis page provides a field-by-field examination of the RTP header, explaining not just what each field contains but why it exists and how implementations use it in practice.

What You Will Learn

By the end of this page, you will understand every field in the RTP header, including version flags, payload type encoding, sequence number behavior, timestamp semantics, SSRC identification, and optional CSRC lists. You'll learn how these fields work together to enable real-time communication.

RTP Header Overview

The RTP header consists of a fixed 12-byte portion present in every packet, plus an optional extension of variable length. The fixed header contains all essential information for basic real-time transport; extensions provide application-specific data when needed.\n\nThe header is designed for efficient parsing by both software and hardware implementations. Fields are aligned on byte or word boundaries where possible, and the most frequently accessed information appears first.

RTP Fixed Header Structure (12 bytes)

Header Layout

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|  CC   |M|     PT      |       Sequence Number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Timestamp                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Synchronization Source (SSRC) identifier           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Contributing Source (CSRC) identifiers            |
|                             ....                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 
Field Summary:
V  (2 bits)  : Version, always 2 for current RTP
P  (1 bit)   : Padding flag, indicates padding bytes at end
X  (1 bit)   : Extension flag, indicates header extension present
CC (4 bits)  : CSRC count, number of CSRC identifiers following
M  (1 bit)   : Marker bit, application-specific significance
PT (7 bits)  : Payload type, identifies media format
Sequence (16 bits): Packet sequence number
Timestamp (32 bits): Media sampling timestamp
SSRC (32 bits): Synchronization source identifier
CSRC (32 bits each): Contributing source identifiers (0-15)

RTP Header Field Summary
Field	Size	Purpose	Key Property
Version (V)	2 bits	Protocol version identification	Always 2 for current RTP
Padding (P)	1 bit	Indicates padding at packet end	For encryption block alignment
Extension (X)	1 bit	Indicates header extension present	Enables future extensibility
CSRC Count (CC)	4 bits	Number of CSRC identifiers	0-15 sources possible
Marker (M)	1 bit	Application-defined significance	Often marks frame boundaries
Payload Type (PT)	7 bits	Identifies media format	0-127, format negotiated via SDP
Sequence Number	16 bits	Packet ordering	Increments by 1 per packet
Timestamp	32 bits	Media capture time	Media-specific clock rate
SSRC	32 bits	Stream source identifier	Randomly generated, unique
CSRC	32 bits each	Contributing sources	Present when CC > 0

Version, Padding, and Extension Flags

The first byte of the RTP header begins with three control fields that receivers check immediately upon packet arrival.\n\nVersion (V) - 2 bits\n\nThe version field identifies the RTP protocol version. The current version is 2, and no updates to this version number are anticipated. Receivers encountering version values other than 2 should discard the packet as malformed or belonging to a different protocol.\n\nVersion 0 was used for early RTP drafts, and version 1 was used in the first published RTP standard (RFC 1889, now obsolete). Version 2 (RFC 3550) has been the standard since 2003.",

Version Field Handling
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Extract version from first byte of RTP header
function parseRtpVersion(firstByte: number): number {
    // Version is in bits 6-7 (most significant 2 bits)
    return (firstByte >> 6) & 0x03;
}
 
// Validate RTP packet version
function validateRtpPacket(packet: Uint8Array): boolean {
    if (packet.length < 12) {
        console.error("Packet too short for RTP header");
        return false;
    }
    
    const version = parseRtpVersion(packet[0]);
    if (version !== 2) {
        console.error(`Invalid RTP version: ${version}`);
        return false;
    }
    
    return true;
}

Padding (P) - 1 bit\n\nWhen set, the padding bit indicates that the packet contains padding bytes at the end that are not part of the payload. The last byte of the padding indicates the total number of padding bytes (including itself).\n\nPadding is typically used when encryption algorithms require block-aligned data. For example, AES encryption operates on 16-byte blocks, so payloads not naturally aligned to 16 bytes require padding. RTP's padding mechanism standardizes how this is handled.

Padding Handling Example

Visualization

Packet with padding (P=1):
┌──────────┬────────────────────────────┬─────────────┐
│  Header  │       Payload Data         │  Padding    │
│ 12 bytes │     Variable length        │  N bytes    │
└──────────┴────────────────────────────┴─────────────┘
                                                    ▲
                                            Last byte = N
                                         (padding length)
 
Example: 47-byte payload needs AES alignment (16 bytes)
- Next 16-byte boundary: 48 bytes
- Padding needed: 1 byte
- Padding content: 0x01 (1 byte of padding, value = count)
 
Example: 42-byte payload needs AES alignment
- Next 16-byte boundary: 48 bytes
- Padding needed: 6 bytes  
- Padding content: 0x00 0x00 0x00 0x00 0x00 0x06

Extension (X) - 1 bit\n\nWhen set, the extension bit indicates that the fixed header is followed by exactly one header extension. Extensions allow profile-specific or application-specific information to be carried in RTP packets without modifying the base protocol.\n\nThe header extension structure begins with a 16-bit extension profile identifier and a 16-bit length field, followed by the extension data. Common extensions include:\n\n- Absolute capture time: Precise wall-clock timestamps for synchronization\n- Audio level: Volume information for voice activity detection\n- Video orientation: Rotation hints for mobile devices\n- Transport-wide sequence numbers: Enhanced loss detection for congestion control

One-Byte vs Two-Byte Header Extensions

RFC 5285 defines two extension formats: one-byte (up to 14 bytes per extension ID) and two-byte (up to 255 bytes). WebRTC typically uses one-byte extensions for efficiency. The extension mechanism allows multiple extensions with different IDs in a single packet.

CSRC Count and Marker Bit

CSRC Count (CC) - 4 bits\n\nThe CC field contains the number of Contributing Source (CSRC) identifiers that follow the fixed header. Valid values range from 0 to 15.\n\nThis field is primarily relevant when RTP mixers combine multiple input streams into a single output stream. The mixer creates new packets with its own SSRC as the source but includes the original SSRCs as CSRC entries, maintaining attribution to the original speakers.\n\nWhen CC > 0:\n\nThe fixed header is followed by CC × 4 bytes of CSRC identifiers before the payload begins. Each CSRC is a 32-bit value identifying one of the original sources whose data contributed to this packet.

CSRC in Mixer Scenario

Mixing Example

Conference call scenario with mixer:
 
Participant A (SSRC: 0x12345678) speaks
Participant B (SSRC: 0x87654321) speaks simultaneously
Participant C (SSRC: 0xABCDEF00) is silent
 
Mixer combines A and B's audio, sends combined stream:
┌────────────────────────────────────────────────────┐
│ RTP Header (CC=2)                                  │
│   SSRC: 0xDEADBEEF (mixer's own identifier)       │
│   CSRC[0]: 0x12345678 (Participant A)             │
│   CSRC[1]: 0x87654321 (Participant B)             │
├────────────────────────────────────────────────────┤
│ Payload: Mixed audio from A + B                    │
└────────────────────────────────────────────────────┘
 
Benefits of CSRC:
• Receivers know who is speaking
• UI can highlight active speakers
• Recording can attribute audio sources
• Billing/analytics can track participation

Marker Bit (M) - 1 bit\n\nThe marker bit has profile-specific semantics—its meaning depends on the application and payload type. The RTP specification intentionally leaves this bit's interpretation to profile documents, allowing different applications to use it for their specific needs.\n\nCommon marker bit uses:

Marker Bit Semantics by Application
Payload Type	Marker Meaning	Purpose
Audio (voice)	Start of talkspurt	Indicates speech after silence, aids jitter buffer
Video (H.264)	End of video frame	Indicates last packet of frame for reassembly
Video (VP8/VP9)	End of frame	Similar to H.264, marks frame boundaries
Generic audio	Often unused	Set to 0 when no special meaning
Text/T.140	End of message	Indicates complete text unit

Why Variable Marker Semantics?

Different media types have fundamentally different needs. Audio needs to signal 'someone started speaking' to optimize jitter buffers. Video needs to signal 'frame is complete' so decoders can process immediately. Rather than forcing a universal interpretation, RTP allows each profile to define what makes sense for its domain.

Payload Type

Payload Type (PT) - 7 bits\n\nThe payload type field identifies the format of the RTP payload and determines its interpretation by the receiver. Values range from 0 to 127.\n\nThe payload type serves as a shorthand reference to a complete codec definition negotiated during session setup. When two endpoints establish an RTP session (via SIP/SDP, WebRTC, or other signaling), they agree on which payload type numbers correspond to which codecs, sample rates, and parameters.\n\nStatic vs. Dynamic Payload Types:\n\nRTP defines two categories of payload type assignments:

Static Payload Types (0-34)

•Predefined by IANA with fixed meanings
•PT 0: PCMU (G.711 µ-law, 8kHz audio)
•PT 8: PCMA (G.711 A-law, 8kHz audio)
•PT 9: G.722 (7kHz audio)
•PT 26: Motion JPEG video
•PT 34: H.263 video
•Many are obsolete but values remain reserved

Dynamic Payload Types (96-127)

•Application-defined during session setup
•Mapped via SDP a=rtpmap attribute
•Common: H.264, VP8, Opus, AAC
•Same PT number has different meanings per session
•Example: PT 97 = Opus in one call, H.264 in another
•Values 35-71 reserved, 72-76 RTCP conflict range

SDP Payload Type Mapping

SDP

v=0
o=- 123456 123456 IN IP4 192.168.1.1
s=WebRTC Session
t=0 0
m=audio 49170 RTP/SAVPF 111 103 9 0 8
a=rtpmap:111 opus/48000/2
a=rtpmap:103 ISAC/16000
a=rtpmap:9 G722/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=fmtp:111 minptime=10; useinbandfec=1
 
Payload Type Meanings After Negotiation:
PT 0   → PCMU (static, always means G.711 µ-law)
PT 8   → PCMA (static, always means G.711 A-law)
PT 9   → G722 (static, wideband audio)
PT 103 → iSAC 16kHz (dynamic, session-specific)
PT 111 → Opus 48kHz stereo (dynamic, session-specific)

Payload Type Changes Mid-Stream

Senders can change payload types during an RTP session without signaling renegotiation—switching from high-quality Opus to low-bandwidth G.711 to adapt to congestion, for example. Receivers must be prepared to handle any payload type negotiated during session setup at any time.

Collision avoidance with RTCP:\n\nPayload type values 72-76 are avoided in RTP because RTCP packets use these values as their packet type field in the same position. When RTP and RTCP share the same port (common with BUNDLE), this range helps demultiplex RTP from RTCP traffic. RFC 5761 specifies this demultiplexing approach.

Sequence Number

Sequence Number - 16 bits\n\nThe sequence number increments by one for each RTP packet sent and wraps around after 65535 back to 0. The initial sequence number is chosen randomly for security reasons (making stream injection attacks harder).\n\nPrimary purposes:\n\n1. Loss detection: Gaps in sequence numbers indicate lost packets\n2. Reordering detection: Out-of-order arrivals detected by sequence comparison\n3. Duplicate detection: Same sequence number appearing twice\n4. Packet counting: Enables statistics for jitter, loss rate calculation

Sequence Number Analysis
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
interface RtpStreamStats {
    lastSeqNum: number;
    expectedSeqNum: number;
    packetsReceived: number;
    packetsLost: number;
    sequenceWrapCount: number;
}
 
function processRtpSequence(
    seqNum: number, 
    stats: RtpStreamStats
): 'ok' | 'lost' | 'duplicate' | 'reordered' {
    // Handle 16-bit wraparound
    const seqDiff = (seqNum - stats.expectedSeqNum + 65536) % 65536;
    
    if (seqNum === stats.expectedSeqNum) {
        // Perfect in-order delivery
        stats.expectedSeqNum = (seqNum + 1) % 65536;
        stats.packetsReceived++;
        stats.lastSeqNum = seqNum;
        return 'ok';
    }
    
    if (seqDiff > 0 && seqDiff < 3000) {
        // Packets were lost (gap in sequence)
        const lostCount = seqDiff;
        stats.packetsLost += lostCount;
        stats.expectedSeqNum = (seqNum + 1) % 65536;
        stats.packetsReceived++;
        console.log(`Detected ${lostCount} lost packets`);
        return 'lost';
    }
    
    if (seqDiff > 63536) {
        // Reordered packet (arrived late, before expected)
        // Already counted as lost, now receiving it
        stats.packetsLost--;
        stats.packetsReceived++;
        console.log(`Reordered packet: ${seqNum}`);
        return 'reordered';
    }
    
    // Duplicate
    console.log(`Duplicate packet: ${seqNum}`);
    return 'duplicate';
}

Why Random Initial Sequence Numbers?

Random initial sequence numbers prevent an attacker from easily injecting packets into an existing stream. Without knowing the current sequence number, an attacker's packets will be detected as out-of-sequence or duplicates. This provides defense in depth alongside encryption.

Sequence number vs. timestamp:\n\nA common confusion is the difference between sequence numbers and timestamps. They serve different purposes:\n\n- Sequence number: Increments by 1 per packet. Used for loss/reordering detection.\n- Timestamp: Increments by samples per packet. Used for playback timing.\n\nFor audio at 48kHz with 20ms packets (960 samples per packet):\n- Sequence: 100, 101, 102, 103...\n- Timestamp: 0, 960, 1920, 2880...\n\nBoth are needed: sequence numbers tell you if packets arrived correctly; timestamps tell you when their content should play.

Timestamp

Timestamp - 32 bits\n\nThe RTP timestamp reflects the sampling instant of the first octet of the media data in that packet. This is not wall-clock time; it's a media-specific value that increments at a rate defined by the payload format.\n\nClock rates by media type:\n\nDifferent media types use different clock rates, reflecting their sampling characteristics:

Common RTP Timestamp Clock Rates
Codec/Format	Clock Rate	Typical Packet Duration	Timestamp Increment
G.711 (PCMU/PCMA)	8,000 Hz	20 ms	160
G.722	8,000 Hz*	20 ms	160
Opus	48,000 Hz	20 ms	960
AAC	Varies (44.1k, 48k)	~23 ms (1024 samples)	1024
H.264/H.265	90,000 Hz	33.33 ms (30fps)	3,000
VP8/VP9/AV1	90,000 Hz	33.33 ms (30fps)	3,000

G.722's Historical Anomaly

G.722 samples at 16kHz but uses an 8kHz RTP clock rate. This historical mistake occurred because G.722 compressed 16kHz audio to 64kbps like G.711—early implementations assumed the clock rate matched. RFC 3551 preserves this error for compatibility.

Timestamp behavior and interpretation:\n\nFor audio:\nTimestamps typically increment linearly by the number of samples in each packet. If a sender transmits 20ms of 48kHz Opus audio, the timestamp increments by 960 (48000 × 0.020) per packet. During silence suppression (not transmitting silence), timestamps continue incrementing, creating gaps that receivers use to insert silence.

Timestamp Calculation Examples

Audio Examples

Opus Audio at 48kHz, 20ms packets:
 
Time    Action                          Timestamp
0ms     Capture samples 0-959           0
20ms    Capture samples 960-1919        960
40ms    Capture samples 1920-2879       1920
60ms    [Silence, no packet sent]       (would be 2880)
80ms    [Silence, no packet sent]       (would be 3840)
100ms   Capture samples 4800-5759       4800
 
Note: When transmission resumes, timestamp reflects actual
sample position, creating a gap. Receiver inserts silence.
 
Video at 90kHz clock, 30fps:
 
Time       Frame    Timestamp    Increment
0.00ms     1        0            -
33.33ms    2        3000         +3000
66.67ms    3        6000         +3000
100.00ms   4        9000         +3000
 
90000 Hz ÷ 30 fps = 3000 timestamp units per frame

For video:\nAll packets belonging to the same video frame share the same timestamp. A single 720p H.264 frame might span 10 RTP packets—all with identical timestamps. The marker bit indicates the last packet of a frame, signaling that a complete frame is available for decoding.\n\nRandom initial timestamp:\nLike sequence numbers, timestamps start from a random value for security. Receivers calculate relative timing from timestamp differences rather than assuming they start at zero.

SSRC and CSRC Identifiers

Synchronization Source (SSRC) - 32 bits\n\nThe SSRC identifier uniquely identifies the source of an RTP stream within a session. Each sender picks a random 32-bit value as their SSRC when joining a session.\n\nWhy random SSRC matters:\n\n- Collision resistance: Even without coordination, 32-bit random values have very low collision probability\n- Security: Attackers can't predict SSRCs to inject traffic\n- Mobility: Same device reconnecting from different IP gets same SSRC\n- Independence: No central allocation authority required

SSRC and CSRC Relationships

Scenario Diagram

Direct Communication (no mixer):
 
Alice (SSRC: 0xA1A1A1A1) ──────> Bob (receives SSRC: 0xA1A1A1A1)
Bob   (SSRC: 0xB2B2B2B2) ──────> Alice (receives SSRC: 0xB2B2B2B2)
 
No CSRC entries, CC=0
 
────────────────────────────────────────────────────────────
 
Conference with MCU Mixer:
 
Alice (SSRC: 0xA1A1A1A1) ────┐
                              ├──> Mixer (SSRC: 0xM1M1M1M1)
Bob   (SSRC: 0xB2B2B2B2) ────┤         │
                              │         │
Carol (SSRC: 0xC3C3C3C3) ────┘         ▼
                                    Mixer sends to everyone:
                                    SSRC: 0xM1M1M1M1
                                    CC: 2 (varies by who's speaking)
                                    CSRC[0]: 0xA1A1A1A1 (Alice)
                                    CSRC[1]: 0xB2B2B2B2 (Bob)
                                    Payload: Mixed audio
 
Receivers see:
- SSRC 0xM1M1M1M1 as sending source
- CSRC list shows Alice and Bob are speaking
- Can show "Alice, Bob speaking" in UI

Contributing Source (CSRC) - 32 bits each\n\nWhen a mixer combines multiple streams, it uses CSRC identifiers to preserve information about the original sources. The CC field indicates how many CSRCs are present (0-15).\n\nSSRC collision handling:\n\nIf two participants randomly generate the same SSRC (probability ~1 in 4 billion), RTCP helps detect this. Participants monitor RTCP SDES packets; if they see their own SSRC reporting from a different CNAME (canonical name), a collision has occurred. The participant who joined later must pick a new SSRC.

SSRC vs. IP:Port

Don't confuse SSRC with network addressing. While each participant has a unique IP:port, they may produce multiple streams (audio + video) each needing a unique SSRC. SSRC identifies streams; IP:port identifies endpoints. With BUNDLE multiplexing, multiple SSRCs share the same IP:port.

Summary: Mastering the RTP Header

We've examined every field of the RTP header, understanding how each contributes to enabling real-time communication over unreliable networks.

Key Takeaways

•Version, Padding, Extension — Control bits enabling protocol identification, encryption alignment, and extensibility.
•CSRC Count and Marker — CC enables mixer attribution; marker bit has profile-specific semantics for boundaries and events.
•Payload Type — 7-bit shorthand for codec negotiated during session setup; dynamic types (96-127) are most common today.
•Sequence Number — 16-bit counter for loss, reordering, and duplicate detection; starts randomly.
•Timestamp — 32-bit media clock for playback timing; rate depends on media type, not packets.
•SSRC/CSRC — Unique stream identifiers enabling multi-party and multi-stream scenarios.

What's next:\n\nRTP provides the data transport, but real-time communication requires feedback about quality, synchronization across streams, and participant management. The next page explores RTCP (Real-time Transport Control Protocol), the companion protocol that provides these essential control functions.

Page Complete

You now understand every field in the RTP header and how implementations use this information for timing, sequencing, and stream identification. This knowledge is foundational for implementing and debugging real-time communication systems.