Loading content...
Every RTP packet—whether carrying a voice sample from a phone call, a video frame from a conference, or game state from a cloud gaming server—begins with the same 12-byte structure. This compact header contains everything receivers need to properly sequence, time, and process real-time media.\n\nThe RTP header represents decades of protocol engineering wisdom, carefully balancing information density against overhead minimization. Every bit serves a purpose; nothing is wasted. Understanding this header is essential for anyone implementing, debugging, or optimizing real-time communication systems.\n\nThis page provides a field-by-field examination of the RTP header, explaining not just what each field contains but why it exists and how implementations use it in practice.
By the end of this page, you will understand every field in the RTP header, including version flags, payload type encoding, sequence number behavior, timestamp semantics, SSRC identification, and optional CSRC lists. You'll learn how these fields work together to enable real-time communication.
The RTP header consists of a fixed 12-byte portion present in every packet, plus an optional extension of variable length. The fixed header contains all essential information for basic real-time transport; extensions provide application-specific data when needed.\n\nThe header is designed for efficient parsing by both software and hardware implementations. Fields are aligned on byte or word boundaries where possible, and the most frequently accessed information appears first.
123456789101112131415161718192021222324
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|V=2|P|X| CC |M| PT | Sequence Number |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Timestamp |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Synchronization Source (SSRC) identifier |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Contributing Source (CSRC) identifiers || .... |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Field Summary:V (2 bits) : Version, always 2 for current RTPP (1 bit) : Padding flag, indicates padding bytes at endX (1 bit) : Extension flag, indicates header extension presentCC (4 bits) : CSRC count, number of CSRC identifiers followingM (1 bit) : Marker bit, application-specific significancePT (7 bits) : Payload type, identifies media formatSequence (16 bits): Packet sequence numberTimestamp (32 bits): Media sampling timestampSSRC (32 bits): Synchronization source identifierCSRC (32 bits each): Contributing source identifiers (0-15)| Field | Size | Purpose | Key Property |
|---|---|---|---|
| Version (V) | 2 bits | Protocol version identification | Always 2 for current RTP |
| Padding (P) | 1 bit | Indicates padding at packet end | For encryption block alignment |
| Extension (X) | 1 bit | Indicates header extension present | Enables future extensibility |
| CSRC Count (CC) | 4 bits | Number of CSRC identifiers | 0-15 sources possible |
| Marker (M) | 1 bit | Application-defined significance | Often marks frame boundaries |
| Payload Type (PT) | 7 bits | Identifies media format | 0-127, format negotiated via SDP |
| Sequence Number | 16 bits | Packet ordering | Increments by 1 per packet |
| Timestamp | 32 bits | Media capture time | Media-specific clock rate |
| SSRC | 32 bits | Stream source identifier | Randomly generated, unique |
| CSRC | 32 bits each | Contributing sources | Present when CC > 0 |
The first byte of the RTP header begins with three control fields that receivers check immediately upon packet arrival.\n\nVersion (V) - 2 bits\n\nThe version field identifies the RTP protocol version. The current version is 2, and no updates to this version number are anticipated. Receivers encountering version values other than 2 should discard the packet as malformed or belonging to a different protocol.\n\nVersion 0 was used for early RTP drafts, and version 1 was used in the first published RTP standard (RFC 1889, now obsolete). Version 2 (RFC 3550) has been the standard since 2003.",
123456789101112131415161718192021
// Extract version from first byte of RTP headerfunction parseRtpVersion(firstByte: number): number { // Version is in bits 6-7 (most significant 2 bits) return (firstByte >> 6) & 0x03;} // Validate RTP packet versionfunction validateRtpPacket(packet: Uint8Array): boolean { if (packet.length < 12) { console.error("Packet too short for RTP header"); return false; } const version = parseRtpVersion(packet[0]); if (version !== 2) { console.error(`Invalid RTP version: ${version}`); return false; } return true;}Padding (P) - 1 bit\n\nWhen set, the padding bit indicates that the packet contains padding bytes at the end that are not part of the payload. The last byte of the padding indicates the total number of padding bytes (including itself).\n\nPadding is typically used when encryption algorithms require block-aligned data. For example, AES encryption operates on 16-byte blocks, so payloads not naturally aligned to 16 bytes require padding. RTP's padding mechanism standardizes how this is handled.
123456789101112131415161718
Packet with padding (P=1):┌──────────┬────────────────────────────┬─────────────┐│ Header │ Payload Data │ Padding ││ 12 bytes │ Variable length │ N bytes │└──────────┴────────────────────────────┴─────────────┘ ▲ Last byte = N (padding length) Example: 47-byte payload needs AES alignment (16 bytes)- Next 16-byte boundary: 48 bytes- Padding needed: 1 byte- Padding content: 0x01 (1 byte of padding, value = count) Example: 42-byte payload needs AES alignment- Next 16-byte boundary: 48 bytes- Padding needed: 6 bytes - Padding content: 0x00 0x00 0x00 0x00 0x00 0x06Extension (X) - 1 bit\n\nWhen set, the extension bit indicates that the fixed header is followed by exactly one header extension. Extensions allow profile-specific or application-specific information to be carried in RTP packets without modifying the base protocol.\n\nThe header extension structure begins with a 16-bit extension profile identifier and a 16-bit length field, followed by the extension data. Common extensions include:\n\n- Absolute capture time: Precise wall-clock timestamps for synchronization\n- Audio level: Volume information for voice activity detection\n- Video orientation: Rotation hints for mobile devices\n- Transport-wide sequence numbers: Enhanced loss detection for congestion control
RFC 5285 defines two extension formats: one-byte (up to 14 bytes per extension ID) and two-byte (up to 255 bytes). WebRTC typically uses one-byte extensions for efficiency. The extension mechanism allows multiple extensions with different IDs in a single packet.
CSRC Count (CC) - 4 bits\n\nThe CC field contains the number of Contributing Source (CSRC) identifiers that follow the fixed header. Valid values range from 0 to 15.\n\nThis field is primarily relevant when RTP mixers combine multiple input streams into a single output stream. The mixer creates new packets with its own SSRC as the source but includes the original SSRCs as CSRC entries, maintaining attribution to the original speakers.\n\nWhen CC > 0:\n\nThe fixed header is followed by CC × 4 bytes of CSRC identifiers before the payload begins. Each CSRC is a 32-bit value identifying one of the original sources whose data contributed to this packet.
123456789101112131415161718192021
Conference call scenario with mixer: Participant A (SSRC: 0x12345678) speaksParticipant B (SSRC: 0x87654321) speaks simultaneouslyParticipant C (SSRC: 0xABCDEF00) is silent Mixer combines A and B's audio, sends combined stream:┌────────────────────────────────────────────────────┐│ RTP Header (CC=2) ││ SSRC: 0xDEADBEEF (mixer's own identifier) ││ CSRC[0]: 0x12345678 (Participant A) ││ CSRC[1]: 0x87654321 (Participant B) │├────────────────────────────────────────────────────┤│ Payload: Mixed audio from A + B │└────────────────────────────────────────────────────┘ Benefits of CSRC:• Receivers know who is speaking• UI can highlight active speakers• Recording can attribute audio sources• Billing/analytics can track participationMarker Bit (M) - 1 bit\n\nThe marker bit has profile-specific semantics—its meaning depends on the application and payload type. The RTP specification intentionally leaves this bit's interpretation to profile documents, allowing different applications to use it for their specific needs.\n\nCommon marker bit uses:
| Payload Type | Marker Meaning | Purpose |
|---|---|---|
| Audio (voice) | Start of talkspurt | Indicates speech after silence, aids jitter buffer |
| Video (H.264) | End of video frame | Indicates last packet of frame for reassembly |
| Video (VP8/VP9) | End of frame | Similar to H.264, marks frame boundaries |
| Generic audio | Often unused | Set to 0 when no special meaning |
| Text/T.140 | End of message | Indicates complete text unit |
Different media types have fundamentally different needs. Audio needs to signal 'someone started speaking' to optimize jitter buffers. Video needs to signal 'frame is complete' so decoders can process immediately. Rather than forcing a universal interpretation, RTP allows each profile to define what makes sense for its domain.
Payload Type (PT) - 7 bits\n\nThe payload type field identifies the format of the RTP payload and determines its interpretation by the receiver. Values range from 0 to 127.\n\nThe payload type serves as a shorthand reference to a complete codec definition negotiated during session setup. When two endpoints establish an RTP session (via SIP/SDP, WebRTC, or other signaling), they agree on which payload type numbers correspond to which codecs, sample rates, and parameters.\n\nStatic vs. Dynamic Payload Types:\n\nRTP defines two categories of payload type assignments:
123456789101112131415161718
v=0o=- 123456 123456 IN IP4 192.168.1.1s=WebRTC Sessiont=0 0m=audio 49170 RTP/SAVPF 111 103 9 0 8a=rtpmap:111 opus/48000/2a=rtpmap:103 ISAC/16000a=rtpmap:9 G722/8000a=rtpmap:0 PCMU/8000a=rtpmap:8 PCMA/8000a=fmtp:111 minptime=10; useinbandfec=1 Payload Type Meanings After Negotiation:PT 0 → PCMU (static, always means G.711 µ-law)PT 8 → PCMA (static, always means G.711 A-law)PT 9 → G722 (static, wideband audio)PT 103 → iSAC 16kHz (dynamic, session-specific)PT 111 → Opus 48kHz stereo (dynamic, session-specific)Senders can change payload types during an RTP session without signaling renegotiation—switching from high-quality Opus to low-bandwidth G.711 to adapt to congestion, for example. Receivers must be prepared to handle any payload type negotiated during session setup at any time.
Collision avoidance with RTCP:\n\nPayload type values 72-76 are avoided in RTP because RTCP packets use these values as their packet type field in the same position. When RTP and RTCP share the same port (common with BUNDLE), this range helps demultiplex RTP from RTCP traffic. RFC 5761 specifies this demultiplexing approach.
Sequence Number - 16 bits\n\nThe sequence number increments by one for each RTP packet sent and wraps around after 65535 back to 0. The initial sequence number is chosen randomly for security reasons (making stream injection attacks harder).\n\nPrimary purposes:\n\n1. Loss detection: Gaps in sequence numbers indicate lost packets\n2. Reordering detection: Out-of-order arrivals detected by sequence comparison\n3. Duplicate detection: Same sequence number appearing twice\n4. Packet counting: Enables statistics for jitter, loss rate calculation
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
interface RtpStreamStats { lastSeqNum: number; expectedSeqNum: number; packetsReceived: number; packetsLost: number; sequenceWrapCount: number;} function processRtpSequence( seqNum: number, stats: RtpStreamStats): 'ok' | 'lost' | 'duplicate' | 'reordered' { // Handle 16-bit wraparound const seqDiff = (seqNum - stats.expectedSeqNum + 65536) % 65536; if (seqNum === stats.expectedSeqNum) { // Perfect in-order delivery stats.expectedSeqNum = (seqNum + 1) % 65536; stats.packetsReceived++; stats.lastSeqNum = seqNum; return 'ok'; } if (seqDiff > 0 && seqDiff < 3000) { // Packets were lost (gap in sequence) const lostCount = seqDiff; stats.packetsLost += lostCount; stats.expectedSeqNum = (seqNum + 1) % 65536; stats.packetsReceived++; console.log(`Detected ${lostCount} lost packets`); return 'lost'; } if (seqDiff > 63536) { // Reordered packet (arrived late, before expected) // Already counted as lost, now receiving it stats.packetsLost--; stats.packetsReceived++; console.log(`Reordered packet: ${seqNum}`); return 'reordered'; } // Duplicate console.log(`Duplicate packet: ${seqNum}`); return 'duplicate';}Random initial sequence numbers prevent an attacker from easily injecting packets into an existing stream. Without knowing the current sequence number, an attacker's packets will be detected as out-of-sequence or duplicates. This provides defense in depth alongside encryption.
Sequence number vs. timestamp:\n\nA common confusion is the difference between sequence numbers and timestamps. They serve different purposes:\n\n- Sequence number: Increments by 1 per packet. Used for loss/reordering detection.\n- Timestamp: Increments by samples per packet. Used for playback timing.\n\nFor audio at 48kHz with 20ms packets (960 samples per packet):\n- Sequence: 100, 101, 102, 103...\n- Timestamp: 0, 960, 1920, 2880...\n\nBoth are needed: sequence numbers tell you if packets arrived correctly; timestamps tell you when their content should play.
Timestamp - 32 bits\n\nThe RTP timestamp reflects the sampling instant of the first octet of the media data in that packet. This is not wall-clock time; it's a media-specific value that increments at a rate defined by the payload format.\n\nClock rates by media type:\n\nDifferent media types use different clock rates, reflecting their sampling characteristics:
| Codec/Format | Clock Rate | Typical Packet Duration | Timestamp Increment |
|---|---|---|---|
| G.711 (PCMU/PCMA) | 8,000 Hz | 20 ms | 160 |
| G.722 | 8,000 Hz* | 20 ms | 160 |
| Opus | 48,000 Hz | 20 ms | 960 |
| AAC | Varies (44.1k, 48k) | ~23 ms (1024 samples) | 1024 |
| H.264/H.265 | 90,000 Hz | 33.33 ms (30fps) | 3,000 |
| VP8/VP9/AV1 | 90,000 Hz | 33.33 ms (30fps) | 3,000 |
G.722 samples at 16kHz but uses an 8kHz RTP clock rate. This historical mistake occurred because G.722 compressed 16kHz audio to 64kbps like G.711—early implementations assumed the clock rate matched. RFC 3551 preserves this error for compatibility.
Timestamp behavior and interpretation:\n\nFor audio:\nTimestamps typically increment linearly by the number of samples in each packet. If a sender transmits 20ms of 48kHz Opus audio, the timestamp increments by 960 (48000 × 0.020) per packet. During silence suppression (not transmitting silence), timestamps continue incrementing, creating gaps that receivers use to insert silence.
12345678910111213141516171819202122
Opus Audio at 48kHz, 20ms packets: Time Action Timestamp0ms Capture samples 0-959 020ms Capture samples 960-1919 96040ms Capture samples 1920-2879 192060ms [Silence, no packet sent] (would be 2880)80ms [Silence, no packet sent] (would be 3840)100ms Capture samples 4800-5759 4800 Note: When transmission resumes, timestamp reflects actualsample position, creating a gap. Receiver inserts silence. Video at 90kHz clock, 30fps: Time Frame Timestamp Increment0.00ms 1 0 -33.33ms 2 3000 +300066.67ms 3 6000 +3000100.00ms 4 9000 +3000 90000 Hz ÷ 30 fps = 3000 timestamp units per frameFor video:\nAll packets belonging to the same video frame share the same timestamp. A single 720p H.264 frame might span 10 RTP packets—all with identical timestamps. The marker bit indicates the last packet of a frame, signaling that a complete frame is available for decoding.\n\nRandom initial timestamp:\nLike sequence numbers, timestamps start from a random value for security. Receivers calculate relative timing from timestamp differences rather than assuming they start at zero.
Synchronization Source (SSRC) - 32 bits\n\nThe SSRC identifier uniquely identifies the source of an RTP stream within a session. Each sender picks a random 32-bit value as their SSRC when joining a session.\n\nWhy random SSRC matters:\n\n- Collision resistance: Even without coordination, 32-bit random values have very low collision probability\n- Security: Attackers can't predict SSRCs to inject traffic\n- Mobility: Same device reconnecting from different IP gets same SSRC\n- Independence: No central allocation authority required
123456789101112131415161718192021222324252627
Direct Communication (no mixer): Alice (SSRC: 0xA1A1A1A1) ──────> Bob (receives SSRC: 0xA1A1A1A1)Bob (SSRC: 0xB2B2B2B2) ──────> Alice (receives SSRC: 0xB2B2B2B2) No CSRC entries, CC=0 ──────────────────────────────────────────────────────────── Conference with MCU Mixer: Alice (SSRC: 0xA1A1A1A1) ────┐ ├──> Mixer (SSRC: 0xM1M1M1M1)Bob (SSRC: 0xB2B2B2B2) ────┤ │ │ │Carol (SSRC: 0xC3C3C3C3) ────┘ ▼ Mixer sends to everyone: SSRC: 0xM1M1M1M1 CC: 2 (varies by who's speaking) CSRC[0]: 0xA1A1A1A1 (Alice) CSRC[1]: 0xB2B2B2B2 (Bob) Payload: Mixed audio Receivers see:- SSRC 0xM1M1M1M1 as sending source- CSRC list shows Alice and Bob are speaking- Can show "Alice, Bob speaking" in UIContributing Source (CSRC) - 32 bits each\n\nWhen a mixer combines multiple streams, it uses CSRC identifiers to preserve information about the original sources. The CC field indicates how many CSRCs are present (0-15).\n\nSSRC collision handling:\n\nIf two participants randomly generate the same SSRC (probability ~1 in 4 billion), RTCP helps detect this. Participants monitor RTCP SDES packets; if they see their own SSRC reporting from a different CNAME (canonical name), a collision has occurred. The participant who joined later must pick a new SSRC.
Don't confuse SSRC with network addressing. While each participant has a unique IP:port, they may produce multiple streams (audio + video) each needing a unique SSRC. SSRC identifies streams; IP:port identifies endpoints. With BUNDLE multiplexing, multiple SSRCs share the same IP:port.
We've examined every field of the RTP header, understanding how each contributes to enabling real-time communication over unreliable networks.
What's next:\n\nRTP provides the data transport, but real-time communication requires feedback about quality, synchronization across streams, and participant management. The next page explores RTCP (Real-time Transport Control Protocol), the companion protocol that provides these essential control functions.
You now understand every field in the RTP header and how implementations use this information for timing, sequencing, and stream identification. This knowledge is foundational for implementing and debugging real-time communication systems.