System Design (HLD)WhatsApp Messaging

WhatsApp Messaging System Design

LevelAdvanced

Duration120 mins

TopicWhatsApp Messaging

5 / 6

Media Handling

Beyond Text: The Media Explosion

While text messages form the foundation of messaging, media now dominates bandwidth. Users share billions of photos, videos, voice notes, and documents daily. WhatsApp processes an estimated 7 billion photos and 1 billion videos every single day—roughly 80,000 media files per second.

Media handling introduces challenges that dwarf text messaging: a single 1-minute video can be 50MB, representing 500,000 times the data of a typical text message. Storing, transmitting, and delivering media efficiently while maintaining end-to-end encryption requires sophisticated architecture spanning upload pipelines, transcoding systems, CDN infrastructure, and intelligent caching.

This page explores the complete media handling architecture, from the moment a user selects a photo until it appears on the recipient's screen.

What You Will Master

You will understand media upload pipelines with resumable uploads, storage strategies for petabytes of media, thumbnail and preview generation, CDN integration for global delivery, and how end-to-end encryption is maintained for all media types. These patterns apply to any media-heavy application.

Media Types and Characteristics

Different media types have vastly different requirements for storage, processing, and delivery. Understanding these characteristics drives architectural decisions.

Media Types in Messaging
Media Type	Typical Size	Processing Needed	Delivery Pattern
Photo (Original)	2-5 MB	Compression, thumbnail generation	Full image on tap
Photo (Thumbnail)	5-20 KB	Pre-generated	Immediate inline display
Video	10-100 MB	Transcoding, multiple bitrates, thumbnails	Progressive/adaptive streaming
Voice Note	0.1-2 MB	Compression (Opus codec)	Full download before play
Document (PDF)	1-100 MB	Preview generation	Full download for viewing
GIF/Sticker	0.1-2 MB	Palette optimization	Cached, immediate display
Location	< 1 KB	Map tile URL generation	Map API integration
Contact vCard	< 10 KB	None	Direct delivery in message

1.1 Scale Calculations for Media

Let's derive the storage and bandwidth requirements:

Media Scale Calculations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
DAILY VOLUME ESTIMATES
══════════════════════
Photos:     7 billion/day × 3 MB avg = 21 PB/day
Videos:     1 billion/day × 30 MB avg = 30 PB/day
Voice:      3 billion/day × 0.5 MB avg = 1.5 PB/day
Documents:  0.5 billion/day × 10 MB avg = 5 PB/day
 
Total daily ingestion: ~57.5 PB/day
Annual growth: ~21 EB/year (exabytes!)
 
RETENTION ASSUMPTIONS
═════════════════════
• Media stored until explicitly deleted (or account deletion)
• Average media lifetime: ~2 years
• Total storage: ~40+ EB (with compression and deduplication)
 
BANDWIDTH REQUIREMENTS
══════════════════════
Upload: 57.5 PB/day = ~5.3 Tbps sustained
Download: Assuming each media viewed 2x on average
          115 PB/day = ~10.6 Tbps sustained
Peak: 3x average = ~48 Tbps
 
For comparison:
• Total global internet traffic: ~400 Tbps
• WhatsApp media alone: ~12% of global traffic (order of magnitude)

Storage Costs at Scale

At $0.02/GB/month for cloud object storage, 40 EB costs ~$800 million/month. Efficient storage tiers (hot/warm/cold), compression, and dedicated infrastructure reduce this dramatically, but media storage remains a major cost center for messaging platforms.

Media Upload Architecture

Uploading a 50MB video over a mobile network is fraught with failure risks. The upload architecture must handle unreliable networks gracefully.

2.1 Chunked, Resumable Uploads

Rather than uploading files as a single payload, resumable uploads break files into chunks:

Resumable Upload Protocol
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
PROTOCOL FLOW
═════════════
 
1. INITIATE UPLOAD
   ────────────────
   Client → Server: {
     filename: "video.mp4",
     filesize: 52_428_800,  // 50 MB
     mime_type: "video/mp4",
     checksum: "sha256:abc123...",
     chunk_size: 1_048_576  // 1 MB chunks
   }
   
   Server → Client: {
     upload_id: "upload_xyz123",
     upload_url: "https://upload.example.com/upload_xyz123",
     expires_at: 1704672000  // Upload session expires in 1 hour
   }
 
2. UPLOAD CHUNKS
   ──────────────
   For each 1 MB chunk:
   
   Client → Server:
     PUT /upload_xyz123
     Content-Range: bytes 0-1048575/52428800
     Body: [chunk 1 data]
   
   Server → Client:
     HTTP 308 Resume Incomplete
     Range: bytes=0-1048575  // Confirming received
   
   [Continue for all 50 chunks...]
 
3. FINALIZE UPLOAD
   ────────────────
   After last chunk:
   
   Server → Client:
     HTTP 200 OK
     {
       media_id: "media_abc123",
       url: "https://cdn.example.com/media/abc123",
       size: 52_428_800,
       checksum_verified: true
     }
 
4. RESUME AFTER FAILURE
   ─────────────────────
   If connection lost after chunk 25:
   
   Client → Server:
     PUT /upload_xyz123
     Content-Range: bytes */52428800  // Asking what's been received
   
   Server → Client:
     HTTP 308 Resume Incomplete
     Range: bytes=0-26214399  // Got chunks 0-24 (25 MB)
   
   Client resumes from chunk 25, not from beginning!

2.2 Upload Infrastructure Design

Upload Infrastructure

┌─────────────────────────────────────────────────────────────────────────┐
│                        MEDIA UPLOAD PIPELINE                             │
└─────────────────────────────────────────────────────────────────────────┘
 
              ┌──────────────────┐
   Client ───►│  Upload Gateway  │  • Handles chunked uploads
              │  (Edge Server)   │  • Stores chunks temporarily in local SSD
              └────────┬─────────┘  • Validates checksums per-chunk
                       │
                       │ On complete upload:
                       ▼
              ┌──────────────────┐
              │ Chunk Assembler  │  • Assembles chunks into complete file
              │                  │  • Verifies final checksum
              └────────┬─────────┘  • Uploads to object storage
                       │
          ┌────────────┼────────────┐
          │            │            │
          ▼            ▼            ▼
   ┌────────────┐ ┌────────────┐ ┌────────────┐
   │ Thumbnail  │ │ Transcoding│ │ Encryption │
   │ Generator  │ │  Pipeline  │ │  Service   │
   └──────┬─────┘ └──────┬─────┘ └──────┬─────┘
          │              │              │
          └──────────────┴──────────────┘
                         │
                         ▼
              ┌──────────────────┐
              │  Object Storage  │  • S3-compatible storage
              │  (Encrypted)     │  • Multiple regions
              └────────┬─────────┘  • Lifecycle policies
                       │
                       ▼
              ┌──────────────────┐
              │   CDN Origin     │  • Edge caching
              │                  │  • Global distribution
              └──────────────────┘

Client-Side Compression

Modern apps compress media client-side before upload. A 5MB photo can compress to 200KB with acceptable quality loss. Video is transcoded to H.264/H.265 at lower bitrates. This reduces upload time and storage costs by 10-20x, making the user experience much better on slow networks.

Media Storage Architecture

Storing exabytes of media requires a carefully designed storage architecture with multiple tiers, geographic distribution, and efficient deduplication.

3.1 Storage Tiers

Not all media is accessed equally. A tiered storage approach optimizes cost and performance:

Media Storage Tiers
Tier	Access Pattern	Storage Type	Cost/GB/Mo
Hot (0-7 days)	Frequent access (viewing, forwarding)	SSD/NVMe, CDN edge cache	~$0.10
Warm (7-90 days)	Occasional access	Standard object storage (S3, GCS)	~$0.02
Cold (90-365 days)	Rare access (search, export)	Infrequent access storage (S3-IA)	~$0.01
Archive (1+ year)	Very rare (legal holds, recovery)	Glacier, Archive storage	~$0.004

3.2 Media Object Schema

Each media item requires metadata for retrieval, processing status, and access control:

Media Object Schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
interface MediaObject {
    // Identification
    id: string;                    // Globally unique media ID
    uploaderId: string;            // Who uploaded this media
    uploadedAt: Date;              // When it was uploaded
    
    // Content properties
    mimeType: string;              // "image/jpeg", "video/mp4", etc.
    originalFilename: string;      // User's original filename (encrypted)
    sizeBytes: number;             // Original size
    duration?: number;             // For video/audio, in seconds
    dimensions?: {                 // For images/videos
        width: number;
        height: number;
    };
    
    // Encryption (E2EE media)
    encryptionKey: string;         // Media key, encrypted with message key
    encryptionIv: string;          // Initialization vector
    keyHash: string;               // Hash for verification
    
    // Storage references
    storageKey: string;            // Object storage key (e.g., S3 key)
    storageBucket: string;         // Which bucket
    storageRegion: string;         // Primary region
    replicatedRegions: string[];   // Where replicated
    storageTier: StorageTier;      // HOT | WARM | COLD | ARCHIVE
    
    // Derived content
    thumbnails: {
        small: ThumbnailRef;       // 150px max dimension
        medium: ThumbnailRef;      // 300px
        large: ThumbnailRef;       // 800px (for preview)
    };
    transcodes?: {                 // For video
        quality_240p: TranscodeRef;
        quality_480p: TranscodeRef;
        quality_720p: TranscodeRef;
        quality_1080p: TranscodeRef;
    };
    
    // Lifecycle
    lastAccessedAt: Date;          // For tiering decisions
    expiresAt?: Date;              // For disappearing messages
    deletedAt?: Date;              // Soft delete
}
 
interface ThumbnailRef {
    storageKey: string;
    sizeBytes: number;
    dimensions: { width: number; height: number };
    encryptionKey: string;         // Thumbnails are also E2EE
}

3.3 Deduplication Strategy

Many users share the same memes, news clips, and viral content. Deduplication can save significant storage:

Content-based hashing: Before encryption, compute a content hash. If identical content exists, reference the existing blob.

Challenge with E2EE: Each sender encrypts with different keys, so the ciphertext differs even for identical plaintext. Deduplication must happen before encryption.

Privacy consideration: Deduplication leaks information ("this content has been shared before"). Most E2EE systems skip deduplication to avoid this metadata leak, accepting the storage cost.

Practical approach: Deduplicate only for media sent by the same user (forwarding their own content uses same encrypted blob).

Geographic Replication

Media is typically replicated to 2-3 regions for durability and latency. A user in Brazil receives media faster from São Paulo than from US-East. But full replication of 40 EB is expensive. Strategies: replicate only hot tier globally, store cold/archive in primary region only.

Thumbnail and Preview Generation

Users shouldn't wait for a 50MB video to download just to see what it contains. Thumbnails and previews provide instant visual feedback while the full content loads.

4.1 Thumbnail Generation Pipeline

Thumbnail Generation Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
IMAGE THUMBNAILS
════════════════
 
┌─────────────┐      ┌─────────────────┐      ┌─────────────────┐
│   Original  │─────►│  Image Decoder  │─────►│   Resize +      │
│   (3 MB)    │      │  (libvips)      │      │   Compress      │
└─────────────┘      └─────────────────┘      └────────┬────────┘
                                                       │
                     ┌─────────────────────────────────┴─────────┐
                     │                                           │
                     ▼                                           ▼
              ┌─────────────┐                            ┌─────────────┐
              │   150x150   │                            │   800x800   │
              │  (5-10 KB)  │                            │  (50-100KB) │
              │  Tiny thumb │                            │  Preview    │
              └─────────────┘                            └─────────────┘
 
VIDEO THUMBNAILS
════════════════
 
┌─────────────┐      ┌─────────────────┐
│   Video     │─────►│  FFmpeg extract │──────► 3 key frames at 10%, 50%, 90%
│   (50 MB)   │      │  key frames     │
└─────────────┘      └─────────────────┘
                             │
                             ▼
                     ┌───────────────────────────────┐
                     │    Animated Preview (GIF)     │
                     │    or first frame as JPEG    │
                     │    Size: 50-200 KB            │
                     └───────────────────────────────┘
 
VOICE NOTES
═══════════
 
┌─────────────┐      ┌─────────────────┐      ┌─────────────────┐
│  Audio      │─────►│ Waveform        │─────►│  Serialized     │
│  (500 KB)   │      │ Extraction      │      │  Visualization  │
└─────────────┘      └─────────────────┘      │  Data (1-2 KB)  │
                                               └─────────────────┘
 
Client renders waveform from data, avoiding image file overhead.

4.2 BlurHash: Instant Placeholders

Even thumbnails take time to download. BlurHash provides instant visual placeholders that render from ~30 bytes of data:

BlurHash Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// BlurHash encodes an image into a short string (~30 characters)
// that can be decoded into a blurry placeholder image on the client.
 
// Server-side: Generate BlurHash during upload
import { encode } from 'blurhash';
 
async function generateBlurHash(imageBuffer: Buffer): Promise<string> {
    const { width, height, data } = await decodeImage(imageBuffer);
    
    // Components determine detail level
    // 4x3 = 12 components, good balance of size vs detail
    const hash = encode(data, width, height, 4, 3);
    
    return hash;  // Example: "LEHV6nWB2yk8pyo0adR*.7kCMdnj"
}
 
// Client-side: Render BlurHash while real image loads
import { decode } from 'blurhash';
 
function renderPlaceholder(hash: string, width: number, height: number): ImageData {
    const pixels = decode(hash, width, height);
    // Returns Uint8ClampedArray of RGBA pixel data
    // Can be drawn directly to canvas
    return new ImageData(pixels, width, height);
}
 
// Message payload includes BlurHash:
interface MediaMessage {
    messageId: string;
    mediaId: string;
    blurHash: string;       // 30 bytes, in message payload
    thumbnailUrl: string;   // 10 KB, fetched separately
    fullUrl: string;        // 3 MB, loaded on tap
    dimensions: { width: number; height: number };
}
 
// Rendering sequence:
// 1. Immediately: Render BlurHash (instant, from message payload)
// 2. ~100-500ms: Load thumbnail, replace BlurHash
// 3. On tap: Load full image, replace thumbnail

Progressive Disclosure Pattern

The pattern of BlurHash → Thumbnail → Full image provides excellent perceived performance. Users see 'something' instantly, details emerge progressively. This three-stage loading is now standard in image-heavy applications. Instagram, Pinterest, and WhatsApp all use variants of this approach.

Video Transcoding Pipeline

Videos come in countless formats, codecs, and resolutions. A transcoding pipeline normalizes these into consistent, optimized formats for delivery.

5.1 Why Transcode?

Format normalization: iPhones produce HEVC (H.265), some cameras produce ProRes, screen recordings may be VP9. Recipients may not support all codecs.

Adaptive bitrate: Different network conditions require different quality levels. Transcoding creates multiple quality versions.

Size reduction: Uploaded 4K 100Mbps video transcodes to 720p H.264 at 2Mbps—a 50x size reduction with acceptable quality.

Fast start: Reorder video atoms (moov atom at start) for progressive playback without full download.

Transcoding Pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
VIDEO TRANSCODING PIPELINE
═══════════════════════════
 
┌─────────────┐
│  Uploaded   │
│   Video     │──────────────────────────────────────────────────────┐
│  (100 MB)   │                                                      │
└──────┬──────┘                                                      │
       │                                                             │
       ▼                                                             │
┌─────────────────┐                                                  │
│  Input Analysis │  FFprobe: codec, resolution, duration, bitrate   │
└────────┬────────┘                                                  │
         │                                                           │
         ▼                                                           │
┌─────────────────┐                                                  │
│ Transcode Jobs  │  Create jobs for each output quality            │
│    Scheduler    │                                                  │
└────────┬────────┘                                                  │
         │                                                           │
    ┌────┴────┬──────────┬──────────┐                                │
    │         │          │          │                                │
    ▼         ▼          ▼          ▼                                ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐                      ┌───────────┐
│ 240p  │ │ 480p  │ │ 720p  │ │1080p  │   Parallel           │ Thumbnail │
│150kbps│ │600kbps│ │1.5Mbps│ │3Mbps  │   Workers            │ Extractor │
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘   (FFmpeg)           └─────┬─────┘
    │         │         │         │                                 │
    └─────────┴─────────┴─────────┴─────────────────────────────────┘
                                  │
                                  ▼
                        ┌─────────────────┐
                        │  Output Storage │  Each quality level
                        │  (Encrypted)    │  stored separately
                        └─────────────────┘
 
FFMPEG EXAMPLE COMMAND:
═══════════════════════
ffmpeg -i input.mp4 \
  -c:v libx264 -preset fast \
  -vf scale=-2:720 \
  -b:v 1500k -maxrate 2000k \
  -bufsize 3000k \
  -c:a aac -b:a 128k \
  -movflags +faststart \
  output_720p.mp4
 
Output sizes for 1 minute video:
• 240p: ~1 MB
• 480p: ~4 MB
• 720p: ~10 MB
• 1080p: ~20 MB
• Original stored: 100 MB

5.2 Distributed Transcoding

Transcoding is CPU-intensive. At 1 billion videos/day, brute-force transcoding is infeasible:

Parallelization: Split video into segments (e.g., 10-second chunks), transcode in parallel across machines, concatenate.

Priority queuing: Recent uploads get priority. Older media in queue can wait. Short videos (<30s) often get synchonous transcoding for immediate availability.

Hardware acceleration: GPU-based encoding (NVENC, Quick Sync) is 5-10x faster than CPU for equivalent quality.

Tiered transcoding: Create 240p/480p first (fast, small), delay 1080p for later. Most mobile views use lower quality anyway.

WhatsApp's Approach

WhatsApp encourages client-side compression before upload. The client transcodes video to H.264, caps resolution, and uses efficient bitrates. This shifts transcoding work to billions of devices, drastically reducing server-side transcoding needs. Most videos are uploaded already optimized.

End-to-End Encrypted Media

In an E2EE system, media must be encrypted on the sender's device before upload. The server stores only encrypted blobs it cannot decrypt.

6.1 E2EE Media Protocol

E2EE Media Upload Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
SENDER'S DEVICE: UPLOAD
═══════════════════════
 
1. Generate random media encryption key:
   media_key = random(32 bytes)
   media_iv = random(16 bytes)
 
2. Encrypt media locally:
   encrypted_media = AES-256-CBC(media_key, media_iv, plaintext_media)
   encrypted_thumbnail = AES-256-CBC(media_key, thumbnail_iv, plaintext_thumbnail)
 
3. Compute integrity hash:
   file_hash = SHA256(encrypted_media)
 
4. Upload encrypted blobs:
   POST /upload/media
   Body: encrypted_media
   Response: { media_url: "https://cdn.example.com/encrypted/abc123" }
 
   POST /upload/thumbnail
   Body: encrypted_thumbnail
   Response: { thumb_url: "https://cdn.example.com/encrypted/thumb456" }
 
5. Include key in message (encrypted with recipient's chat key):
   message_payload = {
     type: "image",
     media_url: "https://cdn.example.com/encrypted/abc123",
     thumb_url: "https://cdn.example.com/encrypted/thumb456",
     media_key: base64(media_key),   // Encrypted in message
     media_iv: base64(media_iv),
     thumbnail_iv: base64(thumbnail_iv),
     file_size: 3145728,
     file_hash: "sha256:abc123...",
     mime_type: "image/jpeg"
   }
   
   E2EE_message = encrypt_with_signal_protocol(message_payload)
 
 
RECIPIENT'S DEVICE: DOWNLOAD
════════════════════════════
 
1. Receive and decrypt message using Signal Protocol:
   message_payload = decrypt_with_signal_protocol(E2EE_message)
 
2. Extract media key and URLs from decrypted payload
 
3. Download encrypted media from CDN:
   GET https://cdn.example.com/encrypted/abc123
   Response: encrypted_media (server cannot decrypt)
 
4. Decrypt locally:
   plaintext_media = AES_decrypt(media_key, media_iv, encrypted_media)
 
5. Verify integrity:
   assert SHA256(encrypted_media) == file_hash
 
6. Display decrypted media to user

6.2 Key Insight: Separation of Storage and Keys

The encrypted blob is stored on servers, but the decryption key travels only through the E2EE message channel:

Server knows: There is a 3MB encrypted blob at URL X
Server doesn't know: What the blob contains, who can decrypt it
Recipient knows: The key to decrypt (from E2EE message)

Even if an attacker hacks the media storage and steals all encrypted blobs, they're useless without the keys—which only exist on sender and recipient devices.

What the Server Sees vs. Reality
Aspect	What Server Sees	Actual Content
Photo	3MB encrypted blob	Birthday party photo
Video	50MB encrypted blob	Child's piano recital
Document	10MB encrypted blob	Tax returns PDF
Filename	Nothing (encrypted)	vacation_2024.jpg
Thumbnail	Encrypted blob	Blurry preview image

Metadata Still Visible

While content is encrypted, the server sees: upload timestamp, file size, MIME type (if included), IP address of uploader, who the message was sent to. This metadata can be revealing. Some protocols encrypt even metadata, but WhatsApp's implementation protects content while exposing metadata.

CDN Integration for Media Delivery

Delivering petabytes of media daily requires a global Content Delivery Network (CDN). CDNs cache content at edge locations close to users, reducing latency and origin server load.

CDN Architecture for Media
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
CDN ARCHITECTURE
════════════════
 
                           ┌─────────────────┐
                           │  User in Tokyo  │
                           └────────┬────────┘
                                    │
                                    │ 1. Request media
                                    ▼
                           ┌─────────────────┐
                           │  Tokyo PoP      │◄───── Cache HIT? Return immediately
                           │  (Edge Server)  │         (~10ms latency)
                           └────────┬────────┘
                                    │
                                    │ Cache MISS
                                    ▼
                           ┌─────────────────┐
                           │  Regional Cache │◄───── Regional cache HIT?
                           │  (Singapore)    │         (~50ms latency)
                           └────────┬────────┘
                                    │
                                    │ Still MISS (rare for popular content)
                                    ▼
                           ┌─────────────────┐
                           │  Origin Shield  │◄───── Coalesce requests to origin
                           │                 │         (prevent thundering herd)
                           └────────┬────────┘
                                    │
                                    ▼
                           ┌─────────────────┐
                           │  Origin (S3)    │  Media storage
                           │  US-East        │  (~200ms from Tokyo)
                           └─────────────────┘
 
CDN CONFIGURATION FOR MESSAGING:
════════════════════════════════
 
1. CACHE POLICY
   ─────────────
   • Media is immutable (content never changes at same URL)
   • Cache-Control: public, max-age=31536000 (1 year)
   • E2EE media: content is encrypted, safe to cache anywhere
   • Unique URLs per media item prevent stale content
 
2. ACCESS CONTROL
   ───────────────
   • Signed URLs: Token-based access expiring after N hours
   • Example: https://cdn.example.com/media/abc?token=xyz&expires=1704672000
   • Prevents unauthorized access even though content is encrypted
   • Rate limiting at edge to prevent abuse
 
3. EDGE FEATURES
   ──────────────
   • TLS termination at edge (reduces latency)
   • Brotli/gzip compression for compressible formats
   • HTTP/2 for multiplexed downloads
   • Range request support for video seeking

7.1 Cache Efficiency Considerations

Hit rate is everything: CDN costs scale with origin requests. High cache hit rates (>95%) dramatically reduce costs.

Challenge with E2EE: Each message to different recipients uses different encryption keys, creating different ciphertext. The same photo sent to 10 people = 10 different encrypted blobs = 10 cache misses.

WhatsApp's approach: For group messages and forwards, the same encrypted blob can be reused if the message references the same media. But 1:1 messages to different people use different keys.

Long-tail problem: Rarely accessed media (old photos from years ago) may never be in cache. Accept higher origin load for cold content.

CDN Cost Optimization

At petabyte scale, CDN egress costs dominate. Strategies: negotiate volume discounts, use multiple CDNs and route by price/performance, implement tiered caching, use efficient codecs (AVIF for images, HEVC for video) to reduce file sizes, and leverage CDN's free tier for thumbnail delivery.

Media Lifecycle Management

Media doesn't live forever. Managing the lifecycle—from upload through access patterns to eventual deletion—is essential for cost control and compliance.

8.1 Lifecycle Stages

Media Lifecycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
MEDIA LIFECYCLE STAGES
══════════════════════
 
Day 0: UPLOAD
├── Store in HOT tier
├── Generate thumbnails
├── CDN edge caching active
└── High access probability
 
Day 1-7: ACTIVE
├── Remain in HOT tier
├── Frequent views (recipients, forwards)
├── CDN cache hits likely
└── No action needed
 
Day 7-30: COOLING
├── Monitor access frequency
├── If access_count < threshold: transition to WARM
├── Remove from CDN edge cache
└── Still quick retrieval (~100ms)
 
Day 30-90: WARM
├── Standard object storage
├── Moved from premium SSD
├── On-demand CDN caching only
└── ~200ms retrieval
 
Day 90-365: COLD
├── Infrequent access tier (S3-IA)
├── Higher retrieval cost, lower storage cost
├── Minimum storage duration charges apply
└── ~500ms retrieval
 
Day 365+: ARCHIVE (Optional)
├── Glacier or equivalent
├── Minutes to hours for retrieval
├── Extremely low storage cost
└── Used for legal holds, backups
 
DELETION TRIGGERS:
══════════════════
• User deletes message/media
• Disappearing messages timer expires
• Account deleted
• Legal hold expired
• Storage quota enforcement (if any)

8.2 Disappearing Messages and Media

Disappearing messages create special challenges for media:

Timer starts on view: The 7-day timer typically starts when recipient views the message, not when sent. Server must track first-view timestamp.

Multi-device sync: If recipient has 3 devices, disappearing message must disappear from all once any device views it + timer expires.

Deletion must be thorough:

Delete encrypted blob from storage
Delete all thumbnails
Delete from CDN caches (issue PURGE request)
Delete metadata from databases
On-device local storage deletion (sender and recipient app responsibility)

Recovery concern: E2EE means once media is deleted, it's unrecoverable. Users who set 24-hour disappearing messages may accidentally lose important content.

Deletion Is Hard in Distributed Systems

True deletion across all replicas, caches, and backups is challenging. CDNs may serve cached content for seconds/minutes after deletion. Database replicas may process delete with lag. Best effort: mark as deleted immediately, let background jobs enforce actual deletion, accept brief inconsistency window.

Download and Streaming Strategies

How media is delivered to the client affects user experience dramatically. Different strategies suit different media types and network conditions.

Download Strategies by Media Type
Media Type	Strategy	Rationale
Photos	Download thumbnail inline, full on tap	Quick preview, full quality on demand
Short videos (<30s)	Progressive download	Download while playing; simple
Long videos (>30s)	Adaptive bitrate streaming (HLS/DASH)	Adjust quality to network; seek support
Voice notes	Full download before play	Small files; need complete for scrubbing
Documents	Download on tap	No preview needed during chat scroll
GIFs/Stickers	Preload next messages' GIFs	Ensure instant animation on scroll

9.1 Adaptive Bitrate Streaming for Video

For longer videos, adaptive streaming adjusts quality based on network conditions:

HLS Streaming for Encrypted Video
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
ADAPTIVE BITRATE STREAMING WITH E2EE
═════════════════════════════════════
 
CHALLENGE:
• Standard HLS/DASH expects server to segment video
• With E2EE, server cannot decode video to segment it
 
SOLUTION: Client-side segment decryption
 
1. Upload: Client encrypts full video with media_key
   Server stores as single encrypted blob
 
2. Manifest creation: Client/server creates HLS manifest
   pointing to byte ranges of the encrypted file:
   
   #EXTM3U
   #EXT-X-KEY:METHOD=NONE  // No server-side key needed
   #EXTINF:10.0,
   https://cdn.example.com/media/abc?range=0-1048576
   #EXTINF:10.0,
   https://cdn.example.com/media/abc?range=1048577-2097152
   ...
 
3. Playback: Client downloads byte ranges, decrypts each
   segment locally, feeds to video player
 
ALTERNATIVE: Pre-segment before encryption
• Client segments video into 10-second chunks
• Encrypts each chunk separately (same key, different IVs)
• Uploads chunks individually
• Server can serve chunks without full-file access
• More complex upload, simpler playback
 
QUALITY SWITCHING:
• Client monitors download speed
• If network degrades: request lower-quality segments
• Encoder ladder stored: 240p/480p/720p/1080p versions
• Seamless switching between qualities

9.2 Bandwidth-Aware Downloading

Not all users have unlimited data. Smart downloading respects network conditions:

WiFi-only mode: Option to download media only on WiFi, showing placeholders on cellular.

Data saver mode: Download only thumbnails; full media on explicit tap. Reduces data by 90%+.

Quality preferences: Let users choose: 'Always HD,' 'Automatic,' or 'Data Saver.' Store per-user preference.

Network type detection: Detect cellular vs WiFi, 4G vs 3G. Adjust behavior automatically.

Background sync: When on WiFi at night, pre-download media from recent conversations for offline access.

Pre-fetching Strategy

Smart pre-fetching improves perceived performance: when user scrolls to a chat, pre-fetch thumbnails for next ~20 messages. When user opens an image, pre-fetch next/previous images in the conversation. This makes browsing feel instant. But be careful: pre-fetched content that's never viewed wastes bandwidth.

Summary and Key Takeaways

Media handling transforms messaging from a text system into a rich multimedia platform. The architecture must handle enormous scale while maintaining the privacy guarantees of E2EE.

Key Takeaways

•Media dominates bandwidth — Photos and videos represent 90%+ of messaging data volume, requiring specialized architecture separate from text messaging.
•Resumable uploads are essential — Chunked, resumable protocols prevent lost progress on unreliable mobile networks. Users shouldn't re-upload 50MB after a brief disconnection.
•Tiered storage optimizes cost — Hot/warm/cold/archive tiers balance access speed vs. storage cost. Most media is rarely accessed after the first week.
•Thumbnails and BlurHash enable progressive loading — 30-byte BlurHash → thumbnail → full image creates perceived instant loading while actual data transfers.
•Client-side compression reduces server load — Encouraging client-side transcoding before upload shifts processing to billions of devices, reducing server infrastructure needs.
•E2EE media separates storage from keys — Encrypted blob stored on servers; decryption key travels only through E2EE message channel. Server cannot access content.
•CDN is critical for delivery — Global edge caching reduces latency and origin load. At petabyte scale, CDN strategy directly impacts user experience and costs.
•Lifecycle management controls costs — Automatic tiering, deletion, and retention policies prevent unbounded storage growth, which would cost billions annually.

What's next:

With media handling covered, we'll explore presence and delivery receipts—how the system tracks who's online, when messages are delivered, and when they're read. We'll examine the real-time presence infrastructure and the privacy trade-offs of visibility features.

Page Complete

You now understand the complete media handling pipeline for messaging systems—from upload through storage to delivery. These patterns of chunked uploads, progressive loading, E2EE media, and CDN integration apply to any application dealing with user-generated media at scale.

5 / 6

Loading learning content...

System Design (HLD)WhatsApp Messaging

WhatsApp Messaging System Design

LevelAdvanced

Duration120 mins

TopicWhatsApp Messaging

5 / 6

Media Handling

Beyond Text: The Media Explosion

This page explores the complete media handling architecture, from the moment a user selects a photo until it appears on the recipient's screen.

What You Will Master

Media Types and Characteristics

Different media types have vastly different requirements for storage, processing, and delivery. Understanding these characteristics drives architectural decisions.

Media Types in Messaging
Media Type	Typical Size	Processing Needed	Delivery Pattern
Photo (Original)	2-5 MB	Compression, thumbnail generation	Full image on tap
Photo (Thumbnail)	5-20 KB	Pre-generated	Immediate inline display
Video	10-100 MB	Transcoding, multiple bitrates, thumbnails	Progressive/adaptive streaming
Voice Note	0.1-2 MB	Compression (Opus codec)	Full download before play
Document (PDF)	1-100 MB	Preview generation	Full download for viewing
GIF/Sticker	0.1-2 MB	Palette optimization	Cached, immediate display
Location	< 1 KB	Map tile URL generation	Map API integration
Contact vCard	< 10 KB	None	Direct delivery in message

1.1 Scale Calculations for Media

Let's derive the storage and bandwidth requirements:

Media Scale Calculations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
DAILY VOLUME ESTIMATES
══════════════════════
Photos:     7 billion/day × 3 MB avg = 21 PB/day
Videos:     1 billion/day × 30 MB avg = 30 PB/day
Voice:      3 billion/day × 0.5 MB avg = 1.5 PB/day
Documents:  0.5 billion/day × 10 MB avg = 5 PB/day
 
Total daily ingestion: ~57.5 PB/day
Annual growth: ~21 EB/year (exabytes!)
 
RETENTION ASSUMPTIONS
═════════════════════
• Media stored until explicitly deleted (or account deletion)
• Average media lifetime: ~2 years
• Total storage: ~40+ EB (with compression and deduplication)
 
BANDWIDTH REQUIREMENTS
══════════════════════
Upload: 57.5 PB/day = ~5.3 Tbps sustained
Download: Assuming each media viewed 2x on average
          115 PB/day = ~10.6 Tbps sustained
Peak: 3x average = ~48 Tbps
 
For comparison:
• Total global internet traffic: ~400 Tbps
• WhatsApp media alone: ~12% of global traffic (order of magnitude)

Storage Costs at Scale

Media Upload Architecture

Uploading a 50MB video over a mobile network is fraught with failure risks. The upload architecture must handle unreliable networks gracefully.

2.1 Chunked, Resumable Uploads

Rather than uploading files as a single payload, resumable uploads break files into chunks:

Resumable Upload Protocol
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
PROTOCOL FLOW
═════════════
 
1. INITIATE UPLOAD
   ────────────────
   Client → Server: {
     filename: "video.mp4",
     filesize: 52_428_800,  // 50 MB
     mime_type: "video/mp4",
     checksum: "sha256:abc123...",
     chunk_size: 1_048_576  // 1 MB chunks
   }
   
   Server → Client: {
     upload_id: "upload_xyz123",
     upload_url: "https://upload.example.com/upload_xyz123",
     expires_at: 1704672000  // Upload session expires in 1 hour
   }
 
2. UPLOAD CHUNKS
   ──────────────
   For each 1 MB chunk:
   
   Client → Server:
     PUT /upload_xyz123
     Content-Range: bytes 0-1048575/52428800
     Body: [chunk 1 data]
   
   Server → Client:
     HTTP 308 Resume Incomplete
     Range: bytes=0-1048575  // Confirming received
   
   [Continue for all 50 chunks...]
 
3. FINALIZE UPLOAD
   ────────────────
   After last chunk:
   
   Server → Client:
     HTTP 200 OK
     {
       media_id: "media_abc123",
       url: "https://cdn.example.com/media/abc123",
       size: 52_428_800,
       checksum_verified: true
     }
 
4. RESUME AFTER FAILURE
   ─────────────────────
   If connection lost after chunk 25:
   
   Client → Server:
     PUT /upload_xyz123
     Content-Range: bytes */52428800  // Asking what's been received
   
   Server → Client:
     HTTP 308 Resume Incomplete
     Range: bytes=0-26214399  // Got chunks 0-24 (25 MB)
   
   Client resumes from chunk 25, not from beginning!

2.2 Upload Infrastructure Design

Upload Infrastructure

┌─────────────────────────────────────────────────────────────────────────┐
│                        MEDIA UPLOAD PIPELINE                             │
└─────────────────────────────────────────────────────────────────────────┘
 
              ┌──────────────────┐
   Client ───►│  Upload Gateway  │  • Handles chunked uploads
              │  (Edge Server)   │  • Stores chunks temporarily in local SSD
              └────────┬─────────┘  • Validates checksums per-chunk
                       │
                       │ On complete upload:
                       ▼
              ┌──────────────────┐
              │ Chunk Assembler  │  • Assembles chunks into complete file
              │                  │  • Verifies final checksum
              └────────┬─────────┘  • Uploads to object storage
                       │
          ┌────────────┼────────────┐
          │            │            │
          ▼            ▼            ▼
   ┌────────────┐ ┌────────────┐ ┌────────────┐
   │ Thumbnail  │ │ Transcoding│ │ Encryption │
   │ Generator  │ │  Pipeline  │ │  Service   │
   └──────┬─────┘ └──────┬─────┘ └──────┬─────┘
          │              │              │
          └──────────────┴──────────────┘
                         │
                         ▼
              ┌──────────────────┐
              │  Object Storage  │  • S3-compatible storage
              │  (Encrypted)     │  • Multiple regions
              └────────┬─────────┘  • Lifecycle policies
                       │
                       ▼
              ┌──────────────────┐
              │   CDN Origin     │  • Edge caching
              │                  │  • Global distribution
              └──────────────────┘

Client-Side Compression

Media Storage Architecture

Storing exabytes of media requires a carefully designed storage architecture with multiple tiers, geographic distribution, and efficient deduplication.

3.1 Storage Tiers

Not all media is accessed equally. A tiered storage approach optimizes cost and performance:

Media Storage Tiers
Tier	Access Pattern	Storage Type	Cost/GB/Mo
Hot (0-7 days)	Frequent access (viewing, forwarding)	SSD/NVMe, CDN edge cache	~$0.10
Warm (7-90 days)	Occasional access	Standard object storage (S3, GCS)	~$0.02
Cold (90-365 days)	Rare access (search, export)	Infrequent access storage (S3-IA)	~$0.01
Archive (1+ year)	Very rare (legal holds, recovery)	Glacier, Archive storage	~$0.004

3.2 Media Object Schema

Each media item requires metadata for retrieval, processing status, and access control:

Media Object Schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
interface MediaObject {
    // Identification
    id: string;                    // Globally unique media ID
    uploaderId: string;            // Who uploaded this media
    uploadedAt: Date;              // When it was uploaded
    
    // Content properties
    mimeType: string;              // "image/jpeg", "video/mp4", etc.
    originalFilename: string;      // User's original filename (encrypted)
    sizeBytes: number;             // Original size
    duration?: number;             // For video/audio, in seconds
    dimensions?: {                 // For images/videos
        width: number;
        height: number;
    };
    
    // Encryption (E2EE media)
    encryptionKey: string;         // Media key, encrypted with message key
    encryptionIv: string;          // Initialization vector
    keyHash: string;               // Hash for verification
    
    // Storage references
    storageKey: string;            // Object storage key (e.g., S3 key)
    storageBucket: string;         // Which bucket
    storageRegion: string;         // Primary region
    replicatedRegions: string[];   // Where replicated
    storageTier: StorageTier;      // HOT | WARM | COLD | ARCHIVE
    
    // Derived content
    thumbnails: {
        small: ThumbnailRef;       // 150px max dimension
        medium: ThumbnailRef;      // 300px
        large: ThumbnailRef;       // 800px (for preview)
    };
    transcodes?: {                 // For video
        quality_240p: TranscodeRef;
        quality_480p: TranscodeRef;
        quality_720p: TranscodeRef;
        quality_1080p: TranscodeRef;
    };
    
    // Lifecycle
    lastAccessedAt: Date;          // For tiering decisions
    expiresAt?: Date;              // For disappearing messages
    deletedAt?: Date;              // Soft delete
}
 
interface ThumbnailRef {
    storageKey: string;
    sizeBytes: number;
    dimensions: { width: number; height: number };
    encryptionKey: string;         // Thumbnails are also E2EE
}

3.3 Deduplication Strategy

Many users share the same memes, news clips, and viral content. Deduplication can save significant storage:

Content-based hashing: Before encryption, compute a content hash. If identical content exists, reference the existing blob.

Challenge with E2EE: Each sender encrypts with different keys, so the ciphertext differs even for identical plaintext. Deduplication must happen before encryption.

Privacy consideration: Deduplication leaks information ("this content has been shared before"). Most E2EE systems skip deduplication to avoid this metadata leak, accepting the storage cost.

Practical approach: Deduplicate only for media sent by the same user (forwarding their own content uses same encrypted blob).

Geographic Replication

Thumbnail and Preview Generation

Users shouldn't wait for a 50MB video to download just to see what it contains. Thumbnails and previews provide instant visual feedback while the full content loads.

4.1 Thumbnail Generation Pipeline

Thumbnail Generation Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
IMAGE THUMBNAILS
════════════════
 
┌─────────────┐      ┌─────────────────┐      ┌─────────────────┐
│   Original  │─────►│  Image Decoder  │─────►│   Resize +      │
│   (3 MB)    │      │  (libvips)      │      │   Compress      │
└─────────────┘      └─────────────────┘      └────────┬────────┘
                                                       │
                     ┌─────────────────────────────────┴─────────┐
                     │                                           │
                     ▼                                           ▼
              ┌─────────────┐                            ┌─────────────┐
              │   150x150   │                            │   800x800   │
              │  (5-10 KB)  │                            │  (50-100KB) │
              │  Tiny thumb │                            │  Preview    │
              └─────────────┘                            └─────────────┘
 
VIDEO THUMBNAILS
════════════════
 
┌─────────────┐      ┌─────────────────┐
│   Video     │─────►│  FFmpeg extract │──────► 3 key frames at 10%, 50%, 90%
│   (50 MB)   │      │  key frames     │
└─────────────┘      └─────────────────┘
                             │
                             ▼
                     ┌───────────────────────────────┐
                     │    Animated Preview (GIF)     │
                     │    or first frame as JPEG    │
                     │    Size: 50-200 KB            │
                     └───────────────────────────────┘
 
VOICE NOTES
═══════════
 
┌─────────────┐      ┌─────────────────┐      ┌─────────────────┐
│  Audio      │─────►│ Waveform        │─────►│  Serialized     │
│  (500 KB)   │      │ Extraction      │      │  Visualization  │
└─────────────┘      └─────────────────┘      │  Data (1-2 KB)  │
                                               └─────────────────┘
 
Client renders waveform from data, avoiding image file overhead.

4.2 BlurHash: Instant Placeholders

Even thumbnails take time to download. BlurHash provides instant visual placeholders that render from ~30 bytes of data:

BlurHash Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// BlurHash encodes an image into a short string (~30 characters)
// that can be decoded into a blurry placeholder image on the client.
 
// Server-side: Generate BlurHash during upload
import { encode } from 'blurhash';
 
async function generateBlurHash(imageBuffer: Buffer): Promise<string> {
    const { width, height, data } = await decodeImage(imageBuffer);
    
    // Components determine detail level
    // 4x3 = 12 components, good balance of size vs detail
    const hash = encode(data, width, height, 4, 3);
    
    return hash;  // Example: "LEHV6nWB2yk8pyo0adR*.7kCMdnj"
}
 
// Client-side: Render BlurHash while real image loads
import { decode } from 'blurhash';
 
function renderPlaceholder(hash: string, width: number, height: number): ImageData {
    const pixels = decode(hash, width, height);
    // Returns Uint8ClampedArray of RGBA pixel data
    // Can be drawn directly to canvas
    return new ImageData(pixels, width, height);
}
 
// Message payload includes BlurHash:
interface MediaMessage {
    messageId: string;
    mediaId: string;
    blurHash: string;       // 30 bytes, in message payload
    thumbnailUrl: string;   // 10 KB, fetched separately
    fullUrl: string;        // 3 MB, loaded on tap
    dimensions: { width: number; height: number };
}
 
// Rendering sequence:
// 1. Immediately: Render BlurHash (instant, from message payload)
// 2. ~100-500ms: Load thumbnail, replace BlurHash
// 3. On tap: Load full image, replace thumbnail

Progressive Disclosure Pattern

Video Transcoding Pipeline

Videos come in countless formats, codecs, and resolutions. A transcoding pipeline normalizes these into consistent, optimized formats for delivery.

5.1 Why Transcode?

Format normalization: iPhones produce HEVC (H.265), some cameras produce ProRes, screen recordings may be VP9. Recipients may not support all codecs.

Adaptive bitrate: Different network conditions require different quality levels. Transcoding creates multiple quality versions.

Size reduction: Uploaded 4K 100Mbps video transcodes to 720p H.264 at 2Mbps—a 50x size reduction with acceptable quality.

Fast start: Reorder video atoms (moov atom at start) for progressive playback without full download.

Transcoding Pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
VIDEO TRANSCODING PIPELINE
═══════════════════════════
 
┌─────────────┐
│  Uploaded   │
│   Video     │──────────────────────────────────────────────────────┐
│  (100 MB)   │                                                      │
└──────┬──────┘                                                      │
       │                                                             │
       ▼                                                             │
┌─────────────────┐                                                  │
│  Input Analysis │  FFprobe: codec, resolution, duration, bitrate   │
└────────┬────────┘                                                  │
         │                                                           │
         ▼                                                           │
┌─────────────────┐                                                  │
│ Transcode Jobs  │  Create jobs for each output quality            │
│    Scheduler    │                                                  │
└────────┬────────┘                                                  │
         │                                                           │
    ┌────┴────┬──────────┬──────────┐                                │
    │         │          │          │                                │
    ▼         ▼          ▼          ▼                                ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐                      ┌───────────┐
│ 240p  │ │ 480p  │ │ 720p  │ │1080p  │   Parallel           │ Thumbnail │
│150kbps│ │600kbps│ │1.5Mbps│ │3Mbps  │   Workers            │ Extractor │
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘   (FFmpeg)           └─────┬─────┘
    │         │         │         │                                 │
    └─────────┴─────────┴─────────┴─────────────────────────────────┘
                                  │
                                  ▼
                        ┌─────────────────┐
                        │  Output Storage │  Each quality level
                        │  (Encrypted)    │  stored separately
                        └─────────────────┘
 
FFMPEG EXAMPLE COMMAND:
═══════════════════════
ffmpeg -i input.mp4 \
  -c:v libx264 -preset fast \
  -vf scale=-2:720 \
  -b:v 1500k -maxrate 2000k \
  -bufsize 3000k \
  -c:a aac -b:a 128k \
  -movflags +faststart \
  output_720p.mp4
 
Output sizes for 1 minute video:
• 240p: ~1 MB
• 480p: ~4 MB
• 720p: ~10 MB
• 1080p: ~20 MB
• Original stored: 100 MB

5.2 Distributed Transcoding

Transcoding is CPU-intensive. At 1 billion videos/day, brute-force transcoding is infeasible:

Parallelization: Split video into segments (e.g., 10-second chunks), transcode in parallel across machines, concatenate.

Priority queuing: Recent uploads get priority. Older media in queue can wait. Short videos (<30s) often get synchonous transcoding for immediate availability.

Hardware acceleration: GPU-based encoding (NVENC, Quick Sync) is 5-10x faster than CPU for equivalent quality.

Tiered transcoding: Create 240p/480p first (fast, small), delay 1080p for later. Most mobile views use lower quality anyway.

WhatsApp's Approach

End-to-End Encrypted Media

In an E2EE system, media must be encrypted on the sender's device before upload. The server stores only encrypted blobs it cannot decrypt.

6.1 E2EE Media Protocol

E2EE Media Upload Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
SENDER'S DEVICE: UPLOAD
═══════════════════════
 
1. Generate random media encryption key:
   media_key = random(32 bytes)
   media_iv = random(16 bytes)
 
2. Encrypt media locally:
   encrypted_media = AES-256-CBC(media_key, media_iv, plaintext_media)
   encrypted_thumbnail = AES-256-CBC(media_key, thumbnail_iv, plaintext_thumbnail)
 
3. Compute integrity hash:
   file_hash = SHA256(encrypted_media)
 
4. Upload encrypted blobs:
   POST /upload/media
   Body: encrypted_media
   Response: { media_url: "https://cdn.example.com/encrypted/abc123" }
 
   POST /upload/thumbnail
   Body: encrypted_thumbnail
   Response: { thumb_url: "https://cdn.example.com/encrypted/thumb456" }
 
5. Include key in message (encrypted with recipient's chat key):
   message_payload = {
     type: "image",
     media_url: "https://cdn.example.com/encrypted/abc123",
     thumb_url: "https://cdn.example.com/encrypted/thumb456",
     media_key: base64(media_key),   // Encrypted in message
     media_iv: base64(media_iv),
     thumbnail_iv: base64(thumbnail_iv),
     file_size: 3145728,
     file_hash: "sha256:abc123...",
     mime_type: "image/jpeg"
   }
   
   E2EE_message = encrypt_with_signal_protocol(message_payload)
 
 
RECIPIENT'S DEVICE: DOWNLOAD
════════════════════════════
 
1. Receive and decrypt message using Signal Protocol:
   message_payload = decrypt_with_signal_protocol(E2EE_message)
 
2. Extract media key and URLs from decrypted payload
 
3. Download encrypted media from CDN:
   GET https://cdn.example.com/encrypted/abc123
   Response: encrypted_media (server cannot decrypt)
 
4. Decrypt locally:
   plaintext_media = AES_decrypt(media_key, media_iv, encrypted_media)
 
5. Verify integrity:
   assert SHA256(encrypted_media) == file_hash
 
6. Display decrypted media to user

6.2 Key Insight: Separation of Storage and Keys

The encrypted blob is stored on servers, but the decryption key travels only through the E2EE message channel:

Server knows: There is a 3MB encrypted blob at URL X
Server doesn't know: What the blob contains, who can decrypt it
Recipient knows: The key to decrypt (from E2EE message)

Even if an attacker hacks the media storage and steals all encrypted blobs, they're useless without the keys—which only exist on sender and recipient devices.

What the Server Sees vs. Reality
Aspect	What Server Sees	Actual Content
Photo	3MB encrypted blob	Birthday party photo
Video	50MB encrypted blob	Child's piano recital
Document	10MB encrypted blob	Tax returns PDF
Filename	Nothing (encrypted)	vacation_2024.jpg
Thumbnail	Encrypted blob	Blurry preview image

Metadata Still Visible

CDN Integration for Media Delivery

Delivering petabytes of media daily requires a global Content Delivery Network (CDN). CDNs cache content at edge locations close to users, reducing latency and origin server load.

CDN Architecture for Media
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
CDN ARCHITECTURE
════════════════
 
                           ┌─────────────────┐
                           │  User in Tokyo  │
                           └────────┬────────┘
                                    │
                                    │ 1. Request media
                                    ▼
                           ┌─────────────────┐
                           │  Tokyo PoP      │◄───── Cache HIT? Return immediately
                           │  (Edge Server)  │         (~10ms latency)
                           └────────┬────────┘
                                    │
                                    │ Cache MISS
                                    ▼
                           ┌─────────────────┐
                           │  Regional Cache │◄───── Regional cache HIT?
                           │  (Singapore)    │         (~50ms latency)
                           └────────┬────────┘
                                    │
                                    │ Still MISS (rare for popular content)
                                    ▼
                           ┌─────────────────┐
                           │  Origin Shield  │◄───── Coalesce requests to origin
                           │                 │         (prevent thundering herd)
                           └────────┬────────┘
                                    │
                                    ▼
                           ┌─────────────────┐
                           │  Origin (S3)    │  Media storage
                           │  US-East        │  (~200ms from Tokyo)
                           └─────────────────┘
 
CDN CONFIGURATION FOR MESSAGING:
════════════════════════════════
 
1. CACHE POLICY
   ─────────────
   • Media is immutable (content never changes at same URL)
   • Cache-Control: public, max-age=31536000 (1 year)
   • E2EE media: content is encrypted, safe to cache anywhere
   • Unique URLs per media item prevent stale content
 
2. ACCESS CONTROL
   ───────────────
   • Signed URLs: Token-based access expiring after N hours
   • Example: https://cdn.example.com/media/abc?token=xyz&expires=1704672000
   • Prevents unauthorized access even though content is encrypted
   • Rate limiting at edge to prevent abuse
 
3. EDGE FEATURES
   ──────────────
   • TLS termination at edge (reduces latency)
   • Brotli/gzip compression for compressible formats
   • HTTP/2 for multiplexed downloads
   • Range request support for video seeking

7.1 Cache Efficiency Considerations

Hit rate is everything: CDN costs scale with origin requests. High cache hit rates (>95%) dramatically reduce costs.

WhatsApp's approach: For group messages and forwards, the same encrypted blob can be reused if the message references the same media. But 1:1 messages to different people use different keys.

Long-tail problem: Rarely accessed media (old photos from years ago) may never be in cache. Accept higher origin load for cold content.

CDN Cost Optimization

Media Lifecycle Management

Media doesn't live forever. Managing the lifecycle—from upload through access patterns to eventual deletion—is essential for cost control and compliance.

8.1 Lifecycle Stages

Media Lifecycle
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
MEDIA LIFECYCLE STAGES
══════════════════════
 
Day 0: UPLOAD
├── Store in HOT tier
├── Generate thumbnails
├── CDN edge caching active
└── High access probability
 
Day 1-7: ACTIVE
├── Remain in HOT tier
├── Frequent views (recipients, forwards)
├── CDN cache hits likely
└── No action needed
 
Day 7-30: COOLING
├── Monitor access frequency
├── If access_count < threshold: transition to WARM
├── Remove from CDN edge cache
└── Still quick retrieval (~100ms)
 
Day 30-90: WARM
├── Standard object storage
├── Moved from premium SSD
├── On-demand CDN caching only
└── ~200ms retrieval
 
Day 90-365: COLD
├── Infrequent access tier (S3-IA)
├── Higher retrieval cost, lower storage cost
├── Minimum storage duration charges apply
└── ~500ms retrieval
 
Day 365+: ARCHIVE (Optional)
├── Glacier or equivalent
├── Minutes to hours for retrieval
├── Extremely low storage cost
└── Used for legal holds, backups
 
DELETION TRIGGERS:
══════════════════
• User deletes message/media
• Disappearing messages timer expires
• Account deleted
• Legal hold expired
• Storage quota enforcement (if any)

8.2 Disappearing Messages and Media

Disappearing messages create special challenges for media:

Timer starts on view: The 7-day timer typically starts when recipient views the message, not when sent. Server must track first-view timestamp.

Multi-device sync: If recipient has 3 devices, disappearing message must disappear from all once any device views it + timer expires.

Deletion must be thorough:

Delete encrypted blob from storage
Delete all thumbnails
Delete from CDN caches (issue PURGE request)
Delete metadata from databases
On-device local storage deletion (sender and recipient app responsibility)

Recovery concern: E2EE means once media is deleted, it's unrecoverable. Users who set 24-hour disappearing messages may accidentally lose important content.

Deletion Is Hard in Distributed Systems

Download and Streaming Strategies

How media is delivered to the client affects user experience dramatically. Different strategies suit different media types and network conditions.

Download Strategies by Media Type
Media Type	Strategy	Rationale
Photos	Download thumbnail inline, full on tap	Quick preview, full quality on demand
Short videos (<30s)	Progressive download	Download while playing; simple
Long videos (>30s)	Adaptive bitrate streaming (HLS/DASH)	Adjust quality to network; seek support
Voice notes	Full download before play	Small files; need complete for scrubbing
Documents	Download on tap	No preview needed during chat scroll
GIFs/Stickers	Preload next messages' GIFs	Ensure instant animation on scroll

9.1 Adaptive Bitrate Streaming for Video

For longer videos, adaptive streaming adjusts quality based on network conditions:

HLS Streaming for Encrypted Video
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
ADAPTIVE BITRATE STREAMING WITH E2EE
═════════════════════════════════════
 
CHALLENGE:
• Standard HLS/DASH expects server to segment video
• With E2EE, server cannot decode video to segment it
 
SOLUTION: Client-side segment decryption
 
1. Upload: Client encrypts full video with media_key
   Server stores as single encrypted blob
 
2. Manifest creation: Client/server creates HLS manifest
   pointing to byte ranges of the encrypted file:
   
   #EXTM3U
   #EXT-X-KEY:METHOD=NONE  // No server-side key needed
   #EXTINF:10.0,
   https://cdn.example.com/media/abc?range=0-1048576
   #EXTINF:10.0,
   https://cdn.example.com/media/abc?range=1048577-2097152
   ...
 
3. Playback: Client downloads byte ranges, decrypts each
   segment locally, feeds to video player
 
ALTERNATIVE: Pre-segment before encryption
• Client segments video into 10-second chunks
• Encrypts each chunk separately (same key, different IVs)
• Uploads chunks individually
• Server can serve chunks without full-file access
• More complex upload, simpler playback
 
QUALITY SWITCHING:
• Client monitors download speed
• If network degrades: request lower-quality segments
• Encoder ladder stored: 240p/480p/720p/1080p versions
• Seamless switching between qualities

9.2 Bandwidth-Aware Downloading

Not all users have unlimited data. Smart downloading respects network conditions:

WiFi-only mode: Option to download media only on WiFi, showing placeholders on cellular.

Data saver mode: Download only thumbnails; full media on explicit tap. Reduces data by 90%+.

Quality preferences: Let users choose: 'Always HD,' 'Automatic,' or 'Data Saver.' Store per-user preference.

Network type detection: Detect cellular vs WiFi, 4G vs 3G. Adjust behavior automatically.

Background sync: When on WiFi at night, pre-download media from recent conversations for offline access.

Pre-fetching Strategy

Summary and Key Takeaways

Media handling transforms messaging from a text system into a rich multimedia platform. The architecture must handle enormous scale while maintaining the privacy guarantees of E2EE.

Key Takeaways

•Media dominates bandwidth — Photos and videos represent 90%+ of messaging data volume, requiring specialized architecture separate from text messaging.
•Resumable uploads are essential — Chunked, resumable protocols prevent lost progress on unreliable mobile networks. Users shouldn't re-upload 50MB after a brief disconnection.
•Tiered storage optimizes cost — Hot/warm/cold/archive tiers balance access speed vs. storage cost. Most media is rarely accessed after the first week.
•Thumbnails and BlurHash enable progressive loading — 30-byte BlurHash → thumbnail → full image creates perceived instant loading while actual data transfers.
•Client-side compression reduces server load — Encouraging client-side transcoding before upload shifts processing to billions of devices, reducing server infrastructure needs.
•E2EE media separates storage from keys — Encrypted blob stored on servers; decryption key travels only through E2EE message channel. Server cannot access content.
•CDN is critical for delivery — Global edge caching reduces latency and origin load. At petabyte scale, CDN strategy directly impacts user experience and costs.
•Lifecycle management controls costs — Automatic tiering, deletion, and retention policies prevent unbounded storage growth, which would cost billions annually.

What's next:

Page Complete

5 / 6